This article is the beginning of a little blog series about our journey to upgrade a single-control-node environment to a highly available OpenStack Cloud. We started to use OpenStack as an experiment while our company was running on a different environment, but as it sometimes happens the experiment suddenly became a production environment without any redundancy, except when we migrated to Ceph as our storage back-end, so at least the (Glance) images, (Nova) ephemeral disks and (Cinder) volumes were redundant. But the (single) control node didn’t even have any RAID configuration, just a regular backup configuration for the database, config files etc.
OpenStack: Upgrade to high availability (Part I)
Obstacles for OpenStack: disk_format is not the same as disk_format
I bet you’re wondering about the title and if there’s a typo or some other mistake. I promise, there’s nothing wrong with the title. It’s a result of some research I did in two different environments (Ocata and Rocky), I already wrote an article about some of the findings.
OpenStack with Ceph: orphaned instances part II
If you read my first article about orphaned instances this might be also relevant for you. I’ll just add some more notes to that topic so this article will be rather short. This report applies to a production Ocata environment and to a Rocky lab environment.
If you tend to use ephemeral disks for Nova instances (I do because launching an instance takes only a couple of seconds) you have to be aware that deleting instances might give you the impression that all is fine since the instance is indeed marked as deleted in the database, so it’s not visible anymore. But that instance could be still existing inside Ceph if you created rbd snapshots of that instance. I’ll show you what I mean:
# only one rbd object exists (Glance image) root@host:~ # rbd -p cinder ls root@host:~ # rbd -p glance ls 5e61eaf9-f988-4086-8d03-982fdd656497 # launch instance with ephemeral disk root@host:~ # openstack server create --flavor 1 --image 5e61eaf9-f988-4086-8d03-982fdd656497 --nic net-id= vm1 root@host:~ # rbd -p cinder info ced41764-9481-4280-a2b1-1cd5ce894b5a_disk | grep parent parent: glance9/5e61eaf9-f988-4086-8d03-982fdd656497@snap # create snapshot root@host:~ # rbd -p cinder snap create ced41764-9481-4280-a2b1-1cd5ce894b5a_disk --snap test-snap root@host:~ # rbd -p cinder snap ls ced41764-9481-4280-a2b1-1cd5ce894b5a_disk SNAPID NAME SIZE TIMESTAMP 18 test-snap 1GiB Wed Mar 6 13:28:28 2019 # delete instance root@host:~ # openstack server delete vm1 Request to delete server c1 has been accepted. root@host:~ # openstack server list +----+------+--------+------------+-------------+----------+ | ID | Name | Status | Task State | Power State | Networks | +----+------+--------+------------+-------------+----------+ +----+------+--------+------------+-------------+----------+
So all is fine, isn’t it? The instance is deleted and our Ceph cluster has gained some more free space, right? Well, the answer is no:
# disk still exists root@host:~ # rbd -p cinder ls ced41764-9481-4280-a2b1-1cd5ce894b5a_disk # and the snapshot is still present root@host:~ # rbd -p cinder snap ls ced41764-9481-4280-a2b1-1cd5ce894b5a_disk SNAPID NAME SIZE TIMESTAMP 18 test-snap 1GiB Wed Mar 6 13:28:28 2019
The expected workflow should be the same as it is for Cinder. Deleting a volume backed instance leaves a Cinder volume which can not be deleted if it has rbd snapshots, the attempt to remove such a volume shows a correct error message:
root@host:~ # cinder delete 8a1d789d-e73e-4db1-b21e-0ed79855ecce Delete for volume 8a1d789d-e73e-4db1-b21e-0ed79855ecce failed: Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group, have snapshots or be disassociated from snapshots after volume transfer. (HTTP 400) (Request-ID: req-85777afb-b52f-4805-ac06-cbb0cce55c12) ERROR: Unable to delete any of the specified volumes.
Although the Cinder workflow also has its flaws: depending on the workflow of creating an instance and snapshots it also can happen that the Cinder CLI command returns successful after trying to delete a volume with snapshots. You’ll have to look in the Cinder logs to find out that the volume actually was not deleted because of existing snapshots. These processes could (or should) be improved, but for Cinder they work (at least in general), they do not work for Nova. So if you delete Nova instances make sure you check for existing snapshots of that instance with:
root@host:~ # rbd -p snap ls
Be careful which snapshots you delete, they might be important. But on the other hand, if you already decided to delete the instance the snapshots can’t be that important. ;-)