OpenStack with Ceph: Clean up orphaned instances

Working with Ceph and OpenStack can make your life as a cloud administrator really easy, but sometimes you discover its downsides. From time to time I share some findings in this blog, it’s a nice documentation for me and hopefully it helps you preventing the same mistakes I did.

I discovered an orphaned instance in a user’s project, fortunately it was not an important one. The instance’s disk was not a volume but a clone from the Glance image (<INSTANCE_ID>_disk), so it depended on that base image. Only there was no base image in the backend anymore, somehow it must have been deleted even though there were existing clones. I assume it had to do with a cache tier incident a couple of months earlier, something must have destroyed the relationship between the image and its clones.

Anyway, how can you clean that up? You can’t simply delete the instance from OpenStack since nova will try to remove the clone from the backend, which will fail because of the missing parent:

ImageNotFound: [errno 2] error opening image 187cb991-1038-476e-b643-001db259eba7_disk at snapshot None

Digging in the database won’t help you except you could update it to deleted state so it won’t show up in the instance list. This would still leave you with the data objects in the Ceph pool. You cloud clean that up manually, of course, but don’t! There’s a better way for this! The steps described in the following could help with other issues, so I’ll leave it here anyway, but please read till the end!

# Get keys and values of the affected instance
control:~ # rados -p images listomapvals rbd_directory | grep -A5 187cb991-1038-476e-b643-001db259eba7 
value (18 bytes) : 
00000000  0e 00 00 00 32 31 66 32  39 38 31 66 34 34 37 39  |....21f2981f4479| 
00000010  63 63                                             |cc| 

control:~ # rados -p images listomapvals rbd_directory | grep -A5 21f2981f4479cc                       
value (45 bytes) : 
00000000  29 00 00 00 31 38 37 63  62 39 39 31 2d 31 30 33  |)...187cb991-103| 
00000010  38 2d 34 37 36 65 2d 62  36 34 33 2d 30 30 31 64  |8-476e-b643-001d| 
00000020  62 32 35 39 65 62 61 37  5f 64 69 73 6b           |b259eba7_disk| 

# Remove omapkeys
control:~ # rados -p images rmomapkey rbd_directory name_187cb991-1038-476e-b643-001db259eba7_disk
control:~ # rados -p images rmomapkey rbd_directory id_21f2981f4479cc

# Remove remaining objects
control:~ # rados -p images rm rbd_id.187cb991-1038-476e-b643-001db259eba7_disk
control:~ # rados -p images ls | grep 21f2981f4479cc | xargs rados -p images rm

This would work, you wouldn’t see orphaned instances anymore, but there’s still a chance that some of this leaves inconsistencies in your cluster, so once again: be careful with this approach!

There’s a much more elegant way to achieve this, so don’t try to delete the orphaned instance yet:

# Rebuild the instance from a different image
control:~ #  openstack server rebuild --image  187cb991-1038-476e-b643-001db259eba7

This will remove the corrupt references to a non-existing Glance image and update the relevant database entries. Now you can either keep using this rebuilt instance for existing or new purposes, or finally delete it.

There’s a nice command in the rbd client to list all images within a specified pool, showing all relationships between base images and clones:

control:~ # rbd -p images list --long 
2018-04-24 15:26:46.454115 7ff10a7fc700 -1 librbd::image::RefreshParentRequest: failed to open parent image: (2) No such file or directory 
2018-04-24 15:26:46.454130 7ff10a7fc700 -1 librbd::image::RefreshRequest: failed to refresh parent image: (2) No such file or directory 
2018-04-24 15:26:46.454142 7ff10a7fc700 -1 librbd::image::OpenRequest: failed to refresh image: (2) No such file or directory 
2018-04-24 15:26:46.454254 7ff10a7fc700 -1 librbd::io::AioCompletion: 0x5565e04c7910 fail: (2) No such file or directory 
NAME                                          SIZE         PARENT      FMT PROT LOCK  
284007bf-cd6b-42ee-9529-274d259e6812_disk     20480M 

If the output of that command starts with these errors you have orphans in your cluster. Check your instances and their base images to identify which one it is.

Disclaimer, and please leave a comment below

And as always with such articles: The steps above do work for us, but needn’t work for you. So if anything goes wrong while you try to reproduce above procedure, it’s not our fault, but yours. And it’s yours to fix it! But whether it works for you or not, please leave a comment below so that others will know.

Posted in Ceph, OpenStack | Tagged , , | Leave a comment

Obstacles for OpenStack: cinder-volume tears down control node

I’d like to share another finding from my work with OpenStack.
I was asked for assistance in a small private cloud based on Ocata with a single control node and a handful of compute nodes, and Ceph as storage backend.

During tests with a Heat template (containing 6 instances, 7 volumes and a small network infrastructure) the control node became unresponsive due to the load cinder-volume caused. The reason was that some of the volumes had to be created from (large) images, in which case Cinder has to convert the Glance images and upload them back to Ceph as volumes.

The conversion happens on the local disk of the control node, which was known and therefore the directory /var/lib/cinder was a separate logical volume with enough disk space. This was a suitable setup for the creation of single volumes, but this was the first real performance test for this environment, and it failed! While the stack was created – which took ages! – the control node was almost inoperable.

So we decided to put the conversion directory on a SSD. Not only lead this to a much faster stack creation but it also kept the control node “alive” and responsive.

This should be considered while planning a cloud infrastructure, although it can be fixed quickly if you have an empty slot available in your server. Until there’s a way for qemu-convert to avoid this workaround it’s quite a good idea to source out the conversion directory onto a faster device.

Posted in OpenStack | Tagged , , , | Leave a comment

Migrating BlueStore’s block.db

Ceph’s BlueStore storage engine is rather new, so the big wave of migrations because of failing block devices is still ahead – on the other hand, non-optimum device selection because of missing experience or “heritage environments” may have left you with a setup you’d rather like to change.

Such an issue can be the location of the OSDs’ RocksDB devices. As a recap: BlueStore allows you to separate storage for the write-ahead log (WAL), its meta-data storage (RocksDB) and the actual content. When using spinning disks for content, the most common case is probably to split off RocksDB onto some SSD. If you have the money, you may have put the WAL onto NVME storage, but if not, it’ll automatically end up on the SSD (if you have it) or on the main block device, if you only have that.

So when setting up that “HDD, plus RocksDB on SSD” OSD, you’ll have had to decide on how to set up the RocksDB block device. As 10 GB RocksDB per Terabyte of main storage is recommended, assigning a full SSD is a waste of resources. You end up with basically two options: Partition the SSD, or turn it into a PV and create a LVM volume group from it. But whatever you decide: Once set up, there’s no documented way to move the RocksDB to a different block device – you’d need to recreate the OSD. Continue reading

Posted in BlueStore, Ceph | Leave a comment