During an attempt to migrate some OSDs’ BlueStore RocksDB to a different block device, we noticed (previously undetected) fatal read errors on the existing RocksDB. The only way to recover from this situation is to remove the OSD and rebuild its content from the other copies.
There are standard procedures to delete and to create OSDs, BlueStore and FileStore. But during our transition from FileStore to BlueStore, we came across a problem where we could not specify the new OSD’s id and had other minor difficulties. And we now wanted to cause the least data movement possible. All this while replacing the RocksDB block device.
To make a long story short: We were looking for a “mkfs”-style approach.
Moving the RocksDB block device
If you have already read the article on migrating the RocksDB block device, this part should be pretty familiar to you.
As the current RocksDB block device is unusable, a new one needs to be put in place. And the most likely situation is (unless you’re using logical volumes for this) that the block device reference will be different from the existing device.
Since we’re creating a new, empty OSD, we don’t need to fiddle with any attributes or meta-data – “ceph-osd” will take care of that for us.
Preparing the data block device
Our starting point is a Ceph cluster that’s set to “noout” and the malicious OSD being stopped.
Please note that below procedure will erase data from your cluster, by wiping the existing OSD. Make sure you have enough good copies, before continuing here. Of course, as our starting point is a hidden failure of a RocksDB device, other OSDs (even on different servers) might be affected by a similar failure. So it might happen that when following these steps, you’ll wipe your last good copy of some of the PGs, with other OSDs only having bad ones. Only proceed now if you are willing to take that risk.
In order to re-create the OSD BlueStore “formatting” of the devices involved, you need to “zap” the current block device’s signature. If not, the “ceph-osd” command will detect the existing BlueStore information and bail out. So by using “dd”, put i. e. 100M of zeros on the base block device of your OSD:
dd if=/dev/zero of=/var/lib/ceph/osd/ceph-<idOfOsd>/block bs=1M count=100
Once that’s done and you’ve provisioned your new RocksDB block device, you can call ceph-osd to initialize the OSD with the original values (and new RocksDB device):
ceph-osd --cluster <yourClusterName> --osd-objectstore bluestore --mkfs -i <idOfOsd> --bluestore-block-db-path <newDbDev> --osd-data /var/lib/ceph/osd/ceph-<idOfOsd>/ --mkjournal --setuser ceph --setgroup ceph --osd-uuid <uuidOfOsd>
You can get hold of the OSD UUID i. e. via the “ceph-bluestore-tool” command:
ceph-bluestore-tool show-label --dev </var/lib/ceph/osd/ceph-<idOfOsd>/block
and looking for the “osd_uuid” info.
Once you’ve started the OSD, Ceph will refill the OSD with new copies of the according PGs, so this may well take some time to finish. But at least no rebalancing did take place (as the OSD tree did not change), nor did we have to go through the hassles of deleting the OSD and trying to create a new OSD.
Disclaimer, and please leave a comment below
And as always with such articles: The steps above do work for us, but needn’t work for you. So if anything goes wrong while you try to reproduce above procedure, it’s not our fault, but yours. And it’s yours to fix it! But whether it works for you or not, please leave a comment below so that others will know.