Ceph’s BlueStore storage engine is rather new, so the big wave of migrations because of failing block devices is still ahead – on the other hand, non-optimum device selection because of missing experience or “heritage environments” may have left you with a setup you’d rather like to change.
Such an issue can be the location of the OSDs’ RocksDB devices. As a recap: BlueStore allows you to separate storage for the write-ahead log (WAL), its meta-data storage (RocksDB) and the actual content. When using spinning disks for content, the most common case is probably to split off RocksDB onto some SSD. If you have the money, you may have put the WAL onto NVME storage, but if not, it’ll automatically end up on the SSD (if you have it) or on the main block device, if you only have that.
So when setting up that “HDD, plus RocksDB on SSD” OSD, you’ll have had to decide on how to set up the RocksDB block device. As 10 GB RocksDB per Terabyte of main storage is recommended, assigning a full SSD is a waste of resources. You end up with basically two options: Partition the SSD, or turn it into a PV and create a LVM volume group from it. But whatever you decide: Once set up, there’s no documented way to move the RocksDB to a different block device – you’d need to recreate the OSD.
Where to put the RocksDB
In terms of “continuous operations”, putting the RocksDB on a partition of your SSD does create a limit: If your SSD fails (or you want to exchange the physical device for whatever other reason), you’ll at least have to shut down the OSD in question.
Putting the RocksDB on a logical volume (LV) does avoid that limit: you have the option to simply add another “physical volume” (i. e. a new SSD) to the volume group and then move all blocks from the old to the new PV – live. So migrating to a new SSD does not interrupt your cluster.
A growing larger server setup, with an increasing number of OSDs on that single server, may saturate that SSD you had installed for all OSDs’ RocksDB block devices. Distributing these across multiple SSDs (after initial OSD setup) won’t work without downtime if you use partitions – but if using LVs, you could simply add another SSD to the volume group and move LVs to that new “physical volume”, spreading the load across multiple devices. And again “live”, so without any interruption to your Ceph cluster.
To make it short: As far as I see it, putting RocksDBs on logical volumes offers a lot of flexibility.
Migrating an existing RocksDB to a new block device
But what if initial decisions were different? And even if you still want to use partitions: What to do if the underlying device starts to fail?
The “supported” answer probably is: Create a new OSD and let Ceph migrate the data. But in fact this will cause a lot of data movement and likely put some strain on the Ceph cluster. So if all you want to do (or even need to do) is to move the RocksDB to a new block device, this approach looks like a big gun pointing at a small fly. Been there, done that.
Looking at the content of the OSD’s “directory” on your Ceph server, you’ll notice a symlink “/var/lib/ceph/osd/ceph-*/block.db” pointing to the current RocksDB block device. So for a first test, you may be tempted to stop the OSD, copy the content of the current block device (i. e. some partition) to its new location (i. e. a same-sized LV) and update the symlink, to point to the new device. Restarting the OSD you’ll notice it’s up and running again, was it that easy? No, it isn’t. If, for example, you look at the list of open file handles for the newly started OSD process, you’ll see it accessing the old block device. Looking at the OSD’s metadata, you’ll also still see it pointing to the old device. While you have changed the symlink, the OSD is still using the old RocksDB block device. Glad you haven’t zapped it yet, aren’t you?
That symlink is actually a second-level information, derived from the OSD’s meta-data. And searching around, you’ll probably stumble across a tool called “ceph-bluestore-tool”. It’s a sort of “Swiss army knife” for BlueStore, so you’d better be careful when exploring its possibilities – but for the sake of this blog article, let’s just say it’s the right tool for the job:
ceph-bluestore-tool set-label-key --dev <dataDev> -k path_block.db -v <newDbDev>
Once that’s set (while the OSD is stopped), you can restart the OSD and it will use the new RocksDB block device, like it was never on its old location. And there’s no rebalancing of the Ceph cluster.
The full procedure
Let’s not leave it unmentioned that especially with logical volumes, a bit more is done by i. e. the “ceph-volume” tool when setting up a new OSD – and since we’re emulating that, these steps need to be done, too, when changing the RocksDB block device. So here’s a list of steps we’d recommend:
- Check that both your Ceph cluster (and you, as the operator) are healthy. While the procedure document here has been used many times, there’s still a chance that something will go wrong – and you’ll not want to have more than a single OSD drop out of your cluster, unless you have a real high replication number…
- Check that the OSD in question really is a BlueStore OSD and already has a separate RocksDB block device.
If there’s no “block.db” symlink in your OSD’s directory then stop now and re-check item #1. - Determine the size of the current RocksDB block device. This procedure will *not* let you change that, neither shrink nor grow:
- Create a logical volume (or other block device) to serve as a new location for the RocksDB with the same size as the original device. “Larger” will work, but wastes space for now. “Smaller” will likely make your OSD fail, now or later.
- Set the “noout” flag, since we’ll be stopping that OSD in question and don’t want any re-balancing to occur.
- Stop the OSD in question.
- Just for good measure, flush the OSD’s journal:
- Copy the RocksDB from old to new device, i. e. via “dd”:
- Change ownership of the new RocksDB device to “ceph:ceph”:
- Update the symlink /var/lib/ceph/osd/ceph-$osdnr/block.db to point to the new device:
- If you have not rebooted your server since OSD creation, you may see a file “/var/lib/ceph/osd/ceph-$osdnr/path_block.db”. Just move it aside, for good measure.
- Update the LV tag on the main device to point to the new RocksDB block device:
- Update BlueStore’s meta-data to point to the new RocksDB block device:
- Make sure the OSD still has its keyring data (updated May 25, 2018):
- (optional) and if ‘osd_key’ is missing, set the label:
- Start the OSD
- Monitor Ceph to check if the OSD successfully restarts and all PGs recover
- Unset the “noout” flag
ceph-2:~ # ls -l /var/lib/ceph/osd/ceph-3/block* lrwxrwxrwx 1 ceph ceph 93 4. Apr 13:01 block -> /dev/ceph-4a2d2d43-2a5e-4a9f-ae69-b84d4887b3f6/osd-block-774b5027-ecdd-4cc5-b4f3-cec5bfd2a2eb lrwxrwxrwx 1 root root 29 4. Apr 13:06 block.db -> /dev/sdc1 ceph-2:~ # cat /var/lib/ceph/osd/ceph-3/bluefs 1
ceph-2:~ # blockdev --getsize64 /dev/sdc1 314572800
ceph-2:~ # lvcreate -n journal-osd3 -L314572800b ceph-journals Logical volume "journal-osd3" created.
ceph-2:~ # ceph-osd -i 3 --flush-journal
ceph-2:~ # dd if=/dev/sdc1 of=/dev/ceph-journals/journal-osd3 bs=1M
ceph-2:~ # chown ceph:ceph $(realpath /dev/ceph-journals/journal-osd3)
ceph-2:~ # rm /var/lib/ceph/osd/ceph-3/block.db ceph-2:~ # ln -s /dev/ceph-journals/journal-osd3 /var/lib/ceph/osd/ceph-3/block.db
ceph-2:~ # lvchange --deltag "ceph.db_device=/dev/sdc1" /var/lib/ceph/osd/ceph-3/block ceph-2:~ # lvchange --addtag "ceph.db_device=/dev/ceph-journals/journal-osd3" /var/lib/ceph/osd/ceph-3/block
ceph-2:~ # ceph-bluestore-tool set-label-key --dev /var/lib/ceph/osd/ceph-3/block -k path_block.db -v /dev/ceph-journals/journal-osd3
ceph-2:~ # ceph-bluestore-tool show-label --dev /var/lib/ceph/osd/ceph-3/block | grep osd_key "osd_key": "AQBQov5amnLSOhAACfWs94hHctLraAe51xOD8w==",
ceph-2:~ # ceph auth get osd.3 | grep key exported keyring for osd.3 key = AQBQov5amnLSOhAACfWs94hHctLraAe51xOD8w== ceph-2:~ # ceph-bluestore-tool set-label-key --dev /var/lib/ceph/osd/ceph-3/block -k osd_key -v AQBQov5amnLSOhAACfWs94hHctLraAe51xOD8w==
Depending on the size (and speed) of your RocksDB device, this may well be done in a few minutes, so the actual impact on the Ceph cluster (i. e. redundancy) is pretty low.
Depending on where you come from, you may run across a situation where your current RocksDB device is beyond copyable – we have, and it came as a surprise (we only noticed when we tried step #8, “dd”). As we didn’t want a full rebalance for the Ceph cluster, we came up with a solution to “re-format” an existing OSD, while changing the RocksDB device at the same time. But that’s for another blog article.
So, yes, things may go wrong. We have created a script for these tasks, including result checks – but above procedure is not officially supported by Ceph and many, many things may go wrong, simply because something is special in *your* environment. Therefore we have decided not to publish the script, but rather leave the description sufficiently unspecific. That way someone with at least some experience running Ceph will be required to complete the steps. If you cannot work it out yourself, please take the official route by creating a new OSD and migrating the OSD data afterward.
Update (May 25, 2018):
The procedure has been expanded with two additional steps (Steps 14 and 15) in order to rise awareness regarding the LVM tags and BlueStore labels of your OSDs. Missing or wrong tags/labels can prevent the OSDs from starting, resulting in an unhealthy cluster. Check existing tags/labels of OSDs that have been set up correctly, compare them to the OSDs that you need to migrate and be careful when changing them. Alfredo Deza pointed out the risks in the mailing list.
Disclaimer, and please leave a comment below
And as always with such articles: The steps above do work for us, but needn’t work for you. So if anything goes wrong while you try to reproduce above procedure, it’s not our fault, but yours. And it’s yours to fix it! But whether it works for you or not, please leave a comment below so that others will know.