This is my second blog post about cephadm, the (relatively) new tool to deploy and manage Ceph clusters. From time to time I feel challenged by questions in the ceph-users mailing list or by customers and then I try to find a solution to particular problems. For example, my first post about cephadm dealt with the options to change a monitor’s IP address. This post will describe in a short way how you can reactivate OSDs from a reinstalled server.
This (virtual) lab environment was based on SUSE Enterprise Storage 7 and podman (Ceph Octopus version 15.2.8). I won’t go into too much detail and spare you most of the command line output, this article is meant to show the general idea and not to provide step-by-step instructions. This is also just one way to do it, I haven’t tried other ways (yet) and there might be smoother procedures. If you have better ideas or other remarks please leave a comment, I’d be happy to try a different approach and update this blog post.
Background
The title already reveals the background of this question. If one of the OSD server’s operating system (OS) breaks and you need to reinstall it there are two options how to deal with the OSDs on that server. Either let the cluster rebalance (which is usually the way to go, that’s what Ceph is designed for) and reinstall the OS. Then wipe the OSDs and add the node back to the cluster which will again trigger a remapping of placement groups (PGs).
Prior to cephadm and containerized services (but not older than Luminous) it was quite straight forward to bring back OSDs from a reinstalled host, ‘ceph-volume’ would do almost everything for you, but until there’s a solution within Ceph orchestrator I currently only see this “hacky” way.
Possible Problem(s)
To prevent Ceph from remapping you would need to set the noout flag assuming you noticed the server failure in time. This means your PGs will be degraded and the risk of data loss would increase in case other disks or hosts fail. Depending on the install mechanisms a new installation of a server can be performed quite quickly which would reduce the risk of data loss. Especially if you consider that the remapping of an entire host also means a degraded cluster for quite some time. So it might actually be faster to reinstall a server and reactivate its OSDs than waiting for the remapping to finish. So it’s basically up to the cluster’s administrator what the best strategy would be for the individual setup.
Please also consider that depending on the cluster details (number of OSDs, stored data, cluster I/O, failure domains, etc.) the PGs on the down OSDs could be outdated to that extent that adding them back to the cluster could cause more load than adding new OSDs.
Solution
If you decide to retain those down OSDs and bring them back online these are the basic steps to achieve that.
My virtual lab environment is running in OpenStack, so to simulate a node failure I just deleted a virtual machine (VM), the OSD volumes were not deleted, of course. Then I launched a new VM, prepared it for Ceph usage and attached the OSD volumes to it.
After the host was added to Ceph a “crash” container was already successfully deployed, so cephadm seemed to work properly confirmed by ceph-volume
command:
ses7-host1:~ # cephadm ceph-volume lvm list
Inferring fsid 7bdffde0-623f-11eb-b3db-fa163e672db2
Using recent ceph image registry.suse.com/ses/7/ceph/ceph:latest
[...]
Although the OSD activation with ceph-volume
failed I had the required information about those down OSDs:
- Path to block devices (data, db, wal)
- OSD FSID
- OSD ID
- Ceph FSID
- OSD keyring
Four of those five properties can be collected from the cephadm ceph-volume lvm list
output. The OSD keyring can be obtained from ceph auth get osd.<ID>
.
Since the crash container was already present the required parent directory was also present, for the rest I used a different OSD server as a template. These are the files I copied from a different server (except for the block and block.db devices, of course):
ses7-host1:~ # ls -1 /var/lib/ceph/7bdffde0-623f-11eb-b3db-fa163e672db2/osd.6/ block block.db ceph_fsid config fsid keyring ready require_osd_release type unit.configured unit.created unit.image unit.poststop unit.run whoami
I only needed to replace the contents of these five files with the correct keyring, OSD FSID, OSD ID:
- fsid
- keyring
- whoami
- unit.run
- unit.poststop
The next step was to create the symbolic links pointing to the correct block and block.db devices and change their ownership:
ses7-host1:/var/lib/ceph/<CEPH_FSID>/osd.<OSD_ID>/ # ln -s /dev/ceph-<VG>/osd-block-<LV> block ses7-host1:/var/lib/ceph/<CEPH_FSID>/osd.<OSD_ID>/ # ln -s /dev/ceph-<VG>/osd-block-<LV> block.db ses7-host1:~ # chown -R ceph.ceph /var/lib/ceph/<CEPH_FSID>/osd.<OSD_ID>/
And finally start the systemd unit:
ses7-host1:~ # systemctl start ceph-<CEPH_FSID>@osd.<OSD_ID>.service
After the first OSD started successfully I repeated this for all remaining OSDs on that server and all of them came back online without an issue. This has not been tested with encrypted OSDs, though, so I’m not sure what else is necessary in that case but maybe this procedure helps figuring that out. I also don’t know if there’s a smoother or even automated way to achieve this, I don’t think there currently is. Maybe (hopefully) someone is working on it though.
Disclaimer
The described steps have been executed in a lab environment. They worked for me but they might not work for you. If anything goes wrong while you try to reproduce the procedure it’s not my fault, but yours. And it’s yours to fix it!