Cephadm: Reusing OSDs on reinstalled server

This is my second blog post about cephadm, the (relatively) new tool to deploy and manage Ceph clusters. From time to time I feel challenged by questions in the ceph-users mailing list or by customers and then I try to find a solution to particular problems. For example, my first post about cephadm dealt with the options to change a monitor’s IP address. This post will describe in a short way how you can reactivate OSDs from a reinstalled server.

This (virtual) lab environment was based on SUSE Enterprise Storage 7 and podman (Ceph Octopus version 15.2.8). I won’t go into too much detail and spare you most of the command line output, this article is meant to show the general idea and not to provide step-by-step instructions. This is also just one way to do it, I haven’t tried other ways (yet) and there might be smoother procedures. If you have better ideas or other remarks please leave a comment, I’d be happy to try a different approach and update this blog post.

Background

The title already reveals the background of this question. If one of the OSD server’s operating system (OS) breaks and you need to reinstall it there are two options how to deal with the OSDs on that server. Either let the cluster rebalance (which is usually the way to go, that’s what Ceph is designed for) and reinstall the OS. Then wipe the OSDs and add the node back to the cluster which will again trigger a remapping of placement groups (PGs).

Prior to cephadm and containerized services (but not older than Luminous) it was quite straight forward to bring back OSDs from a reinstalled host, ‘ceph-volume’ would do almost everything for you, but until there’s a solution within Ceph orchestrator I currently only see this “hacky” way.

Possible Problem(s)

To prevent Ceph from remapping you would need to set the noout flag assuming you noticed the server failure in time. This means your PGs will be degraded and the risk of data loss would increase in case other disks or hosts fail. Depending on the install mechanisms a new installation of a server can be performed quite quickly which would reduce the risk of data loss. Especially if you consider that the remapping of an entire host also means a degraded cluster for quite some time. So it might actually be faster to reinstall a server and reactivate its OSDs than waiting for the remapping to finish. So it’s basically up to the cluster’s administrator what the best strategy would be for the individual setup.

Please also consider that depending on the cluster details (number of OSDs, stored data, cluster I/O, failure domains, etc.) the PGs on the down OSDs could be outdated to that extent that adding them back to the cluster could cause more load than adding new OSDs.

Solution

If you decide to retain those down OSDs and bring them back online these are the basic steps to achieve that.

My virtual lab environment is running in OpenStack, so to simulate a node failure I just deleted a virtual machine (VM), the OSD volumes were not deleted, of course. Then I launched a new VM, prepared it for Ceph usage and attached the OSD volumes to it.

After the host was added to Ceph a “crash” container was already successfully deployed, so cephadm seemed to work properly confirmed by ceph-volume command:

ses7-host1:~ # cephadm ceph-volume lvm list
Inferring fsid 7bdffde0-623f-11eb-b3db-fa163e672db2
Using recent ceph image registry.suse.com/ses/7/ceph/ceph:latest
[...]

Although the OSD activation with ceph-volume failed I had the required information about those down OSDs:

  • Path to block devices (data, db, wal)
  • OSD FSID
  • OSD ID
  • Ceph FSID
  • OSD keyring

Four of those five properties can be collected from the cephadm ceph-volume lvm list output. The OSD keyring can be obtained from ceph auth get osd.<ID>.

Since the crash container was already present the required parent directory was also present, for the rest I used a different OSD server as a template. These are the files I copied from a different server (except for the block and block.db devices, of course):

ses7-host1:~ # ls -1 /var/lib/ceph/7bdffde0-623f-11eb-b3db-fa163e672db2/osd.6/
block
block.db
ceph_fsid
config
fsid
keyring
ready
require_osd_release
type
unit.configured
unit.created
unit.image
unit.poststop
unit.run
whoami

I only needed to replace the contents of these five files with the correct keyring, OSD FSID, OSD ID:

  • fsid
  • keyring
  • whoami
  • unit.run
  • unit.poststop

The next step was to create the symbolic links pointing to the correct block and block.db devices and change their ownership:

ses7-host1:/var/lib/ceph/<CEPH_FSID>/osd.<OSD_ID>/ # ln -s /dev/ceph-<VG>/osd-block-<LV> block

ses7-host1:/var/lib/ceph/<CEPH_FSID>/osd.<OSD_ID>/ # ln -s /dev/ceph-<VG>/osd-block-<LV> block.db

ses7-host1:~ # chown -R ceph.ceph /var/lib/ceph/<CEPH_FSID>/osd.<OSD_ID>/

And finally start the systemd unit:

ses7-host1:~ # systemctl start ceph-<CEPH_FSID>@osd.<OSD_ID>.service

After the first OSD started successfully I repeated this for all remaining OSDs on that server and all of them came back online without an issue. This has not been tested with encrypted OSDs, though, so I’m not sure what else is necessary in that case but maybe this procedure helps figuring that out. I also don’t know if there’s a smoother or even automated way to achieve this, I don’t think there currently is. Maybe (hopefully) someone is working on it though.

Disclaimer

The described steps have been executed in a lab environment. They worked for me but they might not work for you. If anything goes wrong while you try to reproduce the procedure it’s not my fault, but yours. And it’s yours to fix it!

This entry was posted in Ceph, cephadm and tagged , , , , . Bookmark the permalink.

Leave a Reply