A frequent question in the community is, what do I need to do when the operating system of one of my Ceph servers fails, but the OSDs are intact? Can I revive it?
The answer is yes! And it’s quite easy to do!
A few years ago, I wrote an article about the general idea how to do that. But the process has become much easier, so I decided to write a new blog post.
Although the docs cover that in general, I wanted to add some more details to have a bit more context.
This procedure isn’t exclusive to a host failure, we just reinstalled all our Ceph servers on faster SSD drives for the operating system (“OS”). The required steps are a combination of the procedure to add a new host and the ceph cephadm osd activate <host>...
command.
The OS installation is not covered in this post.
After you successfully installed the OS, you need to configure the host so the orchestrator is able to manage it. Our Ceph servers run on openSUSE, the package manager is zypper
, and we use podman
. Adapt the required commands to your OS with its package manager and your preferred container engine.
The reinstalled server is “ceph04”, the Ceph commands to reintegrate “ceph04” are executed on “ceph01”, a host with an admin keyring.
# Install required packages
ceph04:~ # zypper in cephadm podman
# Retrieve public key
ceph01:~ # ceph cephadm get-pub-key > ceph.pub
# Copy key to ceph04
ceph01:~ # ssh-copy-id -f -i ceph.pub root@ceph04
# Retrieve private key to test connection
ceph01:~ # ceph config-key get mgr/cephadm/ssh_identity_key > ceph-private.key
# Modify permissions
ceph01:~ # chmod 400 ceph-private.key
# Test login
ceph01:~ # ssh -i ceph-private.key ceph04
Have a lot of fun...
ceph04:~ #
# Clean up
ceph01:~ # rm ceph.pub ceph-private.key
Since the host should be still in the host list, you don’t need to add it. As soon as the reinstalled host is reachable by the orchestrator
(ceph orch host ls
doesn’t show the host status as offline
or maintenance
), cephadm
will try to deploy missing daemons to that host. In case you run your own container registry, the automatic deployment of the missing daemons will fail until the host has successfully logged in to the registry. So we instruct the orchestrator
to execute a login for each host:
ceph cephadm registry-login my-registry.domain <user> <password>
Shortly after the orchestrator
has performed the registry login, the missing daemons should be successfully deployed to the host, for example crash
, node-exporter
and all other daemons it used to run before the failure.
If that all works, you can activate the existing OSDs simply by running:
ceph cephadm osd activate ceph04
And that’s basically it, the OSDs should boot one after the other. There might be some additional steps required, depending on which daemons are supposed to run on that host, but I’m only focusing on OSD daemons here.
Disclaimer: This procedure has worked many times for us, it might not work for you.