A few weeks ago I helped a Ceph user to recover his broken cluster (see this thread). Basically, after his monitors stopped working he re-deployed a new cluster with the same Ceph FSID and attached the existing OSDs to the re-deployed hosts. But it’s not that easy to re-activate those OSDs because the new monitors don’t have the old osdmap, hence they don’t know anything about the existing OSDs. So how did we fix that?
This scenario can be considered a total monitor store failure, there’s a documented procedure in the upstream docs, unfortunately it’s written for non-cephadm clusters and doesn’t contain too many details about the further steps. If the daemons run within containers there are more things to consider. So I decided to write this blog post and add some more details to the procedure, targeting mainly clusters managed by cephadm. But the recovery procedure doesn’t only require cephadm-specific commands, so it can be considered as a general guideline how to recover from a mon store loss, just with some extra details about cephadm deployments.
The procedure covered by said docs is already written in script form so I used it as a template and extended it for cephadm usage and included some very basic logging. It collects the osdmaps from all OSDs and contains the necessary considerations regarding containers. I decided to only automate the osdmap collection, not all required steps of the store rebuild procedure (e. g. mon store rebuild, client auth, etc.) because the result of the collection has to be inspected carefully before proceeding. Here’s the script I used on one of my lab clusters to collect the osdmaps:
cat /usr/local/bin/ceph-collect-osdmaps.sh #!/bin/bash : <<'END' This script does: 1. Login to each OSD host 2. Create a temporary mon-store in /tmp after cleaning it up 3. Create a temporary mon-store directory within each OSD's filesystem (which is then mapped into the cephadm shell) after cleaning it up 4. Collect the osdmaps from each OSD and sync it to each host's temporary mon-store 5. Sync each host's temporary mon-store to the central mon-store END # List of all OSD hosts hosts="squid1 squid2 squid3" # Ensure this user has passwordless sudo on all OSD hosts (if not root) # It's also easier if the user has passwordless login, e.g. via ssh public key user="root" # Cluster FSID ceph_fsid="df500e26-1930-11f1-a79f-fa163e65a168" # Logging directories ms_central="/tmp/mon-store-central" log_central="/tmp/central-osd-logs" logdir="/tmp/logdir" ms_collected="/tmp/mon-store-collected" # Ensure all OSDs are stopped before proceeding rm -rf $ms_central rm -rf $log_central mkdir -p $log_central for host in $hosts; do ssh $user@$host <<EOF logdir="/tmp/logdir" ms_collected="/tmp/mon-store-collected" ceph_fsid="df500e26-1930-11f1-a79f-fa163e65a168" osd_dir="/var/lib/ceph/\$ceph_fsid" rm -rf \$ms_collected rm -rf \$logdir mkdir -p \$logdir echo -e "\nHost: \$(hostname -s)" for osd in \$(cephadm ceph-volume lvm list --format json 2>/dev/null | jq -r '.[].[].tags | ."ceph.osd_id"'); do echo -e "\nProcessing osd.\$osd ..." rm -rf \$osd_dir/osd.\$osd/mon-store-collect mkdir -p \$osd_dir/osd.\$osd/mon-store-collect cephadm shell --name osd.\$osd -- ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-\$osd --no-mon-config --op update-mon-db --mon-store-path /var/lib/ceph/osd/ceph-\$osd/mon-store-collect 2> >(tee \$logdir/osd.\$osd.\$(hostname -s).log | grep -vEi "quay|infer|using|creating") rsync -qaz \$osd_dir/osd.\$osd/mon-store-collect/. \$ms_collected done EOF rsync -qaz -e ssh $user@$host:$ms_collected/. $ms_central rsync -qav -e ssh $user@$host:$logdir/. $log_central done
If not all osdmaps are collected successfully, the rebuilt monitor store will be incomplete and some OSDs won’t be able to join the cluster. And there are many more potential obstacles before the OSDs can be activated, so it’s critical to review the logs from each host after the osdmap collection script has finished (in case of errors) and then continue step by step.
Collect the osdmaps
There are a couple of assumptions made in the script, you will need to adapt those according to your requirements. For example, the list of hosts, the ceph fsid, the user who logs in to each host. All OSDs should be stopped if there are still running processes, otherwise the ceph-objectstore-tool will fail during the collection. I recommend to stop all daemons (mgr, mds, ceph-exporter, etc.) before starting the recovery, otherwise you might get stuck with a stale mgr.
The procedure has been tested on two different Squid lab clusters, one on 19.2.3, the other one on 19.2.4. Here’s an example of the script output:
Host: squid1 Processing osd.1 ... osd.1 : 0 osdmaps trimmed, 226 osdmaps added. Processing osd.2 ... osd.2 : 0 osdmaps trimmed, 228 osdmaps added. Processing osd.0 ... osd.0 : 0 osdmaps trimmed, 223 osdmaps added.
Ceph FSID
Depending on how much data is actually lost, there are several ways to retrieve the Ceph FSID (to keep the output brief, I use $FSID instead of the actual UUID):
ceph.conf: If there’s any ceph.conf left on any host (clients also have a ceph.conf), it will contain the FSID. Alternatively, since every daemon also has a copy of the ceph.conf, check any daemon config file in/var/lib/ceph/$FSID/osd.X/configif the filesystem is intact.LV tags: On any OSD host run the command and look for the tagceph.cluster_fsid=$FSID:lvs -o tagsceph-volume: On any OSD host run the command:cephadm ceph-volume lvm list | grep "cluster fsid"
Only run “bare” ceph-volume commands if your cluster is not managed by cephadm! Otherwise you’ll most likely end up with more chaos!
OSD FSID
In some failure scenarios you might need the FSID of each OSD to rebuild an OSD’s directory structure. To collect those, you can also use ceph-volume or LV tags:
LV tags: On any OSD host run the command and look for the tagceph.osd_fsid=$FSID:lvs -o tagsceph-volume: On any OSD host run the command:cephadm ceph-volume lvm list | grep "osd fsid"
Only run “bare” ceph-volume commands if your cluster is not managed by cephadm! Otherwise you’ll most likely end up with more chaos!
SSH user
The easiest way is to have a user that can log in to each host passwordless and has (passwordless) sudo permissions to execute the necessary commands. Depending on your Ceph deployment, you could login as the cephadm user which usually has the mentioned permissions. Often the root user is configured as the cephadm user, as it is in my lab cluster in which I tested this procedure.
Rebuild the monitor store
Assuming the collection of osdmaps was successful and no errors were logged, you can now rebuild the store. The hostname of the first monitor host in this example is squid1.
# Create a backup of the current store (just in case) root@squid1:~# rsync -a /var/lib/ceph/$FSID/mon.squid1/store.db/ /root/mon-store-backup/ # Ensure all daemons are stopped! # Remove current mon-store root@squid1:~# rm -rf /var/lib/ceph/$FSID/mon.squid1/store.db/ # Copy collected store.db from your management node (where you ran the collection script) to the first monitor's store.db rsync -av -e ssh /tmp/mon-store-collected/store.db/ root@squid1:/var/lib/ceph/$FSID/mon.squid1/store.db/ # Rebuild mon-store root@squid1:~# cephadm shell --name mon.squid1 -- ceph-monstore-tool /var/lib/ceph/mon/ceph-squid1/ rebuild -- --keyring /etc/ceph/ceph.keyring --mon-ids squid1 squid2 squid3 ... epoch 0 fsid df500e26-1930-11f1-a79f-fa163e65a168 last_changed 2026-06-27T12:18:38.557560+0000 created 2026-06-27T12:18:38.557560+0000 min_mon_release 0 (unknown) election_strategy: 1 0: [v2:192.168.124.107:3300/0,v1:192.168.124.107:6789/0] mon.squid2 1: [v2:192.168.124.222:3300/0,v1:192.168.124.222:6789/0] mon.squid3 2: [v2:192.168.124.230:3300/0,v1:192.168.124.230:6789/0] mon.squid1 ### Note: you need to specify all monitors in the correct order (as mentioned in the docs)! # Change ownership root@squid1:~# chown -R 167.167 /var/lib/ceph/$FSID/mon.squid1/store.db/
Start the monitors
If all above steps were successful, you can try and start the first monitor, either with systemd or cephadm or podman/docker. This also depends on the actual situation. So one of these commands should work:
root@squid1:~# systemctl start ceph-$FSID@mon.squid1.service root@squid1:~# cephadm unit start --name mon.squid1 root@squid1:~# podman start ... / docker start ... # Now check the monitor log to see if it starts successfully root@squid1:~# journalctl -fu start ceph-$FSID@mon.squid1.service
If the first monitor starts, you should copy the contents of its store.db directory to the other monitors, ensure the correct ownership and try starting them as well. Inspect their logs if something goes wrong, and if it does it has to be fixed before continuing. You can also simply repeat the rsync and rebuild steps to recreate the mon stores on the remaining monitors.
If the monitors start you should be able to see a Ceph status now (ceph -s). The Ceph status should now contain the OSDs in the output (just an excerpt):
ceph -s cluster: id: df500e26-1930-11f1-a79f-fa163e65a168 health: HEALTH_WARN mons are allowing insecure global_id reclaim no active mgr services: mon: 3 daemons, quorum squid2,squid3,squid1 (age 57s) mgr: no daemons active osd: 6 osds: 1 up (since 52m), 6 in (since 54m) data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs:
As you can see, we now see that the monitors are aware of 6 OSDs (which is correct), just the “pool”, “objects” and “usage” output is not correct since the OSDs are not up yet and haven’t been able to report to MONs and MGRs.
Before starting OSDs
There are some more steps required before you can activate the OSDs. For example, all custom configs are gone, auth entries are missing, the container_image config value has been reset, and the orchestrator doesn’t work, it has to be re-enabled first. Not all of these steps have to be performed necessarily before activating the OSDs, but some of them are crucial and others can make life easier when activating the OSDs.
# Disable auth_allow_insecure_global_id_reclaim (to clear the warning) root@squid1:~# ceph config set mon auth_allow_insecure_global_id_reclaim false # Recreate and import at least one mgr auth keyring to enable the orchestrator root@squid1:~# cat mgr.squid1.tkgxvc.keyring [mgr.squid1.tkgxvc] key = AQDjJhtqCOWnFRAA4/QHkFo2tnWFG8m37kM5HQ== caps mds = allow * caps mon = profile mgr caps osd = allow * root@squid1:~# ceph auth import -i mgr.squid1.tkgxvc.keyring # Recreate and import all OSDs keyrings # If present, use key from /var/lib/ceph/$FSID/osd.$id/keyring to recreate an OSD keyring root@squid1:~# cat osd.0.keyring [osd.0] key = AQBzGxtqPUu8EBAAC0BqZMcfRJb5ZuGtXFgBCA== caps mgr = allow profile osd caps mon = allow profile osd caps osd = allow * # Check ceph auth for completeness root@quid1:~# ceph auth ls # Start at least one manager daemon to be able to manage the cluster again root@squid1:~# systemctl start ceph-$FSID@mgr.squid1.tkgxvc.service # Set the public network(s), otherwise OSDs will refuse to start root@squid1:~# ceph config set global public_network 192.168.124.0/24 # Enable cephadm module root@squid1:~# ceph mgr module enable cephadm # Set orchestrator backend root@squid1:~# ceph orch set backend cephadm # Recreate cephadm user/keyring root@squid1:~# ceph cephadm generate-key # Get new public key root@squid1:~# ceph cephadm get-pub-key # Add new public key to each host's authorized_keys (usually of the root user) # Re-add hosts to the orchestrator root@squid1:~# ceph orch host add squid1 192.168.124.230 root@squid1:~# ceph orch host add squid2 192.168.124.107 root@squid1:~# ceph orch host add squid3 192.168.124.222 # Add labels if necessary # Configure container_image (choose the correct version for your cluster) root@squid1:~# ceph config set global container_image quay.io/ceph/ceph:v19.2.4
If you don’t change the container_image, some daemons most likely will be deployed using the hard-coded default:
docker.io/ceph/daemon-base:latest-master-devel
Not changing that image could lead to issues when redeploying other daemons such as MDS.
Activate the OSDs
There’s a wrapper for cephadm to activate existing OSDs. It covers a slightly different scenario: when an OSD host fails due to a broken operating system, its OSDs can be re-activated after reinstalling the operating system. In that case, after all prerequisites are met, you can just run:
ceph cephadm osd activate <host>
But chances are high that this approach won’t work in a scenario like we’re facing here (total mon store loss). So a more manual workaround like the following might be required.
Now assuming that the previously described steps worked and errors were fixed, we can try starting OSDs. You should check the contents of the OSD directories first and see if the most important files are present:
root@squid1:~# ls -l /var/lib/ceph/$FSID/osd.0/ total 68 lrwxrwxrwx 1 167 167 93 Jun 19 14:56 block -> /dev/ceph-3d3a6efd-02d7-4abf-ae23-17bc239feb1e/osd-block-8012c0dc-cb7b-4179-908c-df0a94a4227b -rw-r--r-- 1 167 167 37 Jun 19 14:56 ceph_fsid -rw------- 1 167 167 181 Jun 19 14:58 config -rw-r--r-- 1 167 167 37 Jun 19 14:56 fsid -rw------- 1 167 167 142 Jun 19 14:58 keyring -rw------- 1 167 167 1875 Jun 19 14:33 unit.poststop -rw------- 1 167 167 3457 Jun 19 14:33 unit.run -rw-r--r-- 1 167 167 2 Jun 19 14:56 whoami
Without these files the OSDs will most likely refuse to start. Now let’s give it a try:
# systemd root@squid1:~# systemctl start ceph-$FSID@osd.0.service or via # cephadm root@squid1:~# cephadm unit start --name osd.0
Alternatively, if the orchestrator works (you could try to deploy node-exporter or other daemons to confirm), you can also just redeploy an OSD:
root@squid1:~# ceph orch daemon redeploy osd.X
This will recreate missing files in the OSD’s directory and hopefully start the OSD successfully.
Inspect the OSD logs after starting them and look out for error messages if they fail to start:
root@squid1:~# journalctl -fu ceph-$FSID@osd.X.service
If you enabled log_to_file you can follow the log file like this:
root@squid1:~# tail -F /var/log/ceph/$FSID/ceph-osd.X.log
More ToDo’s
Let’s assume that we could successfully start the OSDs and recovery is progressing, we need to access the data now. For RBD pools there’s (probably) not much you need to do, as soon as clients can access the OSDs again, they should be able to map/attach/access the RBD images.
For CephFS it’s a different story. Since the MDS maps are lost as well, all filesystems will have to be recreated, including the MDS daemons. The Ceph docs cover this part. Read the instructions very carefully! The required commands are:
root@squid1:~# ceph fs new <fs_name> <metadata_pool> <data_pool> --force --recover root@squid1:~# ceph fs set <fs_name> joinable true
Since there are no existing MDS daemon yet, you’ll need to deploy them either via .yaml file:
root@squid1:~# ceph orch apply -i mds.yaml
or directly in the command line:
root@squid1:~# ceph orch apply mds <fs_name> --placement="label:mds"
Check out the docs for more information on MDS deployment.
I won’t go into more details, this post is long enough as it is. But you’ll need to deploy all required daemons for all the services you had running before the crash. So you’ll have to recreate your yaml files for all the services like RGW, Monitoring, NVMe-oF, etc. The existing service specs in a recovered cluster look like this (when no services have been managed yet):
root@squid1:~# ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT mgr 2/0 9m ago - <unmanaged> mon 1/0 9m ago - <unmanaged> osd 0 9m ago - <unmanaged>
So you’ll need to update those as well to match your actual cluster configuration. That’s why I recommend to regularly back up your cluster configuration, auth keyrings, service specs and so on, so you don’t have to recreate everything from scratch in such a failure scenario.
After the recovery has been completed, the cluster status looks like this:
squid1:~ # ceph -s cluster: id: df500e26-1930-11f1-a79f-fa163e65a168 health: HEALTH_OK services: mon: 3 daemons, quorum squid2,squid3,squid1 (age 80m) mgr: squid3.fepqsu(active, since 62m), standbys: squid1.pcgzhi mds: 2/2 daemons up, 3 standby osd: 6 osds: 6 up (since 65m), 6 in (since 65m) data: volumes: 2/2 healthy pools: 7 pools, 385 pgs objects: 49 objects, 4.3 MiB usage: 421 MiB used, 60 GiB / 60 GiB avail pgs: 385 active+clean
Potential obstacles
There are more things that can go wrong while trying to re-activate the OSDs. This is not an exhaustive list, of course, but it can help in specific scenarios.
Missing OSD path contents
As already mentioned above, the contents of the OSD directories need to be present and matching. If OSDs don’t start, the logs might point to permission issues, missing files (unit.poststop, type, and so on), a broken symbolic link to the OSD block device or other failures.
Ceph Dashboard
Most likely you’ll have to enable the dashboard again since all configs are gone:
root@squid1:~# ceph mgr module enable dashboard
After that, you’ll have to recreate users, roles etc., whatever you had configured before.
Packages
One of the issues during OSD activation in this case was the presence of the ceph-osd package on at least one of the hosts. It’s known to be problematic when having that package present on a cephadm-managed cluster. So make sure the ceph-osd package is removed. Basically, on a cephadm-managed cluster you only require a cephadm package (although the orchestrator uses a binary stored at /var/lib/ceph/$FSID/cephadm.{some_digest} to remotely execute commands) and for comfort reasons the ceph-common package. Other packages like ceph-mon, ceph-mgr etc. are not required and should also be removed.
ceph-volume
As I already pointed out before, don’t use the bare ceph-volume commands to try and fix things when OSDs are managed by the orchestrator. I recommend to remove the ceph-volume package as well, if present. Attempts to fix the OSDs made things worse and caused more chaos, leaving more things to clean up.
Container images
As mentioned before, there’s a hard-coded default image (docker.io/ceph/daemon-base:latest-master-devel) which can make life harder. Make sure to configure the correct Ceph container_image as well as other daemons (like nvmeof, grafana, prometheus etc.):
root@squid1:~# ceph config ls | grep container_image container_image mgr/cephadm/container_image_alertmanager mgr/cephadm/container_image_base mgr/cephadm/container_image_elasticsearch mgr/cephadm/container_image_grafana mgr/cephadm/container_image_haproxy mgr/cephadm/container_image_jaeger_agent mgr/cephadm/container_image_jaeger_collector mgr/cephadm/container_image_jaeger_query mgr/cephadm/container_image_keepalived mgr/cephadm/container_image_loki mgr/cephadm/container_image_node_exporter mgr/cephadm/container_image_nvmeof mgr/cephadm/container_image_prometheus mgr/cephadm/container_image_promtail mgr/cephadm/container_image_samba mgr/cephadm/container_image_snmp_gateway
Disclaimer: The procedure above was tested in two lab environments in order to write this blog post, but the individual steps were executed in a real cluster. It worked for me/us but it might not work for you. If anything goes wrong it’s not my fault, but yours. And it’s yours to fix it!
If you have suggestions how to improve the procedure or feedback in case you had to use it, I’d appreciate a comment!
