Cephadm: Recover mon store using OSDs

A few weeks ago I helped a Ceph user to recover his broken cluster (see this thread). Basically, after his monitors stopped working he re-deployed a new cluster with the same Ceph FSID and attached the existing OSDs to the re-deployed hosts. But it’s not that easy to re-activate those OSDs because the new monitors don’t have the old osdmap, hence they don’t know anything about the existing OSDs. So how did we fix that?

This scenario can be considered a total monitor store failure, there’s a documented procedure in the upstream docs, unfortunately it’s written for non-cephadm clusters and doesn’t contain too many details about the further steps. If the daemons run within containers there are more things to consider. So I decided to write this blog post and add some more details to the procedure, targeting mainly clusters managed by cephadm. But the recovery procedure doesn’t only require cephadm-specific commands, so it can be considered as a general guideline how to recover from a mon store loss, just with some extra details about cephadm deployments.

The procedure covered by said docs is already written in script form so I used it as a template and extended it for cephadm usage and included some very basic logging. It collects the osdmaps from all OSDs and contains the necessary considerations regarding containers. I decided to only automate the osdmap collection, not all required steps of the store rebuild procedure (e. g. mon store rebuild, client auth, etc.) because the result of the collection has to be inspected carefully before proceeding. Here’s the script I used on one of my lab clusters to collect the osdmaps:

cat /usr/local/bin/ceph-collect-osdmaps.sh  
#!/bin/bash

: <<'END'
This script does:
1. Login to each OSD host
2. Create a temporary mon-store in /tmp after cleaning it up 
3. Create a temporary mon-store directory within each OSD's filesystem (which is then mapped into the cephadm shell) after cleaning it up
4. Collect the osdmaps from each OSD and sync it to each host's temporary mon-store  
5. Sync each host's temporary mon-store to the central mon-store
END

# List of all OSD hosts
hosts="squid1 squid2 squid3"

# Ensure this user has passwordless sudo on all OSD hosts (if not root)
# It's also easier if the user has passwordless login, e.g. via ssh public key
user="root"

# Cluster FSID
ceph_fsid="df500e26-1930-11f1-a79f-fa163e65a168"

# Logging directories
ms_central="/tmp/mon-store-central"
log_central="/tmp/central-osd-logs"
logdir="/tmp/logdir"
ms_collected="/tmp/mon-store-collected"

# Ensure all OSDs are stopped before proceeding

rm -rf $ms_central
rm -rf $log_central
mkdir -p $log_central

for host in $hosts; do
ssh $user@$host <<EOF
logdir="/tmp/logdir"
ms_collected="/tmp/mon-store-collected"
ceph_fsid="df500e26-1930-11f1-a79f-fa163e65a168"
osd_dir="/var/lib/ceph/\$ceph_fsid"
rm -rf \$ms_collected
rm -rf \$logdir
mkdir -p \$logdir
echo -e "\nHost: \$(hostname -s)"

for osd in \$(cephadm ceph-volume lvm list --format json 2>/dev/null | jq -r '.[].[].tags | ."ceph.osd_id"'); do
echo -e "\nProcessing osd.\$osd ..."
rm -rf \$osd_dir/osd.\$osd/mon-store-collect
mkdir -p \$osd_dir/osd.\$osd/mon-store-collect

cephadm shell --name osd.\$osd -- ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-\$osd --no-mon-config --op update-mon-db --mon-store-path /var/lib/ceph/osd/ceph-\$osd/mon-store-collect 2> >(tee \$logdir/osd.\$osd.\$(hostname -s).log | grep -vEi "quay|infer|using|creating")

rsync -qaz \$osd_dir/osd.\$osd/mon-store-collect/. \$ms_collected
done
EOF
rsync -qaz -e ssh $user@$host:$ms_collected/. $ms_central
rsync -qav -e ssh $user@$host:$logdir/. $log_central
done

If not all osdmaps are collected successfully, the rebuilt monitor store will be incomplete and some OSDs won’t be able to join the cluster. And there are many more potential obstacles before the OSDs can be activated, so it’s critical to review the logs from each host after the osdmap collection script has finished (in case of errors) and then continue step by step.

Collect the osdmaps

There are a couple of assumptions made in the script, you will need to adapt those according to your requirements. For example, the list of hosts, the ceph fsid, the user who logs in to each host. All OSDs should be stopped if there are still running processes, otherwise the ceph-objectstore-tool will fail during the collection. I recommend to stop all daemons (mgr, mds, ceph-exporter, etc.) before starting the recovery, otherwise you might get stuck with a stale mgr.

The procedure has been tested on two different Squid lab clusters, one on 19.2.3, the other one on 19.2.4. Here’s an example of the script output:

Host: squid1

Processing osd.1 ...
osd.1   : 0 osdmaps trimmed, 226 osdmaps added.

Processing osd.2 ...
osd.2   : 0 osdmaps trimmed, 228 osdmaps added.

Processing osd.0 ...
osd.0   : 0 osdmaps trimmed, 223 osdmaps added.

Ceph FSID

Depending on how much data is actually lost, there are several ways to retrieve the Ceph FSID (to keep the output brief, I use $FSID instead of the actual UUID):

  • ceph.conf: If there’s any ceph.conf left on any host (clients also have a ceph.conf), it will contain the FSID. Alternatively, since every daemon also has a copy of the ceph.conf, check any daemon config file in /var/lib/ceph/$FSID/osd.X/config if the filesystem is intact.
  • LV tags: On any OSD host run the command and look for the tag ceph.cluster_fsid=$FSID:
    lvs -o tags
  • ceph-volume: On any OSD host run the command:
    cephadm ceph-volume lvm list | grep "cluster fsid"

Only run “bare” ceph-volume commands if your cluster is not managed by cephadm! Otherwise you’ll most likely end up with more chaos!

OSD FSID

In some failure scenarios you might need the FSID of each OSD to rebuild an OSD’s directory structure. To collect those, you can also use ceph-volume or LV tags:

  • LV tags: On any OSD host run the command and look for the tag ceph.osd_fsid=$FSID:
    lvs -o tags
  • ceph-volume: On any OSD host run the command:
    cephadm ceph-volume lvm list | grep "osd fsid"

Only run “bare” ceph-volume commands if your cluster is not managed by cephadm! Otherwise you’ll most likely end up with more chaos!

SSH user

The easiest way is to have a user that can log in to each host passwordless and has (passwordless) sudo permissions to execute the necessary commands. Depending on your Ceph deployment, you could login as the cephadm user which usually has the mentioned permissions. Often the root user is configured as the cephadm user, as it is in my lab cluster in which I tested this procedure.

Rebuild the monitor store

Assuming the collection of osdmaps was successful and no errors were logged, you can now rebuild the store. The hostname of the first monitor host in this example is squid1.

# Create a backup of the current store (just in case)
root@squid1:~# rsync -a /var/lib/ceph/$FSID/mon.squid1/store.db/ /root/mon-store-backup/

# Ensure all daemons are stopped!

# Remove current mon-store
root@squid1:~# rm -rf /var/lib/ceph/$FSID/mon.squid1/store.db/

# Copy collected store.db from your management node (where you ran the collection script) to the first monitor's store.db
rsync -av -e ssh /tmp/mon-store-collected/store.db/ root@squid1:/var/lib/ceph/$FSID/mon.squid1/store.db/

# Rebuild mon-store
root@squid1:~# cephadm shell --name mon.squid1 -- ceph-monstore-tool /var/lib/ceph/mon/ceph-squid1/ rebuild  -- --keyring /etc/ceph/ceph.keyring --mon-ids squid1 squid2 squid3
...
epoch 0
fsid df500e26-1930-11f1-a79f-fa163e65a168
last_changed 2026-06-27T12:18:38.557560+0000
created 2026-06-27T12:18:38.557560+0000
min_mon_release 0 (unknown)
election_strategy: 1
0: [v2:192.168.124.107:3300/0,v1:192.168.124.107:6789/0] mon.squid2
1: [v2:192.168.124.222:3300/0,v1:192.168.124.222:6789/0] mon.squid3
2: [v2:192.168.124.230:3300/0,v1:192.168.124.230:6789/0] mon.squid1

### Note: you need to specify all monitors in the correct order (as mentioned in the docs)!

# Change ownership 
root@squid1:~# chown -R 167.167 /var/lib/ceph/$FSID/mon.squid1/store.db/

Start the monitors

If all above steps were successful, you can try and start the first monitor, either with systemd or cephadm or podman/docker. This also depends on the actual situation. So one of these commands should work:

root@squid1:~# systemctl start ceph-$FSID@mon.squid1.service 

root@squid1:~# cephadm unit start --name mon.squid1

root@squid1:~# podman start ... / docker start ...

# Now check the monitor log to see if it starts successfully
root@squid1:~# journalctl -fu start ceph-$FSID@mon.squid1.service

If the first monitor starts, you should copy the contents of its store.db directory to the other monitors, ensure the correct ownership and try starting them as well. Inspect their logs if something goes wrong, and if it does it has to be fixed before continuing. You can also simply repeat the rsync and rebuild steps to recreate the mon stores on the remaining monitors.

If the monitors start you should be able to see a Ceph status now (ceph -s). The Ceph status should now contain the OSDs in the output (just an excerpt):

ceph -s
cluster:
id: df500e26-1930-11f1-a79f-fa163e65a168
health: HEALTH_WARN
mons are allowing insecure global_id reclaim
no active mgr
services:
mon: 3 daemons, quorum squid2,squid3,squid1 (age 57s)
mgr: no daemons active
osd: 6 osds: 1 up (since 52m), 6 in (since 54m)
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:

As you can see, we now see that the monitors are aware of 6 OSDs (which is correct), just the “pool”, “objects” and “usage” output is not correct since the OSDs are not up yet and haven’t been able to report to MONs and MGRs.

Before starting OSDs

There are some more steps required before you can activate the OSDs. For example, all custom configs are gone, auth entries are missing, the container_image config value has been reset, and the orchestrator doesn’t work, it has to be re-enabled first. Not all of these steps have to be performed necessarily before activating the OSDs, but some of them are crucial and others can make life easier when activating the OSDs.

# Disable auth_allow_insecure_global_id_reclaim (to clear the warning)
root@squid1:~# ceph config set mon auth_allow_insecure_global_id_reclaim false

# Recreate and import at least one mgr auth keyring to enable the orchestrator
root@squid1:~# cat mgr.squid1.tkgxvc.keyring
[mgr.squid1.tkgxvc]
key = AQDjJhtqCOWnFRAA4/QHkFo2tnWFG8m37kM5HQ==
caps mds = allow *
caps mon = profile mgr
caps osd = allow *

root@squid1:~# ceph auth import -i mgr.squid1.tkgxvc.keyring

# Recreate and import all OSDs keyrings
# If present, use key from /var/lib/ceph/$FSID/osd.$id/keyring to recreate an OSD keyring
root@squid1:~# cat osd.0.keyring
[osd.0]
key = AQBzGxtqPUu8EBAAC0BqZMcfRJb5ZuGtXFgBCA==
caps mgr = allow profile osd
caps mon = allow profile osd
caps osd = allow *

# Check ceph auth for completeness
root@quid1:~# ceph auth ls

# Start at least one manager daemon to be able to manage the cluster again
root@squid1:~# systemctl start ceph-$FSID@mgr.squid1.tkgxvc.service

# Set the public network(s), otherwise OSDs will refuse to start
root@squid1:~# ceph config set global public_network 192.168.124.0/24

# Enable cephadm module
root@squid1:~# ceph mgr module enable cephadm

# Set orchestrator backend
root@squid1:~# ceph orch set backend cephadm

# Recreate cephadm user/keyring
root@squid1:~# ceph cephadm generate-key

# Get new public key
root@squid1:~# ceph cephadm get-pub-key

# Add new public key to each host's authorized_keys (usually of the root user)

# Re-add hosts to the orchestrator
root@squid1:~# ceph orch host add squid1 192.168.124.230
root@squid1:~# ceph orch host add squid2 192.168.124.107
root@squid1:~# ceph orch host add squid3 192.168.124.222

# Add labels if necessary

# Configure container_image (choose the correct version for your cluster)
root@squid1:~# ceph config set global container_image quay.io/ceph/ceph:v19.2.4

If you don’t change the container_image, some daemons most likely will be deployed using the hard-coded default:

docker.io/ceph/daemon-base:latest-master-devel

Not changing that image could lead to issues when redeploying other daemons such as MDS.

Activate the OSDs

There’s a wrapper for cephadm to activate existing OSDs. It covers a slightly different scenario: when an OSD host fails due to a broken operating system, its OSDs can be re-activated after reinstalling the operating system. In that case, after all prerequisites are met, you can just run:

ceph cephadm osd activate <host>

But chances are high that this approach won’t work in a scenario like we’re facing here (total mon store loss). So a more manual workaround like the following might be required.

Now assuming that the previously described steps worked and errors were fixed, we can try starting OSDs. You should check the contents of the OSD directories first and see if the most important files are present:

root@squid1:~# ls -l /var/lib/ceph/$FSID/osd.0/
total 68
lrwxrwxrwx 1 167 167 93 Jun 19 14:56 block -> /dev/ceph-3d3a6efd-02d7-4abf-ae23-17bc239feb1e/osd-block-8012c0dc-cb7b-4179-908c-df0a94a4227b
-rw-r--r-- 1 167 167 37 Jun 19 14:56 ceph_fsid
-rw------- 1 167 167 181 Jun 19 14:58 config
-rw-r--r-- 1 167 167 37 Jun 19 14:56 fsid
-rw------- 1 167 167 142 Jun 19 14:58 keyring
-rw------- 1 167 167 1875 Jun 19 14:33 unit.poststop
-rw------- 1 167 167 3457 Jun 19 14:33 unit.run
-rw-r--r-- 1 167 167 2 Jun 19 14:56 whoami

Without these files the OSDs will most likely refuse to start. Now let’s give it a try:

# systemd
root@squid1:~# systemctl start ceph-$FSID@osd.0.service

or via

# cephadm
root@squid1:~# cephadm unit start --name osd.0

Alternatively, if the orchestrator works (you could try to deploy node-exporter or other daemons to confirm), you can also just redeploy an OSD:

root@squid1:~# ceph orch daemon redeploy osd.X

This will recreate missing files in the OSD’s directory and hopefully start the OSD successfully.

Inspect the OSD logs after starting them and look out for error messages if they fail to start:

root@squid1:~# journalctl -fu ceph-$FSID@osd.X.service

If you enabled log_to_file you can follow the log file like this:

root@squid1:~# tail -F /var/log/ceph/$FSID/ceph-osd.X.log

More ToDo’s

Let’s assume that we could successfully start the OSDs and recovery is progressing, we need to access the data now. For RBD pools there’s (probably) not much you need to do, as soon as clients can access the OSDs again, they should be able to map/attach/access the RBD images.

For CephFS it’s a different story. Since the MDS maps are lost as well, all filesystems will have to be recreated, including the MDS daemons. The Ceph docs cover this part. Read the instructions very carefully! The required commands are:

root@squid1:~# ceph fs new <fs_name> <metadata_pool> <data_pool> --force --recover

root@squid1:~# ceph fs set <fs_name> joinable true

Since there are no existing MDS daemon yet, you’ll need to deploy them either via .yaml file:

root@squid1:~# ceph orch apply -i mds.yaml

or directly in the command line:

root@squid1:~# ceph orch apply mds <fs_name> --placement="label:mds"

Check out the docs for more information on MDS deployment.

I won’t go into more details, this post is long enough as it is. But you’ll need to deploy all required daemons for all the services you had running before the crash. So you’ll have to recreate your yaml files for all the services like RGW, Monitoring, NVMe-oF, etc. The existing service specs in a recovered cluster look like this (when no services have been managed yet):

root@squid1:~# ceph orch ls
NAME  PORTS  RUNNING  REFRESHED  AGE  PLACEMENT     
mgr              2/0  9m ago     -    <unmanaged>   
mon              1/0  9m ago     -    <unmanaged>   
osd                0  9m ago     -    <unmanaged>

So you’ll need to update those as well to match your actual cluster configuration. That’s why I recommend to regularly back up your cluster configuration, auth keyrings, service specs and so on, so you don’t have to recreate everything from scratch in such a failure scenario.

After the recovery has been completed, the cluster status looks like this:

squid1:~ # ceph -s
 cluster:
   id:     df500e26-1930-11f1-a79f-fa163e65a168
   health: HEALTH_OK
 
 services:
   mon: 3 daemons, quorum squid2,squid3,squid1 (age 80m)
   mgr: squid3.fepqsu(active, since 62m), standbys: squid1.pcgzhi
   mds: 2/2 daemons up, 3 standby
   osd: 6 osds: 6 up (since 65m), 6 in (since 65m)
 
 data:
   volumes: 2/2 healthy
   pools:   7 pools, 385 pgs
   objects: 49 objects, 4.3 MiB
   usage:   421 MiB used, 60 GiB / 60 GiB avail
   pgs:     385 active+clean

Potential obstacles

There are more things that can go wrong while trying to re-activate the OSDs. This is not an exhaustive list, of course, but it can help in specific scenarios.

Missing OSD path contents

As already mentioned above, the contents of the OSD directories need to be present and matching. If OSDs don’t start, the logs might point to permission issues, missing files (unit.poststop, type, and so on), a broken symbolic link to the OSD block device or other failures.

Ceph Dashboard

Most likely you’ll have to enable the dashboard again since all configs are gone:

root@squid1:~# ceph mgr module enable dashboard

After that, you’ll have to recreate users, roles etc., whatever you had configured before.

Packages

One of the issues during OSD activation in this case was the presence of the ceph-osd package on at least one of the hosts. It’s known to be problematic when having that package present on a cephadm-managed cluster. So make sure the ceph-osd package is removed. Basically, on a cephadm-managed cluster you only require a cephadm package (although the orchestrator uses a binary stored at /var/lib/ceph/$FSID/cephadm.{some_digest} to remotely execute commands) and for comfort reasons the ceph-common package. Other packages like ceph-mon, ceph-mgr etc. are not required and should also be removed.

ceph-volume

As I already pointed out before, don’t use the bare ceph-volume commands to try and fix things when OSDs are managed by the orchestrator. I recommend to remove the ceph-volume package as well, if present. Attempts to fix the OSDs made things worse and caused more chaos, leaving more things to clean up.

Container images

As mentioned before, there’s a hard-coded default image (docker.io/ceph/daemon-base:latest-master-devel) which can make life harder. Make sure to configure the correct Ceph container_image as well as other daemons (like nvmeof, grafana, prometheus etc.):

root@squid1:~# ceph config ls | grep container_image
container_image
mgr/cephadm/container_image_alertmanager
mgr/cephadm/container_image_base
mgr/cephadm/container_image_elasticsearch
mgr/cephadm/container_image_grafana
mgr/cephadm/container_image_haproxy
mgr/cephadm/container_image_jaeger_agent
mgr/cephadm/container_image_jaeger_collector
mgr/cephadm/container_image_jaeger_query
mgr/cephadm/container_image_keepalived
mgr/cephadm/container_image_loki
mgr/cephadm/container_image_node_exporter
mgr/cephadm/container_image_nvmeof
mgr/cephadm/container_image_prometheus
mgr/cephadm/container_image_promtail
mgr/cephadm/container_image_samba
mgr/cephadm/container_image_snmp_gateway

Disclaimer: The procedure above was tested in two lab environments in order to write this blog post, but the individual steps were executed in a real cluster. It worked for me/us but it might not work for you. If anything goes wrong it’s not my fault, but yours. And it’s yours to fix it!

If you have suggestions how to improve the procedure or feedback in case you had to use it, I’d appreciate a comment!

This entry was posted in Ceph, cephadm and tagged , . Bookmark the permalink.

Leave a Reply