Cephadm: Recover mon store using OSDs

A few weeks ago I helped a Ceph user to recover his broken cluster (see this thread). Basically, after his monitors stopped working he re-deployed a new cluster with the same Ceph FSID and attached the existing OSDs to the re-deployed hosts. But it’s not that easy to re-activate those OSDs because the new monitors don’t have the old osdmap, hence they don’t know anything about the existing OSDs. So how did we fix that?

This scenario can be considered a total monitor store failure, there’s a documented procedure in the upstream docs, unfortunately it’s written for non-cephadm clusters and doesn’t contain too many details about the further steps. If the daemons run within containers there are more things to consider. So I decided to write this blog post and add some more details to the procedure, targeting mainly clusters managed by cephadm. But the recovery procedure doesn’t only require cephadm-specific commands, so it can be considered as a general guideline how to recover from a mon store loss, just with some extra details about cephadm deployments.

This procedure only works for non-encrypted OSDs. If you are using dm-encrypted OSDs, make sure you have a backup of the dm-crypt keys, otherwise your data is lost! But there is some development, a user in Slack pointed me to this PR which will add a backup mechanism to the monitors, probably available in the Umbrella release. Note that the PR also contains this statement:

Monitor backups complement, but do not replace, the existing Monitor
recovery procedures.

The procedure covered by said docs is already written in script form so I used it as a template and extended it for cephadm usage and included some very basic logging. It collects the osdmaps from all OSDs and contains the necessary considerations regarding containers. I decided to only automate the osdmap collection, not all required steps of the store rebuild procedure (e. g. mon store rebuild, client auth, etc.) because the result of the collection has to be inspected carefully before proceeding. Here’s the script I used on one of my lab clusters to collect the osdmaps:

cat /usr/local/bin/ceph-collect-osdmaps.sh  
#!/bin/bash

: <<'END'
This script does:
1. Login to each OSD host
2. Create a temporary mon-store in /tmp after cleaning it up 
3. Create a temporary mon-store directory within each OSD's filesystem (which is then mapped into the cephadm shell) after cleaning it up
4. Collect the osdmaps from each OSD and sync it to each host's temporary mon-store  
5. Sync each host's temporary mon-store to the central mon-store
END

# List of all OSD hosts
hosts="squid1 squid2 squid3"

# Ensure this user has passwordless sudo on all OSD hosts (if not root)
# It's also easier if the user has passwordless login, e.g. via ssh public key
user="root"

# Cluster FSID
ceph_fsid="df500e26-1930-11f1-a79f-fa163e65a168"

# Logging directories
ms_central="/tmp/mon-store-central"
log_central="/tmp/central-osd-logs"
logdir="/tmp/logdir"
ms_collected="/tmp/mon-store-collected"

# Ensure all OSDs are stopped before proceeding

rm -rf $ms_central
rm -rf $log_central
mkdir -p $log_central

for host in $hosts; do
ssh $user@$host <<EOF
logdir="/tmp/logdir"
ms_collected="/tmp/mon-store-collected"
ceph_fsid="df500e26-1930-11f1-a79f-fa163e65a168"
osd_dir="/var/lib/ceph/\$ceph_fsid"
rm -rf \$ms_collected
rm -rf \$logdir
mkdir -p \$logdir
echo -e "\nHost: \$(hostname -s)"

for osd in \$(cephadm ceph-volume lvm list --format json 2>/dev/null | jq -r '.[].[].tags | ."ceph.osd_id"'); do
echo -e "\nProcessing osd.\$osd ..."
rm -rf \$osd_dir/osd.\$osd/mon-store-collect
mkdir -p \$osd_dir/osd.\$osd/mon-store-collect

cephadm shell --name osd.\$osd -- ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-\$osd --no-mon-config --op update-mon-db --mon-store-path /var/lib/ceph/osd/ceph-\$osd/mon-store-collect 2> >(tee \$logdir/osd.\$osd.\$(hostname -s).log | grep -vEi "quay|infer|using|creating")

rsync -qaz \$osd_dir/osd.\$osd/mon-store-collect/. \$ms_collected
done
EOF
rsync -qaz -e ssh $user@$host:$ms_collected/. $ms_central
rsync -qav -e ssh $user@$host:$logdir/. $log_central
done

If not all osdmaps are collected successfully, the rebuilt monitor store will be incomplete and some OSDs won’t be able to join the cluster. And there are many more potential obstacles before the OSDs can be activated, so it’s critical to review the logs from each host after the osdmap collection script has finished (in case of errors) and then continue step by step.

Collect the osdmaps

There are a couple of assumptions made in the script, you will need to adapt those according to your requirements. For example, the list of hosts, the ceph fsid, the user who logs in to each host. All OSDs should be stopped if there are still running processes, otherwise the ceph-objectstore-tool will fail during the collection. I recommend to stop all daemons (mgr, mds, ceph-exporter, etc.) before starting the recovery, otherwise you might get stuck with a stale mgr.

The procedure has been tested on two different Squid lab clusters, one on 19.2.3, the other one on 19.2.4. Here’s an example of the script output:

Host: squid1

Processing osd.1 ...
osd.1   : 0 osdmaps trimmed, 226 osdmaps added.

Processing osd.2 ...
osd.2   : 0 osdmaps trimmed, 228 osdmaps added.

Processing osd.0 ...
osd.0   : 0 osdmaps trimmed, 223 osdmaps added.

Ceph FSID

Depending on how much data is actually lost, there are several ways to retrieve the Ceph FSID (to keep the output brief, I use $FSID instead of the actual UUID):

  • ceph.conf: If there’s any ceph.conf left on any host (clients also have a ceph.conf), it will contain the FSID. Alternatively, since every daemon also has a copy of the ceph.conf, check any daemon config file in /var/lib/ceph/$FSID/osd.X/config if the filesystem is intact.
  • LV tags: On any OSD host run the command and look for the tag ceph.cluster_fsid=$FSID:
    lvs -o tags
  • ceph-volume: On any OSD host run the command:
    cephadm ceph-volume lvm list | grep "cluster fsid"

Only run “bare” ceph-volume commands if your cluster is not managed by cephadm! Otherwise you’ll most likely end up with more chaos!

OSD FSID

In some failure scenarios you might need the FSID of each OSD to rebuild an OSD’s directory structure. To collect those, you can also use ceph-volume or LV tags:

  • LV tags: On any OSD host run the command and look for the tag ceph.osd_fsid=$FSID:
    lvs -o tags
  • ceph-volume: On any OSD host run the command:
    cephadm ceph-volume lvm list | grep "osd fsid"

Only run “bare” ceph-volume commands if your cluster is not managed by cephadm! Otherwise you’ll most likely end up with more chaos!

SSH user

The easiest way is to have a user that can log in to each host passwordless and has (passwordless) sudo permissions to execute the necessary commands. Depending on your Ceph deployment, you could login as the cephadm user which usually has the mentioned permissions. Often the root user is configured as the cephadm user, as it is in my lab cluster in which I tested this procedure.

Rebuild the monitor store

Assuming the collection of osdmaps was successful and no errors were logged, you can now rebuild the store. The hostname of the first monitor host in this example is squid1.

# Create a backup of the current store (just in case)
root@squid1:~# rsync -a /var/lib/ceph/$FSID/mon.squid1/store.db/ /root/mon-store-backup/

# Ensure all daemons are stopped!

# Remove current mon-store
root@squid1:~# rm -rf /var/lib/ceph/$FSID/mon.squid1/store.db/

# Copy collected store.db from your management node (where you ran the collection script) to the first monitor's store.db
rsync -av -e ssh /tmp/mon-store-collected/store.db/ root@squid1:/var/lib/ceph/$FSID/mon.squid1/store.db/

# Rebuild mon-store
root@squid1:~# cephadm shell --name mon.squid1 -- ceph-monstore-tool /var/lib/ceph/mon/ceph-squid1/ rebuild  -- --keyring /etc/ceph/ceph.keyring --mon-ids squid1 squid2 squid3
...
epoch 0
fsid df500e26-1930-11f1-a79f-fa163e65a168
last_changed 2026-06-27T12:18:38.557560+0000
created 2026-06-27T12:18:38.557560+0000
min_mon_release 0 (unknown)
election_strategy: 1
0: [v2:192.168.124.107:3300/0,v1:192.168.124.107:6789/0] mon.squid2
1: [v2:192.168.124.222:3300/0,v1:192.168.124.222:6789/0] mon.squid3
2: [v2:192.168.124.230:3300/0,v1:192.168.124.230:6789/0] mon.squid1

### Note: you need to specify all monitors in the correct order (as mentioned in the docs)!

# Change ownership 
root@squid1:~# chown -R 167.167 /var/lib/ceph/$FSID/mon.squid1/store.db/

Start the monitors

If all above steps were successful, you can try and start the first monitor, either with systemd or cephadm or podman/docker. This also depends on the actual situation. So one of these commands should work:

root@squid1:~# systemctl start ceph-$FSID@mon.squid1.service 

root@squid1:~# cephadm unit start --name mon.squid1

root@squid1:~# podman start ... / docker start ...

# Now check the monitor log to see if it starts successfully
root@squid1:~# journalctl -fu start ceph-$FSID@mon.squid1.service

If the first monitor starts, you should copy the contents of its store.db directory to the other monitors, ensure the correct ownership and try starting them as well. Inspect their logs if something goes wrong, and if it does it has to be fixed before continuing. You can also simply repeat the rsync and rebuild steps to recreate the mon stores on the remaining monitors.

If the monitors start you should be able to see a Ceph status now (ceph -s). The Ceph status should now contain the OSDs in the output (just an excerpt):

ceph -s
cluster:
id: df500e26-1930-11f1-a79f-fa163e65a168
health: HEALTH_WARN
mons are allowing insecure global_id reclaim
no active mgr
services:
mon: 3 daemons, quorum squid2,squid3,squid1 (age 57s)
mgr: no daemons active
osd: 6 osds: 1 up (since 52m), 6 in (since 54m)
data:
pools: 0 pools, 0 pgs
objects: 0 objects, 0 B
usage: 0 B used, 0 B / 0 B avail
pgs:

As you can see, we now see that the monitors are aware of 6 OSDs (which is correct), just the “pool”, “objects” and “usage” output is not correct since the OSDs are not up yet and haven’t been able to report to MONs and MGRs.

Before starting OSDs

There are some more steps required before you can activate the OSDs. For example, all custom configs are gone, auth entries are missing, the container_image config value has been reset, and the orchestrator doesn’t work, it has to be re-enabled first. Not all of these steps have to be performed necessarily before activating the OSDs, but some of them are crucial and others can make life easier when activating the OSDs.

# Disable auth_allow_insecure_global_id_reclaim (to clear the warning)
root@squid1:~# ceph config set mon auth_allow_insecure_global_id_reclaim false

# Recreate and import at least one mgr auth keyring to enable the orchestrator
root@squid1:~# cat mgr.squid1.tkgxvc.keyring
[mgr.squid1.tkgxvc]
key = AQDjJhtqCOWnFRAA4/QHkFo2tnWFG8m37kM5HQ==
caps mds = allow *
caps mon = profile mgr
caps osd = allow *

root@squid1:~# ceph auth import -i mgr.squid1.tkgxvc.keyring

# Recreate and import all OSDs keyrings
# If present, use key from /var/lib/ceph/$FSID/osd.$id/keyring to recreate an OSD keyring
root@squid1:~# cat osd.0.keyring
[osd.0]
key = AQBzGxtqPUu8EBAAC0BqZMcfRJb5ZuGtXFgBCA==
caps mgr = allow profile osd
caps mon = allow profile osd
caps osd = allow *

# Check ceph auth for completeness
root@quid1:~# ceph auth ls

# Start at least one manager daemon to be able to manage the cluster again
root@squid1:~# systemctl start ceph-$FSID@mgr.squid1.tkgxvc.service

# Set the public network(s), otherwise OSDs will refuse to start
root@squid1:~# ceph config set global public_network 192.168.124.0/24

# Enable cephadm module
root@squid1:~# ceph mgr module enable cephadm

# Set orchestrator backend
root@squid1:~# ceph orch set backend cephadm

# Recreate cephadm user/keyring
root@squid1:~# ceph cephadm generate-key

# Get new public key
root@squid1:~# ceph cephadm get-pub-key

# Add new public key to each host's authorized_keys (usually of the root user)

# Re-add hosts to the orchestrator
root@squid1:~# ceph orch host add squid1 192.168.124.230
root@squid1:~# ceph orch host add squid2 192.168.124.107
root@squid1:~# ceph orch host add squid3 192.168.124.222

# Add labels if necessary

# Configure container_image (choose the correct version for your cluster)
root@squid1:~# ceph config set global container_image quay.io/ceph/ceph:v19.2.4

If you don’t change the container_image, some daemons most likely will be deployed using the hard-coded default:

docker.io/ceph/daemon-base:latest-master-devel

Not changing that image could lead to issues when redeploying other daemons such as MDS.

Activate the OSDs

There’s a wrapper for cephadm to activate existing OSDs. It covers a slightly different scenario: when an OSD host fails due to a broken operating system, its OSDs can be re-activated after reinstalling the operating system. In that case, after all prerequisites are met, you can just run:

ceph cephadm osd activate <host>

But chances are high that this approach won’t work in a scenario like we’re facing here (total mon store loss). So a more manual workaround like the following might be required.

Now assuming that the previously described steps worked and errors were fixed, we can try starting OSDs. You should check the contents of the OSD directories first and see if the most important files are present:

root@squid1:~# ls -l /var/lib/ceph/$FSID/osd.0/
total 68
lrwxrwxrwx 1 167 167 93 Jun 19 14:56 block -> /dev/ceph-3d3a6efd-02d7-4abf-ae23-17bc239feb1e/osd-block-8012c0dc-cb7b-4179-908c-df0a94a4227b
-rw-r--r-- 1 167 167 37 Jun 19 14:56 ceph_fsid
-rw------- 1 167 167 181 Jun 19 14:58 config
-rw-r--r-- 1 167 167 37 Jun 19 14:56 fsid
-rw------- 1 167 167 142 Jun 19 14:58 keyring
-rw------- 1 167 167 1875 Jun 19 14:33 unit.poststop
-rw------- 1 167 167 3457 Jun 19 14:33 unit.run
-rw-r--r-- 1 167 167 2 Jun 19 14:56 whoami

Without these files the OSDs will most likely refuse to start. Now let’s give it a try:

# systemd
root@squid1:~# systemctl start ceph-$FSID@osd.0.service

or via

# cephadm
root@squid1:~# cephadm unit start --name osd.0

Alternatively, if the orchestrator works (you could try to deploy node-exporter or other daemons to confirm), you can also just redeploy an OSD:

root@squid1:~# ceph orch daemon redeploy osd.X

This will recreate missing files in the OSD’s directory and hopefully start the OSD successfully.

Inspect the OSD logs after starting them and look out for error messages if they fail to start:

root@squid1:~# journalctl -fu ceph-$FSID@osd.X.service

If you enabled log_to_file you can follow the log file like this:

root@squid1:~# tail -F /var/log/ceph/$FSID/ceph-osd.X.log

More ToDo’s

Let’s assume that we could successfully start the OSDs and recovery is progressing, we need to access the data now. For RBD pools there’s (probably) not much you need to do, as soon as clients can access the OSDs again, they should be able to map/attach/access the RBD images.

For CephFS it’s a different story. Since the MDS maps are lost as well, all filesystems will have to be recreated, including the MDS daemons. The Ceph docs cover this part. Read the instructions very carefully! The required commands are:

root@squid1:~# ceph fs new <fs_name> <metadata_pool> <data_pool> --force --recover

root@squid1:~# ceph fs set <fs_name> joinable true

Since there are no existing MDS daemon yet, you’ll need to deploy them either via .yaml file:

root@squid1:~# ceph orch apply -i mds.yaml

or directly in the command line:

root@squid1:~# ceph orch apply mds <fs_name> --placement="label:mds"

Check out the docs for more information on MDS deployment.

I won’t go into more details, this post is long enough as it is. But you’ll need to deploy all required daemons for all the services you had running before the crash. So you’ll have to recreate your yaml files for all the services like RGW, Monitoring, NVMe-oF, etc. The existing service specs in a recovered cluster look like this (when no services have been managed yet):

root@squid1:~# ceph orch ls
NAME  PORTS  RUNNING  REFRESHED  AGE  PLACEMENT     
mgr              2/0  9m ago     -    <unmanaged>   
mon              1/0  9m ago     -    <unmanaged>   
osd                0  9m ago     -    <unmanaged>

So you’ll need to update those as well to match your actual cluster configuration. That’s why I recommend to regularly back up your cluster configuration, auth keyrings, service specs and so on, so you don’t have to recreate everything from scratch in such a failure scenario.

After the recovery has been completed, the cluster status looks like this:

squid1:~ # ceph -s
 cluster:
   id:     df500e26-1930-11f1-a79f-fa163e65a168
   health: HEALTH_OK
 
 services:
   mon: 3 daemons, quorum squid2,squid3,squid1 (age 80m)
   mgr: squid3.fepqsu(active, since 62m), standbys: squid1.pcgzhi
   mds: 2/2 daemons up, 3 standby
   osd: 6 osds: 6 up (since 65m), 6 in (since 65m)
 
 data:
   volumes: 2/2 healthy
   pools:   7 pools, 385 pgs
   objects: 49 objects, 4.3 MiB
   usage:   421 MiB used, 60 GiB / 60 GiB avail
   pgs:     385 active+clean

Potential obstacles

There are more things that can go wrong while trying to re-activate the OSDs. This is not an exhaustive list, of course, but it can help in specific scenarios.

Missing OSD path contents

As already mentioned above, the contents of the OSD directories need to be present and matching. If OSDs don’t start, the logs might point to permission issues, missing files (unit.poststop, type, and so on), a broken symbolic link to the OSD block device or other failures.

Ceph Dashboard

Most likely you’ll have to enable the dashboard again since all configs are gone:

root@squid1:~# ceph mgr module enable dashboard

After that, you’ll have to recreate users, roles etc., whatever you had configured before.

Packages

One of the issues during OSD activation in this case was the presence of the ceph-osd package on at least one of the hosts. It’s known to be problematic when having that package present on a cephadm-managed cluster. So make sure the ceph-osd package is removed. Basically, on a cephadm-managed cluster you only require a cephadm package (although the orchestrator uses a binary stored at /var/lib/ceph/$FSID/cephadm.{some_digest} to remotely execute commands) and for comfort reasons the ceph-common package. Other packages like ceph-mon, ceph-mgr etc. are not required and should also be removed.

ceph-volume

As I already pointed out before, don’t use the bare ceph-volume commands to try and fix things when OSDs are managed by the orchestrator. I recommend to remove the ceph-volume package as well, if present. Attempts to fix the OSDs made things worse and caused more chaos, leaving more things to clean up.

Container images

As mentioned before, there’s a hard-coded default image (docker.io/ceph/daemon-base:latest-master-devel) which can make life harder. Make sure to configure the correct Ceph container_image as well as other daemons (like nvmeof, grafana, prometheus etc.):

root@squid1:~# ceph config ls | grep container_image
container_image
mgr/cephadm/container_image_alertmanager
mgr/cephadm/container_image_base
mgr/cephadm/container_image_elasticsearch
mgr/cephadm/container_image_grafana
mgr/cephadm/container_image_haproxy
mgr/cephadm/container_image_jaeger_agent
mgr/cephadm/container_image_jaeger_collector
mgr/cephadm/container_image_jaeger_query
mgr/cephadm/container_image_keepalived
mgr/cephadm/container_image_loki
mgr/cephadm/container_image_node_exporter
mgr/cephadm/container_image_nvmeof
mgr/cephadm/container_image_prometheus
mgr/cephadm/container_image_promtail
mgr/cephadm/container_image_samba
mgr/cephadm/container_image_snmp_gateway

Disclaimer: The procedure above was tested in two lab environments in order to write this blog post, but the individual steps were executed in a real cluster. It worked for me/us but it might not work for you. If anything goes wrong it’s not my fault, but yours. And it’s yours to fix it!

If you have suggestions how to improve the procedure or feedback in case you had to use it, I’d appreciate a comment!

Posted in Ceph, cephadm | Tagged , | Leave a comment

Cephadm: Change OSD service specification

The Ceph “Tentacle” version (v20.2.0) has been released a couple of weeks ago, and I just discovered a new command for the orchestrator:

tentacle:~ # ceph orch osd set-spec-affinity <new_spec> <osd_id>

But what is the purpose of this command?

One frequent question on the ceph-users mailing list or other channels is how to “move” OSDs to a different service specification. So let’s assume we have two different specs in place:

tentacle:~ # ceph orch ls osd
NAME            PORTS  RUNNING  REFRESHED  AGE  PLACEMENT   
osd.standalone               2  1s ago     7w   tentacle   
osd.test                     1  1s ago     42s  label:osd

Now how can we move the two OSDs from “osd.standalone” to “osd.test” without rebuilding OSDs (which is unnecessary)? My approach until now was to change the unit.meta file of every OSD manually:

# Use an editor of your choice
$ vi /var/lib/ceph/{FSID}/osd.{OSD_ID}/unit.meta  

# and change 
    "service_name": "osd.standalone",
# to 
    "service_name": "osd.test",

After a few minutes, the orchestrator will refresh its specs and update the output accordingly.

This process can now be executed via orchestrator (or alternatively, locally on a node via cephadm) as mentioned above:

tentacle:~ # ceph orch osd set-spec-affinity osd.test 0
Updated service for osd 0

This will replace "service_name": "osd.standalone" with "service_name": "osd.test" in the unit.meta file of OSD.0.

To demonstrate the local command (via cephadm) as well (for two OSDs simultaneously):

tentacle:~ # cephadm update-osd-service --fsid {FSID} --osd-ids 1,2 --service-name osd.test
Inferring config /var/lib/ceph/{FSID}/mon.tentacle/config
Successfully updated daemon osd.1 with service osd.test
Successfully updated daemon osd.2 with service osd.test

And the result:

tentacle:~ # ceph orch ls osd --refresh
NAME            PORTS  RUNNING  REFRESHED  AGE  PLACEMENT   
osd.standalone               0  -          7w   label:osd   
osd.test                     3  1s ago     19h  label:osd

Posted in Ceph, cephadm | Tagged , , , , , , | Leave a comment

Cephadm: PG Imbalance

You get the most out of Ceph if you have a balanced cluster, meaning your PGs are well balanced across the OSDs. I’m not going into more detail about the advantages of a balanced cluster, there’s plenty of information in the docs or the mailing lists.

To ensure that a cluster is in a healthy state, Ceph provides a great set of monitoring options pointing you towards potential issues. Depending on which monitoring utilities you have enabled, you could be facing the “CephPGImbalance” alert. The “Prometheus” manager module is responsible for this alert.

Prometheus starts firing this alert if your PG count per OSD deviates by more than 30% for more than 5 minutes:

name: CephPGImbalance
expr: |
abs(((ceph_osd_numpg > 0) - on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) / on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) * on (ceph_daemon) group_left (hostname) ceph_osd_metadata > 0.3
for: 5m

But there’s a flaw in this approach: it relies on uniform OSD sizes. Although a Ceph operator’s life can be much easier having uniform OSD sizes, that’s just not the reality. Many (or even most?) clusters don’t have OSDs of the same size. In a cluster with different OSD sizes, the ratio of the number of PGs and the size of an OSD is better suited to show the occupancy of an OSD. And after a customer asked us for a fix other than adjusting the threshold, we came up with a workaround.

The main problem was to find out which existing metric we could utilize to reflect OSD sizes. After some digging in the MGR module, we decided to use osd_stats which is already being used for a different metric. Basically, we read the OSD total bytes and transform it into a TiB value, similar to what the Ceph code does to display the OSD crush weight (ceph osd tree). That’s why we called the new function get_osd_crush_weight. I’m gonna spare you more details, I’ll rather focus on how we implemented that change.

This approach works both for Pacific (16.2.13) and Reef (18.2.7), I haven’t tested it in other releases yet. Note that this is just a workaround, it will most likely only work for your current version. You need to take care of the modifications before or after a Ceph upgrade as the MGR code will change.

Disclaimer: This approach has worked for us and our customers, it might not work for you. This procedure is for a containerized environment, managed by cephadm. But it should be easy to adapt to non-cephadm deployments.

There are two files to maintain for this to work (the file paths refer to the path within the daemon containers):

  • /usr/share/ceph/mgr/prometheus/module.py (mgr daemon)
  • /etc/prometheus/alerting/ceph_alerts.yml (prometheus daemon)

MGR module.py

There are three code blocks to add to the module.py file. Copy the original file from inside the container into the local filesystem (for example: /etc/ceph/prometheus_module_crush_weight.py.dist) to be able to modify it, then add these changes and save it as a new file (prometheus_module_crush_weight.py):

diff mgr-prometheus-module.py.dist prometheus_module_crush_weight.py 
694a695,701
>         # add new metric to collect crush weight
>         metrics['osd_crush_weight'] = Metric(
>             'gauge',
>             'osd_crush_weight',
>             'Crush weight for Ceph daemons',
>             ('ceph_daemon',)
>         )
1109a1117,1126
>     # get OSD size in TB from osd_stats as an alternative to the actual crush weight
>     def get_osd_crush_weight(self) -> None:
>         osd_stats = self.get('osd_stats')
>         for osd in osd_stats['osd_stats']:
>             id_ = osd['osd']
>             val = osd['statfs']['total']/(1 << 40)
>             self.metrics['osd_crush_weight'].set(val, (
>                 'osd.{}'.format(id_),
>             ))
>  
1700a1718,1719
>         # get crush weight
>         self.get_osd_crush_weight()

The file /etc/ceph/prometheus_module_crush_weight.py needs to be present on all your MGR nodes. To apply this change, modify the mgr spec file:

# get current mgr spec
ceph orch ls mgr --export > mgr-config.yaml

# add extra_container_args
[...]
spec:
 extra_container_args:
   - -v=/etc/ceph/prometheus_module_crush_weight.py:/usr/share/ceph/mgr/prometheus/module.py:ro

# apply new mgr yaml (try with dry-run first)
ceph orch apply -i mgr-config.yaml (--dry-run)

This will redeploy the MGR daemons, look out for any errors in case they fail to start. You can always roll back to the defaults by removing the extra_container_args from the spec file.

If the MGRs successfully start, you should be able to see the new metric:

curl http://<MGR>:9283/metrics | grep crush

Now we need to get this metric into Prometheus.

Prometheus ceph_alerts.yml

This works similar to the MGR service, create a custom alerts file and add it to the extra_container_args, then redeploy prometheus:

Copy the original alerts yaml file to /etc/ceph/prometheus_ceph_alerts.yml, then replace the original CephPGImbalance section with this one:

     - alert: "CephPGImbalanceCrushWeight"
       annotations:
         description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates by more than 30% from average PG count, taking OSD size into account."
         summary: "PGs are not balanced across OSDs"
       expr: |
         abs((((ceph_osd_numpg > 0) / on (ceph_daemon) ceph_osd_crush_weight)- on (job) group_left avg((ceph_osd_numpg > 0)/ on (ceph_daemon) ceph_osd_crush_weight)by (job)) / on (job) gro
up_left avg((ceph_osd_numpg > 0)/ on (ceph_daemon) ceph_osd_crush_weight)by (job)) * on (ceph_daemon) group_left(hostname)ceph_osd_metadata > 0.3
       for: "5m"

A co-worker came up with the formula, which is similar to the original one but uses the ratio of the number of PGs and the size of an OSD instead of the number of PGs of an OSD.

To apply this change, add this section to your prometheus spec file:

extra_container_args:
 - -v /etc/ceph/prometheus_ceph_alerts.yml:/etc/prometheus/alerting/ceph_alerts.yml:ro

It might be necessary to change file permissions for Prometheus to be able to read that file, so look out for the prometheus logs after applying that change:

ceph orch apply -i prometheus.yaml (--dry-run)
ceph orch redeploy prometheus

Since the monitoring interval is set to 5 minutes, you’ll have to wait that long until you see if you still get alerts. If everything works as expected, you won’t see those alerts anymore, so how can you know if the formula works correctly? Just use the Prometheus Web API and inspect the query with a much smaller deviation by navigating to your Prometheus host, clicking on “Alerts”, then scroll down to the new “CephPGImbalanceCrushWeight” alert and click on the expression. It will redirect you to the Graph panel where you can edit the query by reducing the default 0.3 to say 0.0001. Play around a bit with that threshold, and at some point you should see something other than “Empty result”, which would confirm that the expression is correct.

Posted in Ceph, cephadm | Tagged , , , , | Leave a comment