Cephadm: PG Imbalance

You get the most out of Ceph if you have a balanced cluster, meaning your PGs are well balanced across the OSDs. I’m not going into more detail about the advantages of a balanced cluster, there’s plenty of information in the docs or the mailing lists.

To ensure that a cluster is in a healthy state, Ceph provides a great set of monitoring options pointing you towards potential issues. Depending on which monitoring utilities you have enabled, you could be facing the “CephPGImbalance” alert. The “Prometheus” manager module is responsible for this alert.

Prometheus starts firing this alert if your PG count per OSD deviates by more than 30% for more than 5 minutes:

name: CephPGImbalance
expr: |
abs(((ceph_osd_numpg > 0) - on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) / on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) * on (ceph_daemon) group_left (hostname) ceph_osd_metadata > 0.3
for: 5m

But there’s a flaw in this approach: it relies on uniform OSD sizes. Although a Ceph operator’s life can be much easier having uniform OSD sizes, that’s just not the reality. Many (or even most?) clusters don’t have OSDs of the same size. In a cluster with different OSD sizes, the ratio of the number of PGs and the size of an OSD is better suited to show the occupancy of an OSD. And after a customer asked us for a fix other than adjusting the threshold, we came up with a workaround.

The main problem was to find out which existing metric we could utilize to reflect OSD sizes. After some digging in the MGR module, we decided to use osd_stats which is already being used for a different metric. Basically, we read the OSD total bytes and transform it into a TiB value, similar to what the Ceph code does to display the OSD crush weight (ceph osd tree). That’s why we called the new function get_osd_crush_weight. I’m gonna spare you more details, I’ll rather focus on how we implemented that change.

This approach works both for Pacific (16.2.13) and Reef (18.2.7), I haven’t tested it in other releases yet. Note that this is just a workaround, it will most likely only work for your current version. You need to take care of the modifications before or after a Ceph upgrade as the MGR code will change.

Disclaimer: This approach has worked for us and our customers, it might not work for you. This procedure is for a containerized environment, managed by cephadm. But it should be easy to adapt to non-cephadm deployments.

There are two files to maintain for this to work (the file paths refer to the path within the daemon containers):

  • /usr/share/ceph/mgr/prometheus/module.py (mgr daemon)
  • /etc/prometheus/alerting/ceph_alerts.yml (prometheus daemon)

MGR module.py

There are three code blocks to add to the module.py file. Copy the original file from inside the container into the local filesystem (for example: /etc/ceph/prometheus_module_crush_weight.py.dist) to be able to modify it, then add these changes and save it as a new file (prometheus_module_crush_weight.py):

diff mgr-prometheus-module.py.dist prometheus_module_crush_weight.py 
694a695,701
>         # add new metric to collect crush weight
>         metrics['osd_crush_weight'] = Metric(
>             'gauge',
>             'osd_crush_weight',
>             'Crush weight for Ceph daemons',
>             ('ceph_daemon',)
>         )
1109a1117,1126
>     # get OSD size in TB from osd_stats as an alternative to the actual crush weight
>     def get_osd_crush_weight(self) -> None:
>         osd_stats = self.get('osd_stats')
>         for osd in osd_stats['osd_stats']:
>             id_ = osd['osd']
>             val = osd['statfs']['total']/(1 << 40)
>             self.metrics['osd_crush_weight'].set(val, (
>                 'osd.{}'.format(id_),
>             ))
>  
1700a1718,1719
>         # get crush weight
>         self.get_osd_crush_weight()

The file /etc/ceph/prometheus_module_crush_weight.py needs to be present on all your MGR nodes. To apply this change, modify the mgr spec file:

# get current mgr spec
ceph orch ls mgr --export > mgr-config.yaml

# add extra_container_args
[...]
spec:
 extra_container_args:
   - -v=/etc/ceph/prometheus_module_crush_weight.py:/usr/share/ceph/mgr/prometheus/module.py:ro

# apply new mgr yaml (try with dry-run first)
ceph orch apply -i mgr-config.yaml (--dry-run)

This will redeploy the MGR daemons, look out for any errors in case they fail to start. You can always roll back to the defaults by removing the extra_container_args from the spec file.

If the MGRs successfully start, you should be able to see the new metric:

curl http://<MGR>:9283/metrics | grep crush

Now we need to get this metric into Prometheus.

Prometheus ceph_alerts.yml

This works similar to the MGR service, create a custom alerts file and add it to the extra_container_args, then redeploy prometheus:

Copy the original alerts yaml file to /etc/ceph/prometheus_ceph_alerts.yml, then replace the original CephPGImbalance section with this one:

     - alert: "CephPGImbalanceCrushWeight"
       annotations:
         description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates by more than 30% from average PG count, taking OSD size into account."
         summary: "PGs are not balanced across OSDs"
       expr: |
         abs((((ceph_osd_numpg > 0) / on (ceph_daemon) ceph_osd_crush_weight)- on (job) group_left avg((ceph_osd_numpg > 0)/ on (ceph_daemon) ceph_osd_crush_weight)by (job)) / on (job) gro
up_left avg((ceph_osd_numpg > 0)/ on (ceph_daemon) ceph_osd_crush_weight)by (job)) * on (ceph_daemon) group_left(hostname)ceph_osd_metadata > 0.3
       for: "5m"

A co-worker came up with the formula, which is similar to the original one but uses the ratio of the number of PGs and the size of an OSD instead of the number of PGs of an OSD.

To apply this change, add this section to your prometheus spec file:

extra_container_args:
 - -v /etc/ceph/prometheus_ceph_alerts.yml:/etc/prometheus/alerting/ceph_alerts.yml:ro

It might be necessary to change file permissions for Prometheus to be able to read that file, so look out for the prometheus logs after applying that change:

ceph orch apply -i prometheus.yaml (--dry-run)
ceph orch redeploy prometheus

Since the monitoring interval is set to 5 minutes, you’ll have to wait that long until you see if you still get alerts. If everything works as expected, you won’t see those alerts anymore, so how can you know if the formula works correctly? Just use the Prometheus Web API and inspect the query with a much smaller deviation by navigating to your Prometheus host, clicking on “Alerts”, then scroll down to the new “CephPGImbalanceCrushWeight” alert and click on the expression. It will redirect you to the Graph panel where you can edit the query by reducing the default 0.3 to say 0.0001. Play around a bit with that threshold, and at some point you should see something other than “Empty result”, which would confirm that the expression is correct.

Posted in Ceph, cephadm | Tagged , , , , | Leave a comment

Cephadm: Activate existing OSDs

A frequent question in the community is, what do I need to do when the operating system of one of my Ceph servers fails, but the OSDs are intact? Can I revive it?

The answer is yes! And it’s quite easy to do!

A few years ago, I wrote an article about the general idea how to do that. But the process has become much easier, so I decided to write a new blog post.

Although the docs cover that in general, I wanted to add some more details to have a bit more context.

This procedure isn’t exclusive to a host failure, we just reinstalled all our Ceph servers on faster SSD drives for the operating system (“OS”). The required steps are a combination of the procedure to add a new host and the ceph cephadm osd activate <host>... command.

The OS installation is not covered in this post.

After you successfully installed the OS, you need to configure the host so the orchestrator is able to manage it. Our Ceph servers run on openSUSE, the package manager is zypper, and we use podman. Adapt the required commands to your OS with its package manager and your preferred container engine.
The reinstalled server is “ceph04”, the Ceph commands to reintegrate “ceph04” are executed on “ceph01”, a host with an admin keyring.

# Install required packages
ceph04:~ # zypper in cephadm podman

# Retrieve public key
ceph01:~ # ceph cephadm get-pub-key > ceph.pub

# Copy key to ceph04
ceph01:~ # ssh-copy-id -f -i ceph.pub root@ceph04

# Retrieve private key to test connection
ceph01:~ # ceph config-key get mgr/cephadm/ssh_identity_key > ceph-private.key

# Modify permissions
ceph01:~ # chmod 400 ceph-private.key

# Test login
ceph01:~ # ssh -i ceph-private.key ceph04
Have a lot of fun...
ceph04:~ #

# Clean up
ceph01:~ # rm ceph.pub ceph-private.key

Since the host should be still in the host list, you don’t need to add it. As soon as the reinstalled host is reachable by the orchestrator (ceph orch host ls doesn’t show the host status as offline or maintenance), cephadm will try to deploy missing daemons to that host. In case you run your own container registry, the automatic deployment of the missing daemons will fail until the host has successfully logged in to the registry. So we instruct the orchestrator to execute a login for each host:

ceph cephadm registry-login my-registry.domain <user> <password>

Shortly after the orchestrator has performed the registry login, the missing daemons should be successfully deployed to the host, for example crash, node-exporter and all other daemons it used to run before the failure.

If that all works, you can activate the existing OSDs simply by running:

ceph cephadm osd activate ceph04

And that’s basically it, the OSDs should boot one after the other. There might be some additional steps required, depending on which daemons are supposed to run on that host, but I’m only focusing on OSD daemons here.

Disclaimer: This procedure has worked many times for us, it might not work for you.

Posted in Ceph, cephadm | Tagged , , , , , | Leave a comment

Cephadm: migrate block.db/block.wal to new device

A couple of years ago, before cephadm took over Ceph deployments, we wrote an article about migrating the DB/WAL devices from slow to fast devices. The procedure has become much easier than it used to be, thanks to ceph-bluestore-tool (or alternatively, ceph-volume). Keep in mind that cephadm managed clusters are typically running in containers, so the migration of the DB/WAL needs to be performed within the OSD containers.

To keep it brief, I’ll only focus on the DB device (block.db), migrating to a separate WAL device (block.wal) is very similar.

# Create Volume Group
ceph:~ # vgcreate ceph-db /dev/vdf

# Create Logical Volume
ceph:~ # lvcreate -L 5G -n ceph-osd0-db ceph-db

# Set the noout flag for OSD.0
ceph:~ # ceph osd add-noout osd.0

# Stop the OSD
ceph:~ # ceph orch daemon stop osd.0

# Enter the OSD shell
ceph:~ # cephadm shell --name osd.0

# Get the OSD's FSID
[ceph: root@ceph /]# OSD_FSID=$(ceph-volume lvm list 0 | awk '/osd fsid/ {print $3}')
fb69ba54-4d56-4c90-a855-6b350d186df5

# Create the DB device
[ceph: root@ceph /]# ceph-volume lvm new-db --osd-id 0 --osd-fsid $OSD_FSID --target ceph-db/ceph-osd0-db

# Migrate the DB to the new device
[ceph: root@ceph /]# ceph-volume lvm migrate --osd-id 0 --osd-fsid $OSD_FSID --from /var/lib/ceph/osd/ceph-0/block --target ceph-db/ceph-osd0-db

# Exit the shell and start the OSD
ceph:~ # ceph orch daemon start osd.0


# Unset the noout flag
ceph:~ # ceph osd rm-noout osd.0

To verify the new configuration, you can inspect the OSD’s metadata:

ceph:~ # osd metadata 0 -f json | jq -r '.bluefs_dedicated_db,.devices'
1
vdb, vdf

We can confirm that we now have a dedicated db device.

You can also check the OSD’s perf dump:

ceph:~ # ceph tell osd.0 perf dump bluefs | jq -r
'.[].db_total_bytes,.[].db_used_bytes'
5368700928
47185920

That’s it, the OSD’s DB is now on a different device!

Migrate DB back to main device

If you’re looking for the other way around, that’s also possible. Although this works with ceph-volume as well, for the sake of variety I’ll show the way with ceph-bluestore-tool:

# Set the noout flag
ceph:~ # ceph osd add-noout osd.0

# Stop the OSD
ceph:~ # ceph orch daemon stop osd.0

# Enter the shell
ceph:~ # cephadm shell --name osd.0

# Migrate DB to main device

[ceph: root@ceph /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ --command bluefs-bdev-migrate --devs-source /var/lib/ceph/osd/ceph-0/block.db --dev-target /var/lib/ceph/osd/ceph-0/block
inferring bluefs devices from bluestore path
 device removed:1 /var/lib/ceph/osd/ceph-0/block.db

# IMPORTANT: Remove the DB's Logical Volume before you start the OSD! Otherwise the OSD will use it again because of the LV tags.
ceph:~ # lvremove /dev/ceph-db/ceph-osd0-db

# Alternatively, delete the LV tags of the DB LV before starting the OSD.

# Start the OSD
ceph:~ # ceph orch daemon start osd.0

# Unset the noout flag
ceph:~ # ceph osd rm-noout osd.0

Verify the results:

ceph:~ # ceph osd metadata 0 -f json | jq -r '.bluefs_dedicated_db,.devices'
0
vdb

The provided steps were performed on a Reef cluster (version 18.2.4).

Disclaimer: As always with such articles: The steps above do work for us, but needn’t work for you. So if anything goes wrong while you try to reproduce above procedure, it’s not our fault, but yours. And it’s yours to fix it! But whether it works for you or not, please leave a comment below so that others will know.

Posted in Ceph, cephadm | Tagged , , , , | Leave a comment