Cephadm: PG Imbalance

You get the most out of Ceph if you have a balanced cluster, meaning your PGs are well balanced across the OSDs. I’m not going into more detail about the advantages of a balanced cluster, there’s plenty of information in the docs or the mailing lists.

To ensure that a cluster is in a healthy state, Ceph provides a great set of monitoring options pointing you towards potential issues. Depending on which monitoring utilities you have enabled, you could be facing the “CephPGImbalance” alert. The “Prometheus” manager module is responsible for this alert.

Prometheus starts firing this alert if your PG count per OSD deviates by more than 30% for more than 5 minutes:

name: CephPGImbalance
expr: |
abs(((ceph_osd_numpg > 0) - on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) / on (job) group_left () avg by (job) (ceph_osd_numpg > 0)) * on (ceph_daemon) group_left (hostname) ceph_osd_metadata > 0.3
for: 5m

But there’s a flaw in this approach: it relies on uniform OSD sizes. Although a Ceph operator’s life can be much easier having uniform OSD sizes, that’s just not the reality. Many (or even most?) clusters don’t have OSDs of the same size. In a cluster with different OSD sizes, the ratio of the number of PGs and the size of an OSD is better suited to show the occupancy of an OSD. And after a customer asked us for a fix other than adjusting the threshold, we came up with a workaround.

The main problem was to find out which existing metric we could utilize to reflect OSD sizes. After some digging in the MGR module, we decided to use osd_stats which is already being used for a different metric. Basically, we read the OSD total bytes and transform it into a TiB value, similar to what the Ceph code does to display the OSD crush weight (ceph osd tree). That’s why we called the new function get_osd_crush_weight. I’m gonna spare you more details, I’ll rather focus on how we implemented that change.

This approach works both for Pacific (16.2.13) and Reef (18.2.7), I haven’t tested it in other releases yet. Note that this is just a workaround, it will most likely only work for your current version. You need to take care of the modifications before or after a Ceph upgrade as the MGR code will change.

Disclaimer: This approach has worked for us and our customers, it might not work for you. This procedure is for a containerized environment, managed by cephadm. But it should be easy to adapt to non-cephadm deployments.

There are two files to maintain for this to work (the file paths refer to the path within the daemon containers):

/usr/share/ceph/mgr/prometheus/module.py (mgr daemon)
/etc/prometheus/alerting/ceph_alerts.yml (prometheus daemon)

MGR module.py

There are three code blocks to add to the module.py file. Copy the original file from inside the container into the local filesystem (for example: /etc/ceph/prometheus_module_crush_weight.py.dist) to be able to modify it, then add these changes and save it as a new file (prometheus_module_crush_weight.py):

diff mgr-prometheus-module.py.dist prometheus_module_crush_weight.py 
694a695,701
>         # add new metric to collect crush weight
>         metrics['osd_crush_weight'] = Metric(
>             'gauge',
>             'osd_crush_weight',
>             'Crush weight for Ceph daemons',
>             ('ceph_daemon',)
>         )
1109a1117,1126
>     # get OSD size in TB from osd_stats as an alternative to the actual crush weight
>     def get_osd_crush_weight(self) -> None:
>         osd_stats = self.get('osd_stats')
>         for osd in osd_stats['osd_stats']:
>             id_ = osd['osd']
>             val = osd['statfs']['total']/(1 << 40)
>             self.metrics['osd_crush_weight'].set(val, (
>                 'osd.{}'.format(id_),
>             ))
>  
1700a1718,1719
>         # get crush weight
>         self.get_osd_crush_weight()

The file /etc/ceph/prometheus_module_crush_weight.py needs to be present on all your MGR nodes. To apply this change, modify the mgr spec file:

# get current mgr spec
ceph orch ls mgr --export > mgr-config.yaml

# add extra_container_args
[...]
spec:
 extra_container_args:
   - -v=/etc/ceph/prometheus_module_crush_weight.py:/usr/share/ceph/mgr/prometheus/module.py:ro

# apply new mgr yaml (try with dry-run first)
ceph orch apply -i mgr-config.yaml (--dry-run)

This will redeploy the MGR daemons, look out for any errors in case they fail to start. You can always roll back to the defaults by removing the extra_container_args from the spec file.

If the MGRs successfully start, you should be able to see the new metric:

curl http://<MGR>:9283/metrics | grep crush

Now we need to get this metric into Prometheus.

Prometheus ceph_alerts.yml

This works similar to the MGR service, create a custom alerts file and add it to the extra_container_args, then redeploy prometheus:

Copy the original alerts yaml file to /etc/ceph/prometheus_ceph_alerts.yml, then replace the original CephPGImbalance section with this one:

     - alert: "CephPGImbalanceCrushWeight"
       annotations:
         description: "OSD {{ $labels.ceph_daemon }} on {{ $labels.hostname }} deviates by more than 30% from average PG count, taking OSD size into account."
         summary: "PGs are not balanced across OSDs"
       expr: |
         abs((((ceph_osd_numpg > 0) / on (ceph_daemon) ceph_osd_crush_weight)- on (job) group_left avg((ceph_osd_numpg > 0)/ on (ceph_daemon) ceph_osd_crush_weight)by (job)) / on (job) gro
up_left avg((ceph_osd_numpg > 0)/ on (ceph_daemon) ceph_osd_crush_weight)by (job)) * on (ceph_daemon) group_left(hostname)ceph_osd_metadata > 0.3
       for: "5m"

A co-worker came up with the formula, which is similar to the original one but uses the ratio of the number of PGs and the size of an OSD instead of the number of PGs of an OSD.

To apply this change, add this section to your prometheus spec file:

extra_container_args:
 - -v /etc/ceph/prometheus_ceph_alerts.yml:/etc/prometheus/alerting/ceph_alerts.yml:ro

It might be necessary to change file permissions for Prometheus to be able to read that file, so look out for the prometheus logs after applying that change:

ceph orch apply -i prometheus.yaml (--dry-run)
ceph orch redeploy prometheus

Since the monitoring interval is set to 5 minutes, you’ll have to wait that long until you see if you still get alerts. If everything works as expected, you won’t see those alerts anymore, so how can you know if the formula works correctly? Just use the Prometheus Web API and inspect the query with a much smaller deviation by navigating to your Prometheus host, clicking on “Alerts”, then scroll down to the new “CephPGImbalanceCrushWeight” alert and click on the expression. It will redirect you to the Graph panel where you can edit the query by reducing the default 0.3 to say 0.0001. Play around a bit with that threshold, and at some point you should see something other than “Empty result”, which would confirm that the expression is correct.

MGR module.py

Prometheus ceph_alerts.yml

Leave a Reply Cancel reply

Archives

Meta