PGs not deep-scrubbed in time

Every now and then, Ceph users or operators stumble across this warning:

# ceph health detail
HEALTH_WARN 1161 pgs not deep-scrubbed in time
[WRN] PG_NOT_DEEP_SCRUBBED: 1161 pgs not deep-scrubbed in time
pg 86.fff not deep-scrubbed since 2024-08-21T02:35:25.733187+0000
...

Although I had increased the deep_scrub_interval from 1 week (default in seconds: osd_deep_scrub_interval = 604800) to a longer interval, let’s say 2 weeks (ceph config set osd osd_deep_scrub_interval 1209600), the cluster still warns that PGs haven’t been deep-scrubbed in time, even if the last deep_scrub_stamp is clearly younger than the configured interval. In the above example, the last deep_scrub_stamp was only 13 days from the time of the warning, so we were wondering why.

I checked the config setting:

# ceph config help osd_deep_scrub_interval
osd_deep_scrub_interval - Deep scrub each PG (i.e., verify data checksums) at least this often
 (float, advanced)
 Default: 604800.000000
 Can update at runtime: true
 Services: [osd]

The config setting is applicable to the OSD service and it can be changed at runtime, so I expected it to be enough to clear the warning. Apparently, it was not.

Next, I looked into the code:

pool->opts.get(pool_opts_t::DEEP_SCRUB_INTERVAL, &deep_scrub_interval);
        if (deep_scrub_interval <= 0) {
          deep_scrub_interval = cct->_conf->osd_deep_scrub_interval;
        }

First, it checks whether the pool has an individual deep_scrub_interval set, which was not the case here. Next, it reads the config value from the config database, which we had changed to two weeks. But the warning still persisted, why?

It turns out that the config is read from the MGR service, not the OSD service! To me it’s not clear from the code path, but I could confirm it quickly by only setting osd_deep_scrub_interval for the MGR service to 1 hour in a test cluster:

# ceph config set mgr osd_deep_scrub_interval 3600

The cluster then warns immediately about PGs not deep-scrubbed in time, as expected. And if you know what you’re looking for, you realize the solution is already out there. But the documentation doesn’t mention that, and just searching for this issue online will give you all kinds of results and workarounds, but nothing pointing to the actual solution.

There are several options to deal with this, depending on the actual cluster, its workload, the OSD tree and some more factors, here are two of them:

  1. Change the default config setting for osd_deep_scrub_interval
    • either globally (for all services):
      • # ceph config set global osd_deep_scrub_interval 1209600
    • or for MGRs and OSDs:
      • # ceph config set osd osd_deep_scrub_interval 1209600
      • # ceph config set mgr osd_deep_scrub_interval 1209600
  2. Configure a different interval for each pool individually:
    • # ceph osd pool set <POOL> deep_scrub_interval 1209600

Note: the config setting has different meanings for those two daemons, I’ll get to that in a bit. But you should change the values for both services, if you decide to do so.

For example, if your cluster is heterogeneous with different device classes and many pools, deep-scrubs might be quick for SSD-backed pools and will take much longer for HDD-backed pools. That’s why I would recommend setting the value per pool so you don’t need to change the default config settings.

I’d like to add some more details how that config option impacts those two different daemons:

[osd] osd_deep_scrub_interval

OSDs will try to deep-scrub the PGs within this interval. If you increase the value (default one week or 604800 seconds), you give your OSDs more time to deep-scrub all PGs.

[mgr] osd_deep_scrub_interval

The (active) MGR checks if the last deep-scrub timestamp is younger than the osd_deep_scrub_interval (plus a couple of days), giving your OSDs a default of 12.25 days to deep-scrub all PGs before the warning is issued. The actual formula is:

(mon_warn_pg_not_deep_scrubbed_ratio * deep_scrub_interval) + deep_scrub_interval

For example:

(0.75 * 7 days) + 7 days = 12.25 days

So if you increase osd_deep_scrub_interval only for OSDs but not for MGRs, the OSDs will have more time to finish their regular deep-scrubs, but you’ll still end up with the warning because the MGR compares the last deep-scrub timestamp to those 12.25 days.

The affected cluster was on version Pacific where the config help page has slightly less information, it lacks the information about the service type:

## Pacific config help output
# ceph config help osd_deep_scrub_interval
osd_deep_scrub_interval - Deep scrub each PG (i.e., verify data checksums) at least this often
(float, advanced)
Default: 604800.000000
Can update at runtime: true

The information provided above (including the service type) is from a Reef cluster. I reported the configuration inconsistency to the documentation team, and I also asked to add a more detailed explanation about this health warning in the health check documentation.

Posted in Ceph, cephadm | Tagged , , , | Leave a comment

Cephadm: change public network

One question that comes up regularly in the ceph-users mailing list is how to change the ceph (public) network in a cluster deployed by cephadm. I wrote an article a few years ago when cephadm was first introduced that focused on changing the monitor’s ip addresses, but I didn’t address the entire network. So now I will.

To keep the article brief I will not paste all of the terminal output here. This is supposed to be a guide not step-by-step instructions. And for the sake of simplicity, I’ll cover only the ceph “public_network”, not the “cluster_network”. There are several possible scenarios involving the “public_network”, but I will just cover one specific scenario: moving the entire cluster to a different data center, which involves completely shutting down the cluster. Parts of this procedure can be used in disaster recovery situations, for example where two out of three monitors are broken and the surviving one needs to be started with a modified monmap to be able to form a quorum.

Disclaimer: The following steps have been executed in a lab environment. They worked for me but they might not work for you. If anything goes wrong while you try to reproduce the following procedure it’s not my fault, but yours. And it’s yours to fix it!

The Ceph version used in these tests was Reef 18.2.1.

Create backups of all relevant information such as keyrings, config files and a current monmap. Stop the cluster, and then prevent the daemons from starting by disabling the ceph.target.

Perform the maintenance procedure (e. g. move servers to a different location) and power on the servers. Now change the network setup (IP addresses, NTP, etc.) according to your requirements.

Now it’s getting serious… The next steps contain some more details to have a bit of context and working examples. In this procedure, the “old network” has addresses of the form 10.10.10.0/24 and the “new network” has addresses of the form 192.168.160.0/24.

# Enter shell of first MON
reef1:~ # cephadm shell --name mon.reef1

# Extract current monmap
[ceph: root@reef1 /]# ceph-mon -i reef1 --extract-monmap monmap

# Print content
[ceph: root@reef1 /]# monmaptool --print monmap
monmaptool: monmap file monmap
epoch 5
fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a
last_changed 2024-02-21T09:32:18.292040+0000
created 2024-02-21T09:18:27.136371+0000
min_mon_release 18 (reef)
election_strategy: 1
0: [v2:10.10.10.11:3300/0,v1:10.10.10.11:6789/0] mon.reef1
1: [v2:10.10.10.12:3300/0,v1:10.10.10.12:6789/0] mon.reef2
2: [v2:10.10.10.13:3300/0,v1:10.10.10.13:6789/0] mon.reef3

# Remove MONs with old address
[ceph: root@reef1 /]# monmaptool --rm reef1 --rm reef2 --rm reef3 monmap

# Add MONs with new address
[ceph: root@reef1 /]# monmaptool --addv reef1 [v2:192.168.160.11:3300/0,v1:192.168.160.11:6789/0] --addv reef2 [v2:192.168.160.12:3300/0,v1:192.168.160.12:6789/0] --addv reef3 [v2:192.168.160.13:3300/0,v1:192.168.160.13:6789/0] monmap

# Verify changes
[ceph: root@reef1 /]# monmaptool --print monmap
monmaptool: monmap file monmap
epoch 5
fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a
last_changed 2024-02-21T09:32:18.292040+0000
created 2024-02-21T09:18:27.136371+0000
min_mon_release 18 (reef)
election_strategy: 1
0: [v2:192.168.160.11:3300/0,v1:192.168.160.11:6789/0] mon.reef1
1: [v2:192.168.160.12:3300/0,v1:192.168.160.12:6789/0] mon.reef2
2: [v2:192.168.160.13:3300/0,v1:192.168.160.13:6789/0] mon.reef3

# Inject new monmap
[ceph: root@reef1 /]# ceph-mon -i reef1 --inject-monmap monmap

Repeat this procedure for the remaining monitors. Keep in mind that their ceph.conf (/var/lib/ceph/{FSID}/mon.{MON}/config) still refers to the old network. Update those files accordingly, then start the monitors. If everything went well they should connect to each other and form a quorum.

Now the ceph public_network needs to be updated:

ceph config set mon public_network 192.168.160.0/24

Update the config files of the MGRs as well (/var/lib/ceph/{FSID}/mgr.{mgr}/config) and start them. Now you should have the orchestrator available again to deal with the OSDs (and other daemons), but it will still try to connect to the old network since the host list still contains the old addresses. Update the host addresses with:

ceph orch host set-addr reef1 192.168.160.11
ceph orch host set-addr reef2 192.168.160.12
ceph orch host set-addr reef3 192.168.160.13

It can take a few minutes for the orchestrator to connect to each host. Eventually you should reconfigure the OSDs so their config files are updated automatically:

ceph orch reconfig osd

To verify, check the config files of one or more OSDs (/var/lib/ceph/{FSID}/osd.{OSD_ID}/config), if for some reason they are not updated automatically, you can do that manually.

Now the OSDs should be able to start successfully and eventually recover. Monitor the ceph status carefully, for example if you didn’t catch all OSDs and some of them still have the old address configured, you would see that in the osd dump output:

ceph osd dump | grep "10\.10\.10"

If that is the case, modify their config file if necessary and try to restart the affected OSDs. Unset the noout flag and test if the cluster works as expected. And don’t forget to enable the ceph.target so the daemons start automatically after the next reboot.

I repeated this procedure a couple of times back and forth and it did work for me every time. Of course, there’s no guarantee that it will work for you, you might encounter issues which don’t come up in a virtual environment. So plan carefully and if possible, test the procedure in a test environment first.

If there’s anything important missing here or if I made a mistake, feel free to comment!

Posted in Ceph, cephadm | Tagged , , | Leave a comment

How to migrate from SUSE Enterprise Storage to upstream Ceph

As most customers who use(d) the SES product (“SUSE Enterprise Storage”) should be aware, the product has been discontinued. The reasoning for that decision are not part of this article.

Instead I’d like to address customers, users, operators or admins who haven’t yet decided how to continue with Ceph. Besides quitting Ceph (hopefully you’ll decide against this) there are several options to be considered. This post is not a step-by-step guide with technical details but rather to get an overview of a few possible paths. One approach will not be covered here, though: moving to a different vendor with Ceph Support, as they will have their own migration path.

Cephadm clusters

If you decide to continue using Ceph but want to move to an upstream release (at the time of writing “Reef” had just been released) and your cluster is already managed by cephadm, it’s quite easy: upgrade your cluster with the desired Ceph version (for example the latest “Quincy” release):

ceph orch upgrade start --image quay.io/ceph/ceph:v17.2.6

Note: make sure your cluster is healthy before upgrading! The upgrade will only work if your cluster nodes have direct Internet access and sufficient free disk space. In an air-gapped environment with a private container registry you’ll need to synchronize the Ceph image to your registry first and edit the image URL in the command line pointing to your own registry. But basically, that’s all there is to do! Since the cluster is already containerized it’s independent from the operating system and you can continue using SLES. This has the advantage of still having support for the operating system (subscriptions are not taken into consideration in this article).

SaltStack

You can also continue to use SaltStack since you have a Salt Master (admin node) and the other nodes are Salt Minions. This can still be used to centrally manage systems configuration from your admin node or issue commands for target nodes (e. g. check the free disk space across all nodes):

salt 'storage*' cmd.run 'df -h'

As for ceph-salt which basically manages adding or removing cluster nodes, it will probably continue to work properly for at least some time but don’t rely on it. Moving forward to newer Ceph releases its functionality might become quite limited. At that point some other strategy will be required to manage new systems, removing cluster nodes is not depending on ceph-salt though, it just can be a little more convenient. But since Ceph clusters usually don’t grow too fast you can deal with this topic at a later point in time after the migration has finished.

Operating System

If you need to also migrate to a different operating system (“OS”) you can do that as well, of course. You can either remove the nodes one by one entirely from the cluster and wait for the rebalancing to finish, reinstall a new operating system of your choice (make sure that the same Ceph version is available) and redeploy the OSDs as well. Or you can preserve the OSDs and just reinstall the OS, cephadm is usually capable of activating existing OSDs. There are some factors to be considered which approach will work best, I will not cover those details here.

If you need to move to a different OS and your cluster is not managed by cephadm yet, I strongly recommend to upgrade the cluster to a cephadm managed version first.

Non-cephadm clusters

Now this can be a little more difficult but it’s still manageable. If your cluster has not yet been upgraded to a containerized version (Octopus or later) this section is relevant for you. Since I can not cover all possible combinations in this article I will just assume that the cluster is still on SES 6 (Nautilus), if you’re still running an older release you should upgrade at least to Nautilus first. To continue the upgrade process to upstream Ceph all OSDs need to be bluestore OSDs managed by ceph-volume, so you might need to redeploy all of them before proceeding with the adoption.

If you still have valid SES subscriptions I strongly recommend to upgrade to a containerized version (last SES release is 7.1 based on Ceph Pacific) while you have full SUSE support.

Software Repositories

The details of the upgrade process differ depending on the available software repositories, but the general process stays the same. I will just assume that you have an RMT server running in your environment and your cluster nodes are registered against it. If your nodes are directly connected to the SUSE Customer Center (“SCC”) the OS upgrade process is the same as with an RMT. If you have a SUSE Manager available, the process will be a bit different, but you’re probably aware of that and know how to manage it.

Upgrade SLES

This is usually an easy process (just run zypper migration or zypper dup respectively) if your subscriptions are valid and all add-on products are available. This might not be the case if your SES subscriptions expired before moving to cephadm. Since there are quite a few things to consider during that process I can’t cover all the details here. Just note that it might not be straight forward and will probably require some manual steps to preserve cluster availability during the upgrade.

Add upstream repository

If everything went well and your cluster is up and running with a newer SLES underneath you should be able to add a custom repository to your RMT (if necessary). That is required for the “cephadm” package, “podman” and its dependencies should be available with the sle-module-containers add-on.

There are RPMs available for Ceph Pacific (for example on software.opensuse.org), so you should be able to move from Nautilus to Pacific by simply upgrading those packages. Note that those packages are for openSUSE Leap 15.3, but they should be compatible with SLE 15 SP3, our tests didn’t reveal any issue. Make sure that you start your upgrade with Monitor nodes first (probably colocated with Managers), then OSDs, then the rest.

Disclaimer: this might not work without manual intervention or might not work at all, do not continue if the first node fails! If possible, ensure that you can rollback the first MON node (pull one of the RAID disks for the OS, or create a snapshot if it’s a virtual machine). If it fails and you can’t rollback, deploy a new MON service with Nautilus on one of the other cluster nodes, this should be still possible with DeepSea via your admin node. Once the cluster has recovered, inspect what went wrong and try to fix it.

Cephadm adopt

If you managed to keep your cluster alive and have now Ceph Pacific running make sure your nodes have the Pacific container images available, then you can move forward to adopt the daemons by cephadm. Note that it’s not supported to upgrade from Nautilus to Quincy or later, you need to upgrade to Pacific first because it’s not supported to skip more than one Ceph release.

From here on you can basically follow the instructions from the upstream documentation. To break it down a bit for an overview, these are the key steps:

  1. Check requirements
  2. Adopt Monitors/Managers
  3. Adopt OSDs
  4. Redeploy gateways (MDS, RGW, etc.)

Although the documentation states that your gateways might not be available during redeployment there is a way to avoid that if you have dedicated servers for those gateways. If they are colocated with MON or OSD services it might not be possible to avoid gateway downtime.

Conclusion

Ceph is a robust and resilient storage solution (if configured properly), we have seen it prove that lots of times in our own as well in our customer’s clusters. If you had similar experiences and don’t want to quit on Ceph, I hope I could shed some light on the adoption process without bothering you with too many details. You can get in touch with us if you need assistance with that process. Being a recognized SUSE partner for many years, we have in-depth knowledge of the SUSE products – and we also have according knowledge and experience regarding upstream Ceph. If you have already migrated to upstream Ceph, I’d be curious how it went for you. If you have additional remarks, don’t hesitate to comment!

Posted in Ceph, cephadm | Tagged , , , | Leave a comment