PGs not deep-scrubbed in time

Every now and then, Ceph users or operators stumble across this warning:

# ceph health detail
HEALTH_WARN 1161 pgs not deep-scrubbed in time
[WRN] PG_NOT_DEEP_SCRUBBED: 1161 pgs not deep-scrubbed in time
pg 86.fff not deep-scrubbed since 2024-08-21T02:35:25.733187+0000
...

Although I had increased the deep_scrub_interval from 1 week (default in seconds: osd_deep_scrub_interval = 604800) to a longer interval, let’s say 2 weeks (ceph config set osd osd_deep_scrub_interval 1209600), the cluster still warns that PGs haven’t been deep-scrubbed in time, even if the last deep_scrub_stamp is clearly younger than the configured interval. In the above example, the last deep_scrub_stamp was only 13 days from the time of the warning, so we were wondering why.

I checked the config setting:

# ceph config help osd_deep_scrub_interval
osd_deep_scrub_interval - Deep scrub each PG (i.e., verify data checksums) at least this often
 (float, advanced)
 Default: 604800.000000
 Can update at runtime: true
 Services: [osd]

The config setting is applicable to the OSD service and it can be changed at runtime, so I expected it to be enough to clear the warning. Apparently, it was not.

Next, I looked into the code:

pool->opts.get(pool_opts_t::DEEP_SCRUB_INTERVAL, &deep_scrub_interval);
        if (deep_scrub_interval <= 0) {
          deep_scrub_interval = cct->_conf->osd_deep_scrub_interval;
        }

First, it checks whether the pool has an individual deep_scrub_interval set, which was not the case here. Next, it reads the config value from the config database, which we had changed to two weeks. But the warning still persisted, why?

It turns out that the config is read from the MGR service, not the OSD service! To me it’s not clear from the code path, but I could confirm it quickly by only setting osd_deep_scrub_interval for the MGR service to 1 hour in a test cluster:

# ceph config set mgr osd_deep_scrub_interval 3600

The cluster then warns immediately about PGs not deep-scrubbed in time, as expected. And if you know what you’re looking for, you realize the solution is already out there. But the documentation doesn’t mention that, and just searching for this issue online will give you all kinds of results and workarounds, but nothing pointing to the actual solution.

There are several options to deal with this, depending on the actual cluster, its workload, the OSD tree and some more factors, here are two of them:

  1. Change the default config setting for osd_deep_scrub_interval
    • either globally (for all services):
      • # ceph config set global osd_deep_scrub_interval 1209600
    • or for MGRs and OSDs:
      • # ceph config set osd osd_deep_scrub_interval 1209600
      • # ceph config set mgr osd_deep_scrub_interval 1209600
  2. Configure a different interval for each pool individually:
    • # ceph osd pool set <POOL> deep_scrub_interval 1209600

Note: the config setting has different meanings for those two daemons, I’ll get to that in a bit. But you should change the values for both services, if you decide to do so.

For example, if your cluster is heterogeneous with different device classes and many pools, deep-scrubs might be quick for SSD-backed pools and will take much longer for HDD-backed pools. That’s why I would recommend setting the value per pool so you don’t need to change the default config settings.

I’d like to add some more details how that config option impacts those two different daemons:

[osd] osd_deep_scrub_interval

OSDs will try to deep-scrub the PGs within this interval. If you increase the value (default one week or 604800 seconds), you give your OSDs more time to deep-scrub all PGs.

[mgr] osd_deep_scrub_interval

The (active) MGR checks if the last deep-scrub timestamp is younger than the osd_deep_scrub_interval (plus a couple of days), giving your OSDs a default of 12.25 days to deep-scrub all PGs before the warning is issued. The actual formula is:

(mon_warn_pg_not_deep_scrubbed_ratio * deep_scrub_interval) + deep_scrub_interval

For example:

(0.75 * 7 days) + 7 days = 12.25 days

So if you increase osd_deep_scrub_interval only for OSDs but not for MGRs, the OSDs will have more time to finish their regular deep-scrubs, but you’ll still end up with the warning because the MGR compares the last deep-scrub timestamp to those 12.25 days.

The affected cluster was on version Pacific where the config help page has slightly less information, it lacks the information about the service type:

## Pacific config help output
# ceph config help osd_deep_scrub_interval
osd_deep_scrub_interval - Deep scrub each PG (i.e., verify data checksums) at least this often
(float, advanced)
Default: 604800.000000
Can update at runtime: true

The information provided above (including the service type) is from a Reef cluster. I reported the configuration inconsistency to the documentation team, and I also asked to add a more detailed explanation about this health warning in the health check documentation.

This entry was posted in Ceph, cephadm and tagged , , , . Bookmark the permalink.

Leave a Reply