Cephadm: Activate existing OSDs

A frequent question in the community is, what do I need to do when the operating system of one of my Ceph servers fails, but the OSDs are intact? Can I revive it?

The answer is yes! And it’s quite easy to do!

A few years ago, I wrote an article about the general idea how to do that. But the process has become much easier, so I decided to write a new blog post.

Although the docs cover that in general, I wanted to add some more details to have a bit more context.

This procedure isn’t exclusive to a host failure, we just reinstalled all our Ceph servers on faster SSD drives for the operating system (“OS”). The required steps are a combination of the procedure to add a new host and the ceph cephadm osd activate <host>... command.

The OS installation is not covered in this post.

After you successfully installed the OS, you need to configure the host so the orchestrator is able to manage it. Our Ceph servers run on openSUSE, the package manager is zypper, and we use podman. Adapt the required commands to your OS with its package manager and your preferred container engine.
The reinstalled server is “ceph04”, the Ceph commands to reintegrate “ceph04” are executed on “ceph01”, a host with an admin keyring.

# Install required packages
ceph04:~ # zypper in cephadm podman

# Retrieve public key
ceph01:~ # ceph cephadm get-pub-key > ceph.pub

# Copy key to ceph04
ceph01:~ # ssh-copy-id -f -i ceph.pub root@ceph04

# Retrieve private key to test connection
ceph01:~ # ceph config-key get mgr/cephadm/ssh_identity_key > ceph-private.key

# Modify permissions
ceph01:~ # chmod 400 ceph-private.key

# Test login
ceph01:~ # ssh -i ceph-private.key ceph04
Have a lot of fun...
ceph04:~ #

# Clean up
ceph01:~ # rm ceph.pub ceph-private.key

Since the host should be still in the host list, you don’t need to add it. As soon as the reinstalled host is reachable by the orchestrator (ceph orch host ls doesn’t show the host status as offline or maintenance), cephadm will try to deploy missing daemons to that host. In case you run your own container registry, the automatic deployment of the missing daemons will fail until the host has successfully logged in to the registry. So we instruct the orchestrator to execute a login for each host:

ceph cephadm registry-login my-registry.domain <user> <password>

Shortly after the orchestrator has performed the registry login, the missing daemons should be successfully deployed to the host, for example crash, node-exporter and all other daemons it used to run before the failure.

If that all works, you can activate the existing OSDs simply by running:

ceph cephadm osd activate ceph04

And that’s basically it, the OSDs should boot one after the other. There might be some additional steps required, depending on which daemons are supposed to run on that host, but I’m only focusing on OSD daemons here.

Disclaimer: This procedure has worked many times for us, it might not work for you.

Posted in Ceph, cephadm | Tagged , , , , , | Leave a comment

Cephadm: migrate block.db/block.wal to new device

A couple of years ago, before cephadm took over Ceph deployments, we wrote an article about migrating the DB/WAL devices from slow to fast devices. The procedure has become much easier than it used to be, thanks to ceph-bluestore-tool (or alternatively, ceph-volume). Keep in mind that cephadm managed clusters are typically running in containers, so the migration of the DB/WAL needs to be performed within the OSD containers.

To keep it brief, I’ll only focus on the DB device (block.db), migrating to a separate WAL device (block.wal) is very similar.

# Create Volume Group
ceph:~ # vgcreate ceph-db /dev/vdf

# Create Logical Volume
ceph:~ # lvcreate -L 5G -n ceph-osd0-db ceph-db

# Set the noout flag for OSD.0
ceph:~ # ceph osd add-noout osd.0

# Stop the OSD
ceph:~ # ceph orch daemon stop osd.0

# Enter the OSD shell
ceph:~ # cephadm shell --name osd.0

# Get the OSD's FSID
[ceph: root@ceph /]# OSD_FSID=$(ceph-volume lvm list 0 | awk '/osd fsid/ {print $3}')
fb69ba54-4d56-4c90-a855-6b350d186df5

# Create the DB device
[ceph: root@ceph /]# ceph-volume lvm new-db --osd-id 0 --osd-fsid $OSD_FSID --target ceph-db/ceph-osd0-db

# Migrate the DB to the new device
[ceph: root@ceph /]# ceph-volume lvm migrate --osd-id 0 --osd-fsid $OSD_FSID --from /var/lib/ceph/osd/ceph-0/block --target ceph-db/ceph-osd0-db

# Exit the shell and start the OSD
ceph:~ # ceph orch daemon start osd.0


# Unset the noout flag
ceph:~ # ceph osd rm-noout osd.0

To verify the new configuration, you can inspect the OSD’s metadata:

ceph:~ # osd metadata 0 -f json | jq -r '.bluefs_dedicated_db,.devices'
1
vdb, vdf

We can confirm that we now have a dedicated db device.

You can also check the OSD’s perf dump:

ceph:~ # ceph tell osd.0 perf dump bluefs | jq -r
'.[].db_total_bytes,.[].db_used_bytes'
5368700928
47185920

That’s it, the OSD’s DB is now on a different device!

Migrate DB back to main device

If you’re looking for the other way around, that’s also possible. Although this works with ceph-volume as well, for the sake of variety I’ll show the way with ceph-bluestore-tool:

# Set the noout flag
ceph:~ # ceph osd add-noout osd.0

# Stop the OSD
ceph:~ # ceph orch daemon stop osd.0

# Enter the shell
ceph:~ # cephadm shell --name osd.0

# Migrate DB to main device

[ceph: root@ceph /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ --command bluefs-bdev-migrate --devs-source /var/lib/ceph/osd/ceph-0/block.db --dev-target /var/lib/ceph/osd/ceph-0/block
inferring bluefs devices from bluestore path
 device removed:1 /var/lib/ceph/osd/ceph-0/block.db

# IMPORTANT: Remove the DB's Logical Volume before you start the OSD! Otherwise the OSD will use it again because of the LV tags.
ceph:~ # lvremove /dev/ceph-db/ceph-osd0-db

# Alternatively, delete the LV tags of the DB LV before starting the OSD.

# Start the OSD
ceph:~ # ceph orch daemon start osd.0

# Unset the noout flag
ceph:~ # ceph osd rm-noout osd.0

Verify the results:

ceph:~ # ceph osd metadata 0 -f json | jq -r '.bluefs_dedicated_db,.devices'
0
vdb

The provided steps were performed on a Reef cluster (version 18.2.4).

Disclaimer: As always with such articles: The steps above do work for us, but needn’t work for you. So if anything goes wrong while you try to reproduce above procedure, it’s not our fault, but yours. And it’s yours to fix it! But whether it works for you or not, please leave a comment below so that others will know.

Posted in Ceph, cephadm | Tagged , , , , | Leave a comment

PGs not deep-scrubbed in time

Every now and then, Ceph users or operators stumble across this warning:

# ceph health detail
HEALTH_WARN 1161 pgs not deep-scrubbed in time
[WRN] PG_NOT_DEEP_SCRUBBED: 1161 pgs not deep-scrubbed in time
pg 86.fff not deep-scrubbed since 2024-08-21T02:35:25.733187+0000
...

Although I had increased the deep_scrub_interval from 1 week (default in seconds: osd_deep_scrub_interval = 604800) to a longer interval, let’s say 2 weeks (ceph config set osd osd_deep_scrub_interval 1209600), the cluster still warns that PGs haven’t been deep-scrubbed in time, even if the last deep_scrub_stamp is clearly younger than the configured interval. In the above example, the last deep_scrub_stamp was only 13 days from the time of the warning, so we were wondering why.

I checked the config setting:

# ceph config help osd_deep_scrub_interval
osd_deep_scrub_interval - Deep scrub each PG (i.e., verify data checksums) at least this often
 (float, advanced)
 Default: 604800.000000
 Can update at runtime: true
 Services: [osd]

The config setting is applicable to the OSD service and it can be changed at runtime, so I expected it to be enough to clear the warning. Apparently, it was not.

Next, I looked into the code:

pool->opts.get(pool_opts_t::DEEP_SCRUB_INTERVAL, &deep_scrub_interval);
        if (deep_scrub_interval <= 0) {
          deep_scrub_interval = cct->_conf->osd_deep_scrub_interval;
        }

First, it checks whether the pool has an individual deep_scrub_interval set, which was not the case here. Next, it reads the config value from the config database, which we had changed to two weeks. But the warning still persisted, why?

It turns out that the config is read from the MGR service, not the OSD service! To me it’s not clear from the code path, but I could confirm it quickly by only setting osd_deep_scrub_interval for the MGR service to 1 hour in a test cluster:

# ceph config set mgr osd_deep_scrub_interval 3600

The cluster then warns immediately about PGs not deep-scrubbed in time, as expected. And if you know what you’re looking for, you realize the solution is already out there. But the documentation doesn’t mention that, and just searching for this issue online will give you all kinds of results and workarounds, but nothing pointing to the actual solution.

There are several options to deal with this, depending on the actual cluster, its workload, the OSD tree and some more factors, here are two of them:

  1. Change the default config setting for osd_deep_scrub_interval
    • either globally (for all services):
      • # ceph config set global osd_deep_scrub_interval 1209600
    • or for MGRs and OSDs:
      • # ceph config set osd osd_deep_scrub_interval 1209600
      • # ceph config set mgr osd_deep_scrub_interval 1209600
  2. Configure a different interval for each pool individually:
    • # ceph osd pool set <POOL> deep_scrub_interval 1209600

Note: the config setting has different meanings for those two daemons, I’ll get to that in a bit. But you should change the values for both services, if you decide to do so.

For example, if your cluster is heterogeneous with different device classes and many pools, deep-scrubs might be quick for SSD-backed pools and will take much longer for HDD-backed pools. That’s why I would recommend setting the value per pool so you don’t need to change the default config settings.

I’d like to add some more details how that config option impacts those two different daemons:

[osd] osd_deep_scrub_interval

OSDs will try to deep-scrub the PGs within this interval. If you increase the value (default one week or 604800 seconds), you give your OSDs more time to deep-scrub all PGs.

[mgr] osd_deep_scrub_interval

The (active) MGR checks if the last deep-scrub timestamp is younger than the osd_deep_scrub_interval (plus a couple of days), giving your OSDs a default of 12.25 days to deep-scrub all PGs before the warning is issued. The actual formula is:

(mon_warn_pg_not_deep_scrubbed_ratio * deep_scrub_interval) + deep_scrub_interval

For example:

(0.75 * 7 days) + 7 days = 12.25 days

So if you increase osd_deep_scrub_interval only for OSDs but not for MGRs, the OSDs will have more time to finish their regular deep-scrubs, but you’ll still end up with the warning because the MGR compares the last deep-scrub timestamp to those 12.25 days.

The affected cluster was on version Pacific where the config help page has slightly less information, it lacks the information about the service type:

## Pacific config help output
# ceph config help osd_deep_scrub_interval
osd_deep_scrub_interval - Deep scrub each PG (i.e., verify data checksums) at least this often
(float, advanced)
Default: 604800.000000
Can update at runtime: true

The information provided above (including the service type) is from a Reef cluster. I reported the configuration inconsistency to the documentation team, and I also asked to add a more detailed explanation about this health warning in the health check documentation.

Posted in Ceph, cephadm | Tagged , , , | Leave a comment