Cephadm: migrate block.db/block.wal to new device

A couple of years ago, before cephadm took over Ceph deployments, we wrote an article about migrating the DB/WAL devices from slow to fast devices. The procedure has become much easier than it used to be, thanks to ceph-bluestore-tool (or alternatively, ceph-volume). Keep in mind that cephadm managed clusters are typically running in containers, so the migration of the DB/WAL needs to be performed within the OSD containers.

To keep it brief, I’ll only focus on the DB device (block.db), migrating to a separate WAL device (block.wal) is very similar.

# Create Volume Group
ceph:~ # vgcreate ceph-db /dev/vdf

# Create Logical Volume
ceph:~ # lvcreate -L 5G -n ceph-osd0-db ceph-db

# Set the noout flag for OSD.0
ceph:~ # ceph osd add-noout osd.0

# Stop the OSD
ceph:~ # ceph orch daemon stop osd.0

# Enter the OSD shell
ceph:~ # cephadm shell --name osd.0

# Get the OSD's FSID
[ceph: root@ceph /]# OSD_FSID=$(ceph-volume lvm list 0 | awk '/osd fsid/ {print $3}')
fb69ba54-4d56-4c90-a855-6b350d186df5

# Create the DB device
[ceph: root@ceph /]# ceph-volume lvm new-db --osd-id 0 --osd-fsid $OSD_FSID --target ceph-db/ceph-osd0-db

# Migrate the DB to the new device
[ceph: root@ceph /]# ceph-volume lvm migrate --osd-id 0 --osd-fsid $OSD_FSID --from /var/lib/ceph/osd/ceph-0/block --target ceph-db/ceph-osd0-db

# Exit the shell and start the OSD
ceph:~ # ceph orch daemon start osd.0


# Unset the noout flag
ceph:~ # ceph osd rm-noout osd.0

To verify the new configuration, you can inspect the OSD’s metadata:

ceph:~ # osd metadata 0 -f json | jq -r '.bluefs_dedicated_db,.devices'
1
vdb, vdf

We can confirm that we now have a dedicated db device.

You can also check the OSD’s perf dump:

ceph:~ # ceph tell osd.0 perf dump bluefs | jq -r
'.[].db_total_bytes,.[].db_used_bytes'
5368700928
47185920

That’s it, the OSD’s DB is now on a different device!

Migrate DB back to main device

If you’re looking for the other way around, that’s also possible. Although this works with ceph-volume as well, for the sake of variety I’ll show the way with ceph-bluestore-tool:

# Set the noout flag
ceph:~ # ceph osd add-noout osd.0

# Stop the OSD
ceph:~ # ceph orch daemon stop osd.0

# Enter the shell
ceph:~ # cephadm shell --name osd.0

# Migrate DB to main device

[ceph: root@ceph /]# ceph-bluestore-tool --path /var/lib/ceph/osd/ceph-0/ --command bluefs-bdev-migrate --devs-source /var/lib/ceph/osd/ceph-0/block.db --dev-target /var/lib/ceph/osd/ceph-0/block
inferring bluefs devices from bluestore path
 device removed:1 /var/lib/ceph/osd/ceph-0/block.db

# IMPORTANT: Remove the DB's Logical Volume before you start the OSD! Otherwise the OSD will use it again because of the LV tags.
ceph:~ # lvremove /dev/ceph-db/ceph-osd0-db

# Alternatively, delete the LV tags of the DB LV before starting the OSD.

# Start the OSD
ceph:~ # ceph orch daemon start osd.0

# Unset the noout flag
ceph:~ # ceph osd rm-noout osd.0

Verify the results:

ceph:~ # ceph osd metadata 0 -f json | jq -r '.bluefs_dedicated_db,.devices'
0
vdb

The provided steps were performed on a Reef cluster (version 18.2.4).

Disclaimer: As always with such articles: The steps above do work for us, but needn’t work for you. So if anything goes wrong while you try to reproduce above procedure, it’s not our fault, but yours. And it’s yours to fix it! But whether it works for you or not, please leave a comment below so that others will know.

Posted in Ceph, cephadm | Tagged , , , , | Leave a comment

PGs not deep-scrubbed in time

Every now and then, Ceph users or operators stumble across this warning:

# ceph health detail
HEALTH_WARN 1161 pgs not deep-scrubbed in time
[WRN] PG_NOT_DEEP_SCRUBBED: 1161 pgs not deep-scrubbed in time
pg 86.fff not deep-scrubbed since 2024-08-21T02:35:25.733187+0000
...

Although I had increased the deep_scrub_interval from 1 week (default in seconds: osd_deep_scrub_interval = 604800) to a longer interval, let’s say 2 weeks (ceph config set osd osd_deep_scrub_interval 1209600), the cluster still warns that PGs haven’t been deep-scrubbed in time, even if the last deep_scrub_stamp is clearly younger than the configured interval. In the above example, the last deep_scrub_stamp was only 13 days from the time of the warning, so we were wondering why.

I checked the config setting:

# ceph config help osd_deep_scrub_interval
osd_deep_scrub_interval - Deep scrub each PG (i.e., verify data checksums) at least this often
 (float, advanced)
 Default: 604800.000000
 Can update at runtime: true
 Services: [osd]

The config setting is applicable to the OSD service and it can be changed at runtime, so I expected it to be enough to clear the warning. Apparently, it was not.

Next, I looked into the code:

pool->opts.get(pool_opts_t::DEEP_SCRUB_INTERVAL, &deep_scrub_interval);
        if (deep_scrub_interval <= 0) {
          deep_scrub_interval = cct->_conf->osd_deep_scrub_interval;
        }

First, it checks whether the pool has an individual deep_scrub_interval set, which was not the case here. Next, it reads the config value from the config database, which we had changed to two weeks. But the warning still persisted, why?

It turns out that the config is read from the MGR service, not the OSD service! To me it’s not clear from the code path, but I could confirm it quickly by only setting osd_deep_scrub_interval for the MGR service to 1 hour in a test cluster:

# ceph config set mgr osd_deep_scrub_interval 3600

The cluster then warns immediately about PGs not deep-scrubbed in time, as expected. And if you know what you’re looking for, you realize the solution is already out there. But the documentation doesn’t mention that, and just searching for this issue online will give you all kinds of results and workarounds, but nothing pointing to the actual solution.

There are several options to deal with this, depending on the actual cluster, its workload, the OSD tree and some more factors, here are two of them:

  1. Change the default config setting for osd_deep_scrub_interval
    • either globally (for all services):
      • # ceph config set global osd_deep_scrub_interval 1209600
    • or for MGRs and OSDs:
      • # ceph config set osd osd_deep_scrub_interval 1209600
      • # ceph config set mgr osd_deep_scrub_interval 1209600
  2. Configure a different interval for each pool individually:
    • # ceph osd pool set <POOL> deep_scrub_interval 1209600

Note: the config setting has different meanings for those two daemons, I’ll get to that in a bit. But you should change the values for both services, if you decide to do so.

For example, if your cluster is heterogeneous with different device classes and many pools, deep-scrubs might be quick for SSD-backed pools and will take much longer for HDD-backed pools. That’s why I would recommend setting the value per pool so you don’t need to change the default config settings.

I’d like to add some more details how that config option impacts those two different daemons:

[osd] osd_deep_scrub_interval

OSDs will try to deep-scrub the PGs within this interval. If you increase the value (default one week or 604800 seconds), you give your OSDs more time to deep-scrub all PGs.

[mgr] osd_deep_scrub_interval

The (active) MGR checks if the last deep-scrub timestamp is younger than the osd_deep_scrub_interval (plus a couple of days), giving your OSDs a default of 12.25 days to deep-scrub all PGs before the warning is issued. The actual formula is:

(mon_warn_pg_not_deep_scrubbed_ratio * deep_scrub_interval) + deep_scrub_interval

For example:

(0.75 * 7 days) + 7 days = 12.25 days

So if you increase osd_deep_scrub_interval only for OSDs but not for MGRs, the OSDs will have more time to finish their regular deep-scrubs, but you’ll still end up with the warning because the MGR compares the last deep-scrub timestamp to those 12.25 days.

The affected cluster was on version Pacific where the config help page has slightly less information, it lacks the information about the service type:

## Pacific config help output
# ceph config help osd_deep_scrub_interval
osd_deep_scrub_interval - Deep scrub each PG (i.e., verify data checksums) at least this often
(float, advanced)
Default: 604800.000000
Can update at runtime: true

The information provided above (including the service type) is from a Reef cluster. I reported the configuration inconsistency to the documentation team, and I also asked to add a more detailed explanation about this health warning in the health check documentation.

Posted in Ceph, cephadm | Tagged , , , | Leave a comment

Cephadm: change public network

One question that comes up regularly in the ceph-users mailing list is how to change the ceph (public) network in a cluster deployed by cephadm. I wrote an article a few years ago when cephadm was first introduced that focused on changing the monitor’s ip addresses, but I didn’t address the entire network. So now I will.

To keep the article brief I will not paste all of the terminal output here. This is supposed to be a guide not step-by-step instructions. And for the sake of simplicity, I’ll cover only the ceph “public_network”, not the “cluster_network”. There are several possible scenarios involving the “public_network”, but I will just cover one specific scenario: moving the entire cluster to a different data center, which involves completely shutting down the cluster. Parts of this procedure can be used in disaster recovery situations, for example where two out of three monitors are broken and the surviving one needs to be started with a modified monmap to be able to form a quorum.

Disclaimer: The following steps have been executed in a lab environment. They worked for me but they might not work for you. If anything goes wrong while you try to reproduce the following procedure it’s not my fault, but yours. And it’s yours to fix it!

The Ceph version used in these tests was Reef 18.2.1.

Create backups of all relevant information such as keyrings, config files and a current monmap. Stop the cluster, and then prevent the daemons from starting by disabling the ceph.target.

Perform the maintenance procedure (e. g. move servers to a different location) and power on the servers. Now change the network setup (IP addresses, NTP, etc.) according to your requirements.

Now it’s getting serious… The next steps contain some more details to have a bit of context and working examples. In this procedure, the “old network” has addresses of the form 10.10.10.0/24 and the “new network” has addresses of the form 192.168.160.0/24.

# Enter shell of first MON
reef1:~ # cephadm shell --name mon.reef1

# Extract current monmap
[ceph: root@reef1 /]# ceph-mon -i reef1 --extract-monmap monmap

# Print content
[ceph: root@reef1 /]# monmaptool --print monmap
monmaptool: monmap file monmap
epoch 5
fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a
last_changed 2024-02-21T09:32:18.292040+0000
created 2024-02-21T09:18:27.136371+0000
min_mon_release 18 (reef)
election_strategy: 1
0: [v2:10.10.10.11:3300/0,v1:10.10.10.11:6789/0] mon.reef1
1: [v2:10.10.10.12:3300/0,v1:10.10.10.12:6789/0] mon.reef2
2: [v2:10.10.10.13:3300/0,v1:10.10.10.13:6789/0] mon.reef3

# Remove MONs with old address
[ceph: root@reef1 /]# monmaptool --rm reef1 --rm reef2 --rm reef3 monmap

# Add MONs with new address
[ceph: root@reef1 /]# monmaptool --addv reef1 [v2:192.168.160.11:3300/0,v1:192.168.160.11:6789/0] --addv reef2 [v2:192.168.160.12:3300/0,v1:192.168.160.12:6789/0] --addv reef3 [v2:192.168.160.13:3300/0,v1:192.168.160.13:6789/0] monmap

# Verify changes
[ceph: root@reef1 /]# monmaptool --print monmap
monmaptool: monmap file monmap
epoch 5
fsid 2851404a-d09a-11ee-9aaa-fa163e2de51a
last_changed 2024-02-21T09:32:18.292040+0000
created 2024-02-21T09:18:27.136371+0000
min_mon_release 18 (reef)
election_strategy: 1
0: [v2:192.168.160.11:3300/0,v1:192.168.160.11:6789/0] mon.reef1
1: [v2:192.168.160.12:3300/0,v1:192.168.160.12:6789/0] mon.reef2
2: [v2:192.168.160.13:3300/0,v1:192.168.160.13:6789/0] mon.reef3

# Inject new monmap
[ceph: root@reef1 /]# ceph-mon -i reef1 --inject-monmap monmap

Repeat this procedure for the remaining monitors. Keep in mind that their ceph.conf (/var/lib/ceph/{FSID}/mon.{MON}/config) still refers to the old network. Update those files accordingly, then start the monitors. If everything went well they should connect to each other and form a quorum.

Now the ceph public_network needs to be updated:

ceph config set mon public_network 192.168.160.0/24

Update the config files of the MGRs as well (/var/lib/ceph/{FSID}/mgr.{mgr}/config) and start them. Now you should have the orchestrator available again to deal with the OSDs (and other daemons), but it will still try to connect to the old network since the host list still contains the old addresses. Update the host addresses with:

ceph orch host set-addr reef1 192.168.160.11
ceph orch host set-addr reef2 192.168.160.12
ceph orch host set-addr reef3 192.168.160.13

It can take a few minutes for the orchestrator to connect to each host. Eventually you should reconfigure the OSDs so their config files are updated automatically:

ceph orch reconfig osd

To verify, check the config files of one or more OSDs (/var/lib/ceph/{FSID}/osd.{OSD_ID}/config), if for some reason they are not updated automatically, you can do that manually.

Now the OSDs should be able to start successfully and eventually recover. Monitor the ceph status carefully, for example if you didn’t catch all OSDs and some of them still have the old address configured, you would see that in the osd dump output:

ceph osd dump | grep "10\.10\.10"

If that is the case, modify their config file if necessary and try to restart the affected OSDs. Unset the noout flag and test if the cluster works as expected. And don’t forget to enable the ceph.target so the daemons start automatically after the next reboot.

I repeated this procedure a couple of times back and forth and it did work for me every time. Of course, there’s no guarantee that it will work for you, you might encounter issues which don’t come up in a virtual environment. So plan carefully and if possible, test the procedure in a test environment first.

If there’s anything important missing here or if I made a mistake, feel free to comment!

Posted in Ceph, cephadm | Tagged , , | Leave a comment