How to migrate from SUSE Enterprise Storage to upstream Ceph

As most customers who use(d) the SES product (“SUSE Enterprise Storage”) should be aware, the product has been discontinued. The reasoning for that decision are not part of this article.

Instead I’d like to address customers, users, operators or admins who haven’t yet decided how to continue with Ceph. Besides quitting Ceph (hopefully you’ll decide against this) there are several options to be considered. This post is not a step-by-step guide with technical details but rather to get an overview of a few possible paths. One approach will not be covered here, though: moving to a different vendor with Ceph Support, as they will have their own migration path.

Cephadm clusters

If you decide to continue using Ceph but want to move to an upstream release (at the time of writing “Reef” had just been released) and your cluster is already managed by cephadm, it’s quite easy: upgrade your cluster with the desired Ceph version (for example the latest “Quincy” release):

ceph orch upgrade start --image quay.io/ceph/ceph:v17.2.6

Note: make sure your cluster is healthy before upgrading! The upgrade will only work if your cluster nodes have direct Internet access and sufficient free disk space. In an air-gapped environment with a private container registry you’ll need to synchronize the Ceph image to your registry first and edit the image URL in the command line pointing to your own registry. But basically, that’s all there is to do! Since the cluster is already containerized it’s independent from the operating system and you can continue using SLES. This has the advantage of still having support for the operating system (subscriptions are not taken into consideration in this article).

SaltStack

You can also continue to use SaltStack since you have a Salt Master (admin node) and the other nodes are Salt Minions. This can still be used to centrally manage systems configuration from your admin node or issue commands for target nodes (e. g. check the free disk space across all nodes):

salt 'storage*' cmd.run 'df -h'

As for ceph-salt which basically manages adding or removing cluster nodes, it will probably continue to work properly for at least some time but don’t rely on it. Moving forward to newer Ceph releases its functionality might become quite limited. At that point some other strategy will be required to manage new systems, removing cluster nodes is not depending on ceph-salt though, it just can be a little more convenient. But since Ceph clusters usually don’t grow too fast you can deal with this topic at a later point in time after the migration has finished.

Operating System

If you need to also migrate to a different operating system (“OS”) you can do that as well, of course. You can either remove the nodes one by one entirely from the cluster and wait for the rebalancing to finish, reinstall a new operating system of your choice (make sure that the same Ceph version is available) and redeploy the OSDs as well. Or you can preserve the OSDs and just reinstall the OS, cephadm is usually capable of activating existing OSDs. There are some factors to be considered which approach will work best, I will not cover those details here.

If you need to move to a different OS and your cluster is not managed by cephadm yet, I strongly recommend to upgrade the cluster to a cephadm managed version first.

Non-cephadm clusters

Now this can be a little more difficult but it’s still manageable. If your cluster has not yet been upgraded to a containerized version (Octopus or later) this section is relevant for you. Since I can not cover all possible combinations in this article I will just assume that the cluster is still on SES 6 (Nautilus), if you’re still running an older release you should upgrade at least to Nautilus first. To continue the upgrade process to upstream Ceph all OSDs need to be bluestore OSDs managed by ceph-volume, so you might need to redeploy all of them before proceeding with the adoption.

If you still have valid SES subscriptions I strongly recommend to upgrade to a containerized version (last SES release is 7.1 based on Ceph Pacific) while you have full SUSE support.

Software Repositories

The details of the upgrade process differ depending on the available software repositories, but the general process stays the same. I will just assume that you have an RMT server running in your environment and your cluster nodes are registered against it. If your nodes are directly connected to the SUSE Customer Center (“SCC”) the OS upgrade process is the same as with an RMT. If you have a SUSE Manager available, the process will be a bit different, but you’re probably aware of that and know how to manage it.

Upgrade SLES

This is usually an easy process (just run zypper migration or zypper dup respectively) if your subscriptions are valid and all add-on products are available. This might not be the case if your SES subscriptions expired before moving to cephadm. Since there are quite a few things to consider during that process I can’t cover all the details here. Just note that it might not be straight forward and will probably require some manual steps to preserve cluster availability during the upgrade.

Add upstream repository

If everything went well and your cluster is up and running with a newer SLES underneath you should be able to add a custom repository to your RMT (if necessary). That is required for the “cephadm” package, “podman” and its dependencies should be available with the sle-module-containers add-on.

There are RPMs available for Ceph Pacific (for example on software.opensuse.org), so you should be able to move from Nautilus to Pacific by simply upgrading those packages. Note that those packages are for openSUSE Leap 15.3, but they should be compatible with SLE 15 SP3, our tests didn’t reveal any issue. Make sure that you start your upgrade with Monitor nodes first (probably colocated with Managers), then OSDs, then the rest.

Disclaimer: this might not work without manual intervention or might not work at all, do not continue if the first node fails! If possible, ensure that you can rollback the first MON node (pull one of the RAID disks for the OS, or create a snapshot if it’s a virtual machine). If it fails and you can’t rollback, deploy a new MON service with Nautilus on one of the other cluster nodes, this should be still possible with DeepSea via your admin node. Once the cluster has recovered, inspect what went wrong and try to fix it.

Cephadm adopt

If you managed to keep your cluster alive and have now Ceph Pacific running make sure your nodes have the Pacific container images available, then you can move forward to adopt the daemons by cephadm. Note that it’s not supported to upgrade from Nautilus to Quincy or later, you need to upgrade to Pacific first because it’s not supported to skip more than one Ceph release.

From here on you can basically follow the instructions from the upstream documentation. To break it down a bit for an overview, these are the key steps:

  1. Check requirements
  2. Adopt Monitors/Managers
  3. Adopt OSDs
  4. Redeploy gateways (MDS, RGW, etc.)

Although the documentation states that your gateways might not be available during redeployment there is a way to avoid that if you have dedicated servers for those gateways. If they are colocated with MON or OSD services it might not be possible to avoid gateway downtime.

Conclusion

Ceph is a robust and resilient storage solution (if configured properly), we have seen it prove that lots of times in our own as well in our customer’s clusters. If you had similar experiences and don’t want to quit on Ceph, I hope I could shed some light on the adoption process without bothering you with too many details. You can get in touch with us if you need assistance with that process. Being a recognized SUSE partner for many years, we have in-depth knowledge of the SUSE products – and we also have according knowledge and experience regarding upstream Ceph. If you have already migrated to upstream Ceph, I’d be curious how it went for you. If you have additional remarks, don’t hesitate to comment!

Posted in Ceph, cephadm | Tagged , , , | Leave a comment

Cephadm: Reusing OSDs on reinstalled server

This is my second blog post about cephadm, the (relatively) new tool to deploy and manage Ceph clusters. From time to time I feel challenged by questions in the ceph-users mailing list or by customers and then I try to find a solution to particular problems. For example, my first post about cephadm dealt with the options to change a monitor’s IP address. This post is will describe in a short way how you can reactivate OSDs from a reinstalled server.

This (virtual) lab environment was based on SUSE Enterprise Storage 7 and podman (Ceph Octopus version 15.2.8). I won’t go into too much detail and spare you most of the command line output, this article is meant to show the general idea and not to provide step-by-step instructions. This is also just one way to do it, I haven’t tried other ways (yet) and there might be smoother procedures. If you have better ideas or other remarks please leave a comment, I’d be happy to try a different approach and update this blog post.

Background

The title already reveals the background of this question. If one of the OSD server’s operating system (OS) breaks and you need to reinstall it there are two options how to deal with the OSDs on that server. Either let the cluster rebalance (which is usually the way to go, that’s what Ceph is designed for) and reinstall the OS. Then wipe the OSDs and add the node back to the cluster which will again trigger a remapping of placement groups (PGs).

Prior to cephadm and containerized services (but not older than Luminous) it was quite straight forward to bring back OSDs from a reinstalled host, ‘ceph-volume’ would do almost everything for you, but until there’s a solution within Ceph orchestrator I currently only see this “hacky” way.

Possible Problem(s)

To prevent Ceph from remapping you would need to set the noout flag assuming you noticed the server failure in time. This means your PGs will be degraded and the risk of data loss would increase in case other disks or hosts fail. Depending on the install mechanisms a new installation of a server can be performed quite quickly which would reduce the risk of data loss. Especially if you consider that the remapping of an entire host also means a degraded cluster for quite some time. So it might actually be faster to reinstall a server and reactivate its OSDs than waiting for the remapping to finish. So it’s basically up to the cluster’s administrator what the best strategy would be for the individual setup.

Please also consider that depending on the cluster details (number of OSDs, stored data, cluster I/O, failure domains, etc.) the PGs on the down OSDs could be outdated to that extent that adding them back to the cluster could cause more load than adding new OSDs.

Solution

If you decide to retain those down OSDs and bring them back online these are the basic steps to achieve that.

My virtual lab environment is running in OpenStack, so to simulate a node failure I just deleted a virtual machine (VM), the OSD volumes were not deleted, of course. Then I launched a new VM, prepared it for Ceph usage and attached the OSD volumes to it.

After the host was added to Ceph a “crash” container was already successfully deployed, so cephadm seemed to work properly confirmed by ceph-volume command:

ses7-host1:~ # cephadm ceph-volume lvm list
Inferring fsid 7bdffde0-623f-11eb-b3db-fa163e672db2
Using recent ceph image registry.suse.com/ses/7/ceph/ceph:latest
[...]

Although the OSD activation with ceph-volume failed I had the required information about those down OSDs:

  • Path to block devices (data, db, wal)
  • OSD FSID
  • OSD ID
  • Ceph FSID
  • OSD keyring

Four of those five properties can be collected from the cephadm ceph-volume lvm list output. The OSD keyring can be obtained from ceph auth get osd.<ID>.

Since the crash container was already present the required parent directory was also present, for the rest I used a different OSD server as a template. These are the files I copied from a different server (except for the block and block.db devices, of course):

ses7-host1:~ # ls -1 /var/lib/ceph/7bdffde0-623f-11eb-b3db-fa163e672db2/osd.6/
block
block.db
ceph_fsid
config
fsid
keyring
ready
require_osd_release
type
unit.configured
unit.created
unit.image
unit.poststop
unit.run
whoami

I only needed to replace the contents of these five files with the correct keyring, OSD FSID, OSD ID:

  • fsid
  • keyring
  • whoami
  • unit.run
  • unit.poststop

The next step was to create the symbolic links pointing to the correct block and block.db devices and change their ownership:

ses7-host1:/var/lib/ceph/<CEPH_FSID>/osd.<OSD_ID>/ # ln -s /dev/ceph-<VG>/osd-block-<LV> block

ses7-host1:/var/lib/ceph/<CEPH_FSID>/osd.<OSD_ID>/ # ln -s /dev/ceph-<VG>/osd-block-<LV> block.db

ses7-host1:~ # chown -R ceph.ceph /var/lib/ceph/<CEPH_FSID>/osd.<OSD_ID>/

And finally start the systemd unit:

ses7-host1:~ # systemctl start ceph-@osd.<OSD_ID>.service

After the first OSD started successfully I repeated this for all remaining OSDs on that server and all of them came back online without an issue. This has not been tested with encrypted OSDs, though, so I’m not sure what else is necessary in that case but maybe this procedure helps figuring that out. I also don’t know if there’s a smoother or even automated way to achieve this, I don’t think there currently is. Maybe (hopefully) someone is working on it though.

Disclaimer

The described steps have been executed in a lab environment. They worked for me but they might not work for you. If anything goes wrong while you try to reproduce the procedure it’s not my fault, but yours. And it’s yours to fix it!

Posted in Ceph, cephadm | Tagged , , , , | Leave a comment

Cephadm: Changing a Monitor’s IP address

With the introduction of Ceph Octopus lots of things have changed how to deploy and manage clusters. One of those things is cephadm which allows you to bootstrap a new cluster very quickly. One of the major changes is the now containerized environment based on docker or podman.

This post is not about the details of a containerized environment but rather about one specific task that has been asked frequently in the ceph-users mailing list: How can I change a MON’s IP address, for example if it moved to a different location? There are basically two ways described in the docs, the “right way” and the “messy way”. In the previous releases (before containers) the process wasn’t very complicated, especially the “right way”. The “messy way” should be avoided if possible since it can or maybe even will result in cluster downtime if you only have 3 MONs. But in a disaster recovery scenario or if the whole cluster moves to a different data center it can be helpful.

Although the mentioned docs state that a MON should not change its IP address it still can be necessary, just a couple of weeks ago one of our customers changed the whole public network and I had to rebuild the monmap the “messy way”. But fortunately it worked and the cluster came back healthy. Because the described processes in the docs are designed for the pre-container deployments I will not only describe the containerized version of the “messy way”. I will also try to provide a version of the “right way” that’s focusing on changing a MON’s IP address instead of adding a new MON and then remove the old MON. My description will also be divided into two parts, the “right way” and the “messy way” (of course). Usually I would add the disclaimer at the bottom but I feel it should be added right here:

The following steps have been executed in a lab environment. They worked for me but they might not work for you. If anything goes wrong while you try to reproduce the following procedure it’s not my fault, but yours. And it’s yours to fix it!

This (virtual) lab environment was based on SUSE Enterprise Storage 7 and podman (Ceph Octopus version 15.2.5). I won’t go into too much detail and spare you all the command line output, this article is meant to show the general idea and not to provide step-by-step instructions. This is also just one way to do it, I haven’t tried other ways (yet) and there might be much smoother procedures. If you have better ideas or other remarks please leave a comment, I’d be happy to try other/better procedures and update this blog post.

The “right way”

Before you start, make sure the cluster is healthy and you’ll still have a MON quorum even if one MON goes down so the clients won’t notice any disruptions.

The actual “right way” would be to add a new MON (host4) to the cluster using the cluster spec file or a specific MON spec file and remove the old MON (host1) afterwards.

But as I already mentioned I actually want to change the MON’s IP address, not add a new one. So the procedure changes a little:

# Change host1 MON's IP address
# Cluster still has quorum but looses one MON

cephadm:~ # ceph -s
cluster:
id: 8f279f36-811c-3270-9f9d-58335b1bb9c0
health: HEALTH_WARN
1/3 mons down, quorum host2,host3

# Cephadm will continue to probe for host1
# Change naming service to update host1's IP address
# Cluster recovers

As soon as the MGR daemons (the active MGR, to be more precise) reach host1 by its new IP address (DNS should be reconfigured properly) the probing should succeed and the MON container should start successfully.

For full disclosure I need to mention that I did it slightly different and realized afterwards that some of the steps I did were not necessary, for example removing host1 from the spec file to prevent cephadm from restarting the container. But as soon as the MON’s IP changes it won’t be reachable anyway so there’s no point in that.

The “messy way”

As already stated this is basically a disaster recovery scenario and hopefully you’ll never have to do that. But in case you do, the following procedure might help. Before you start make sure you have backed up all important files/directories/keyrings etc. You’ll then need a new (modified) monmap to inject into the MON daemons.

# Get monmap
host1:~ # ceph mon getmap > monmap.file
got monmap epoch 10

# Print old monmap
host1:~ # monmaptool --print old-monmap
monmaptool: monmap file old-monmap
epoch 4
fsid 8f279f36-811c-3270-9f9d-58335b1bb9c0
last_changed 2020-12-17T13:39:13.545453+0100
created 2020-12-17T13:39:13.545453+0100
min_mon_release 15 (octopus)
0: [v2:192.168.168.50:3300/0,v1:192.168.168.50:6789/0] mon.host1
1: [v2:192.168.168.51:3300/0,v1:192.168.168.51:6789/0] mon.host2
2: [v2:192.168.168.52:3300/0,v1:192.168.168.52:6789/0] mon.host3

# Remove host1 from monmap 
host1:~ # monmaptool --rm host1 old-monmap
monmaptool: monmap file old-monmap
monmaptool: removing host1
monmaptool: writing epoch 0 to old-monmap (2 monitors)

# Add host1 with new IP (192.168.168.58) to monmap 
host1:~ # monmaptool --addv host1 [v2:192.168.168.58:3300/0,v1:192.168.168.58:6789/0] old-monmap
monmaptool: monmap file old-monmap
monmaptool: writing epoch 0 to old-monmap (3 monitors)

# Check content
host1:~ # monmaptool --print old-monmap
monmaptool: monmap file old-monmap
epoch 0
fsid 8f279f36-811c-3270-9f9d-58335b1bb9c0
last_changed 2020-12-17T13:39:13.545453+0100
created 2020-12-17T13:39:13.545453+0100
min_mon_release 15 (octopus)
0: [v2:192.168.168.51:3300/0,v1:192.168.168.51:6789/0] mon.host2
1: [v2:192.168.168.52:3300/0,v1:192.168.168.52:6789/0] mon.host3
2: [v2:192.168.168.58:3300/0,v1:192.168.168.58:6789/0] mon.host1

# Rename file and copy it to all MON hosts
host1:~ # mv old-monmap new-monmap
host1:~ # scp new-monmap host2:/tmp/ 
host1:~ # scp new-monmap host3:/tmp/

# Change IP address of host1

Now host1 won’t be able to join the cluster anymore and we need to change the monmap. The difficulty here is that to inject a (modified) monmap you need the binary ‘ceph-mon’ which is running within the container. You can’t inject (not even extract) it into the running service because of the LOCK file:

# Enter the container
cephadm enter --name mon.host2

# Try to extract monmap from running daemon
[ceph: root@host2 /]# ceph-mon -i host2 --extract-monmap monmap.bin
2020-12-17T12:29:00.932+0000 7f6a9935b640 -1 rocksdb: IO error: While lock file: /var/lib/ceph/mon/ceph-host2/store.db/LOCK: Resource temporarily unavailable
2020-12-17T12:29:00.932+0000 7f6a9935b640 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-host2': (22) Invalid argument

But if we shut down the container the binary won’t be available, so what do we do? Exactly, we launch the container without ‘ceph-mon’ running. For that we just need to change the unit.run file under /var/lib/ceph/8f279f36-811c-3270-9f9d-58335b1bb9c0/mon.host2/unit.run. You probably noticed that the UUID in this path obviously is my ceph cluster’s fsid. Now just replace /usr/bin/ceph-mon with /bin/bash and restart the service:

# Stop MON service
host2:~ # systemctl stop ceph-8f279f36-811c-3270-9f9d-58335b1bb9c0@mon.host2.service

# Replace binary
host2:~ # cd /var/lib/ceph/8f279f36-811c-3270-9f9d-58335b1bb9c0/mon.host2

host2:/var/lib/ceph/8f279f36-811c-3270-9f9d-58335b1bb9c0/mon.host2 # sed -i -e 's/\/usr\/bin\/ceph-mon/\/bin\/bash/' unit.run

# Start service
host2:~ # systemctl start ceph-8f279f36-811c-3270-9f9d-58335b1bb9c0@mon.host2.service

Now there should be a running container without a ‘ceph-mon’ process and I can change the monmap:

# Enter container
host2:~ # cephadm enter --name mon.host2

# Either create new monmap 
[ceph: root@host2 /]# monmaptool --create --fsid 8f279f36-811c-3270-9f9d-58335b1bb9c0 --addv host1 [v2:192.168.168.58:3300,v1:192.168.168.58:6789] --addv host2 [v2:192.168.168.51:3300,v1:192.168.168.51:6789] --addv host3 [v2:192.168.168.52:3300,v1:192.168.168.52:6789] new-monmap
monmaptool: monmap file new-monmap
monmaptool: set fsid to 8f279f36-811c-3270-9f9d-58335b1bb9c0
monmaptool: writing epoch 0 to new-monmap (3 monitors)

[ceph: root@host2 /]# monmaptool --set-min-mon-release octopus new-monmap
monmaptool: monmap file new-monmap
setting min_mon_release = octopus
monmaptool: writing epoch 0 to new-monmap (3 monitors)

# Or use the already modified map
[ceph: root@host2 /]# scp 192.168.168.51:/tmp/new-monmap .

# Inject monmap
[ceph: root@host2 /]# ceph-mon -i host2 --inject-monmap new-monmap
[ceph: root@host2 /]# exit

# Stop service
host2:~ # systemctl stop ceph-8f279f36-811c-3270-9f9d-58335b1bb9c0@mon.host2.service

# Revert changes in unit.run file
host2:/var/lib/ceph/8f279f36-811c-3270-9f9d-58335b1bb9c0/mon.host2 # sed -i -e 's/\/bin\/bash/\/usr\/bin\/ceph-mon/' unit.run

# Set ownership (it changed after editing unit.run)
host2:/var/lib/ceph/8f279f36-811c-3270-9f9d-58335b1bb9c0/mon.host2 # chown -R ceph.ceph store.db

# Start service
host2:~ # systemctl start ceph-8f279f36-811c-3270-9f9d-58335b1bb9c0@mon.host2.service

Although two of three MONs still have the right IP addresses it’s possible that they still can’t form a quorum since their monmaps differ. Repeat above procedure for all MONs and as soon as the majority of MONs has the new monmap and the containers launch successfully the cluster should get back to a healthy state.

As I already wrote this procedure worked for me and the cluster is still up and healthy, I haven’t noticed any side effects but then again, it’s only a lab cluster without much data in it and no active clients. If there’s a cleaner way to modify the container in order to start it without ‘ceph-mon’ I’d appreciate it if you left a comment.

Update: There’s an easier way to modify an offline container than modifying the unit.run file as previously described. It’s sufficient to stop the daemon and enter the shell:

host2:~ # systemctl stop ceph-8f279f36-811c-3270-9f9d-58335b1bb9c0@mon.host2.service

host2:~ # cephadm shell --name mon.host2      
[...]

[ceph: root@ses7-host3 /]# ceph mon getmap -o monmap

If you do modify anything within the mon container, make sure that you did it as user “ceph” so the permissions don’t change. Otherwise set the correct permissions in case the container doesn’t start successfully.

Posted in Ceph, cephadm | Tagged , , , , , , | Leave a comment