With the introduction of Ceph Octopus lots of things have changed how to deploy and manage clusters. One of those things is cephadm which allows you to bootstrap a new cluster very quickly. One of the major changes is the now containerized environment based on docker or podman.
This post is not about the details of a containerized environment but rather about one specific task that has been asked frequently in the ceph-users mailing list: How can I change a MON’s IP address, for example if it moved to a different location? There are basically two ways described in the docs, the “right way” and the “messy way”. In the previous releases (before containers) the process wasn’t very complicated, especially the “right way”. The “messy way” should be avoided if possible since it can or maybe even will result in cluster downtime if you only have 3 MONs. But in a disaster recovery scenario or if the whole cluster moves to a different data center it can be helpful.
Although the mentioned docs state that a MON should not change its IP address it still can be necessary, just a couple of weeks ago one of our customers changed the whole public network and I had to rebuild the monmap the “messy way”. But fortunately it worked and the cluster came back healthy. Because the described processes in the docs are designed for the pre-container deployments I will not only describe the containerized version of the “messy way”. I will also try to provide a version of the “right way” that’s focusing on changing a MON’s IP address instead of adding a new MON and then remove the old MON. My description will also be divided into two parts, the “right way” and the “messy way” (of course). Usually I would add the disclaimer at the bottom but I feel it should be added right here:
The following steps have been executed in a lab environment. They worked for me but they might not work for you. If anything goes wrong while you try to reproduce the following procedure it’s not my fault, but yours. And it’s yours to fix it!
Note that in my test procedure I only changed the IP address but not the network. So to actually move MONs to a different network more caution and preparation is required as well as routable networks if you want to avoid downtime.
This (virtual) lab environment was based on SUSE Enterprise Storage 7 and podman (Ceph Octopus version 15.2.5). I won’t go into too much detail and spare you all the command line output, this article is meant to show the general idea and not to provide step-by-step instructions. This is also just one way to do it, I haven’t tried other ways (yet) and there might be much smoother procedures. If you have better ideas or other remarks please leave a comment, I’d be happy to try other/better procedures and update this blog post.
The “right way”
Before you start, make sure the cluster is healthy and you’ll still have a MON quorum even if one MON goes down so the clients won’t notice any disruptions.
The actual “right way” would be to add a new MON (host4) to the cluster using the cluster spec file or a specific MON spec file and remove the old MON (host1) afterwards.
But as I already mentioned I actually want to change the MON’s IP address, not add a new one. So the procedure changes a little:
# Change host1 MON's IP address # Cluster still has quorum but looses one MON cephadm:~ # ceph -s cluster: id: 8f279f36-811c-3270-9f9d-58335b1bb9c0 health: HEALTH_WARN 1/3 mons down, quorum host2,host3 # Cephadm will continue to probe for host1 # Change naming service to update host1's IP address # Cluster recovers
As soon as the MGR daemons (the active MGR, to be more precise) reach host1 by its new IP address (DNS should be reconfigured properly) the probing should succeed and the MON container should start successfully.
For full disclosure I need to mention that I did it slightly different and realized afterwards that some of the steps I did were not necessary, for example removing host1 from the spec file to prevent cephadm from restarting the container. But as soon as the MON’s IP changes it won’t be reachable anyway so there’s no point in that.
The “messy way”
As already stated this is basically a disaster recovery scenario and hopefully you’ll never have to do that. But in case you do, the following procedure might help. Before you start make sure you have backed up all important files/directories/keyrings etc. You’ll then need a new (modified) monmap to inject into the MON daemons.
# Get monmap host1:~ # ceph mon getmap > monmap.file got monmap epoch 10 # Print old monmap host1:~ # monmaptool --print old-monmap monmaptool: monmap file old-monmap epoch 4 fsid 8f279f36-811c-3270-9f9d-58335b1bb9c0 last_changed 2020-12-17T13:39:13.545453+0100 created 2020-12-17T13:39:13.545453+0100 min_mon_release 15 (octopus) 0: [v2:192.168.168.50:3300/0,v1:192.168.168.50:6789/0] mon.host1 1: [v2:192.168.168.51:3300/0,v1:192.168.168.51:6789/0] mon.host2 2: [v2:192.168.168.52:3300/0,v1:192.168.168.52:6789/0] mon.host3 # Remove host1 from monmap host1:~ # monmaptool --rm host1 old-monmap monmaptool: monmap file old-monmap monmaptool: removing host1 monmaptool: writing epoch 0 to old-monmap (2 monitors) # Add host1 with new IP (192.168.168.58) to monmap host1:~ # monmaptool --addv host1 [v2:192.168.168.58:3300/0,v1:192.168.168.58:6789/0] old-monmap monmaptool: monmap file old-monmap monmaptool: writing epoch 0 to old-monmap (3 monitors) # Check content host1:~ # monmaptool --print old-monmap monmaptool: monmap file old-monmap epoch 0 fsid 8f279f36-811c-3270-9f9d-58335b1bb9c0 last_changed 2020-12-17T13:39:13.545453+0100 created 2020-12-17T13:39:13.545453+0100 min_mon_release 15 (octopus) 0: [v2:192.168.168.51:3300/0,v1:192.168.168.51:6789/0] mon.host2 1: [v2:192.168.168.52:3300/0,v1:192.168.168.52:6789/0] mon.host3 2: [v2:192.168.168.58:3300/0,v1:192.168.168.58:6789/0] mon.host1 # Rename file and copy it to all MON hosts host1:~ # mv old-monmap new-monmap host1:~ # scp new-monmap host2:/tmp/ host1:~ # scp new-monmap host3:/tmp/ # Change IP address of host1
Now host1 won’t be able to join the cluster anymore and we need to change the monmap. The difficulty here is that to inject a (modified) monmap you need the binary ‘ceph-mon’ which is running within the container. You can’t inject (not even extract) it into the running service because of the LOCK file:
# Enter the container cephadm enter --name mon.host2 # Try to extract monmap from running daemon [ceph: root@host2 /]# ceph-mon -i host2 --extract-monmap monmap.bin 2020-12-17T12:29:00.932+0000 7f6a9935b640 -1 rocksdb: IO error: While lock file: /var/lib/ceph/mon/ceph-host2/store.db/LOCK: Resource temporarily unavailable 2020-12-17T12:29:00.932+0000 7f6a9935b640 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-host2': (22) Invalid argument
But if we shut down the container the binary won’t be available, so what do we do? Exactly, we launch the container without ‘ceph-mon’ running. For that we just need to change the unit.run
file under /var/lib/ceph/8f279f36-811c-3270-9f9d-58335b1bb9c0/mon.host2/unit.run
. You probably noticed that the UUID in this path obviously is my ceph cluster’s fsid. Now just replace /usr/bin/ceph-mon
with /bin/bash
and restart the service:
# Stop MON service host2:~ # systemctl stop ceph-8f279f36-811c-3270-9f9d-58335b1bb9c0@mon.host2.service # Replace binary host2:~ # cd /var/lib/ceph/8f279f36-811c-3270-9f9d-58335b1bb9c0/mon.host2 host2:/var/lib/ceph/8f279f36-811c-3270-9f9d-58335b1bb9c0/mon.host2 # sed -i -e 's/\/usr\/bin\/ceph-mon/\/bin\/bash/' unit.run # Start service host2:~ # systemctl start ceph-8f279f36-811c-3270-9f9d-58335b1bb9c0@mon.host2.service
Now there should be a running container without a ‘ceph-mon’ process and I can change the monmap:
# Enter container host2:~ # cephadm enter --name mon.host2 # Either create new monmap [ceph: root@host2 /]# monmaptool --create --fsid 8f279f36-811c-3270-9f9d-58335b1bb9c0 --addv host1 [v2:192.168.168.58:3300,v1:192.168.168.58:6789] --addv host2 [v2:192.168.168.51:3300,v1:192.168.168.51:6789] --addv host3 [v2:192.168.168.52:3300,v1:192.168.168.52:6789] new-monmap monmaptool: monmap file new-monmap monmaptool: set fsid to 8f279f36-811c-3270-9f9d-58335b1bb9c0 monmaptool: writing epoch 0 to new-monmap (3 monitors) [ceph: root@host2 /]# monmaptool --set-min-mon-release octopus new-monmap monmaptool: monmap file new-monmap setting min_mon_release = octopus monmaptool: writing epoch 0 to new-monmap (3 monitors) # Or use the already modified map [ceph: root@host2 /]# scp 192.168.168.51:/tmp/new-monmap . # Inject monmap [ceph: root@host2 /]# ceph-mon -i host2 --inject-monmap new-monmap [ceph: root@host2 /]# exit # Stop service host2:~ # systemctl stop ceph-8f279f36-811c-3270-9f9d-58335b1bb9c0@mon.host2.service # Revert changes in unit.run file host2:/var/lib/ceph/8f279f36-811c-3270-9f9d-58335b1bb9c0/mon.host2 # sed -i -e 's/\/bin\/bash/\/usr\/bin\/ceph-mon/' unit.run # Set ownership (it changed after editing unit.run) host2:/var/lib/ceph/8f279f36-811c-3270-9f9d-58335b1bb9c0/mon.host2 # chown -R ceph.ceph store.db # Start service host2:~ # systemctl start ceph-8f279f36-811c-3270-9f9d-58335b1bb9c0@mon.host2.service
Although two of three MONs still have the right IP addresses it’s possible that they still can’t form a quorum since their monmaps differ. Repeat above procedure for all MONs and as soon as the majority of MONs has the new monmap and the containers launch successfully the cluster should get back to a healthy state.
As I already wrote this procedure worked for me and the cluster is still up and healthy, I haven’t noticed any side effects but then again, it’s only a lab cluster without much data in it and no active clients. If there’s a cleaner way to modify the container in order to start it without ‘ceph-mon’ I’d appreciate it if you left a comment.
Update: There’s an easier way to modify an offline container than modifying the unit.run file as previously described. It’s sufficient to stop the daemon and enter the shell:
host2:~ # systemctl stop ceph-8f279f36-811c-3270-9f9d-58335b1bb9c0@mon.host2.service host2:~ # cephadm shell --name mon.host2 [...] [ceph: root@ses7-host3 /]# ceph mon getmap -o monmap
If you do modify anything within the mon container, make sure that you did it as user “ceph” so the permissions don’t change. Otherwise set the correct permissions in case the container doesn’t start successfully.