Ceph: How to enable nova live snapshots on Xen hypervisor

Currently, our company is in the middle of a migration process, trying to move our existing virtual machines into our private OpenStack cloud (Mitaka). Some of them will be moved, others will be re-installed. One of the major issues we are facing from time to time is the lack of Xen support.

The current Mitaka environment is consisting of one control and three compute nodes (all running on openSUSE Leap 42.1), the compute nodes are Xen hypervisors, our storage backend is Ceph (for nova, glance and cinder).

Background

I’m not that familiar with Ceph yet, still trying to figure out the best configuration to improve the performance. Anyway, the cluster is up and running – mostly stable – and we began to launch our first soon-to-be-productive instances in that cluster. For maintenance reasons, we needed to create a snapshot of one of the running instances as a backup in case we messed up the VM’s configuration, wouldn’t be the first time…

Possible problems

Unfortunately, nova decided to take a cold snapshot while another colleague was working on that machine. That was quite annoying as the vm freezed for the duration of the snapshot, so my colleague had to get a coffee…
Here is some output of nova-compute.log:

2017-01-12 12:55:51.919 [instance: 14b75237-7619-481f-9636-792b64d1be17] Beginning cold snapshot process
2017-01-12 12:59:27.085 [instance: 14b75237-7619-481f-9636-792b64d1be17] Snapshot image upload complete

I searched for some config options I could have missed, and I found and changed the option disable_libvirt_livesnapshot to false in nova.conf, but the result was the same, still a cold snapshot. You can find the described option in nova.conf on your compute node(s), in the section workarounds:

[workarounds]
...
disable_libvirt_livesnapshot = false

I had no clue why it wouldn’t work, but…

Solution

I knew that on rbd level the live snapshot process worked as expected, without any downtime of the instance, we already use(d) it for our backups. So I added some log statements to /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py (I searched for live_snapshot and identified this file to be root cause) and found out that nova always passes hypervisor-driver qemu into the function _host.has_min_version(). This function always returns false because we use Xen and live snapshots always get disabled.
So I dug deeper and searched for occurences of HV_DRIVER_QEMU, as this was the hard coded value that was passed to has_min_version:

compute1:~ # grep -r HV_DRIVER_XEN /usr/lib/python2.7/site-packages/nova/
/usr/lib/python2.7/site-packages/nova/virt/libvirt/host.py:HV_DRIVER_XEN = "Xen"
/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py: hv_driver = host.HV_DRIVER_XEN

Great, there were only two files and one of them we already knew (the log statements), so not too much to look into:

compute1:~ # grep -C3 HV_DRIVER_QEMU /usr/lib/python2.7/site-packages/nova/virt/libvirt/host.py
# This list is for libvirt hypervisor drivers that need special handling.
# This is *not* the complete list of supported hypervisor drivers.
HV_DRIVER_QEMU = "QEMU"
HV_DRIVER_XEN = "Xen"

Well, that was helpful, only two different drivers, and the second one seemed to be exactly what I needed, so I simply replaced HV_DRIVER_QEMU by HV_DRIVER_XEN in the driver.py and that did the trick! I added an if-statement to make it configurable, here the diff:

compute1:~ # diff -u /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py.mod
--- /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py 2017-01-13 09:33:23.257525708 +0100
+++ /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py.mod 2017-01-13 09:33:46.349105366 +0100
@@ -1649,9 +1649,14 @@
# redundant because LVM supports only cold snapshots.
# It is necessary in case this situation changes in the
# future.
+ if CONF.libvirt.virt_type == 'xen':
+ hv_driver = host.HV_DRIVER_XEN
+ else:
+ hv_driver = host.HV_DRIVER_QEMU
+
if (self._host.has_min_version(MIN_LIBVIRT_LIVESNAPSHOT_VERSION,
MIN_QEMU_LIVESNAPSHOT_VERSION,
- host.HV_DRIVER_QEMU)
+ hv_driver)
and source_type not in ('lvm')
and not CONF.ephemeral_storage_encryption.enabled
and not CONF.workarounds.disable_libvirt_livesnapshot):

For confirmation an excerpt of the nova-compute.log:

2017-01-12 17:20:22.760 [instance: 14b75237-7619-481f-9636-792b64d1be17] instance snapshotting
2017-01-12 17:20:24.049 [instance: 14b75237-7619-481f-9636-792b64d1be17] Beginning live snapshot process
2017-01-12 17:24:38.997 [instance: 14b75237-7619-481f-9636-792b64d1be17] Snapshot image upload complete

And indeed, that snapshot was live, the vm did not freeze. I decided to file a bug report, you can find it here.

This entry was posted in Ceph, OpenStack and tagged , , , , . Bookmark the permalink.

Leave a Reply