OpenStack: Upgrade to high availability (Part I)

This article is the beginning of a little blog series about our journey to upgrade a single-control-node environment to a highly available OpenStack Cloud. We started to use OpenStack as an experiment while our company was running on a different environment, but as it sometimes happens the experiment suddenly became a production environment without any redundancy, except when we migrated to Ceph as our storage back-end, so at least the (Glance) images, (Nova) ephemeral disks and (Cinder) volumes were redundant. But the (single) control node didn’t even have any RAID configuration, just a regular backup configuration for the database, config files etc.

One of the first things after moving our new OpenStack Cloud to production was to add a second disk to the control node and create a software RAID to have at least some kind of failure resiliency. This has worked quite well for a couple of years now, we barely had any downtime except for maintenance, of course. But this is one downside of having only one controller: instances with self-service networks won’t be reachable via their floating IP as long as the controller is down. So depending on the maintenance purpose a downtime is basically inevitable, not to speak about unplanned outages (luckily, we didn’t have any in the recent years).

To not overstretch our luck we decided that we need to upgrade our environment to have at least two control nodes which should be controlled by pacemaker. I had very limited experience with high availability so I started researching, but all I could find were old and incomplete documentations (even the OpenStack documentation) or old blog posts which helped understanding the big picture but that was it. I had no other choice so I gathered all the information from those different sources and started to test with virtual machines and hardware servers, trying to figure out what was (still) applicable to our goal.
Another obstacle we had to overcome was our plan to move from linuxbridge to openvswitch but preserve all networks, routers and ports from the existing Cloud. With the lack of documentation and the day-to-day business with customers it took us about two years to actually finish the upgrade process. But we did it!

Overview

These are the main aspects of the old environment:

  • Operating system on all cloud nodes: openSUSE Leap 42.3
  • OpenStack version Ocata (started as an experiment with Kilo)
  • Ceph Nautilus (started with Hammer, also as an experiment)
  • Neutron with linuxbridge
  • SUSE Manager (Salt-Master for automated system installation and configuration)

Since we are a small company we only use a subset of the available OpenStack services:

  • Keystone
  • Glance
  • Cinder
  • Nova
  • Neutron
  • Horizon
  • (Heat)

We had been using Heat in the old environment but at the time of the migration there were no active stacks so we decided to ignore it until the migration is finished and simply configure it from scratch in the new Cloud.

The Plan

After the migration the new environment was supposed to look like this:

  • Operating system on all cloud nodes: openSUSE Leap 15.1
  • Two control nodes controlled by pacemaker
  • OpenStack version Train
  • Neutron with openvswitch
  • HAProxy in front of the OpenStack endpoints (virtual IP)

Although this series of articles is very technical I’ll avoid detailed code blocks as much as possible since those won’t be of much use. The described steps were prepared and tested for our specific environment and won’t necessarily work for other environments.
If you have questions or remarks about any of the described steps please don’t hesitate to comment.

This entry was posted in Ceph, High Availability, OpenStack and tagged , , , , . Bookmark the permalink.

Leave a Reply