OpenStack: Upgrade to high availability (Part II)

This article is the second part (find the previous post here) of the series about our OpenStack upgrade process. It will be about the preparation steps required to reach that goal, the key elements of the preparation were (not necessarily in that order):

  • Upgrade the OpenStack database
  • Create AutoYaST profiles
  • Create salt states
  • Prepare PXE installation

Since we have a SUSE Manager (SUMA) in our Cloud environment the essential basis for the installation and configuration of our servers was already there: the SUMA server provides repositories for all our existing machines and is also our PXE server to handle automatic installations based on AutoYaST. This setup makes it possible to start from scratch over and over until the results are satisfying.

Setting up and configuring a SUMA server is not part of this blog series as it’s quite a huge task and is a project on its own. Creating all the necessary salt states was also challenging and can’t be covered in detail, but a short description should suffice: basically I needed to compare (OpenStack) configuration files between our old Ocata Cloud and a SUSE OpenStack Cloud 9 (Rocky) test installation I had running as a virtual environment and then merge those configs into salt states. It’s not very difficult but was quite time-consuming. The resulting configuration files would be tested on one of the new control nodes to identify conflicts or deprecated configuration options.

Creating AutoYaST profiles and setting up a PXE-server are not part of this series either since it’s not applicable to every setup (AutoYaST is SUSE-specific). The SUMA component responsible for the PXE installation is called Cobbler. Since we already have such an environment many of the required information to create those AutoYaST profiles are provided as variables to cobbler which made it a lot easier for us to create different profiles for control and compute nodes. But again, this is a very complex setup and can’t be covered here.

Database

The most important thing on the old control node was the (MySQL) database. The main requirement was to import that database into the new Cloud in order to preserve our working environment. Since we decided to install the new control nodes with the latest operating system and OpenStack version (at that time) the import also needed some preparation.

To test the database migration I created a new virtual machine (“ocata-vm”) identical to the old control node (Leap 42.3 with exact same package versions which is possible due to the SUMA repository staging) to play with the database without any risk. Having Ceph as the storage back-end for all services was quite neat: with RBD snapshots a rollback to a defined state was possible within a couple of seconds so I could start over if I messed up. Of course, OpenStack has its own snapshot mechanisms but they were not optimal in this case: Nova snapshots result in new, flat Glance images and take longer, and dealing with Cinder volume snapshots was not as quick as simply doing a snapshot rollback within Ceph.

Since it was just a test-run that would have to be repeated before the actual migration the current database status didn’t really matter, I just needed it to document the necessary steps.

# Dump and copy the databases
old-control:~ # for service in cinder glance keystone nova nova_api nova_cell0 neutron
> do mysqldump $service -r /tmp/20200417-$service.dump
> done

old-control:~ # scp /tmp/20200417-*.dump ocata-vm:/tmp/

I wanted to be as sure as possible that the upgraded database would work in the new Cloud, so I didn’t simply execute the various db-manage commands, e.g. nova-manage db sync. In addition I (temporarily) edited the keystone endpoints to point to my ocata-vm (replace “old-control” with “ocata-vm” in the keystone.endpoints table) and tried to start each of the OpenStack services after every upgrade step.

I would only continue if Keystone, Glance, Nova, Neutron and Cinder would start successfully. I mainly focused on the API services and didn’t pay much attention to services like cinder-scheduler etc. The Horizon dashboard was also not in the focus, I figured it couldn’t be too difficult to get it running if the rest was working. RabbitMQ was also ignored since I didn’t need the services to talk to each other, I just wanted them to start properly and then continue with the database upgrade.

During every startup of the OpenStack services I monitored all related log files to see if any of the services failed and then fixed the errors. For example, one of the most important changes was the switch from auth_uri to www_authenticate_uri in the config files. Of course, there were lots of (mostly small) issues during that process, but it doesn’t make much sense to mention all of them.

Before importing the databases I had to make sure they already existed in my ocata-vm and had the same credentials as in our production environment so the services would connect to the database. How to configure the database credentials is described in the OpenStack deployment guide. After the database import I also needed to change the cell_mapping in the nova_api database so Nova was able to connect:

# Import databases
ocata-vm:~ # for service in cinder glance keystone nova nova_api nova_cell0 neutron
> do mysql $service < /tmp/20200417-$service.dump
> done

# Replace keystone endpoints
MariaDB [keystone]> select * from keystone.endpoint;
[...]

MariaDB [keystone]> UPDATE keystone.endpoint SET url=REPLACE(url,'old-control','ocata-vm');

# Show cell information (--verbose also shows passwords)
ocata-vm:~ # nova-manage cell_v2 list_cells
+-------+--------------------------------------+----------------------------------------------------+--------------------------------------------------+----------+
|  Name |                 UUID                 |                            Transport URL           |                       Database Connection        | Disabled |
+-------+--------------------------------------+----------------------------------------------------+--------------------------------------------------+----------+
| cell0 | 00000000-0000-0000-0000-000000000000 |                                none:/              | mysql+pymysql://nova:****@old-control/nova_cell0 |  False   |
| cell1 | ae1536e9-48ec-4ad1-9a1c-d826c6289dd7 | rabbit://openstack:****@old-control:5672/openstack |    mysql+pymysql://nova:****@old-control/nova    |  False   |
+-------+--------------------------------------+----------------------------------------------------+--------------------------------------------------+----------+

# Update cell information
ocata-vm:~ # nova-manage cell_v2 update_cell --cell_uuid <CELL0_UUID> \
--database_connection "mysql+pymysql://nova:<NOVA_PASS>@ocata-vm/nova_cell0" \
--transport-url "none:///"

ocata-vm:~ # nova-manage cell_v2 update_cell --cell_uuid <CELL1_UUID> \
--database_connection "mysql+pymysql://nova:<NOVA_PASS@ocata-vm/nova" \
--transport-url "rabbit://openstack:<RABBIT_PASS>@ocata-vm"

After verifying that the services started successfully I could continue with the actual upgrade process. I won’t go into too much detail about all the upgrade steps since they repeated themselves, of course, I’ll just point out the key elements.

Basically, I ran these steps after every OpenStack version upgrade:

Keystone

ocata-vm:~ # keystone-manage db_sync --expand
ocata-vm:~ # keystone-manage db_sync --migrate
ocata-vm:~ # keystone-manage db_sync --contract
ocata-vm:~ # keystone-manage db_sync --check
ocata-vm:~ # systemctl restart apache
ocata-vm:~ # source admin.sh

# Check keystone
ocata-vm:~ # openstack endpoint list

Since this whole upgrade process also included an upgrade from python 2.7 to python 3.6 at some point I had to remove python2-apache and python2-wsgi packages and replace them with the correct python3 version in order to get apache (Keystone) running.

Nova

The placement API was extracted from Nova and became an independently packaged component so this had to be considered additionally during the upgrade.

ocata-vm:~ # nova-manage api_db sync
ocata-vm:~ # nova-manage db sync
ocata-vm:~ # nova-manage db online_data_migrations
ocata-vm:~ # nova-status upgrade check

ocata-vm:~ # systemctl start openstack-nova-api.service
ocata-vm:~ # source admin.sh
ocata-vm:~ # openstack server list -c Name --all 
+------------------+
| Name             |
+------------------+
| ceph             |
| soc9-compute1    |
| soc9-control3    |
| soc9-control2    |
| soc9-control1    |
| soc9-admin       |
| ses6-osd6        |
| ses6-osd5        |
| ses6-osd4        |
| ses6-mon3        |
| ses6-mon2        |
| ses6-mon1        |
| ses6-admin       |
[...]

Yay, all of our instances are still there! Continuing with Glance…

Glance

ocata-vm:~ # glance-manage db_sync

Glance was probably the easiest component to upgrade, I can’t recall having any issue except for some config file changes but no serious show-stopper.

Cinder

During the Cinder upgrade from Stein to Train this error occurred:

ocata-vm:~ # cinder-manage db online_data_migrations
ocata-vm:~ # cinder-manage db sync
Error during database migration: Migration cannot continue until all volumes have been migrated to the `__DEFAULT__` volume type. Please run `cinder-manage db online_data_migrations`. There are still untyped volumes unmigrated.

Obviously, the cinder-manage db online_data_migrations command didn’t resolve that so I just did that myself:

# Show all active volume types
MariaDB [cinder]> select id,name from volume_types where deleted=0;
+--------------------------------------+-------------+
| id                                   | name        |
+--------------------------------------+-------------+
| 152cb8cb-0f64-4064-9235-8138418f73b9 | ssd         |
| 5f7f9e7f-a5b4-4061-b228-022d42139467 | hdd         |
| b00620a1-bda1-443e-afbe-b43c9021bf56 | ceph-ec     |
| f27162b0-c3c4-4929-8a95-89b83875885f | __DEFAULT__ |
+--------------------------------------+-------------+

# Update all empty rows
MariaDB [(none)]> update cinder.snapshots set volume_type_id='f27162b0-c3c4-4929-8a95-89b83875885f' where volume_type_id IS NULL;
Query OK, 75 rows affected (0,03 sec)
Rows matched: 75 Changed: 75 Warnings: 0

MariaDB [(none)]> update cinder.volumes set volume_type_id='f27162b0-c3c4-4929-8a95-89b83875885f' where volume_type_id IS NULL;
Query OK, 75 rows affected (0,03 sec)
Rows matched: 75 Changed: 75 Warnings: 0

This manual intervention only had to be done once, after updating these tables I had no major issues with Cinder during the rest of the upgrade process.

Neutron

ocata-vm:~ # neutron-db-manage check_migration
ocata-vm:~ # neutron-db-manage upgrade head
ocata-vm:~ # systemctl start openstack-neutron.service 
ocata-vm:~ # source admin.sh
ocata-vm:~ # openstack network list
+--------------------------------------+----------------+--------------------------------------+
| ID                                   | Name           | Subnets                              |
+--------------------------------------+----------------+--------------------------------------+
| 0523eb01-17ab-4c57-8fe6-5a49edcfb2af | ovsnet         | d442f9aa-56bd-49f0-8d09-eb4f4e60c13b |
| 18db85e5-36aa-4669-9004-9ec43baad3f2 | floating       | dafed1c4-da5b-4557-bf18-cc7f9266f82a |
| 4421e160-d675-49f2-8c29-9722aebf03b2 | adminnet       | ade0def8-76bd-4c24-af49-6ebed152da8e |
[...]

Although there weren’t any major issues with the Neutron DB upgrade there was still no indication if our switch from linuxbridge to openvswitch was even possible. At some point we would either fail or succeed, so we just continued with the preparation.

For the upgrade from Rocky to Stein I also had to upgrade the operating system from Leap 42.3 to Leap 15.1 since there were no Stein packages available for Leap 42.3. But that was not a problem at all, the VM upgraded and rebooted successfully and the OpenStack services also started successfully on the new operating system.

All these steps were repeated until the database arrived at the Train station. ;-) That was quite promising but during the first attempt I didn’t capture all the steps I took and I also wanted to avoid documenting faulty steps, so I had to rollback and start over. But in the end I had prepared a reliable documentation for the actual upgrade and I got optimistic about the rest of the process.

The next article will be about the requirements for pacemaker resources and how to configure stateful and stateless OpenStack services.

This entry was posted in Ceph, High Availability, OpenStack and tagged , , , , . Bookmark the permalink.

Leave a Reply