OpenStack: Upgrade to high availability (Part III)

The previous post in this little series was about environment preparation to install the latest operating system and OpenStack version but preserve the database in order to migrate an existing Cloud environment to a newer platform. I mainly focused on the database as the most critical component.

This post will be about Galera, RabbitMQ and Memcached to continue the environment preparation. The next article will describe the high availability setup of the OpenStack services and then I’ll conclude with a last article dealing with the required steps for the actual migration.

Highly available database (Galera)

To provide a fail-safe database setup we decided to go with Galera since we already had a virtual SUSE OpenStack Cloud that we used as a template for some of the necessary configuration, and they also use Galera in their setup. To keep this series readable I’ll spare most details and try to focus on the main aspects.

Galera would also work with two nodes only, but to avoid split-brain scenarios it’s recommended to have a third tiebreaker node that only runs the garbd service. We chose a different hardware machine in our environment for this task. During my work with OpenStack I already wrote an article about Galera and tiebreakers so I’ll skip most of the content.

The rest is pretty straight forward, I’ll describe the executed steps in a large code block. All these steps have to be performed on both controller nodes. Please note that in our setup this is all done automatically via Salt so we didn’t actually need to run zypper in mariadb etc., it’s just for visibility.

For the sake of simplicity the control nodes are called controller01 and controller02, with their IP addresses 10.0.0.1 and 10.0.0.2. The virtual control node is just called controller with 10.0.0.100.

# Install packages
controller01:~ # zypper in mariadb-client mariadb mariadb-galera \
python3-PyMySQL galera-3-wsrep-provider galera-python-clustercheck

# Galera configuration
# Some non-default values (especially max_connections)
controller01:~ #  cat /etc/my.cnf.d/74-galera-tuning.cnf 
[mysqld]
innodb_buffer_pool_size = 256M
innodb_log_file_size = 64M
innodb_buffer_pool_instances = 1
innodb_log_file_size = 64M
innodb_flush_method = O_DIRECT
innodb_flush_log_at_trx_commit = 1

max_connections = 2048
tmp_table_size = 64M
max_heap_table_size = 64M
skip_name_resolve = 1

# Regular galera.conf
controller01:~ # cat /etc/my.cnf.d/75-galera-custom.cnf 
[mysqld]
wsrep_on = ON
wsrep_provider = /usr/lib64/galera-3/libgalera_smm.so
wsrep_cluster_name="NewCloudGalera"

wsrep_cluster_address = "gcomm://controller01,controller02"
wsrep_provider_options = "gmcast.listen_addr = tcp://10.0.0.1:4567; \
gcs.fc_limit = 5; gcs.fc_factor = 0.8;"
wsrep_slave_threads = 1
wsrep_max_ws_rows = 0
wsrep_max_ws_size = 2147483647
wsrep_debug = 0
binlog_format = ROW
default_storage_engine = InnoDB
innodb_autoinc_lock_mode = 2
innodb_doublewrite = 1
query_cache_size = 0
query_cache_type = 0
expire_logs_days = 10

user                   = mysql
datadir                = /var/lib/mysql
tmpdir                 = /var/lib/mysqltmp
bind-address           = 10.0.0.1

To allow pacemaker to monitor the highly available database the galera-python-clustercheck also needs to be configured. Make sure to grant database access for the monitoring user.

controller01:~ # cat /etc/galera-python-clustercheck/my.cnf 
[client]
user=monitoring
password=****
host=10.0.0.1

# Additional options can be specified (defaults are commented)
controller01:~ # cat /etc/galera-python-clustercheck/galera-python-clustercheck.conf
[...]
#   -p PORT, --port=PORT  Port to listen on [default: 8000]
#   -6, --ipv6            Listen to ipv6 only (disabled ipv4) [default: False]
#   -4 IPV4, --ipv4=IPV4  Listen to ipv4 on this address [default: 0.0.0.0]

GALERA_PYTHON_CLUSTERCHECK_OPTIONS="--conf=/etc/galera-python-clustercheck/my.cnf"

controller01:~ # cat /etc/haproxy/haproxy.cfg
# Galera haproxy configuration
# Port 8000 is for the clustercheck
listen galera
        bind 10.0.0.100:3306
        mode tcp
        stick-table type ip size 1000
        stick on dst
        option httpchk
        option clitcpka
        maxconn 2038

        default-server port 8000

        server controller01 10.0.0.1:3306 check inter 2000 fastinter 1000 rise 5 fall 2 backup on-marked-down shutdown-sessions
        server controller02 10.0.0.2:3306 check inter 2000 fastinter 1000 rise 5 fall 2 backup on-marked-down shutdown-sessions

These settings have to be made on all controller nodes.

The respective pacemaker resource is configured as follows (FQDN is masked here):

# Galera (MariaDB cluster)
primitive galera galera \
params check_user=monitoring check_passwd=**** datadir="/var/lib/mysql" \
enable_creation=true log="/var/log/mysql/mysqld.log" socket="/var/run/mysql/mysql.sock" \
wsrep_cluster_address="gcomm://controller01,controller02" cluster_host_map="controller01:controller01.domain;controller02:controller02.domain" \
op demote interval=0 timeout=600s \
op monitor interval=23s \
op monitor interval=20s role=Master \
op promote interval=0 timeout=600s \
op start interval=0 timeout=120s \
op stop interval=0 timeout=120s

# Galera is a multi-state resource
ms ms-galera galera \
meta clone-max=3 interleave=false master-max=3 notify=true ordered=false target-role=Started is-managed=true

# Clustercheck
primitive galera-python-clustercheck systemd:galera-python-clustercheck \
op monitor interval=10s

To successfully bootstrap and initialize Galera we came up with the following command order. Again, these steps were developed manually but in the end they were all executed in a specified order by Salt. The reason for this “strange” procedure is that we needed to initialize MariaDB with our root password, create the databases, grant access etc., but this was not possible as soon as the pacemaker cluster was built. But since we can’t install all control nodes entirely simultaneously (one node initializes the cluster, the other(s) simply join) the Galera bootstrap would fail. That’s why we split up these steps before the actual Galera bootstrap.

# Workaround to initialize mysql without conflicting with galera
sed -i 's/wsrep_cluster_address = "gcomm:\/\/controller01,controller02"/wsrep_cluster_address = "gcomm:\/\/"/' /etc/my.cnf.d/75-galera-custom.cnf

# Bootstrap galera
galera_new_cluster

# Set root password, drop test db, disable remote access
mysql -u root << EOF
UPDATE mysql.user SET Password=PASSWORD('$PASSWORD') WHERE User='root';
DELETE FROM mysql.user WHERE User='';
DELETE FROM mysql.user WHERE User='root' AND Host NOT IN ('localhost', '127.0.0.1', '::1');
DROP DATABASE IF EXISTS test;
DELETE FROM mysql.db WHERE Db='test' OR Db='test\\_%';
FLUSH PRIVILEGES;
EOF

# Create databases
mysql -u root -p$PASSWORD << EOF
CREATE DATABASE keystone;
CREATE DATABASE glance;
CREATE DATABASE placement;
CREATE DATABASE nova_api;
CREATE DATABASE nova;
CREATE DATABASE nova_cell0;
CREATE DATABASE cinder;
CREATE DATABASE neutron;
GRANT PROCESS, SELECT ON *.* TO 'monitoring'@'localhost' IDENTIFIED BY '****';
GRANT PROCESS, SELECT ON *.* TO 'monitoring'@'%' IDENTIFIED BY '****';
GRANT ALL PRIVILEGES ON keystone.* TO 'keystone'@'localhost' IDENTIFIED BY '****';
GRANT ALL PRIVILEGES ON keystone.* TO 'keystone'@'%' IDENTIFIED BY '****';
[...]
# Repeat for all databases

FLUSH PRIVILEGES;
EOF

# stop mysql and undo config changes
systemctl stop mariadb.service
sed -i 's/wsrep_cluster_address = "gcomm:\/\/"/wsrep_cluster_address = "gcomm:\/\/controller01,controller02"/' /etc/my.cnf.d/75-galera-custom.cnf

After this step the pacemaker resource could be started (after the second node had joined the cluster) and Galera would bootstrap a new cluster.

RabbitMQ

The messaging service creates its own cluster so luckily there was not much to configure. We just needed to make sure to provide a nodename and a working epmd configuration:

controller01:~ # cat /etc/rabbitmq/rabbitmq-env.conf 
NODENAME=rabbit@controller01

controller01:~ # cat /etc/systemd/system/epmd.socket.d/ports.conf 
[Socket]
ListenStream=10.0.0.1:4369
FreeBind=true

The pacemaker resource is also defined as a multi-state resource:

# RabbitMQ
primitive rabbitmq ocf:rabbitmq:rabbitmq-server-ha \
params default_vhost=openstack erlang_cookie=XX... pid_file="/var/run/rabbitmq/pid" \
policy_file="/etc/rabbitmq/ocf-promote" rmq_feature_health_check=true rmq_feature_local_list_queues=true \
meta failure-timeout=30s migration-threshold=10 resource-stickiness=100 \
op demote interval=0 timeout=120s \
op monitor interval=30s \
op monitor interval=27s role=Master \
op notify interval=0 timeout=180s \
op promote interval=0 timeout=120s \
op start interval=0 timeout=360s \
op stop interval=0 timeout=120s

primitive rabbitmq-port-blocker ocf:pacemaker:ClusterMon \
params extra_options="-E /usr/bin/rabbitmq-alert-handler.sh --watch-fencing" \
op monitor interval=10s \
meta target-role=started

ms ms-rabbitmq rabbitmq \
meta clone-max=3 interleave=false master-max=1 master-node-max=1 notify=true ordered=false target-role=started

If the rabbit cluster would start successfully it could be configured for our openstack environment:

controller01:~ # rabbitmqctl set_cluster_name rabbit@TRAIN
controller01:~ # rabbitmqctl add_vhost openstack
controller01:~ # rabbitmqctl set_policy --vhost openstack --priority 0 --apply-to queues ha-queues '^(?!amq.).*' '{ \"ha-mode\": \"exactly\", \"ha-params\": 2}'"
controller01:~ # rabbitmqctl add_user openstack PASSWORD && rabbitmqctl set_user_tags openstack management
controller01:~ # rabbitmqctl set_permissions --vhost openstack openstack ".*" ".*" ".*"

# Check status
controller01:~ # rabbitmqctl cluster_status
Cluster status of node rabbit@controller01 ...
[{nodes,[{disc,[rabbit@controller01,rabbit@controller02]}]},
 {running_nodes,[rabbit@controller02,rabbit@controller01]},
 {cluster_name,<<"rabbit@TRAIN">>},
 {partitions,[]},
 {alarms,[{rabbit@controller02,[]},{rabbit@controller01,[]}]}]

Now we had a rabbit cluster and the OpenStack services were able to communicate with each other (we hoped) if the transport_url was updated correctly in all the configuration files, here’s an excerpt from nova.conf:

controller01:~ # grep transport_url /etc/nova/nova.conf.d/100-custom.conf
transport_url = rabbit://openstack:****@controller01.domain:5672,openstack:****@controller02.domain:5672/openstack

Memcached

The last part of this post is a rather short one. Memcached caches authentication tokens from the identity service (Keystone), without it the other services won’t be able to authenticate against Keystone.

We configured memcached to listen to both localhost and the primary IP address because during our tests we found error messages leading to this conclusion. We didn’t put more effort into the investigation and simply continued with this configuration:

controller01:~ # grep -Ev "^$|^#" /etc/sysconfig/memcached
MEMCACHED_PARAMS="-U 0 -m 64 -l 127.0.0.1,10.0.0.1 -p 11211 -c 4096"
MEMCACHED_USER="memcached"
MEMCACHED_GROUP="memcached"

Usually, the OpenStack services are configured with a list of memcached servers so we tried this as documented:

[keystone_authtoken]
memcached_servers = controller01:11211,controller02:11211

But in case controller01 goes down the second control node does not have the cached token available so the client will be unauthorized and a new request will have to be made. The result would be the same for virtual IP as well. So we decided to only provide the localhost as memcached server:

[keystone_authtoken]
memcached_servers = localhost:11211

This setting has worked quite well for us during several failover scenarios and it still does now that we are in production with the new Cloud environment.

If you have any comment or questions about this setup please let me know!

This entry was posted in Ceph, High Availability, OpenStack and tagged , , , , . Bookmark the permalink.

Leave a Reply