This article refers to a four node environment based on SUSE Cloud 5, consisting of 1 admin node, 1 control node (both SLES11-SP3), 2 compute nodes, one with SLES11-SP3, the other with SLES12. The SLES11 compute node was added later for debugging purposes. I won’t describe the issues during cloud deployment, they are part of the other posts in this blog.
If you decide to run your compute node on SLES12 it is very likely that your launched instances won’t get an ip address assigned during creation time.
When I had installed the cloud nodes and deployed the OpenStack services onto the control and compute nodes, I created and uploaded an image, configured the security group rules to allow SSH onto the instances and finally launched an instance.
But there was no access to the instance via SSH, only via compute node (xl console) or in the web ui. In the Horizon Dashboard I could see that the VM had a fixed IP assigned, but it was not accessible. I logged in to the VM to see the network configs, but eth0 had no IP assigned. That explained why communication with the VM was impossible, but why was no IP injected?
I tried all kinds of things, made sure that my image had DHCP enabled, checked with our network admin if our settings were right etc. Nothing really helped, I was stuck. So I started a thread in the SUSE Forum and asked the mailing list, hoping for help.
The answers weren’t really helping, they just described the designed behavior. Simultaneously, I had an open SR for an openvswitch issue. I tried to get nova-compute working with ovs because I couldn’t get linuxbridge working. But ovs didn’t really change anything, my compute node seemed to be misconfigured by Chef, because after the second reboot during the deployment of nova-compute it lost its network connectivity and was not accessible anymore. Support recommended to stick with linuxbridge, so we switched back and found out that the VLAN interface on the compute node didn’t get created automatically, and hence wasn’t added to the bridge. We’re using bonding of two Ethernet adapters. Also, the loss of network connectivity of compute was still an issue with linuxbridge.
I tried all kinds of configurations the Support suggested, reinstalled the compute node countless times but without success.
To be able to compare settings and behavior I installed another compute node with SLES11-SP3. Here, everything worked perfectly! Service deployment (nova) worked without problems, instances’ network settings were configured correctly, I could SSH onto the instance, everything was fine! But what was the difference between SLES11 and SLES12 using linuxbridge?
First we got a PTF for wicked, it fixed the network problems on SLES12, the nova services could be deployed without breaking the network connectivity on compute, that was a big step forward!
But the next steps were not very successful, we tried different ways to debug why the instances on SLES12 compute node had no IP assigned, without helpful results. One tip from Support pointed me into the right direction. I was supposed to add two lines to the file /usr/lib/python2.7/site-packages/neutron/plugins/linuxbridge/agent/linuxbridge_neutron_agent.py for debugging purposes. Unfortunately, I couldn’t see any additional output, so it seemed that the piece of code was never reached. First I tried to trace why, but couldn’t really find anything. I tried a simpler step: compare the file linuxbridge_neutron_agent.py on SLES11 to the same file on SLES12, that was the right spot!!
In the file on SLES11 is a very important comment:
# NOTE(toabctl): Don't use /sys/devices/virtual/net here because not all tap # devices are listed here (i.e. when using Xen) BRIDGE_FS = "/sys/class/net/"
Alright, so what’s on SLES12?
BRIDGE_FS = "/sys/devices/virtual/net"
Very strange, I thought. Let’s take a look what’s inside these directories:
root@d0c-c4-7a-06-71-f0:/sys/devices/virtual/net # ll total 0 drwxr-xr-x 6 root root 0 Aug 24 09:02 bond0 drwxr-xr-x 7 root root 0 Aug 24 09:03 brqe423e6c2-54 drwxr-xr-x 5 root root 0 Aug 24 09:01 lo root@d0c-c4-7a-06-71-f0:/sys/class/net # ll total 0 lrwxrwxrwx 1 root root 0 Aug 24 09:02 bond0 -> ../../devices/virtual/net/bond0 -rw-r--r-- 1 root root 4096 Aug 24 09:02 bonding_masters lrwxrwxrwx 1 root root 0 Aug 24 09:03 brqe423e6c2-54 -> ../../devices/virtual/net/brqe423e6c2-54 lrwxrwxrwx 1 root root 0 Aug 24 09:02 eth0 -> ../../devices/pci0000:00/0000:00:01.1/0000:02:00.0/net/eth0 lrwxrwxrwx 1 root root 0 Aug 24 09:02 eth1 -> ../../devices/pci0000:00/0000:00:01.1/0000:02:00.1/net/eth1 lrwxrwxrwx 1 root root 0 Aug 24 09:01 lo -> ../../devices/virtual/net/lo
Now that comment on SLES11 makes definitely sense! Not all devices are listed in the “wrong” directory. So I edited the file on SLES12 to use the same directory as on SLES11 and restarted the linuxbridge agent, that did the trick! All VMs get their eth0 configured correctly, after assigning a floating IP they are accessible via SSH.
I reported my finding to the Support and the answer was that this is an existing bug, but the fix didn’t make it to the update channels yet, but it will soon.