Thursday, August 8, 2013

Linux NIC Bonding and VLAN Tagging with IBM Flex Chassis

What is IBM Flex Chassis?

IBM Flex chassis is the new blade technology from IBM which replaces the 10 year old BladeCenter H chassis. Like the BladeCenter chassis, the Flex can fit fully functional network switches into the chassis (unlike Cisco which puts dummy pass-thru modules that plug into top of rack switches).

Environment Setup

My customer had 1 Flex chassis in the main site and another in the disaster recovery (DR) site. Each chassis had IBM 10Gb EN4093 Scalable Switches. The 2 switches in each chassis were interconnected via Virtual Link Aggregation (VLAG) to load balance the traffic between each other. They were connected to 1 Cisco ToR switch. The spanning tree protocol (STP) was PVRST+.

Some of the nodes/servers in the chassis were running VMware & some were running RedHat Enterprise Linux (RHEL) 6 with Oracle RAC setup on top of that.

The Oracle nodes needed to have multiple IPs belonging to multiple VLANs. The nodes had only 2 internal 10Gb NICs, so NIC Bonding with VLAN tagging was the best choice. I used Linux's native NIC Bonding.

As of this writing, Emulex does not have a NIC Bonding software for their chips on the IBM Flex nodes.

The Problem

The RHEL nodes were configured with active-passive NIC teaming, but they were losing connectivity randomly and Oracle RAC would report that one of the configured interfaces could no longer communicate and the cluster is affected.

The Solution

The chassis switches act as 2 switches: 1 switch to the nodes and 1 switch to the outside world. Because of this, even if the switch loses connectivity to the outside world, the internal nodes wouldn't know about the uplink failure. Also, for some reason, the MACs weren't being updated on the Cisco ToR switch.

So, instead of using "miimon" which monitors the physical link between the node and the internal ports of the switch, I changed it to "arp" which will send ARP requests through the ToR L3 switches and that will keep the MAC table refreshed and prevent the IPs from flapping on the nodes.

Configuration

The following configuration was done on RHEL 6. It should work similarly on all distributions, but the location of the files may differ.

Enabling NIC Bonding and Setting Parameters

Append this line to the file /etc/modprobe.conf
# bonding config
alias bond0 bonding
options bond0 mode=active-backup arp_interval=50 arp_ip_target=10.10.5.1,10.10.1.1

arp_interval value is in milliseconds. You can specify multiple target IPs. I suggest adding 2. The maximum allowed is 16. The target IPs are the IPs of a VLAN's gateway.

Configuring Interfaces with VLAN Tagging

cd to /etc/sysconfig/network-scripts and create the following files

File name: ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes

File name: ifcfg-bond0.105
DEVICE=bond0.105
BOOTPROTO=static
IPADDR=10.10.5.112
NETMASK=255.255.255.0
GATEWAY=10.10.5.1
ONBOOT=yes
VLAN=yes

File name: ifcfg-bond0.101
DEVICE=bond0.101
BOOTPROTO=static
IPADDR=10.10.1.112
NETMASK=255.255.255.0
GATEWAY=10.10.1.1
ONBOOT=yes
VLAN=yes

The file name has to end with bond0.VLANID, and the device name has to match that. The IP address schema can be whatever was defined by the network team on that VLAN.
The network engineers I worked with, create VLAN IDs & IP schemas like this:
VLAN 105 -> 10.10.5.x
VLAN 1055 -> 10.10.55.x

You don't have to follow the same way, but it makes it easy to know the VLAN ID from the IP.

You can repeat the above steps and create as many files as you have VLANs.

Modify the following files:

File name: ifcfg-eth0
DEVICE=eth0
BOOTPROTO=none
MASTER=bond0
SLAVE=yes
ONBOOT=yes

File name: ifcfg-eth1
DEVICE=eth1
BOOTPROTO=none
MASTER=bond0
SLAVE=yes
ONBOOT=yes

Repeat these steps for the number of NICs that you have & want them to participate in the NIC Bonding group.

Configuring The Chassis Switches

The only thing missing now is creating the VLANs on the switches, then enabling VLAN Tagging on the nodes' NICs (internal ports) on the chassis switches. I'll be using the "iscli" command line interface instead of "ibmcli."

It's better to have firmware 7.5.3+ on the switches before proceeding, else some commands may be different, and some features may be missing (like Auto Spanning Tree Group assignment), and it'll require that you do extra work manually.

Enabling VLAN Tagging on the internal ports:
interface port INTA1-INTA14
tagging

Create VLANs:
vlan 101
enable
member INTA1-INTA12,INTA13,INTA14

This will create VLAN ID 101, and place the internal ports 1-14 in it, which belong to nodes 1-14. I wrote it this way to show how you can define non-consecutive ports.

vlan 105
enable
member INTA1-INTA14

The default private VLAN ID (PVID) is 1. This is the native VLAN. Any non-tagged traffic will be siphoned there. If the customer's native VLAN ID is different, change this value. If the customer does not intend to have any untagged traffic, it's better to change this value to something that doesn't exist on the customer side to create a black-hole on the internal switches for unwanted untagged traffic.

To change the PVID:
interface port INTA1-INTA14
pvid 5

Assuming the native VLAN at the customer side is 5. To set it to something that doesn't exist, agree with the customer on a VLAN that they'll never use. In my case, I often use 3999.

interface port INTA1-INTA14
pvid 3999

You don't have to create the VLAN beforehand. The switch will automatically create the VLAN, assign it to its own Spanning Tree Group (STG) and change the PVID of the defined node ports.

That's it! Now restart the network services and bonding interfaces should come up.

References

  1. RHEL: Linux Bond / Team Multiple Network Interfaces (NIC) Into a Single Interface
  2. Linux Ethernet Bonding Driver HOWTO
  3. NIC Bonding for KVM (has cute graphs)
I highly recommend that you read the 2nd link (Kernel guide) before doing anything. It explains the different types of modes (active/passive, active/active, EtherChannel, ...etc.) and whether they require ToR switch support or not.

Caution

Remember that you cannot use active/active in an EtherChannel/PortChannel manner because the 2 internal NICs in each node belong to two different switches, and EtherChannel require that the ports belong to the same switch. It is possible if you stack the two chassis switches, but I have not attempted this before.

Also, make sure the STP used on the IBM switches match whatever is there on the customer side, otherwise you'll cause a network loop and bring down the entire customer network!

May your packets serve you well.

3 comments:

John said...

Regarding the comments in your article about not using an active-active etherchannel on the host NIC; as you already have VLAG/ISL enabled between the two EN4093's in the chassis you should be able to VLAG-enable the LACP key which would allow that active-active link. i.e. using the iSCLI command "vlag adminkey 1000 enable" where 1000 is an LACP key on the appropriate internal interface of the switch.

I would be very interested in knowing whether you have tried NIC bonding over vNIC interfaces, where one of the vNIC's is used as an FC link and the other 3 are bonded LAN interfaces. Ideally, using two physical NIC's on the blade server in a VLAG scenario it would seem reasonable that a total of 6 vnic's could be etherchanneled as active-active with LACP and the remaining 2 (1 per physcial NIC) used for FC storage. Once I get another Pureflex chassis delivered I'm hoping to test this scenario out.

MBH said...

Hello John,

I faced a lot of problems in this particular setup as the ToR was 1 switch only rather than 2.

I already have a VLAG in place over an LACP channel. I have RHEL 5.9 installed for the customer and I did suspect that the Linux bonding was misbehaving, until yesterday that I found out I had to enable MAC Address Notifications on the Flex switches.

Before enabling that, when 1 NIC fails, the other becomes active, then not all VLANs become reachable. It was quite weird.

In theory, an active-active bond should work, but be careful not to use one that requires an EtherChannel. You need one that does round robin instead.

I haven't tried bonding with vNICs, but I don't see why it won't work.

In general, I try to avoid vNICs as much as possible, as they make the switch function as a pass-through and you cannot use VLAG when vNICs are enabled.

John said...

Hard to say without seeing configs and diagrams, but I assume that you only have two VLAG's setup, one on the ISL link between the two chassis switches and the other to the top-of-rack switch.

It's possible to "VLAG enable" any LACP etherchannel on those cn4093 switches and enable them to span two switches. This includes LACP bundles to a top-of-rack switch or to the server residing in the chassis. Very similar to how VPC on a Cisco Nexus switch or a MEC on a Cisco 6500 VSS stack works.

Assuming your ISL VLAG setup is correct between the two cn4093r's then adding the bit of configuration below to both of the chassis switches will allow you to go with LACP bonding (linux bond mode=4) on the host in an active-active etherchannel setup on the compute node in Bay 1. Once the VLAG is created the host can't tell the difference whether the etherchannel is coming from a single switch or a pair of switches.

!
interface port INTA1
tag-pvid
pvid 121
no flowcontrol
exit
!
interface port INTA1
lacp mode active
lacp key 101
!
! below is the key statement to make the etherchannel work across switches
!
vlag adminkey 101 enable
!

I had not heard that vNIC mode will make the switch act like a pass-through module, that's pretty disappointing. So far I've not used anything but pNIC mode on the 4 or 5 PureFlex chassis that I've built out. When things slow down a bit I'm hoping that the system guys will let me have some time to spend tinkering with this stuff rather than pushing them into production as fast as possible.