Saturday, August 17, 2013

Configuring FCoE on IBM Flex nodes and V7000 Storage

Disclaimer

I work for an IBM partner. The opinions depicted in this post are solely mine. Any performance degradation, bug, problem mentioned here is generic to any vendor, unless otherwise strictly specified.

Article Revisions

v1.0 - August 17 (17082013) - Initial release
v1.1 - August 17 (17082013) - Small additions to CN4093 section
v1.2 - October 11 (11102013) - Correction to FCoE frames and Ethernet frames (Thanks Anonymous!)

What is FCoE?

FCoE is short for Fiber Channel over Ethernet. It's an encapsulation of FC packets inside Ethernet packets, allowing a server/node to communicate with a storage system through standard Ethernet, instead of investing in dedicated FC infrastructure.

The idea is to combine, or converge as the industry likes to call it, multiple protocols into a single technology, to reduce the datacenter clutter. With Ethernet, you can now transmit data packets (Ethernet), iSCSI (storage protocol) and FCoE (storage protocol).

Issues with FCoE

No Real Optimization

A payload is the data carried by the protocol from one point to another. The typical Ethernet frame payload is 832 bytes, while FC frames can carry 2kB, and iSCSI fits perfectly into Ethernet's packets.
Standard Ethernet supports an increased payload size with something called Jumbo Frames up to 9kB. IPv6 allows a maximum payload size of 4GB (minus 1 byte) but that requires modification of the Transport Layer to allow TCP/UDP to carry larger payloads, and is no longer done on the Ethernet frame.

FCoE runs on Ethernet frames limited to or a max of 9kB when Jumbo Frames are used. However, iSCSI was designed for Ethernet and runs on the Internet Protocol (IP) on top of Ethernet, which allows it to make use of IPv6 Jumbo Frames' large payload size.

So, if you're going with an overhead of protocols already, you might as well go with iSCSI on IPv6 (assuming the storage supports it) and enable Jumbo Frames (assuming the network backend supports very large Jumbo Frames), instead of going with FC over Ethernet!

Overhead and Replication

iSCSI has been in the business for a long time (6+ years) and works well with the standard Ethernet payload size, but better with jumbo frames. Many storage systems offer iSCSI and have been offering it for a long time. It also doesn't require any special protocols and it "just works" including across data-centers, as long as the link the stable (but there's latency, obviously).

FCoE is relatively new, and requires a heap of protocols to maintain a main "feature": Being Lossless. All this protocol overhead means extra latency, and from what I've been reading, it's not yet possible to push FCoE between datacenters. This means that whenever you need data replication across datacenters, you'll need to attach your storage box into an FC-only infrastructure, which means investing in an FC infrastructure! (OK, maybe just 2 switches, but they still cost money!)

Note: This IBM document mentions that it is possible to replicate between different V7000 storage systems via FCoE only, but the article above puts the limitation on the router/networking end, not on the storage end. Also, that post is 3 years old, so things might have changed now. Approach with caution, anyway, and validate with your vendors.

Another overhead is the encapsulation encoding and decoding process. Disks speak SCSI protocol, and what FC packets do, is put the SCSI commands and their data inside an FC packet, then send it over to the storage, which will strip out the FC packet, then execute the SCSI commands and data.

With FCoE, the server is inserting a SCSI payload inside an FC payload and that is inserted into an Ethernet payload!

Multi-Protocol Failure

Currently, FCoE is enabled on Converged Network Adapters (CNAs) that offer standard Ethernet functions (normal network access) + iSCSI + FCoE. When enabling FCoE, the CNA presents to the Operating System (OS) a bunch of storage adapters of type FC.

What happens when you have a failure on the adapter? You lose both network access and storage access. What happens if the FCoE switch fails? You lose both network access and storage access.

A related scenario would be storage upgrades where one path needs to be offline to move the equipment from old stuff to new stuff, this means affecting both network and storage. One more scenario is your usual network administrator mistake where the spanning tree configuration goes wrong, adds a new VLAN, or plugs a cable into a non-configured switch and the network goes into a loop (think of it as a denial of service attack).
While one would think that if the network fails, then why do you need storage access, is completely valid, the issue here is that the sudden loss of storage could also lead to data corruption.

That's why I personally prefer to keep the two activities separate: Network and Storage. It can still be done if you buy separate switches for FCoE and Ethernet, but then where is the "convergence" of your datacenter and its cost reductions?

Port Reservation

I don't know about other vendors, but on the IBM Flex chassis switches, using FCoE requires reserving 2 external ports from the switch (must be Omni Ports), even if you're using a V7000 Flex plugged into the chassis.

The FCoE protocol requires having an FC Forwarder (FCF) even if the traffic is internal to the chassis. These ports have to be reserved and configured in pairs. 2 ports are needed for every storage system to be connected via FCoE.

You do not need to plug SFPs into these reserved ports.

Limited Communications to Storage Systems

FCoE communicates through VLANs on the Ethernet network. Each NIC must belong to one VLAN when talking to an FCoE target. Because of that, a NIC can only talk to one storage system. If you need a node to talk to multiple storage systems, you'll need to assign each pair of NICs to a separate FCoE VLAN belonging to each FCoE storage system.

This limitation is not there for FC infrastructures, as a node's FC adapter registers itself on the FC SAN fabric, and then the administrator zones (groups) each adapter with a storage system, and an adapter can belong be grouped with multiple storage systems, as long as all storage systems use the same FC adapter settings.

The FCoE connectivity limitation can be avoided by virtualizing various storage systems under one storage system, and expose that one system to the nodes. IBM's Storage Volume Controller and its little brother the V7000 can do that.

Administration Role Separation

In large organization, the roles of a network admin and a storage admin are separated. With Network Convergence, who will be responsible for configuring the network switches? Will the admin take responsibility for both network and storage?

Lab Setup

Alright, enough blabbing. Let's get to business. This is the lab setup for this experiment:

  1. IBM Enterprise Flex Chassis
  2. Two x240 nodes (Intel processors)
    1. Windows Server 2012 was preinstalled by a colleagure so I used it for tests
    2. Installed ESXi 5.1 U1 (IBM Customized image) for Boot from SAN tests
  3. One 4-port CN4054 CNA on each node
    1. Firmware: 4.4.180.3
    2. Feature on Demand (FoD) to enable FCoE
  4. V7000 Flex storage (mounted into the chassis)
    1. Firmware: 6.4.1.3
  5. Two CN4093 converged switches
    1. Firmware: 7.5.3
    2. Base license, allowing use of only 2 ports on the 4-port cards
IBM's Flex chassis allows one to contain nodes, Ethernet switches, FC switches, and storage, all into a 10U chassis, and the communication between the components is internal to the chassis at a minimum of 10Gbps. End of shameless plug.

Note0: The 4-port CNA is made by Emulex, and it has the same chipset found on the 2-port LAN on Motherboard (LoM) built into some x240 nodes.

Note1: The firmware levels above are important and you should meet these as a minimum. As of this writing, the storage has newer firmware, but I kept it at this level as it's the minimum required and for testing purposes.

Configuration Overview

  1. Configure x240 nodes and their CNAs
    1. Understanding the CNA
    2. Possible NIC Configurations
    3. Configure FCoE Feature on the NICs
      I won't cover OS configuration nor multipathing driver installation
    4. Configure nodes for SAN Boot via FCoE
  2. Configure V7000 Storage
  3. Configure the CN4093 Converged Switches
    1. Sample Configuration
    2. Configuration Explanation
  4. Profit!
If you need help upgrading the firmware of any component, refer to the device's user manual in the device links posted above. I won't cover these here.

Note: Throughout the guide, screenshots and configuration, I have masked the WWPNs and MACs of the devices used in the lab, because I'm paranoid. Deal with it.

1) Configuring x240 nodes and their CNAs

This is easy, but you could lose yourself within the forest of menus, so I have a few screenshots to make you happy. You can either follow the text description, or spoon-feed yourself with my awesome screenshots.

Understanding the CNA

I'll quickly explain how the CNA is going to function, so that you don't get confused when you configure it.

The 4-port 10Gbps CNA and the 2-port LoM, have 4 physical ports, and 2 physical ports respectively. When enabling Multichannel functionality, the CNA automagically splices each physical port into 4 virtual ports (vNICs).

So physical port 1 will have 4 vNICs: A1 = A1.1 + A1.2 + A1.3 + A1.4. Each vNIC can be allocated bandwidth, not exceeding 10Gb, and the total bandwidth of 10Gb is shared among all 4 vNICs, so you cannot over-commit the bandwidth. So, in an OS, you'll see 8 NICs if you have a 2-port LOM, and 16 NICs if you have a 4-port CNA (4 vNICs per physical port).

You can change the bandwidth allocation dynamically from the switch for any port, live. It's up to you how much bandwidth is allocated to the FCoE port. If you have a license to use all 4 ports, I suggest you use Ethernet on the 1st and 2nd NICs, and FCoE on the 3rd and 4th. This way, you'll be able to allocate full 10Gb to FCoE.

Use the NICs in sequence (1+2, 3+4) to make sure Ethernet passes through both CN4093 switches, and FCoE passes through both CN4093 switches. Ports 1 and 3 communicate with switch0 located in Bay1, while ports 2 and 4 communicate with switch1 located in Bay2.

Possible NIC Configurations

  1. Use physical NICs
  2. Use virtual NICs
  3. Use a mix of pNICs and vNICs

Remember that a 2-port LOM will have each of its physical ports connect to 1 switch. port0 to switch0 and port1 to switch1. So, if you have 2 switches only, you have to use option (2): vNICs.

vNICs are mandatory if you want to share Ethernet and FCoE on the same pipe and you want to guarantee bandwidth for FCoE. If you do not use vNICs, FCoE and Ethernet will compete on the bandwidth. If your servers are busy, it may lead to delayed I/Os and performance degradation.

My favorite configuration is if you have a 4-port adapter, and Upgrade1 switch licenses for your 2 switches, then you can use 2 ports as pNICs for FCoE and 2 ports as pNICs for Ethernet. No need for vNIC configuration.

Alternatively, you can also enable vNICs on the first 2 ports, and leave the 3rd and 4th ports as pNICs. Or the opposite. So you can mix, but they'll have to be in pairs.

If you have a 4-port adapter, with the base license of the switches, your options are the same as the LOM, in the first paragraph.

Configure FCoE Feature on the NICs

  1. Power on the node and press F1 to login to the UEFI setup
  2. UEFI main menu -> System Settings -> Network -> Select 1st NIC (PFA 17:0:0 here) -> Emulex 10G NIC
    You're now at the Emulex NIC Selection menu
  3. Notice the link speed. It should report a number.
  4. Switch Configuration: Change it to IBM Virtual Fabric -- default: Switch Independent
  5. Personality: Change it to FCoE -- default: NIC
  6. Multichannel: Enable if you want to enable vNICs
  7. Controller Configuration -> View Configuration
  8. The 2nd NIC should report itself as FCoE. Only 1 NIC will have FCoE functions.
    Notice that the numbering of the NICs is all even. These NICs belong to switch0 located in Bay1.
  9. Press Esc until you're back at the Emulex NIC Selection menu.
  10. Feature on Demand -> Install FCoE license
  11. You're now done with the first NIC. The 2nd NIC will have the same settings as the 1st. You will need to repeat the steps above for the 3rd NIC, and that NIC's settings will be applied to the 4th.
  12. Press Esc until you're back at the Network menu and select the 2nd NIC.
    Notice that the NICs have odd numbers. These are mapped to switch1 located in Bay2.
  13. Esc to the System Settings menu -> Emulex Configuration Utility
    If you do not see this option, Esc to UEFI main menu, save, then exit to reboot the node.
  14. Highlight the 1st NIC (001) but don't click on it. Write down the NIC's Port Name and node name in a text file for later use. Highlight the other NICs and write their PNs.
    If you don't have an Upgrade1 license for your CN4093 switches, you won't be able to use the 3rd and 4th NICs, so you can ignore them.
  15. Click on the 1st NIC. You're now at the Emulex Adapter Configuration menu.
  16. Configure DCBX Mode: Change it to CEE -- default: CIN
  17. Later on when you're done configuring the storage and the switch, come back here and run Scan for Fiber Devices and you should see the V7000 listed (ID 2145)
    Also, scroll down and click on Display Adapter Info sub-menu and you'll see the FCoE VLAN ID, if your switch was configured properly. This is auto-discovered.
  18. Esc to the Emulex Adapter Configuration menu, and select the 2nd NIC (002) then repeat the same steps.
  19. Esc to UEFI main menu, save and reboot back to the Scan for Fiber Devices for later use.
With those easy steps, you have completed ONE node. Repeat the same for all nodes. If you're fortunate enough to have had ordered the Flex System Manager node, then it's your lucky day! You can create a Configuration Template of the configured node, which would capture its hardware component configurations, and deploy its hardware configuration to other nodes. It's magic.

If you intend to use the Configuration Templates, I recommend that you configure all the components (internal disk RAID setup, time, boot order, ...etc.), then create the template out of the node.

UEFI -> System Settings menu

Network -> Select NIC

Click on that to get the juicy settings

Change settings as listed. Multichan is for vNICs

Showing the available options

Showing the available options

Click it!

FCoE vNIC is always the 2nd

FCoE requires a license. Install it.

To the next adapter

The 2nd NIC follows the settings of the 1st


Emulex Configuration Utility for FCoE HBA Settings

Select 1st NIC. Note Port Name for FC zoning

Change settings to CEE

After storage and switch config is done, scan fiber devices

Configure nodes for SAN boot via FCoE

Each canister/controller will have 1 port looking at one switch and the other port looking at the other switch, which means on each switch you'll see both controllers.

This guide is specific to V7000 and V7000 Flex and VMware ESXi 5.1 (IBM Customized Image). For other storage types, I highly recommend you read and follow the steps in the "Storage and Network Convergence Using FCoE and iSCSI" redbook (link in references). It explains booting from SAN with FCoE and iSCSI, and has excellent tips.

  1. Configure the FCoE switches and make sure that the storage and nodes are functioning properly.
  2. Create a volume and assign it to the node that will boot from SAN. It must be the first volume assigned to the node (LUN 0/SCSI ID 0).
  3. Boot the node into UEFI -> System Settings menu -> Emulex Configuration Utility
  4. Select the 1st adapter
  5. Set Boot from SAN: Change it to Enable
  6. Validate storage connectivity and volume/LUN assignment: Navigate to Add Boot Device -> Select 1st Controller
    If you don't see the storage or LUN 0000, then you need to finish configuring the switches, assign a LUN to the node, then come back here.
    Do not select a boot device here. This is only for validation of connectivity.
  7. Configure HBA and Boot Parameters -> Boot Target Scan Method: Select Boot Path Discovered Targets
    Commit Changes.
  8. Esc to Adapter Selection menu and select the 2nd adapter, and repeat the same steps.
  9. Configuring FCoE SAN boot should be sufficient on 2 ports.
  10. Esc to System Settings menu -> Devices and I/O Ports -> Enable/Disable Onboard Devices
  11. SAS Controller: Disable to disable booting from local disks on the node. Do this even if you don't have any local disks.
  12. Esc to Devices and I/O Ports -> Device Boot Priority
  13. Drag the SAS Controller to the bottom of the list. Save/Commit.
  14. Esc to Main Menu -> Boot Manager -> Add Boot Option -> Generic Boot Option
  15. Add Hard Disk 0, 1, 2, 3
    If you configure 2 FCoE ports, you'll have 4 possible paths to boot from. By selecting 4 disks, the UEFI will configure each path into a Hard Disk, and boot from the first available one.
  16. Esc to Boot Manager -> Delete Boot Option: Delete anything that you don't need (PXE, Floppy)
  17. Esc to main menu -> Save
  18. Reboot and install OS
Note: During adapter preparation phase in UEFI, it'll probe the FCoE ports and see which one is online, and will nominate and use one of them only.

Steps 7 and 15 allow high flexibility and reduce configuration time for implementations that have many nodes. The typical method of implementation is defining the boot LUN and path for each node. So if you have 10 nodes, and 2 FCoE ports, you'd need to repeat those configurations 40 times! Using Boot Discovery and auto Hard Disk assignment by UEFI, you avoid this headache.

It does add some extra time to the boot process, but it's not really important at the advantage of flexibility.

Emulex Configuration Utility

Adapter Selection

Enable Boot from SAN for both adapters

You should be able to see storage and LUNs here

Do not add the LUNs. Just validate connectivity.

Configure HBA and Boot Parameters

Boot Path Discovered Targets

Add Boot Option -> Generic Boot Option

Add Hard Disk 0, 1, 2 and 3 for a total of 4 paths

Devices and I/O Ports

Enable / Disable Onboard Devices

Disable SAS Controller


2) Configure V7000 Storage

If you have purchased the V7000/V7000 Flex with the FCoE daughter cards, there's no configuration for FCoE. If you bought the daughter cards at a later stage, you'll need to activate them from the canisters. This won't be covered here. Please refer to the online manual.

If you login to the V7000's web interface, you'll see each canister's (controller) Port Numbers, for both FC and Ethernet. You'll see these numbers once the switch is configured.

V7000 Flex canister/controller 1

V7000 Flex canister/controller 2

Note that the port type is FC



3) Configure the CN4093 Switches

As mentioned before, a minimum of 2 external Omni ports must be reserved, even if you're using a V7000 Flex inside the same chassis as the nodes.

This switch configuration will assume default bandwidth allocations. I highly advise you to read the CN4093 redbook (link in references) for optimizations.

I'll first write the entire switch config, then explain each section.

Login to the switch in "iscli" mode, then type "enable" to access the enable mode. Now type "config terminal" to be able to modify.

version "7.5.3"
switch-type "IBM Flex System Fabric CN4093 10Gb Converged Scalable Switch"
!
system port EXT15-EXT16 type fc
!
interface port INTA1
name "Flex System Manager node"
no flowcontrol
exit
!
interface port INTA2
name "Power p260 node"
no flowcontrol
exit
!
interface port INTA3
name "x240 node1"
tagging
no flowcontrol
exit
!
interface port INTA4
name "x240 node2"
tagging
no flowcontrol
exit
!
interface port INTA5
tagging
no flowcontrol
exit
!
interface port INTA6
tagging
no flowcontrol
exit
!
interface port INTA7
name "v7000 flex node1"
tagging
pvid 1002
no flowcontrol
exit
!
interface port INTA8
name "v7000 flex node2"
tagging
pvid 1002
no flowcontrol
exit
!
interface port INTA9
tagging
no flowcontrol
exit
!
interface port INTA10
tagging
no flowcontrol
exit
!
interface port INTA11
tagging
no flowcontrol
exit
!
interface port INTA12
tagging
no flowcontrol
exit
!
interface port INTA13
tagging
no flowcontrol
exit
!
interface port INTA14
tagging
no flowcontrol
exit
!
vlan 1
member INTA1-INTA6,INTA9-INTA14,EXT1-EXT2,EXT11-EXT16
no member INTA7-INTA8
!
vlan 1002
enable
name "fcoe"
member INTA3-INTA4,INTA7-INTA8,EXT15-EXT16
fcf enable
!
!
vnic enable
vnic port INTA3 index 1
bandwidth 25
enable
exit
!
vnic port INTA4 index 1
bandwidth 25
enable
exit
!
vnic vnicgroup 1
vlan 3001
enable
member INTA3.1
member INTA4.1
exit
!
spanning-tree stp 80 vlan 3001
!
spanning-tree stp 113 vlan 1002
!
!
!
!
fcoe fips enable
!
fcoe fips port INTA3 fcf-mode off
fcoe fips port INTA4 fcf-mode off
fcoe fips port INTA7 fcf-mode on
fcoe fips port INTA8 fcf-mode on
fcoe fips port EXT15 fcf-mode on
fcoe fips port EXT16 fcf-mode on
!
!
cee enable
!
!
fcalias v7k_node1_p1 wwn 50:00:00:00:00:04:00:76
fcalias v7k_node2_p1 wwn 50:00:00:00:00:04:00:77
fcalias node3 wwn 10:00:00:00:00:00:00:5d
fcalias node4 wwn 10:00:00:00:00:00:00:6b
!
zone name v7k_node3
        member fcalias v7k_node1_p1
        member fcalias v7k_node2_p1
        member fcalias node3
zone name v7k_cluster
        member fcalias v7k_node1_p1
        member fcalias v7k_node2_p1
zone name v7k_node4
        member fcalias node4
        member fcalias v7k_node2_p1
        member fcalias v7k_node1_p1
zoneset name ActiveConfig
member v7k_node3
member v7k_cluster
member v7k_node4
zoneset activate name ActiveConfig
!
no ip routing
!
!
end

Configuration Explanation

system port EXT15-EXT16 type fc
This changes the type of the Omni ports from being Ethernet ports to FC ports. This is required to bind the ports to a storage system, whether the storage is internal to the chassis or external. If your storage is external, these are the ports where you have to plug the FC SFPs and cables to your external SAN fabric.

interface port INTA1-INTA14
name "port name"
no flowcontrol
tagging
pvid 1002
interface port : defines which ports you want to work on. You can specify 1 port or a range. If you have Upgrade1 license, you can also define INTA1-INTB14 to modify all 28 ports in one shot.

name "port name" : It's better that you do this on a per port basis, to give each port a unique name, to know which system is using that port.

no flowcontrol : Disables traffic flowcontrol. A requirement for FCoE.

tagging : Enable VLAN tagging on a port, allowing that port to belong to multiple VLANs. Do not enable this on ports that will not use FCoE, nor require VLAN tagging. An example to this is a standalone Windows/Linux node.

pvid 1002 : Set the Private VLAN ID (native VLAN) on the port. The default is 1 in all networks. This has to be changed to the VLAN of the FCoE on the V7000 Flex ports. If you do not have a chassis storage, no internal port needs this PVID set.

vlan 1
member INTA1-INTA6,INTA9-INTA14,EXT1-EXT2,EXT11-EXT16
no member INTA7-INTA8
!
vlan 1002
enable
name "fcoe"
member INTA3-INTA4,INTA7-INTA8,EXT15-EXT16
fcf enable
!
These are VLAN definitions, and which ports belong to the VLAN and which don't.
1002 is the preferred VLAN ID for FCoE. You can change this to whatever you want, but make sure the customer network doesn't have the same ID on the Ethernet network to not cause confusion for your nodes.

fcf enable : Enable Fiber Channel Forwarding on this VLAN. This is a must on the FCoE VLANs if you have a V7000 Flex or an upstream (Top of Rack) switch that understands FCoE. If you're connecting the chassis to a SAN fabric, you need to enable NPV mode. See the CN4093 redbook for details.

vnic enable
vnic port INTA3 index 1
bandwidth 25
enable
exit
vnic enable : This is only needed if you need vNICs and want to enable it.

vnic port index 1 : This is vNIC1 of the internal physical port 3. In other words, it's INTA3.1.
You only need to set this, if you want to use this specific vNIC. If you do not set these settings, it'll appear as disconnected on the OS.

bandwidth 25 : Allocate 25% of the 10Gb bandwidth, which is 2.5 Gbps to this vNIC.

Note: You do not allocate bandwidth nor define a vNIC index for the FCoE port.

vnic vnicgroup 1
vlan 3001
enable
member INTA3.1
member INTA4.1
exit
vnic vnicgroup : Create a vNIC Group to add members to it. This is a must for vNIC configurations. Not required for non-vNIC setup.
The group members can be vNICs, internal physical ports, and external ports. In the example above, only internal ports were added. No external ports were configured.

vlan 3001 : Each vNIC Group requires its own VLAN, and this must not be an existing VLAN. This is only for internal communication, and will not conflict with the customer side VLANs.

vNICs not added to a vNIC Group, will appear as disconnected.

spanning-tree stp 80 vlan 3001
If spanning tree is enabled, this will place the VLAN 3001 in its own Spanning Tree Group number 80. The firmware will by default assign each VLAN into its own STG without having to do this manually.

fcoe fips enable
!
fcoe fips port INTA3 fcf-mode off
fcoe fips port INTA4 fcf-mode off
fcoe fips port INTA7 fcf-mode on
fcoe fips port INTA8 fcf-mode on
fcoe fips port EXT15 fcf-mode on
fcoe fips port EXT16 fcf-mode on
!
cee enable

Enable fcoe initialization protocol snooping, which will detect which ports support FCoE and which don't.

fcf-mode off/on/auto : It should be OFF for the internal ports of the compute nodes, and on for the storage and FC ports. You can also avoid messing things, and set this to auto on all ports.

cee enable : Enable Converged Enhanced Ethernet to allow FC packet encapsulation over Ethernet.

fcalias
Define an alias to make it easy to identify nodes and storage ports.

no fcalias wwn
To remove an already configured fcalias.

zone name
Create a zone and add aliases to this zone.

zoneset name
zoneset activate name
Create a zoneset, which is a group of zones to enable this set for the entire switch.

no ip routing
Disable Layer3 routing, and make the switch a Layer2 switch only.

show fcoe database
-----------------------------------------------------------------------
 VLAN  FCID                  WWN                     MAC         Port
-----------------------------------------------------------------------
 1002  011000     50:00:00:00:00:04:00:77      0e:fc:00:01:10:00   INTA8
 1002  011100     50:00:00:00:00:04:00:76      0e:fc:00:01:11:00   INTA7
 1002  011101     10:00:00:00:00:00:00:5d      0e:fc:00:01:11:01   INTA3

 Total number of entries = 3

-----------------------------------------------------------------------
Displays the currently established FCoE connections on the switch. It doesn't show any node-storage associations. It shows the nodes/storage that has been detected to have FCoE. The section in orange is a sample output.

show zone
List the configured zones on the switch.

For details and explanations of each command, or extra details, do read the CN4093 redbook (linked below in the references).

Note: The above configuration should be the same for the 2nd CN4093 switch, except for the FCalias parts as the WWPNs will be different.

References

  1. IBM V7000 Storage
    1. IBM Storwize V7000 Information Center
    2. Configuration Limits and Restrictions for IBM Storwize V7000
    3. Implementing the IBM Storwize V7000 V6.3
    4. IBM Flex System V7000 Storage Node Introduction and Implementation Guide
  2. Internet Small Computer Systems Interface (iSCSI)
    1. iSCSI Standard by IETF
    2. Comparing Performance Between iSCSI, FCoE and FC
  3. FCoE
    1. Storage and Network Convergence Using FCoE and iSCSI (redbook)
    2. FCoE Between Datacenters
    3. Fixing Stupid, an FCoE Response
    4. FCoE: Additional Considerations (T11 Fiber Channel Committee)
    5. FCoE Questions and Answers (Cisco)
    6. Datacenter Bridging Exchange (DCBX)
  4. Fiber Channel
    1. Fiber Channel Generations (16 Gbps FC)
    2. FC vs iSCSI (Trusted Network Solutions)
    3. FC Frames
  5. IBM CN4093 and EN4093R
    1. Application Guide for EN4093 and EN4093R - Second Edition
    2. Application Guide for CN4093 - First Edition
    3. IBM Networking OS 7.5 Release Notes for CN4093
  6. Emulex
    1. Emulex Universal Multichannel Reference Guide (Guide for the CN4054 VFA)
    2. White papers and documents for cards by Emulex made for IBM
    3. More white papers
    4. Emulex Virtual Fabric Adapter drivers, firmware and user guide
  7. Network Frames
    1. IPv6 Packets
    2. FCoE Frames
    3. Jumbo Frames
    4. Ethernet Frames
    5. Internet Protocol (IP)

Thursday, August 8, 2013

Linux NIC Bonding and VLAN Tagging with IBM Flex Chassis

What is IBM Flex Chassis?

IBM Flex chassis is the new blade technology from IBM which replaces the 10 year old BladeCenter H chassis. Like the BladeCenter chassis, the Flex can fit fully functional network switches into the chassis (unlike Cisco which puts dummy pass-thru modules that plug into top of rack switches).

Environment Setup

My customer had 1 Flex chassis in the main site and another in the disaster recovery (DR) site. Each chassis had IBM 10Gb EN4093 Scalable Switches. The 2 switches in each chassis were interconnected via Virtual Link Aggregation (VLAG) to load balance the traffic between each other. They were connected to 1 Cisco ToR switch. The spanning tree protocol (STP) was PVRST+.

Some of the nodes/servers in the chassis were running VMware & some were running RedHat Enterprise Linux (RHEL) 6 with Oracle RAC setup on top of that.

The Oracle nodes needed to have multiple IPs belonging to multiple VLANs. The nodes had only 2 internal 10Gb NICs, so NIC Bonding with VLAN tagging was the best choice. I used Linux's native NIC Bonding.

As of this writing, Emulex does not have a NIC Bonding software for their chips on the IBM Flex nodes.

The Problem

The RHEL nodes were configured with active-passive NIC teaming, but they were losing connectivity randomly and Oracle RAC would report that one of the configured interfaces could no longer communicate and the cluster is affected.

The Solution

The chassis switches act as 2 switches: 1 switch to the nodes and 1 switch to the outside world. Because of this, even if the switch loses connectivity to the outside world, the internal nodes wouldn't know about the uplink failure. Also, for some reason, the MACs weren't being updated on the Cisco ToR switch.

So, instead of using "miimon" which monitors the physical link between the node and the internal ports of the switch, I changed it to "arp" which will send ARP requests through the ToR L3 switches and that will keep the MAC table refreshed and prevent the IPs from flapping on the nodes.

Configuration

The following configuration was done on RHEL 6. It should work similarly on all distributions, but the location of the files may differ.

Enabling NIC Bonding and Setting Parameters

Append this line to the file /etc/modprobe.conf
# bonding config
alias bond0 bonding
options bond0 mode=active-backup arp_interval=50 arp_ip_target=10.10.5.1,10.10.1.1

arp_interval value is in milliseconds. You can specify multiple target IPs. I suggest adding 2. The maximum allowed is 16. The target IPs are the IPs of a VLAN's gateway.

Configuring Interfaces with VLAN Tagging

cd to /etc/sysconfig/network-scripts and create the following files

File name: ifcfg-bond0
DEVICE=bond0
BOOTPROTO=none
ONBOOT=yes

File name: ifcfg-bond0.105
DEVICE=bond0.105
BOOTPROTO=static
IPADDR=10.10.5.112
NETMASK=255.255.255.0
GATEWAY=10.10.5.1
ONBOOT=yes
VLAN=yes

File name: ifcfg-bond0.101
DEVICE=bond0.101
BOOTPROTO=static
IPADDR=10.10.1.112
NETMASK=255.255.255.0
GATEWAY=10.10.1.1
ONBOOT=yes
VLAN=yes

The file name has to end with bond0.VLANID, and the device name has to match that. The IP address schema can be whatever was defined by the network team on that VLAN.
The network engineers I worked with, create VLAN IDs & IP schemas like this:
VLAN 105 -> 10.10.5.x
VLAN 1055 -> 10.10.55.x

You don't have to follow the same way, but it makes it easy to know the VLAN ID from the IP.

You can repeat the above steps and create as many files as you have VLANs.

Modify the following files:

File name: ifcfg-eth0
DEVICE=eth0
BOOTPROTO=none
MASTER=bond0
SLAVE=yes
ONBOOT=yes

File name: ifcfg-eth1
DEVICE=eth1
BOOTPROTO=none
MASTER=bond0
SLAVE=yes
ONBOOT=yes

Repeat these steps for the number of NICs that you have & want them to participate in the NIC Bonding group.

Configuring The Chassis Switches

The only thing missing now is creating the VLANs on the switches, then enabling VLAN Tagging on the nodes' NICs (internal ports) on the chassis switches. I'll be using the "iscli" command line interface instead of "ibmcli."

It's better to have firmware 7.5.3+ on the switches before proceeding, else some commands may be different, and some features may be missing (like Auto Spanning Tree Group assignment), and it'll require that you do extra work manually.

Enabling VLAN Tagging on the internal ports:
interface port INTA1-INTA14
tagging

Create VLANs:
vlan 101
enable
member INTA1-INTA12,INTA13,INTA14

This will create VLAN ID 101, and place the internal ports 1-14 in it, which belong to nodes 1-14. I wrote it this way to show how you can define non-consecutive ports.

vlan 105
enable
member INTA1-INTA14

The default private VLAN ID (PVID) is 1. This is the native VLAN. Any non-tagged traffic will be siphoned there. If the customer's native VLAN ID is different, change this value. If the customer does not intend to have any untagged traffic, it's better to change this value to something that doesn't exist on the customer side to create a black-hole on the internal switches for unwanted untagged traffic.

To change the PVID:
interface port INTA1-INTA14
pvid 5

Assuming the native VLAN at the customer side is 5. To set it to something that doesn't exist, agree with the customer on a VLAN that they'll never use. In my case, I often use 3999.

interface port INTA1-INTA14
pvid 3999

You don't have to create the VLAN beforehand. The switch will automatically create the VLAN, assign it to its own Spanning Tree Group (STG) and change the PVID of the defined node ports.

That's it! Now restart the network services and bonding interfaces should come up.

References

  1. RHEL: Linux Bond / Team Multiple Network Interfaces (NIC) Into a Single Interface
  2. Linux Ethernet Bonding Driver HOWTO
  3. NIC Bonding for KVM (has cute graphs)
I highly recommend that you read the 2nd link (Kernel guide) before doing anything. It explains the different types of modes (active/passive, active/active, EtherChannel, ...etc.) and whether they require ToR switch support or not.

Caution

Remember that you cannot use active/active in an EtherChannel/PortChannel manner because the 2 internal NICs in each node belong to two different switches, and EtherChannel require that the ports belong to the same switch. It is possible if you stack the two chassis switches, but I have not attempted this before.

Also, make sure the STP used on the IBM switches match whatever is there on the customer side, otherwise you'll cause a network loop and bring down the entire customer network!

May your packets serve you well.

Friday, July 19, 2013

HOWTO: Tor2Web on Debian

What is Tor2Web?

Tor2Web is an open source software that allows the general public to browse Tor hidden services, which are servers running anonymously behind the Tor network. These servers could be web servers, email servers or any other services, and are normally accessed only via the Tor software, but if for some reason you cannot install it, or need a quick look, tor2web provides this kind of easy access via standard web browsers.

Tor's hidden services (HSs) usually look like this http://sx3jvhfgzhw44p3x.onion (Wikileak's HS), but with tor2web, you can create a general public proxy where people enter the HS like this https://sx3jvhfgzhw44p3x.tor2web.org/ or bind your tor2web node to a specific HS (yours or someone else's) so that when someone accesses your tor2web domain name, it will only view your specific HS.

In all cases, Tor2Web cannot be used alone. You still need a 2nd node that has the HS running on it. This guide will not cover how to setup a HS. You can refer to Tor's guide for that. Never run both the HS & tor2web on the same node!

Why Use Tor2Web?

Tor2Web is part of a bigger project called GlobaLeaks, but one can use tor2web to provide access to banned/blocked material or provide services anonymously, not related to whistle-blowing, like an anonymous chat service. Taking down the tor2web node doesn't affect the HS & it's always easy to bring up a new tor2web node.

Another reason is to simply hide your stuff. With news of governments snooping on everyone and everything, one doesn't feel safe anymore. They could use anything against you at any point in the future. So, by running a HS, you can run your own email server or file server, and not worry about some law enforcement goons seizing your hardware and disrupting your work, even if you're collateral damage (like when Mega Upload servers were seized, many other websites were also affected).

Of course, one cannot attain full anonymity without taking all possible precautions such as registering the host/VPS/domain name using Tor to not be tracked to you, paying for the services using Bitcoins, and SSH into your system from behind Tor.

Requirements

  1. Debian Squeeze (6) or Wheezy (7)
  2. tor
  3. python2.7
  4. python-Twisted version 13.1.x
  5. tor2web
tor2web v3.0 uses Python Twisted as a webserver rather than a standalone webserver like Nginx.

Note: This guide has slight differences from the Tor2Web guide. Feel free to refer to both and modify as you please.

1) Debian

Some Virtual Private Server (VPS) providers do not support Wheezy, but tor2web needs packages only available in Wheezy. You will have to install a lot of packages from Wheezy to be able to use python2.7 and its requirements, but you don't need to run a dist-upgrade. So, in the end, your distribution will still be Squeeze.

If you're already running Wheezy, then you can simply ignore the parts about adding Wheezy's specific repositories and removing them later.

1.1) Initial Sources

Modify the file /etc/apt/sources.lst with your favorite editor (vi, vim, emacs, nano, pico)
deb http://ftp.uk.debian.org/debian             squeeze main contrib non-free
deb http://ftp.uk.debian.org/debian-security    squeeze/updates main contrib non-free
deb http://ftp.uk.debian.org/debian-backports squeeze-backports main
deb http://ftp.uk.debian.org/debian-backports squeeze-backports-sloppy main
deb http://deb.torproject.org/torproject.org squeeze main
deb http://dl.bintray.com/globaleaks/deb /

Replace "squeeze" with "wheezy" if that's what you're running, but ignore the backports lines as you don't need them.

Your sources above could be from a different mirror for Debian's packages. Always use the closest mirror to your host.

1.2) Upgrading the box

You need to be root to run these commands, or have "sudo" installed with the correct privileges.

The last command will upgrade your host to the latest packages of your distribution. It's always good to stay up to date and avoid buggy packages.

sudo apt-get update
sudo apt-get upgrade

1.3) Add Wheezy repositories

Modify the file /etc/apt/sources.lst with your favorite editor (vi, vim, emacs, nano, pico) again, and add Wheezy repos
deb http://ftp.uk.debian.org/debian             wheezy main contrib non-free
deb http://ftp.uk.debian.org/debian-security    wheezy/updates main contrib non-free

2) Tor

Import Tor's keyring to authenticate the packages and install Tor.

Note: If you're running as root, or you don't have sudo installed, simply remove "sudo" from the 2nd command above.

sudo apt-get update
sudo apt-get install debian-keyring debian-archive-keyring
sudo apt-get update

gpg --keyserver keys.gnupg.net --recv 886DDD89
gpg --export A3C4F0F979CAA22CDBA8F512EE8CBC9E886DDD89 | sudo apt-key add -

sudo apt-get install deb.torproject.org-keyring
sudo apt-get update
sudo apt-get install tor

Now you should have tor installed. You check the service status using: service tor status

3) Python

Check if you already have python installed and which version:
python --version
python2.7 --version

Now to install python and some of its fellow packages:
sudo apt-get install -t wheezy python2.7 python-pip python-dev wget ca-certificates

If you're running Squeeze, you'll be asked to also install glibc, and during the install you'll be asked to restart the services associated with it. If this is a fresh host, it's safe to say "Yes" if it's a production box with services running live, choose "No." This will also upgrade & install a lot of packages from Wheezy, as python-pip is dependent on them.

If you have multiple versions of python installed (2.6, 2.7 & 3.x), you'll need to change the default to 2.7:
update-alternatives --install /usr/bin/python python /usr/bin/python2.7 10

If you need to change the default later on, run this:
update-alternatives --config python

4) Tor2Web

First to grab the tor2web requirements and install them:
wget https://raw.github.com/globaleaks/GLBackend/master/requirements.txt
sudo pip install -r requirements.txt
sudo apt-get install tor2web
sudo service tor2web status
sudo service tor2web stop

Now that you have tor2web installed, and stopped, you need to create certificates for session encryption between the tor2web node and the user's browser.

when running the 2nd openssl command, make sure the Common Name (FQDN) is the same as the domain name that will be used to access the tor2web server. If you're planning to access it by IP, use the IP here.

The 3rd command will create a certificate valid for a year (365 days). If you want it to live longer, increase the number of days.

The last command will take a long time to generate the file. If your computer is faster than your host, run it on your computer then copy it there.

cd /home/tor2web/certs/
openssl genrsa -aes256 -out tor2web-key.pem 4096
openssl req -new -key tor2web-key.pem -out tor2web-csr.pem
openssl x509 -req -days 365 -in tor2web-csr.pem -signkey tor2web-key.pem -out tor2web-intermediate.pem
openssl dhparam -out tor2web-dh.pem 4096

Don't change the file names as tor2web is coded to use them as is.

The certificates created will require you to enter the password whenever you [re]start the tor2web service. This is good in case your server was hijacked & your certs were taken, they cannot be used unless someone knows the password (which shouldn't be the same as the root password!).

If you don't care, then you can create a certificate without a password:
openssl rsa -in tor2web-key.pem -out tor2web-key.pem.insecure
mv tor2web-key.pem tor2web-key.pem.secure
mv tor2web-key.pem.insecure  tor2web-key.pem

5) Configuring Tor and Tor2Web

Limit tor to the local host IP and prevent external connections, by modifying /etc/tor/torrc and add this line:
SocksPort 127.0.0.1:9050

Copy the example file into a new one:
cp /etc/tor2web.conf.example /etc/tor2web.conf

Modify /etc/tor2web.conf
nodename = whatever you want here. It can be the IP since it's unique

listen_ipv4 = IP of your VPS
listen_ipv6 = IP of your VPS, if you want to use IPv6
Change these to the IPs. If you don't want to bind by IPv6 or IPv4, comment the line out by adding a # sign at the beginning of the line. One of them must be present.

listen_port_http = 80
listen_port_https = 443
You cannot comment out http port, as tor2web will automatically redirect to https, but you can change the default ports.

basehost = this should be the root domain name (example.com not www.example.com or any subdomain, unless you happen to be running behind a subdomain). It can also be an IP. If you do not use the exact name/IP that the tor2web service will be accessed from, tor2web will not function.

Sockshost = 127.0.0.1
socksport = 9050
These should match the SocksPort config in /etc/tor/torrc file.

cipher_list = DHE-RSA-AES256-SHA
I do not trust the others, as DSS is not quite common and is only used by Microsoft for some reason, and RC4 can be misused at times.

mode = TRANSLATION
onion = hidden service name: name.onion (jntlesnev5o7zysa.onion - piratebay's HS)
blockcrawl = True
Using mode Translation means you'll bind tor2web to a specific hidden service defined in the onion option. If you want tor2web to function as a general proxy, leave the default options. The blockcrawl option blocks search engines.

I commented out the email notification options as I don't really care about notifications.

#mirror = ...
comment it out unless you happen to be running multiple tor2web nodes and want this one to list them in the banner.

6) Clean Up & Run the Services

If you're running Squeeze, remove the Wheezy main repo and only keep the security one to prevent upgrading the entire host to Wheezy.

Now run the tor2web service: (it will run tor for you)
service tor2web restart

You'll be asked to input the password of the certificate every time you [re]start the service, unless you have removed the password. Read the caution about it above.

It's always a good idea to NEVER ssh into your hidden service node using its real IP from your tor2web node. Should your tor2web node ever get seized or compromised, you do not want anyone looking at the logs or history to find the IP of your hidden service, as it won't hidden anymore.

7) Help

If you need my help setting up a HS, never contact me with your personal email and do not ask for help in the comments section. Create a fake email or use tormail.org. I don't want to know why you're setting it up or what kind of material you're hosting. The less I know, the better for the both of us.

You can ask anonymously on my blog about generic config or a problem you're having, but if you need specific help, send me an email with no details and then we'll agree on how to proceed.

8) Sources