Techy Title Here: 2016

Monday, February 29, 2016

Freeing Disk from VMware Virtual Flash Read Cache (vFRC)

I was toying with vFRC in my lab and when I was done, I deleted the volume from the vSphere web client, but the local flash disk had retained its GPT partition format and was still claimed as a VMFS volume. I was unable to use that disk for other applications.

Try deleting using the web client:

Select the host then go to Manage tab then select Storage option and from there choose the Storage Devices entry. Select the disk, then click on the gear icon and choose Erase Partitions. Make sure you selected the right disk because this will wipe everything.

Via CLI: To delete the disk partition, first enable SSH on the host, then login and list all disks:
ls -l /vmfs/devices/disks/

Sample output:
ls -l /vmfs/devices/disks/
total 495867432
-rw-------    1 root     root     8004304896 Feb 29 08:45 mpx.vmhba32:C0:T0:L0
-rw-------    1 root     root       4161536 Feb 29 08:45 mpx.vmhba32:C0:T0:L0:1
-rw-------    1 root     root     262127616 Feb 29 08:45 mpx.vmhba32:C0:T0:L0:5
-rw-------    1 root     root     262127616 Feb 29 08:45 mpx.vmhba32:C0:T0:L0:6
-rw-------    1 root     root     115326976 Feb 29 08:45 mpx.vmhba32:C0:T0:L0:7
-rw-------    1 root     root     299876352 Feb 29 08:45 mpx.vmhba32:C0:T0:L0:8
-rw-------    1 root     root     2684354560 Feb 29 08:45 mpx.vmhba32:C0:T0:L0:9
-rw-------    1 root     root     128035676160 Feb 29 08:45 t10.ATA_____ADATA_SP600_____________________________7F1820011415________
-rw-------    1 root     root     128033579008 Feb 29 08:45 t10.ATA_____ADATA_SP600_____________________________7F1820011415________:1
-rw-------    1 root     root     120034123776 Feb 29 08:45 t10.ATA_____KINGSTON_SV300S37A120G__________________50026B7255068D61____
-rw-------    1 root     root     120032591872 Feb 29 08:45 t10.ATA_____KINGSTON_SV300S37A120G__________________50026B7255068D61____:1
lrwxrwxrwx    1 root     root            20 Feb 29 08:45 vml.0000000000766d68626133323a303a30 -> mpx.vmhba32:C0:T0:L0
lrwxrwxrwx    1 root     root            22 Feb 29 08:45 vml.0000000000766d68626133323a303a30:1 -> mpx.vmhba32:C0:T0:L0:1
lrwxrwxrwx    1 root     root            22 Feb 29 08:45 vml.0000000000766d68626133323a303a30:5 -> mpx.vmhba32:C0:T0:L0:5
lrwxrwxrwx    1 root     root            22 Feb 29 08:45 vml.0000000000766d68626133323a303a30:6 -> mpx.vmhba32:C0:T0:L0:6
lrwxrwxrwx    1 root     root            22 Feb 29 08:45 vml.0000000000766d68626133323a303a30:7 -> mpx.vmhba32:C0:T0:L0:7
lrwxrwxrwx    1 root     root            22 Feb 29 08:45 vml.0000000000766d68626133323a303a30:8 -> mpx.vmhba32:C0:T0:L0:8
lrwxrwxrwx    1 root     root            22 Feb 29 08:45 vml.0000000000766d68626133323a303a30:9 -> mpx.vmhba32:C0:T0:L0:9
lrwxrwxrwx    1 root     root            72 Feb 29 08:45 vml.010000000035303032364237323535303638443631202020204b494e475354 -> t10.ATA_____KINGSTON_SV300S37A120G______________                                   ____50026B7255068D61____
lrwxrwxrwx    1 root     root            74 Feb 29 08:45 vml.010000000035303032364237323535303638443631202020204b494e475354:1 -> t10.ATA_____KINGSTON_SV300S37A120G____________                                   ______50026B7255068D61____:1
lrwxrwxrwx    1 root     root            72 Feb 29 08:45 vml.01000000003746313832303031313431352020202020202020414441544120 -> t10.ATA_____ADATA_SP600_________________________                                   ____7F1820011415________
lrwxrwxrwx    1 root     root            74 Feb 29 08:45 vml.01000000003746313832303031313431352020202020202020414441544120:1 -> t10.ATA_____ADATA_SP600_______________________                                   ______7F1820011415________:1

Find your disk there, and then list its partitions:
partedUtil getptbl /vmfs/devices/disks/

Sample output:
partedUtil getptbl /vmfs/devices/disks/vml.010000000035303032364237323535303638443631202020204b494e475354
gpt
14593 255 63 234441648
1 2048 234440703 AA31E02A400F11DB9590000C2911D1B8 vmfs 0

You can see above that there's one partition labeled as "vmfs" which we need to get rid of. The leading number (in blue) is the partition number.

To delete the partition:
partedUtil delete /vmfs/devices/disks/

Sample output:
partedUtil delete /vmfs/devices/disks/vml.010000000035303032364237323535303638443631202020204b494e475354 1

Done. Look in vSphere web client and it should now report 0 primary partitions on that disk and you're free to use it for something else.

Check the partition table:
partedUtil getptbl /vmfs/devices/disks/vml.010000000035303032364237323535303638443631202020204b494e475354
gpt
14593 255 63 234441648

Saturday, February 6, 2016

NGINX with High Security Ciphers and LetsEncrypt

I want to move away from the bloated Apache web server and NGINX meets my requirements, but this time I want to use SSL/TLS with signed certificates with the highest security ciphers that support Perfect Forward Secrecy, because why not?

Sadly, the information was scattered and not everything is there in the manuals, so this is a documentation of what I've found and done in my setup.

The Let's Encrypt project provides authenticated and validated domain certificates for free! The catch? They expire every 90 days and their official client requires root access & dependencies, but you can (auto)renew and avoid these. Read on to know more.

Article Updates

Mar 3rd

Corrected root's crontab entry.
Corrected headers' content and location
Added more info about security and privacy headers

Environment

My setup consists of the stuff below. This post will presume Debian & NGINX are already installed. In the steps below, a line starting with "#" means it's a command you should type. Type the command without the "#" character (not necessarily as root).

Debian Jessie (8)

#cat /etc/issue

NGINX version 1.6.2, installed from nginx-full package.

#nginx -v

OpenSSL 1.0.1k

#openssl version

Python 2.7.9

python --version

acme-tiny Dec 29, 2015

If you have an older version of openssl or nginx, you're likely to face problems and failures since new ciphers have been introduced in recent versions of OpenSSL only (1.0.1h) and the same for NGINX's settings. Make sure your distro supports the latest versions, otherwise you'll be leaving yourself and your visitors vulnerable.

Why acme-tiny?

The official letsencrypt client requires installing some dependencies such as gcc (GNU C Compiler) and some other things, in addition to requiring it being run as root, not only once, but as a daemon or in a cronjob as it requires to renew the certificate every 90 days!

As much as I appreciate the Let's Encrypt initiative, I'm not granting their software root access to my machines, nor installing gcc on a production machine. That's where acme-tiny comes in: a small (200 lines) client that is using Let's Encrypt API calls and you can (and should) audit the client's code before using it, since it's only 200 lines of human-readable Python code.

Configuring NGINX for TLS/PFS

SSL is dead. You should be using TLS only, and if you don't have to service old devices (Android 4.x, old IE browsers, Windows XP), then you should be using TLS v1.2 only with a strict set of ciphers.

Perfect Forward Secrecy (PFS) is an old standard but hasn't been widely adopted until after Snowden revealed the amount of encrypted data being stored for later decryption. PFS cycles the encryption key during the session, so even when a session is captured, decryption will be possible only for a small portion as the key changes.

TLS Config

If you're going to configure a wildcard certificate, place the config in /etc/nginx/nginx.conf. Otherwise if the certificate is unique to a specific domain/subdomain, you'll need to place the config in a virtual host config file.

In my case, I started with a wildcard but it self a self-signed certificate and was rejected by browsers, which is normal. Later when I made a Let's Encrypt certificate, I moved it to the specific subdomain.

Note: Let's Encrypt doesn't support wildcard certs as of this writing, however, they allow you up to 100 domains/subdomains.

Edit /etc/nginx/nginx.conf

user www-data;
worker_processes 4;
pid /run/nginx.pid;

events {
        worker_connections 256;
        multi_accept on;
}

http {

        ##
        # Basic Settings
        ##

        sendfile on;
        tcp_nopush on;
        tcp_nodelay on;
        keepalive_timeout 65;
        types_hash_max_size 2048;
        server_tokens off;

        # server_names_hash_bucket_size 64;
        server_name_in_redirect off;

        include /etc/nginx/mime.types;
        default_type application/octet-stream;

        ##
        # SSL Settings
        ##

        ssl_protocols TLSv1.2; # Dropping SSLv3, ref: POODLE
        ssl_prefer_server_ciphers on;
 # Change the cache name. Read the manual for more info.
        ssl_session_cache shared:YourSSLCacheNameHere:10m;
        ssl_session_timeout 10m;
        ssl_session_tickets off;
        ssl_stapling on;
        ssl_stapling_verify on;
 # contains CBC AES algs which I do not like
        #ssl_ciphers "EECDH+AESGCM:EDH+AESGCM:AES256+EECDH:AES256+EDH";
 # AES256 GCM is not yet supported on most browsers
        #ssl_ciphers 'ECDHE-RSA-AES256-GCM-SHA384';
        # TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 & TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
 ssl_ciphers "EECDH+AESGCM";
        ssl_ecdh_curve secp384r1;
 # self-generated 4096 DH key range
        ssl_dhparam /etc/nginx/ssl/dhparam.pem;

 # Put IPs of your hosting provider here, or a trusted DNS provider. These are Google's.
 resolver 8.8.8.8 8.8.4.4 [2001:4860:4860::8888] valid=300s;
        resolver_timeout 5s;

        # wildcard cert config should go here, if any
        #ssl_certificate /etc/nginx/ssl/;
        #ssl_certificate_key /etc/nginx/ssl/;

        ##
        # Logging Settings
        ##

        access_log /var/log/nginx/access.log;
        error_log /var/log/nginx/error.log;

        ##
        # Gzip Settings
        ##

        gzip on;
        gzip_disable "msie6";

        # gzip_vary on;
        # gzip_proxied any;
        # gzip_comp_level 6;
        # gzip_buffers 16 8k;
        # gzip_http_version 1.1;
        # gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;

        ##
        # Virtual Host Configs  
        ##

        include /etc/nginx/conf.d/*.conf;
        include /etc/nginx/sites-enabled/*;
}

Make sure you correct the line breaks if you paste. Due to styling on my blog, you may have one line spilling to multiple lines in the config, and this will break your config.

Don't worry about non-existing directories. We'll come to those later as we finish the setup.

Default Virtual Host Config

I don't use a default domain (www, for example) as mine are hidden from public. If you're like me, then this config fits you, otherwise move to the step below.

Edit /etc/nginx/sites-enabled/default

# Default server configuration
server {
 # change IP to match yours
        listen 127.0.0.1:80 default_server;
 # uncomment to enable IPv6
        #listen [::1]:80 default_server;
 # uncomment to enable ssl on IPv4
        listen 127.0.0.1:443 ssl default_server;
 # uncomment to enable ssl on IPv6
        #listen [::1]:443 ssl default_server;
        server_name _; #default server

        ssl_certificate /etc/nginx/ssl/default_wild.crt;
        ssl_certificate_key /etc/nginx/ssl/default_wild.key;

        root /var/www/html;

        # Add index.php to the list if you are using PHP
        #index index.html index.htm index.nginx-debian.html;
        index index.html;

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                try_files $uri $uri/ =404;
                autoindex off;
        }
}

This config will load when someone visits the IP(s) NGINX is configured at.

Virtual Host Config

This is where your subdomain config goes. In my case, the certificate belongs to this specific subdomain, so the certificate lines are added here. If you were using a wildcard cert, you should move them to nginx.conf above.

Create a file for your subdomain /etc/nginx/sites-available/mysubdom

server {
        listen 127.0.0.1:80;
 # uncomment if you want IPv6
        #listen [::1]:80;
        #listen 127.0.0.1:443 ssl;
        #listen [::1]:443 ssl;
        server_name subdomain.domain.com;
        keepalive_timeout 70;

        # The certificate is for subdomain.domain.com only
        #ssl_certificate /var/www/challenge/subdomain_chained.crt;
        #ssl_certificate_key /etc/nginx/ssl/subdomain.key;

        root /var/www/subdomain;

        # Add index.php to the list if you are using PHP
        #index index.html index.htm index.nginx-debian.html;
        index index.html;

        # letsencrypt challenge directory to verify domain
        location /.well-known/acme-challenge/ {
                alias /var/www/challenge/;
                try_files $uri =404;
 }

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                try_files $uri $uri/ =404;
                autoindex off; #enable if you want file listing
        }

}

Notice that listening for SSL/TLS is not yet enabled and the ssl_certificate line and the one below have a hash to comment it. This is required for the initial setup since we'll need to reload nginx and it'll fail since the files are not there yet. We'll enable these lines once everything is done.

To make this config file active by NGINX, you need to link it to sites-enabled:

ln -s /etc/nginx/sites-available/mysubdom /etc/nginx/sites-enabled/mysubdom

Create the directory for your subdomain to serve files:

mkdir /var/www/subdomain
mkdir -p /var/www/challenge/.well-known/acme-challenge
chown -R www-data:www-data /var/www/subdomain
chmod 775 /var/www/subdomain
chmod 771 /var/www/challenge

www-data is the user that NGINX runs as, as shown in the first NGINX config above. Don't worry about the challenge directory owner for now. It'll be taken care of later.

Private Keys and Certificates

The overall config is done. What's left is generating private keys, deriving a certificate for the subdomain, then finally working with Let's Encrypt client.

Create the directory /etc/nginx/ssl to place the subdomain private keys and other things in there:

mkdir /etc/nginx/ssl

Modify its permissions to be restricted to root and only those who know exactly which file to use:

chmod 751 /etc/nginx/ssl

Now inside the ssl directory, generate a 4096 bit Diffie-Hellman parameters file (prime numbers) to act as seeds for the PFS/TLS sessions (this will take a VERY LONG time):

openssl dhparam -out dhparam.pem 4096

Generate a self-signed certificate to be used for the default virtual host (i.e., not the one you care about). This will be served to anyone accessing the IP or any subdomain other than the one you specifically define in the virtual host:

openssl req -x509 -nodes -days 3650 -newkey rsa:4096 -sha512 -keyout /etc/nginx/ssl/default_wild.key -out /etc/nginx/ssl/default_wild.crt

If you don't configure this, users will be served your legitimate certificate and they'll be able to find your "hidden" subdomain. Only do the above if you want your domain/subdomain hidden.

Generate a subdomain private key and a certificate request:

openssl genrsa 4096 > subdomain.key
openssl req -new -sha512 -key subdomain.key -subj "/CN=subdomain.domain.com" > subdomain.csr

This one is the domain/subdomain that will be valid to the world. it can also be "domain.com" if you like.

Let's Encrypt and ACME-Tiny

For security purposes, it's best to have the client run as a separate user. Should anything go wrong in the future, its access would be quite isolated.

Environment Setup

Create a user for it:

useradd -m letsencrypt

Copy the subdomain csr file and set home directory permissions:

chmod 751 /home/letsencrypt
cp /etc/nginx/ssl/subdomain.csr /home/letsencrypt/
chown -R letsencrypt:letsencrypt /home/letsencrypt
chown -R letsencrypt:letsencrypt /var/www/challenge

Now switch user to become the letsencrypt user for the rest of the commands:

su - letsencrypt
openssl genrsa 4096 > account.key
wget https://raw.githubusercontent.com/diafygi/acme-tiny/master/acme_tiny.py
chmod 400 account.key
chmod 400 acme_tiny.py
chmod 400 subdomain.csr

The account.key is your private key to identify you to Let's Encrypt. Keep it safe!:

ls -l
-r-------- 1 letsencrypt letsencrypt 9150 Feb  6 12:13 acme_tiny.py
-r-------- 1 letsencrypt letsencrypt 3247 Feb  6 12:44 private.key
-r-------- 1 letsencrypt letsencrypt 1594 Feb  6 12:38 subdomain.csr

Now exit to be root (or you can use sudo) and restart nginx:

service nginx restart

If there are no errors here, it's all good, otherwise look into /var/log/nginx/error.log for hints.

Script Execution

Now that NGINX is functioning on port 80, it will be used to verify the subdomain ownership. acme-tiny writes to LetsEncrypt.org via APIs and they reply with a random hash that is written to the challenge directory, which is accessible via NGINX on port 80, and then LetsEncrypt.org checks that this hash actually exists at the subdomain you supplied and then verifies you.

su - letsencrypt
python acme_tiny.py --account-key account.key --csr subdomain.csr --acme-dir /var/www/challenge/ > /var/www/challenge/subdomain.crt

All should go OK without errors. If any, verify directory paths and file and directory permissions. Make sure the username "letsencrypt" has access to the files private.key, subdomain.csr and the challenge directory.

NGINX requires concatenating the intermediate certificate to the freshly signed certificate from Let's Encrypt:

wget -O /var/www/challenge/lets-encrypt-x1-cross-signed.pem https://letsencrypt.org/certs/lets-encrypt-x1-cross-signed.pem
cat /var/www/challenge/subdomain.crt /var/www/challenge/lets-encrypt-x1-cross-signed.pem > /var/www/challenge/subdomain_chained.crt

That's it! It should now work after enabling the SSL/TLS settings in NGINX.

Enable TLS in NGINX

Modify the file /etc/nginx/sites-enabled/mysubdom to make it look like this:

server {
        listen 127.0.0.1:80;
 # uncomment if you want IPv6
        #listen [::1]:80;
        server_name subdomain.domain.com;

 # force all traffic to go to HTTPS instead of HTTP
        return 301 https://subdomain.domain.com$request_uri;
}

server {
        listen 127.0.0.1:443 ssl;
        #listen [::1]:443 ssl;
        server_name subdomain.domain.com;
        keepalive_timeout 70;

        # The certificate is for subdomain.domain.com only
        ssl_certificate /var/www/challenge/subdomain_chained.crt;
        ssl_certificate_key /etc/nginx/ssl/subdomain.key;

        add_header Strict-Transport-Security "max-age=63072000; includeSubdomains; preload";
        add_header X-Frame-Options DENY; #or "SAMEORIGIN" always;
        add_header X-Content-Type-Options nosniff;
        add_header Content-Security-Policy 'default-src https://subdomain.domain.com:443';
        add_header X-Xss-Protection '1; mode=block';


        root /var/www/subdomain;

        # Add index.php to the list if you are using PHP
        #index index.html index.htm index.nginx-debian.html;
        index index.html;

        # letsencrypt challenge directory to verify domain
        location /.well-known/acme-challenge/ {
                alias /var/www/challenge/;
                try_files $uri =404;
 }

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                try_files $uri $uri/ =404;
                autoindex off; #enable if you want file listing
        }

}

Notice how listening on port 80 (HTTP) has been shifted to its own segment while the rest uses HTTPS exclusively. Future certificate renewals can also go over HTTPS as long as the certificate is still valid. If not, revert the config to be as it was at the beginning.

Reload NGINX to read the certificates and make the settings active:

service nginx reload

Note: Reload reads the settings again without dropping connections. It's advised for live websites.

About Headers

Previously, I had the security headers in the main nginx.conf file, but that will apply the same headers to all websites, and that's not scalable nor correct. According to Igor Sysoev (NGINX's creator), he created the config in NGINX to not inherit so that troubleshooting becomes simpler. Duplicating code is good because it makes life easy in finding the problem when things go wrong. See the link for his talk below in the references.

This means that headers (and other configs) should be repeated for every virtual host you configure. If you configure a header in the main block in nginx.conf then define another (or modified) header in the subdomain block, the latter will take over and the first one will be ignored.

About Security Headers

The SecurityHeaders service recommends using HTTP Public-Key Pinning (Stapling) or HPKP for short, but there are privacy and performance concerns with that: Pinning means the public key of your own certificate is sent in the header and is sent to your certificate issuer to validate it. This prevents a Man-in-the-Middle attack, but exposes your visit(s) to the certificate issuer! Additionally, it puts a huge burden on the certificate issuer to scale their own performance to reply to every single site visit. If they don't (and why should they?), your site visiting experience will suffer great delays.

The headers also tell the browser to cache your public keys for a very long period (3+ months) to protect you against forged certificates that could come during that period, but since we're using Let's Encrypt certificates which expire every 3 months, it'll become hectic to manage the headers, aging and other aspects.

With all these concerns, I decided against adding Public-Key Pinning headers in my config. It is up to you to evaluate your case. See the references below for more details about the available options for HPKP in addition to the Content-Protection policies and the XSS protection policies, as they may affect your site when you want to load media/material external to your website.

Auto-Renewing The Certificate

LetsEncrypt issues certificates valid for 90 days only to combat spam and fraudulent uses of domains that have been neglected. That means the certificate needs to be renewed before 90 days expire.

As the user "letsencrypt" put the following in a shell script letsencrypt_renew.sh:

#!/bin/bash
python acme_tiny.py --account-key /home/letsencrypt/account.key --csr /home/letsencrypt/subdomain.csr --acme-dir /var/www/challenge/ > /var/www/challenge/subdomain.crt || exit

wget -O /var/www/challenge/lets-encrypt-x1-cross-signed.pem https://letsencrypt.org/certs/lets-encrypt-x1-cross-signed.pem

cat /var/www/challenge/subdomain.crt /var/www/challenge/lets-encrypt-x1-cross-signed.pem > /var/www/challenge/subdomain_chained.crt

Make sure every command is complete and on its own line. The styling here could break them into multiple lines.

This should be run as a cron job so set the permissions:

chmod 744 letsencrypt_renew.sh

Now run # crontab -e and add this line:

# LetsEncrypt cert renewal -- nginx will be reloaded by root in another cron job
1 1 27 * * test $(($(date +\%m)\%2)) -eq 0 && /home/letsencrypt/letsencrypt_renew.sh

This will run the job every 2 months on the 27th day at 01:01 AM (even months of the year). Basically, every 60 days.

Now exit the user "letsencrypt" and as root, run # crontab -e and add this line:

# m h  dom mon dow   command
2 1 27 * * test $(($(date +\%m)\%2)) -eq 0 && `/usr/sbin/service nginx reload`

This will reload nginx at 01:02 AM, a minute after the certificate has been refreshed by the previous job.

Test Site Security and Settings

Now go to SSL Labs and test your website (https://subdomain.domain.com)

Then go to Security Headers and test your website (https://subdomain.domain.com)

References

I highly recommend visiting the sites below from bottom to top. I added them last to first in the order of pages I had on my tabs.

Saturday, January 23, 2016

16 Gb Brocade SAN Fabric Merge

Introduction

A customer with an existing setup from HP with HP-branded Brocade switches wanted to connect those switches to the newly acquired IBM setup (also using Brocade switches). The HP switches are the 24-port 8 Gb switches, and the IBM ones are 48-port 16 Gb switches. The final goal is to virtualize the HP storage behind the V7000 storage, but this will not be discussed in this post.

The HP SAN switches had existing configurations & were in production. The IBM switches also had configurations for an ongoing implementation.

To merge the SAN fabrics, there are 2 ways:

Wipe one of them (clear the config), disable it, then enable it. The config of the other switch will be written to this empty one.
Merge 2 different fabrics without wiping any data.

This post will address point (2), because I didn't want to re-do all the zoning from scratch. That's a waste of time. The steps will be done in command line (CLI), because I hate java.

Why Write This Post?

I was reading Brocade's forums and many were talking about using fabric merge tools and that the two fabrics must have different names, and there was a lot of wrong or outdated information that no longer applies to the new Fabric OS 7.x (new switch firmware).

Status

HP switches had Fabric OS (FOS) 7.1.
IBM switches had FOS 7.4.
HP switches had full fabric license.
IBM 48-port switches include "Full Fabric" license by default, but doesn't show with "licenseshow" command. It's bundled & enabled by default.
HP switches had domain ID: 11 & 12.
IBM switches had default domain ID: 1.
Switch configuration name on HP was different from the one on IBM.
IBM switch 1 connected to HP switch 1 using 1 FC cable. switch 2 connected to switch 2 using 1 FC cable.
IBM switches had 16 Gb SFPs. HP had 8 Gb SFPs. Speed of IBM SFP used for SAN connection was fixed to 8 Gb (no auto negotiate).

Requirements

Fabric OS has to be 6.x or 7.x on all switches connecting to each other. The minor version ".x" does not have to match, but it's recommended to keep the switches on the same level, if possible.
Full Fabric license must be available on 24-port switches. It's available by default on 48-port switches.
Change Domain ID from default value to a unique value. The 2 switches connecting to each other must have different Domain IDs.
Switch configuration names must be the same for the fabric to merge. If they are different, "Zone Conflict" error will show on the secondary switch.
If you have a lot of traffic going from one switch to another switch, it's advised to purchase the "Trunking License" to allow aggregating multiple FC ports/links together.
Aliases and zone names must be unique before merging the fabric. If you have similar alias names on the 2 different switches, you have to rename the aliases/zones on the secondary switch (the one that you can disable to merge the fabric).
Aliases that have the same WWN on both secondary and primary switches, must have the same name on both fabrics. This is a very unique case, but possible if you're virtualizing the WWNs of your servers.
Make sure switch date, timezone & time are all correct before you merge the switches. Changing the timezone requires a switch restart, so plan for the downtime.
Default user is 'admin' and default password is 'password'.
Do not connect any FC cables between the HP/IBM (different switches) until you're told to do so. Follow the steps exactly as shown below.

Steps

In the steps below, a line starting with "#" means it's a command you should type. Type the command without the "#" character.

Some steps will require rebooting the switch. Some will require disabling the switch more than one time, which makes it offline, and stops all storage access traffic. It's better to change the paths from the servers to the 2nd switch manually, or if you're sure the multipath drivers are working properly, you can disable server ports.

The primary switch is the one that will remain operational. The secondary switch is the one where we are making all these changes & can afford downtime.

Disable Ports

It's better to disable server ports, to prevent multipath driver from using the paths again when they're online, but before you finish your activity. Do this on ONE switch only! After you successfully merge fabrics on this switch, enable ports, then move to the 2nd switch. Do NOT disable ports on both switches at the same time, if you have active servers connected to the SAN switches.

List available ports and WWNs: # switchshow
# portdisable <port number>
Example: # portdisable 15
This will disable the 16th port (port numbering starts from zero)

Repeat this for all ports.

Change the Timezone

# date
This will show current time, date & timezone. Example: Tue Jan 12 09:00:03 AST 2016. AST = Arab Standard Time timezone.
# tstimezone --interactive
Follow the prompts. Choose the continent, then the country.
After finishing, a message will say: "System Time Zone change will take effect at next reboot"
If time is not correct, change it before you reboot. See the steps below.
If the time is correct, you can now reboot the switch: # reboot

Change the Time and Date

date [MMDDhhmm[[CC]YY]]
MM = Month = 01, 02, ..., 12
DD = Day = 01, 02, ..., 31
hh = Hour = 00, 01, 02, ..., 23
mm = Minute = 00, 01, 02, ..., 59
CC = First two digits of the year = 20 for 2016
YY = Last two digits of the year = 16 for 2016
To change the time & date to Jan 23 2016 21:43:00 (9:43 PM)
# date 012321432016
Time change does not require a reboot. If you changed the timezone, you should reboot now.

Display Current Domain ID

# switchshow
Top of the output will show a line: switchDomain: 1
1 is the default value.

Change Domain ID

To change the Domain ID of a switch, the switch must be disabled first:
# switchdisable
This will take the switch offline and stop all traffic.
Start the configuration process to change switch parameters:
# configure
Fabric parameters (yes, y, no, n): [no] yes
Domain: (1..239) [1] <Unique ID must be different from the switch you will connect to>
Press Enter for all other parameters to use default values. No need to change any of them.
# switchenable

Rename Zone Configuration

You should rename the zone config to match the primary switch. The primary switch is the one that will remain operational. The secondary switch is the one where we are making all these changes.

# cfgshow
This will print current aliases, zones and zone config information. At the top, you'll see the config name:
Defined configuration:
cfg: HO_SANSW1_Top
The config must be disabled before you can rename it: # cfgdisable
Now, rename the config to be the same as the primary switch: # zoneobjectrename <current name>, <new name>
Example: # zoneobjectrename HO_SANSW1_Top, Production_SAN1
Remember, both primary (HP switch in my case) and secondary (IBM in my case) must have the same config name to be able to merge the fabrics.
Save the new config changes: # cfgsave
Run the command again to see the new config name: # cfgshow
Now activate the config: # cfgenable <config name>

Change Port Speed

All ports are disabled. We need to change the speed of the port to make it fixed instead of using auto negotiate. This must be done on both primary and secondary switches.

# portcfgspeed <port number> <speed>
Example: # portcfgspeed 35 16
This will fix the speed of port 35 to 16 Gbps. Auto negotiation will be disabled.
Do this on the port that will connect each primary SAN switch to each secondary SAN switch.
Keep the port disabled on the secondary switch.
Enable the port on the primary switch: # portenable <port number>
Connect your Fiber Channel cables into the ports.

Merging The Fabrics

First, save the current zone names of the secondary switch in a text file. We will need them after this step: # cfgshow
Copy the output and save it in a text/word file.
On the secondary switch, disable the config: # cfgdisable
Now enable the port connecting the secondary & primary switches: # portenable 35
Wait 10-30 seconds before proceeding to give enough time for the link to establish and the 2 switches to talk.
Disable the secondary switch to make it the slave and to add the config from the primary:
# switchdisable
Enable the secondary switch: # switchenable
Wait 10-50 seconds, then check the switch: # switchshow
You should see in the line of the port connecting the switches something like this:
35 35 1f2300 id 8G Online FC E-Port 10:00:00:xx:xx:xx:xx:xx "" (upstream)
Wait some time and the name of the primary switch will appear between the double quotes.
You should also see both switches in the same fabric now: # fabricshowThis should show the names of the primary & secondary switches.
If you type # cfgshow it will show all zones and aliases from both switches, but only those from the primary are in the active config.

Enabling Zones of Secondary Switch

The fabrics are now merged, but the zones of the secondary switch are not in the active config yet. We need to add them to the config and enable the config.

Open the text file of the zone names (cfgshow output) from the previous step.
To add the zones, type the command: # cfgadd "<zone name>", "zone1; zone2; zone3"
Notice it's a semicolon between the zone names. You can add multiple zones at the same time to the active config.
If you're lazy and java works for you, you can use the graphical interface to select the zones and add them to the config.
When done, type: # cfgsave
press "y" to save it.
Then type: # cfgenable <config name>

Congratulations! Now all zones are active from both switches. The ports are still disabled, though, so let's enable them.

Enable Ports

List available ports and WWNs: # switchshow
# portenable
Example: # portdisable 0
This will enable the 1st port (port numbering starts from zero)
Repeat this for all ports.
You can now check your servers and storage and all links should be operational.

Congratulations! You're now done with the first switch connectivity. Make sure your links are stable, then move on to the remaining switches.

Errors

Zone Conflicts and Segmentation

For some reason, the switch showed "segmented" and "zone conflict" messages and upon a reboot, all ports were disabled. Trying to enable a specific port gave the error: "Port 35: Port enable failed due to unknown system error"

I rebooted the SAN switch again and the ports (and switch) became online again. Looks like it froze at some point and needed another reboot. If this happens often, upgrade the FOS to latest stable version. For me, it only happened once.

If you still get "zone conflict" after finishing all the steps, then you have an alias with the same WWN but different names. To fix it, rename the alias using the "zoneobjectrename" command as shown above.

Unstable Ports

I was unlucky to have the ports being unstable. The link kept going online & offline, flapping many times and sometimes it connects at 16 Gbps and sometimes at 8 Gbps (before I fixed the speed to 8 Gbps). Also, it prevented the switches from creating a fabric connection.

First clear the stats to not carry any old data: # portstatsclear <port number>, then you can check your port statistics by issuing the command: # portshow <port number>

In the output, if you have very large numbers in any of these parameters:

Unknown
Parity_err
2_parity_err
Link failure
Loss_of_sync
Loss_of_sig
Invalid_word
Invalid_crc

In my case, I had to change 2 SFPs, one on the old HP SAN switch and one on the new IBM SAN switch. I also had to change the port slot on the old HP switch because the port slot itself had problems. I'm glad the FC cable was good.

References

Implementing IBM b-Type SAN with 8 Gbps Directors and Switches: http://www.redbooks.ibm.com/abstracts/sg246116.html?Open
Fabric merge is section 13.2 - page 636 (pdf) / 608 (redbook)

Wednesday, January 6, 2016

Lenovo G8272 and EN4093R Invalid Signature Firmware Upgrade Problem

While trying to upgrade the firmware of brand new Lenovo G8272 switches from the initial release of 8.2.1.0, I got an error after uploading the new firmware:

Failure: image contains invalid signature.
G8272(config)#
Feb  9 18:58:41 G8272 ERROR   mgmt: Firmware download failed to image1

I only got 2 results online and both pointed at Changelogs that mention the issue has been fixed, but not how! I contacted a great person within Lenovo who checked internal documents and it turned out that this issue affects G8272 and EN4093R switches manufactured on December 2015 (specifically, 12th week of 2015). (Thank you Zeeshan!)

Cause

"The switch software uses it hardware serial number and the public keys on its kernel file system to generate a private key to decrypt the OS or Boot image being uploaded to it and then proceeds to install it. If the serial number of the switch is changed for some reason, the combination of the hardware serial number and the public keys will fail to generate the appropriate private key to decrypt the uploaded image and reports that the image has an invalid signature."

In my case, the switches were fresh & no one changed any serial code, but were still affected.

Fix

"In order to remedy this situation, the way out is to remove the public keys installed on the kernel file system and reboot the switch. During reboot, the switch will generate new set of public keys using the current serial number. With these newly generated public keys, the switch will be able to compute the proper private key to decrypt the uploaded images."

Requirements

Serial cable (mini-USB that came with the switch)
Serial-to-USB kit (you have to buy this on your own)
CAT5E or CAT6 STP or UTP cable
New firmware (8.2.4.0 as of this writing)
PuTTY or your favorite serial/telnet/ssh tool
admin password (default is admin:admin)
ftp/tftp server software. I suggest 3CDaemon (FTP & TFTP) or Filezilla (FTP & SFTP).

On a Flex chassis, you should enable Serial Over LAN (SOL) from the Chassis Management Module (CMM) to be able to access the serial port of the switches. Use UTP cable on the CMM port not the switch.

I highly recommend configuring the management port (RJ45) to use for firmware upload since it'll be very fast, as it'll take 45 minutes to upload one OS image! While it takes 1 minute on the management port via Ethernet.

Note: The initial firmware (8.2.1.0 does not support SSH). However, SSH is enabled by default once you upgrade to 8.2.4.0. Make sure you disable HTTP & Telnet after the upgrade.

Procedure

Any line that starts with # it means this is a command to be typed (without the # sign).

Connect to serial port on the switch (mini-USB port)
Login as admin user
Reboot the switch: #reload
When the switch shows Memory Test, press Shift+t to enter Manufacturer Mode.
U-Boot 2009.06 (Feb 23 2015 - 07:27:18)

CPU0: P2020, Version: 2.1, (0x80e20021)
Core: E500, Version: 5.1, (0x80211051)
Clock Configuration:
       CPU0:1200 MHz, CPU1:1200 MHz,
       CCB:600 MHz,
       DDR:400 MHz (800 MT/s data rate) (Asynchronous), LBC:37.500 MHz
L1:    D-cache 32 kB enabled
       I-cache 32 kB enabled
Board: Networking OS RackSwitch G8272
I2C:   ready
DRAM:   DDR: 4 GB

Memory Test ..........

Manufacturing Mode
FLASH: 16 MB
L2:    512 KB enabled
PCIe1: Root Complex of PCIe, x2, regs @ 0xffe0a000
PCIe1: Bus 00 - 01
MMC: FSL_ESDHC: 0
Note : Operational Mode has changed.
Net:   eTSEC1, eTSEC2 [PRIME]

Booting OS
Once the OS boots, enter the admin password (default is admin)
You should now be at the prompt where it says: Diagnostics#
Enter diagnostics mode: #linux
List the filesystem to see if there are existing public encryption keys: #ls /user/*.pem
> ls /user/*.pem
/user/development_key.pub.pem /user/production_key.pub.pem
The two files above should show. Delete them: #rm /user/*.pem
That's it. Now quit by typing q in the command: #q
Now reboot: #/boot/reset
Press "y" to confirm rebooting. The switch will now reboot and generate new keys to match the current hardware serials and whatnot.
Now connect via Ethernet (or configure an IP interface on the management port then connect) and upgrade the switch
#copy tftp image1 address 192.168.70.13 filename G8272-8.2.4.0_OS.man mgt-port
Change tftp to match what protocol you're using.
Change 192.168.70.13 to match your machine's IP where the TFTP/FTP server is running.
Change G8272_8.2.4.0_OS.man to match the file name.
You'll be asked if you want to make image1 the default boot image; press y.
Repeat the same step above for the 2nd image: image2. Do NOT select it as the default image.
Now upload the Boot image:
#copy tftp boot address 192.168.70.13 filename G8272-8.2.4.0_Boot.man mgt-port
We're done. If you have any config unsaved, type: #write
Now that you're done, reboot the switch: #reload

Congratulations.

Tip: You may want to change the switches' timezone, date & time (in that exact order). The defaults dated to Feb 2015.

IBM POWER8 Networking via Direct Attach Cables

I recently had a project where my company sold POWER8 servers to the customer along with some Lenovo servers and Lenovo G8272 network switches. The switches have 48x 1/10 Gb ports + 6x 40 Gb ports.

To save on cost, it's possible to use Direct Attach Cables (DACs) to connect servers to the switches without buying SFPs nor FC cables. List price comparison:

Lenovo 10GBASE-SR SFP+ Transceiver (46C3447) = $629
Lenovo 5m LC-LC OM3 MMF Cable (00MN508) = $58
To connect 1 server (4 ports) to switches (4 ports) = 8x $629 + 4x $58 = $5,264.

In contrast, with DACs, you only need 1 cable which includes the SFPs (copper):

Lenovo 5m Passive SFP+ DAC Cable (90Y9433) = $210
Lenovo 5m Active DAC SFP+ Cable (00VX117) = $290
Active are often used for switch-to-switch connectivity.
To connect 1 server (4 ports) to switches (4 ports) = 4x $210 = $840.

DACs are 16% the cost! Or 6.3 times cheaper. These prices are based on publicly available list prices. They might be different depending on your region and distributor.

The POWER8 servers (S822) have the following Ethernet adapter: EN0U -- PCIe2 4-Port (10Gb+1GBE) Copper SFP+RJ45 Adapters. According to the redbook (guide), these adapters require Active Copper DACs.

I actually used the Passive DACs that I used for the Lenovo servers, and the cables worked just fine. The AIX team configured 2 Virtual Input/Output Servers (VIOS) on each POWER8 system, and each POWER8 system had 4 of these adapters. We also configured LACP for each VIOS, so the total bandwidth available to each VIOS was 40 Gb.

So even though the redbook says that Active DACs are required, the passive ones work just fine. Also the redbook only lists 1 meter, 3 meter & 5 meter cables (since they're active) and no mention of passive cables.