Friday, April 17, 2009

TPSA: When Everything Goes Right

I was going to write about a networking story, but what happened yesterday registered as a day to never be forgotten, hence this write-up.

TPSA: The Perils of System Administration -- A series of true stories about system administration. The first story is here.

Yesterday, Thursday April 16th, we had a scheduled maintenance job on our production servers. The cluster wasn't functioning as it should and yesterday's tasks were aimed at rectifying the problems once & for all, and update the software installed.

This was our time-plan for each task on the list:
1400 to 1410Shutdown SAP on DB and APP Server
1410 to 1440Take a database backup
Shutdown database
1440 to 1500Take a Filesystem backup (All Filesystems on both nodes)
Changing mointpoints for High Availability
Cluster switchover testing
1500 to 1700Shutdown SAP and database
Perform Kernel Upgrade on SAP CI
Start the database and SAP on the CI Instance ONLY (Not the APPS)
Perform technical testing on CI
Check the mountpoints exe
Start SAP on APPS
Perform technical testing on APPS
1700 to 1730Import the ST-A/PI Patch 01L_ECC600
Change SAP Parameters based on the document
Review the parameters on CI and APPS (Memory and Work Processes)
Restart the SAP CI and APPS
Perform technical testing on CI and APPS


All tasks were easy and planned out with all members of this properly set:
- ERP software consultant (joined by his colleague later)
- ERP offshore support consultant
- AIX Unix consultant from IBM Kuwait
- Myself

We grabbed lunch around 1230 and some snacks and headed to the Head Office (H.O.).

@1401: ERP Applications were stopped

@1406: We started a full offline backup of the production database

@1437: The backup reached its final stage, then threw an error, stopped, and gladly deleted the backup

*NO!! THIS IS NOT THE TIME!! PLEASE!*

After being stunned and depressed for 2 minutes, I thought of stopping the database and starting again, using the user db2ehp I did that and when I tried to start the database again, it threw an error!

*JAWS DROPPED -- LAAAAAAAAA!!*

We faced the same error a few days back due to some user profile changes, so we had to run the command "/usr/opt/db2_08_01/instance/db2iupdt db2ehp" -- I ran that command and it puked an error ...

*NOOOO!! WHY WHY!! IT WAS FINE WHY NOW!! I HATE YOU!!*

I navigate around and go to the home directory of the user db2ehp to check the environment variables and profiles, only to find out that these files have been corrupted and turned to binary garbage

*SHOCKED*
*EYE TWITCHES*

I ftp to the secondary database node, looked for similar files, then decided to copy the whole directory and renamed the files to match the host name of the primary node.

It didn't work for some reason, even though all the environment variable scripts had proper values...

I called the company responsible for the support on our backup software (TSM) and in about half an hour their consultant provided me with a command line to recover a certain directory to a specific location.

Then we restored from our backup software that whole directory: "/usr/tivoli/tsm/client/ba/bin/dsmc restore /home/db2ehp/ -subdir=yes /bkfs2/restoreyaman/"

@1620: All files were recovered, and now were able to switch to the user db2ehp properly.
I run the comman "/usr/opt/db2_08_01/instance/db2iupdt db2ehp" and then start the database and it worked!

*YESSS!!*

Database backup started again and we waited till it finished.

DB backup finished and we wanted to take a backup of the filesystems through TSM. Going back & forth, we eventually invoked it manually through TSM, but it timed out & didn't work. Error in communication through TCP/IP, it said!

*THE IP IS WORKING! I CAN PING AND LOGIN WHY CANT YOU!!???? BLOODY $##^%^@*

@1743: The IBM engineer arrived and suggested that since the directory to be changed is very small, then just copy the directory somewhere else. We copied the directory we were going to change with "cp -pR " to preserve permissions, to a remote filesystem.

Now that everything is ready to be changed, the offshore support was contacted and their consultant logged in to our server through VPN and did the required changes.
The changes were simple: One of the filesystems was part of the cluster resources and its mount point was incorrect. We simply had to change the mount point from "/db2/db2EHP to /db2/EHP/db2ehp"

After that step was done, the IBM dude synchronized the changes between the cluster nodes on the IBM AIX Unix cluster. That was half the work -- now we just needed to make sure that we can failover back & forth, then we will proceed with patching the ERP software to the latest version.

@1805: We crashed the primary node, to simulate a failover from the primary DB node to the secondary, tested our ERP software, and it was working.
Upon crashing the primary node, it was booted up again from the management console and left to come up. Meanwhile, we were checking that the DB was working properly on the 2nd node and that the ERP software was able to communicate with the DB and everything was fine.

@1815: The primary DB node didn't come up yet. Fishy. Upon checking on it, it seemed to be stuck, so we restarted it again.

@1825: The machine didn't come up after 10 minutes, which was very suspicious... after checking the management console, it was stuck at code: "0557" -- Some Googling away and to our luck, the message meant problems with the filesystem....

The system couldn't boot because from what it seems, the root filesystem (/) is corrupted, hence the operating system can't load.

*EYE TWITCHES*
*WHY ???? WHYYYYY??? WHY NOW? WHY ME? *

I'm quite disappointed, since I have never expected to face such an issue with AIX and on a p5 series machine. Even the IBM dude was shocked.

The IBM engineer said he can proceed with the procedure we found here:
http://www.docstoc.com/docs/2801670/AIX-BOOTING-PROBLEM -- page 5

But he said that since his job isn't support, it would be better for us to call IBM's support line and log the call, and whatever instructions they give, he'll execute them.

I call IBM's branch in Kuwait and dial the extension which usually takes me to their support in UAE. No one answered... I called 2 more times, without avail.

The IBM dude called a colleague and he gave him another extension for off-hours support. We called that extension and someone picked up! (OH JOY)

I told the support dude my company's name & that we're from Kuwait and he asked me what was the problem & to log an issue by sending an email. I sent the email.

@1953: I received an email from him asking for my company's name, again.

I reply to it and wait for another 10 minutes. Then I call the extension again and ask him what is going on, he said that our support contract had expired in 2005.

*HUH?!*

I tell him that we bought the hardware in November 2007!! And that the support contract for that didn't expire yet!! He insisted that there was no data on their end to support my claim and we argued for a good 10 minutes.
The IBM dude with us interfered and said that my claims are correct and that he was present during the purchase and commissioning of these boxes, but the dude at UAE said if their database doesn't show such a thing, there's nothing he can do.

@2024: He emails me with instructions to contact IBM Europe, and that if they are to help us, they will charge $360 an hour for a minimum of two hours!

According to all the links we found on Google, the error code points at a corrupt filesystem, so we know what's the problem exactly, and there was no point in contacting or paying for IBM Europe.

@2035: we proceed with loading the first DVD of AIX and boot from it to recovery mode on he primary node, and follow instructions in the document above.

All filesystems we corrupted. ALL of them.

*I looked at the IBM dude and said: I'm this close to sit in a corner and cry*

The AIX guru started fixing them one by one, and all got repaired (including root filesystem), except one: /home, which contains the startup and environment scripts for the database...

*CRAP! but at least the root filesystem is sane!*

We reboot the machine and enter the recovery mode again and run fsck (ilesystem check) again to make sure all the filesystems are fine now .. all were, except /home -- it's a goner. Can't be recovered anymore. The LVM partition was corrupt beyond recognition.

The IBM engineer proceeds to make sure that the root filesystem is bootable, on both disks, by issuing the respective commands to write the Master Boot Records to them, and other information.

We exit the recovery mode, for the machine to boot in normal mode and we look anxiously at the error code display... as soon as it passed and the system started to come up, we jumped in joy and hugged.

@2050: Now that we have most of the filesystems working, we decide to back them all up on a DVD (sysback).

@2105: The backup to the DVD failed. Apparently the unix box only likes DVD-RAM media. Luckily, there was an option to take a backup over the LAN to TSM.

@2115: We then proceeded to destroy the corrupt filesystem and its evil logical volume, then create a fresh one and import the /home directory contents to it from TSM: "/usr/tivoli/tsm/client/ba/bin/dsmc restore /home/ -subdir=yes /home/"

@2135: After restoration was done, we took another full backup of the root volume group (which includes the new /home filesystem).

Then we rebooted the primary node, to make sure that the filesystems persisted. It didn't come up after 5 minutes... When checking, the IBM dude had forgotten to abort booting from CD, at which the screen was stuck at. Exiting from that menu, the machine booted normally.

@2145: We rebooted one more time, just to be sure, and everything went fine. At this point, we no longer needed the offshore support nor the IBM engineer since their job was done. The offshore support logged off & the IBM unix guru left, with many warm thanks from me.

Then we proceeded to patch the ERP software; basically, it's just a compressed file with the new binaries and an installer script. We already had the stuff uncompressed on a remote filesystem (NFS), so we just mounted that, renamed the old directory (exe) to "exe_old", then copied the files to the proper location.

It should be noted that the filesystem which we applied the patch to, is exported as a network filesystem (NFS) to other nodes.

As the ERP software came up on the primary node, we started it on the 2nd node, but it crashed...

The ERP dudes tinkered around and found out that all nodes except the primary ERP one are using the old files (pre-patch) !!!

*EYE TWITCHES*
*WHAT THE!!*

We found out a few minutes later that for some reason, the NFS mount is still pointing at the old directory, which we renamed! So, it seemed like even if you rename the directory, NFS would keep track of that!! (Maybe we should've stopped NFS before doing the renaming?)

I proceed to re-export the NFS directories, in an attempt to refresh any links to the directories. It didn't work, and now other nodes are getting an error for this particular NFS mount:
"df: /sapmnt/EHP/exe: A file, file system or message queue is no longer available."

*EYE TWITCHES*
*NO MORE PROBLEMS, PLEASE!!! LET US FINISH AND GO HOME!!*

I stopped the NFS service & started it again. No use. Deleted the old exported directory settings from NFS, and add it again, then restart NFS. No use.

At the end, we decided to try moving out the files from the old directory to a temporary one, putting the new files in the renamed dirctory "exe_old" then renaming it back to "exe"

It worked!!!!! And I laughed hysterically, not believing what happened and the "solution"

During all this, we had to take down the cluster resources before modifying anything, since all nodes point at the shared NFS filesystem, even though it's not part of the cluster resources!!!

We had to shut & start the cluster services about 5 times, till we figured out the solution above.

We brought up all the systems and the ERP guys applied application-level patches & plugins (yes, more of them...)

@0057: I took the ERP dudes to their hotel and I went home.

Tuesday, April 14, 2009

The Perils of System Administration

Whenever you join a company, visit the IT department and you'll find a group of guys calling themselves Systems Administrators, or Systems Engineers (depending on their rank).
These are the guys that keep your IT services running, and even if you see them slacking, playing chess, hide-and-seek, making fun of users, or sleeping on the desk, you can almost always count on them when one of the servers go down; they will stay at work and use toothpicks to keep their eyes open, until your precious(ssss) services are up & running.

I have come across two kinds of admins: Those who have ethics and those who don't. In time of crisis, you can tell which is which, in case the unethical was a cunning fox during the casual days of duty.

Ethics dictate that you state clearly what you know & what you don't, take responsibility for your actions, be loyal to your employer, don't abuse your power, and do your job as you should.

I'll share a couple of stories here to further show the dedication, demand and abuse that IT administrators are subjected to.

Sleepless Nights: Data? What Data?


Three weeks ago we had a scheduled maintenance task starting at Thursday 1400 hours (2 PM) till 1800 (6 PM). The scope of this task was to fix our Database Server cluster where our Enterprise Resource Planning (ERP) software works. This means HR, Finance, Warehouses, and Sales are all dependent on it.

We stopped the system at 1405, took an offline full backup of the database before working, then we proceeded to verify the backup to make sure it's consistent.

At 1435, the backup was done, and we proceed with the offshore support of the ERP system to fix the cluster problems on the secondary/standby node.

One thing lead to another, and we ended up staying till 2200 (10 PM) and planned to continue working on Friday starting at 0800, hoping to finish before lunch time.
The offshore guys were still logged in through VPN from India and continue to dig around for a few more hours.

On Friday I was at the Head Office (HO) at 0750, contacted the offshore support and we picked up from where we left. Around 1100, we got both nodes to work, and we did 2 failover tests and the database worked fine, until we switched back to the primary node.

Everything just went down the hill from there...

The database entered an infinite loop and entered recovery mode. What is recovery mode, you say?
Well, it crashes, then comes up again trying to start, then crashes, and so on.
These continuous cycles caused the error dump filesystem to fill up, which caused another crash at a higher level, stopping the recovery cycle and ending with a non-working database server.

Around 1600, we were still trying to bring the database up after investigating many error logs of the database and the operating system.

A few more futile attempts were made to run the database, after increasing the size of the error dump filesystem.

Around 2100, we realized that our database has been corrupted. No more data. No more business.

Enter panic mode.

We knew we had a safe full backup after the business closed, so we won't be losing any changes. Now, it was all about recovering the database, make sure the ERP software is working, then we could sleep.

I forgot to mention that since we couldn't leave the place, I had a friend of mine bring us lunch to work, and that was the only meal we had that day. THANKS HISHAM!!!

Anyway, we raised a support ticket to the ERP software vendor (SAP) with the highest priority possible and they called me within 30 minutes from Germany. They verified that it is indeed a top priority problem and they assigned one of their elite support guys to help us.

Around 0200, Saturday, we decided to dump the existing now corrupted database, and import the backup.

At 0300 Saturday, we had wrapped up the unfinished work of the cluster failover task, and were just happy that the data and the systems are up & running for the business to use, since my company works on Saturdays.



More stories to come...

Monday, April 13, 2009

BumpTop: The New Desktop Experience!

I got this through the maillist and I have to admit that even though I'm not a graphical interface kind of person, the idea is awesome!

It's a new concept of having a "desktop" where you treat it like you treat your own desk: Have stuff laying around, stuff pinned to the wall, stuff piled up, ...etc.

If you're a person who liked graphical effects and like to interact with your computer, I think this fits you quite well! (It even has themes!)

The original video of the idea is here. It explains the idea & its origin.

Software: http://bumptop.com/ -- Currently Windows only, but you can vote for Linux &/| Mac.

If you run this software, kindly leave your feed back here, or link back to me with your own review of it.

Tuesday, April 7, 2009

Zain e-Go: Show me MY Usage before you charge me!

I have this Zain e-Go device (Huawei E220) and on 3 occasions I had to pay more than the subscription fees and got my bandwidth capped at 16 kB/s because I went over the 30 GB/month limit.

What I don't understand is since Zain knows how much I'm downloading, why doesn't it show me the bandwidth usage per day?

I don't mind paying when exceeding, but Zain/MTC has a history of mis-billing and now I can't tell why I'm paying extra, unlike the case with calls & SMS.

I'm not the only device user, so it doesn't help to install a program on my own machine and log the traffic.

A Request to ISPs: Consider The Users when Going Fiber

Since Kuwait is moving to the Fiber connectivity infrastructure (hopefully within 2 years...), I wish that local ISPs can do the following:

When users subscribe for an Internet connection, the ISP would provide a standard link capacity of 5-10Mbps, but cap the Internet bandwidth to the subscribed value.

This would allow us, users, to reach local websites quite fast (newspapers, stock market, online banking, e-government) and would also allow us to form online games locally in Kuwait and enjoy the low lag & fast connectivity.

Heck, we can help ISPs reduce the total Internet bandwidth coming from the Internet by running local torrent trackers and serve the weekly common things like TV shows' episodes, anime and whatnot.