TPSA: The Perils of System Administration -- A series of true stories about system administration. The first story is here.
Yesterday, Thursday April 16th, we had a scheduled maintenance job on our production servers. The cluster wasn't functioning as it should and yesterday's tasks were aimed at rectifying the problems once & for all, and update the software installed.
This was our time-plan for each task on the list:
1400 to 1410 | Shutdown SAP on DB and APP Server |
1410 to 1440 | Take a database backup Shutdown database |
1440 to 1500 | Take a Filesystem backup (All Filesystems on both nodes) Changing mointpoints for High Availability Cluster switchover testing |
1500 to 1700 | Shutdown SAP and database Perform Kernel Upgrade on SAP CI Start the database and SAP on the CI Instance ONLY (Not the APPS) Perform technical testing on CI Check the mountpoints exe Start SAP on APPS Perform technical testing on APPS |
1700 to 1730 | Import the ST-A/PI Patch 01L_ECC600 Change SAP Parameters based on the document Review the parameters on CI and APPS (Memory and Work Processes) Restart the SAP CI and APPS Perform technical testing on CI and APPS |
All tasks were easy and planned out with all members of this properly set:
- ERP software consultant (joined by his colleague later)
- ERP offshore support consultant
- AIX Unix consultant from IBM Kuwait
- Myself
We grabbed lunch around 1230 and some snacks and headed to the Head Office (H.O.).
@1401: ERP Applications were stopped
@1406: We started a full offline backup of the production database
@1437: The backup reached its final stage, then threw an error, stopped, and gladly deleted the backup
*NO!! THIS IS NOT THE TIME!! PLEASE!*
After being stunned and depressed for 2 minutes, I thought of stopping the database and starting again, using the user db2ehp I did that and when I tried to start the database again, it threw an error!
*JAWS DROPPED -- LAAAAAAAAA!!*
We faced the same error a few days back due to some user profile changes, so we had to run the command "/usr/opt/db2_08_01/instance/db2iupdt db2ehp" -- I ran that command and it puked an error ...
*NOOOO!! WHY WHY!! IT WAS FINE WHY NOW!! I HATE YOU!!*
I navigate around and go to the home directory of the user db2ehp to check the environment variables and profiles, only to find out that these files have been corrupted and turned to binary garbage
*SHOCKED*
*EYE TWITCHES*
I ftp to the secondary database node, looked for similar files, then decided to copy the whole directory and renamed the files to match the host name of the primary node.
It didn't work for some reason, even though all the environment variable scripts had proper values...
I called the company responsible for the support on our backup software (TSM) and in about half an hour their consultant provided me with a command line to recover a certain directory to a specific location.
Then we restored from our backup software that whole directory: "/usr/tivoli/tsm/client/ba/bin/dsmc restore /home/db2ehp/ -subdir=yes /bkfs2/restoreyaman/"
@1620: All files were recovered, and now were able to switch to the user db2ehp properly.
I run the comman "/usr/opt/db2_08_01/instance/db2iupdt db2ehp" and then start the database and it worked!
*YESSS!!*
Database backup started again and we waited till it finished.
DB backup finished and we wanted to take a backup of the filesystems through TSM. Going back & forth, we eventually invoked it manually through TSM, but it timed out & didn't work. Error in communication through TCP/IP, it said!
*THE IP IS WORKING! I CAN PING AND LOGIN WHY CANT YOU!!???? BLOODY $##^%^@*
@1743: The IBM engineer arrived and suggested that since the directory to be changed is very small, then just copy the directory somewhere else. We copied the directory we were going to change with "cp -pR
Now that everything is ready to be changed, the offshore support was contacted and their consultant logged in to our server through VPN and did the required changes.
The changes were simple: One of the filesystems was part of the cluster resources and its mount point was incorrect. We simply had to change the mount point from "/db2/db2EHP to /db2/EHP/db2ehp"
After that step was done, the IBM dude synchronized the changes between the cluster nodes on the IBM AIX Unix cluster. That was half the work -- now we just needed to make sure that we can failover back & forth, then we will proceed with patching the ERP software to the latest version.
@1805: We crashed the primary node, to simulate a failover from the primary DB node to the secondary, tested our ERP software, and it was working.
Upon crashing the primary node, it was booted up again from the management console and left to come up. Meanwhile, we were checking that the DB was working properly on the 2nd node and that the ERP software was able to communicate with the DB and everything was fine.
@1815: The primary DB node didn't come up yet. Fishy. Upon checking on it, it seemed to be stuck, so we restarted it again.
@1825: The machine didn't come up after 10 minutes, which was very suspicious... after checking the management console, it was stuck at code: "0557" -- Some Googling away and to our luck, the message meant problems with the filesystem....
The system couldn't boot because from what it seems, the root filesystem (/) is corrupted, hence the operating system can't load.
*EYE TWITCHES*
*WHY ???? WHYYYYY??? WHY NOW? WHY ME? *
I'm quite disappointed, since I have never expected to face such an issue with AIX and on a p5 series machine. Even the IBM dude was shocked.
The IBM engineer said he can proceed with the procedure we found here:
http://www.docstoc.com/docs/2801670/AIX-BOOTING-PROBLEM -- page 5
But he said that since his job isn't support, it would be better for us to call IBM's support line and log the call, and whatever instructions they give, he'll execute them.
I call IBM's branch in Kuwait and dial the extension which usually takes me to their support in UAE. No one answered... I called 2 more times, without avail.
The IBM dude called a colleague and he gave him another extension for off-hours support. We called that extension and someone picked up! (OH JOY)
I told the support dude my company's name & that we're from Kuwait and he asked me what was the problem & to log an issue by sending an email. I sent the email.
@1953: I received an email from him asking for my company's name, again.
I reply to it and wait for another 10 minutes. Then I call the extension again and ask him what is going on, he said that our support contract had expired in 2005.
*HUH?!*
I tell him that we bought the hardware in November 2007!! And that the support contract for that didn't expire yet!! He insisted that there was no data on their end to support my claim and we argued for a good 10 minutes.
The IBM dude with us interfered and said that my claims are correct and that he was present during the purchase and commissioning of these boxes, but the dude at UAE said if their database doesn't show such a thing, there's nothing he can do.
@2024: He emails me with instructions to contact IBM Europe, and that if they are to help us, they will charge $360 an hour for a minimum of two hours!
According to all the links we found on Google, the error code points at a corrupt filesystem, so we know what's the problem exactly, and there was no point in contacting or paying for IBM Europe.
@2035: we proceed with loading the first DVD of AIX and boot from it to recovery mode on he primary node, and follow instructions in the document above.
All filesystems we corrupted. ALL of them.
*I looked at the IBM dude and said: I'm this close to sit in a corner and cry*
The AIX guru started fixing them one by one, and all got repaired (including root filesystem), except one: /home, which contains the startup and environment scripts for the database...
*CRAP! but at least the root filesystem is sane!*
We reboot the machine and enter the recovery mode again and run fsck (ilesystem check) again to make sure all the filesystems are fine now .. all were, except /home -- it's a goner. Can't be recovered anymore. The LVM partition was corrupt beyond recognition.
The IBM engineer proceeds to make sure that the root filesystem is bootable, on both disks, by issuing the respective commands to write the Master Boot Records to them, and other information.
We exit the recovery mode, for the machine to boot in normal mode and we look anxiously at the error code display... as soon as it passed and the system started to come up, we jumped in joy and hugged.
@2050: Now that we have most of the filesystems working, we decide to back them all up on a DVD (sysback).
@2105: The backup to the DVD failed. Apparently the unix box only likes DVD-RAM media. Luckily, there was an option to take a backup over the LAN to TSM.
@2115: We then proceeded to destroy the corrupt filesystem and its evil logical volume, then create a fresh one and import the /home directory contents to it from TSM: "/usr/tivoli/tsm/client/ba/bin/dsmc restore /home/ -subdir=yes /home/"
@2135: After restoration was done, we took another full backup of the root volume group (which includes the new /home filesystem).
Then we rebooted the primary node, to make sure that the filesystems persisted. It didn't come up after 5 minutes... When checking, the IBM dude had forgotten to abort booting from CD, at which the screen was stuck at. Exiting from that menu, the machine booted normally.
@2145: We rebooted one more time, just to be sure, and everything went fine. At this point, we no longer needed the offshore support nor the IBM engineer since their job was done. The offshore support logged off & the IBM unix guru left, with many warm thanks from me.
Then we proceeded to patch the ERP software; basically, it's just a compressed file with the new binaries and an installer script. We already had the stuff uncompressed on a remote filesystem (NFS), so we just mounted that, renamed the old directory (exe) to "exe_old", then copied the files to the proper location.
It should be noted that the filesystem which we applied the patch to, is exported as a network filesystem (NFS) to other nodes.
As the ERP software came up on the primary node, we started it on the 2nd node, but it crashed...
The ERP dudes tinkered around and found out that all nodes except the primary ERP one are using the old files (pre-patch) !!!
*EYE TWITCHES*
*WHAT THE!!*
We found out a few minutes later that for some reason, the NFS mount is still pointing at the old directory, which we renamed! So, it seemed like even if you rename the directory, NFS would keep track of that!! (Maybe we should've stopped NFS before doing the renaming?)
I proceed to re-export the NFS directories, in an attempt to refresh any links to the directories. It didn't work, and now other nodes are getting an error for this particular NFS mount:
"df: /sapmnt/EHP/exe: A file, file system or message queue is no longer available."
*EYE TWITCHES*
*NO MORE PROBLEMS, PLEASE!!! LET US FINISH AND GO HOME!!*
I stopped the NFS service & started it again. No use. Deleted the old exported directory settings from NFS, and add it again, then restart NFS. No use.
At the end, we decided to try moving out the files from the old directory to a temporary one, putting the new files in the renamed dirctory "exe_old" then renaming it back to "exe"
It worked!!!!! And I laughed hysterically, not believing what happened and the "solution"
During all this, we had to take down the cluster resources before modifying anything, since all nodes point at the shared NFS filesystem, even though it's not part of the cluster resources!!!
We had to shut & start the cluster services about 5 times, till we figured out the solution above.
We brought up all the systems and the ERP guys applied application-level patches & plugins (yes, more of them...)
@0057: I took the ERP dudes to their hotel and I went home.