Tuesday, April 14, 2009

The Perils of System Administration

Whenever you join a company, visit the IT department and you'll find a group of guys calling themselves Systems Administrators, or Systems Engineers (depending on their rank).
These are the guys that keep your IT services running, and even if you see them slacking, playing chess, hide-and-seek, making fun of users, or sleeping on the desk, you can almost always count on them when one of the servers go down; they will stay at work and use toothpicks to keep their eyes open, until your precious(ssss) services are up & running.

I have come across two kinds of admins: Those who have ethics and those who don't. In time of crisis, you can tell which is which, in case the unethical was a cunning fox during the casual days of duty.

Ethics dictate that you state clearly what you know & what you don't, take responsibility for your actions, be loyal to your employer, don't abuse your power, and do your job as you should.

I'll share a couple of stories here to further show the dedication, demand and abuse that IT administrators are subjected to.

Sleepless Nights: Data? What Data?


Three weeks ago we had a scheduled maintenance task starting at Thursday 1400 hours (2 PM) till 1800 (6 PM). The scope of this task was to fix our Database Server cluster where our Enterprise Resource Planning (ERP) software works. This means HR, Finance, Warehouses, and Sales are all dependent on it.

We stopped the system at 1405, took an offline full backup of the database before working, then we proceeded to verify the backup to make sure it's consistent.

At 1435, the backup was done, and we proceed with the offshore support of the ERP system to fix the cluster problems on the secondary/standby node.

One thing lead to another, and we ended up staying till 2200 (10 PM) and planned to continue working on Friday starting at 0800, hoping to finish before lunch time.
The offshore guys were still logged in through VPN from India and continue to dig around for a few more hours.

On Friday I was at the Head Office (HO) at 0750, contacted the offshore support and we picked up from where we left. Around 1100, we got both nodes to work, and we did 2 failover tests and the database worked fine, until we switched back to the primary node.

Everything just went down the hill from there...

The database entered an infinite loop and entered recovery mode. What is recovery mode, you say?
Well, it crashes, then comes up again trying to start, then crashes, and so on.
These continuous cycles caused the error dump filesystem to fill up, which caused another crash at a higher level, stopping the recovery cycle and ending with a non-working database server.

Around 1600, we were still trying to bring the database up after investigating many error logs of the database and the operating system.

A few more futile attempts were made to run the database, after increasing the size of the error dump filesystem.

Around 2100, we realized that our database has been corrupted. No more data. No more business.

Enter panic mode.

We knew we had a safe full backup after the business closed, so we won't be losing any changes. Now, it was all about recovering the database, make sure the ERP software is working, then we could sleep.

I forgot to mention that since we couldn't leave the place, I had a friend of mine bring us lunch to work, and that was the only meal we had that day. THANKS HISHAM!!!

Anyway, we raised a support ticket to the ERP software vendor (SAP) with the highest priority possible and they called me within 30 minutes from Germany. They verified that it is indeed a top priority problem and they assigned one of their elite support guys to help us.

Around 0200, Saturday, we decided to dump the existing now corrupted database, and import the backup.

At 0300 Saturday, we had wrapped up the unfinished work of the cluster failover task, and were just happy that the data and the systems are up & running for the business to use, since my company works on Saturdays.



More stories to come...

4 comments:

Nosayba said...

Stress? Digestive!! (or Kabab).

Nice title for your series up there. Waiting for more stories (not wishing that you face more of these stressful episodes though).

MBH said...

We were two guys in the HO and we had to take naps on shift-basis!

Nosayba said...

Alright, alright.. Chocolate biscuits.

MBH said...

Actually, we had Kabab for lunch.

From Bahar restaurant. Wasn't that good, but it kept us working!