A true story, names not revealed to protect the innocent and a Dilbert in the making. An illustration of Business Service Management, rather than a Wiki like definition, of technology impact and calculating costs and value.
Early in my career, green and wet behind the ears, about 8 months into the job working the 4:00 – 12:00 shift solo (the shift where stuff gets done, but not discovered until 7:00 am the next day) in a distributed data center. You know what I’m talking about and likely already sense the pain that is about to come. I knew how to run the jobs, I didn’t know what they were really doing or how to fix things if they went wrong – at least not until one fateful summer night. I was working for an outsourcer processing insurance claims for the customer to pay the beneficiaries.
I worked my shift, I left at midnight, jobs done, reports printed, tape back-ups done, the girl working midnight was about to have an easy night. That is until about 6:30 am when she would attempt to bring 3 mainframes online for the next days claims processing. Yes, she was greeted with more error codes than she knew what to do with, The Boss received an early alarm/wake-up call, I can’t bring Rodney to life (that’s the nick name of each of our mainframes) HELP!!
The Solution – Scavenger Hunt:
I arrive at work at 4 to chaos, looks of anger and irrecoverable damage on your shift yesterday. I look around, machines are humming and I say it was not irrecoverable, Rodney is up and running. The phone rings, I answer it, my friend Richard in a distant location, he asks how are you, I say worst day of my life, he says, “it was you!”, meaning he had helped restore service, but no one ratted me out as the root cause.
Richard walks me through my previous night’s shift and what I did and didn’t notice. I trashed a bunch of files. Not a big deal if you have back-ups, which we did, just hang a tape, reload the files and restart Rodney – 5 minutes.
The Cost – my Penalty:
The Boss comes into the data center and waves at me, come take a walk with me. I figure I’m about to get fired, afterall, the data center was down for 7 hours, not a single claim processed, beneficiaries didn’t receive checks, my company missed an SLA, dozens of people worked 7 hours to fix my mistake, but there was something even worse I was about to experience. At 20 something, I couldn’t calculate the number of zero’s for the cost of my simple error.
The Boss walks me through a room with the folks that input claims and reminds me they get paid by the claim, meaning the number of claims they key each day. My simple mistake caused a 7 hour outage, a team of people to find the root cause in order to restore service, my company may have been slapped with a fine, beneficiary checks were delayed, but most heart wrenching to me was that I impacted the paychecks of more than 100 folks that were paid by the number of claims they keyed each day. As I walked through the room, they didn’t know I was the root cause, but they were glaring at us none-the-less.
The room seemed the length of a football field that day. As we exited the room The Boss simply said, “are you going to do this again?” and I quickly responded, “I hope you fire me if I do!”.
Business Service Management – claims processing was my business, my company caused an outage of significant cost. This happens every day, the cost is quite easy to calculate and the insurance policy to mitigate the risk is far less costly, however, as IT professionals we have a difficult time justifying service enabling our data centers with proper management until there is an outage. A single outage can cost 1-2% of revenue and a solution to avoid it can be a fraction of that cost.
Data centers are growing more complex, virutalization and cloud computing are seen as low cost options by removing hardware and software costs, however, the cost of support is overlooked and we are entering a familiar cycle of short sided savings over long term cost to repeat the dotcom bust of the 90’s with the hosting providers and web services. Service Enabling infrastracture with an end-to-end view to pinpoint root cause, visibility to read the indicators before impact so that restoration can be minutes – not hours greatly reducing the cost of an outage has to factored into the solution. By service enabling with management upfront allows you to take risks, be agile with new technology by having the right management in place to monitor for thresholds, errors, etc. avoiding and mitigating outages.
I know my Boss wasn’t really mad that I was the root cause of an outage, he’s was mad that a 5 minute fix relied upon a 7 hour scavenger hunt! This is my Dilbert – what’s yours?