Tag Archive | "Outages"

Outages, Monitoring and Being Prepared

Tags: Amazon EC2, Business Service Management, Cloud Computing, Disasters, Enterprise IT, Outages


Last week, Lori MacVittie had a blog post on DevCentral about earning your data center merit badge. The message was delivered up front and it was simple to understand: Be prepared. MacVittie is right of course, the best way to stay out of trouble is to put systems in place to prevent it from happening in the first place.

But today’s outage at Amazon EC2 showed us something else — that no matter how well prepared you are, stuff happens that’s totally out of your control and it can spiral out of your control pretty quickly. Lest you think just because you don’t use public Cloud Infrastructure as a Service (IaaS) like Amazon EC2 and hence have nothing to worry about, think again.

If it can happen to Amazon, it can happen to you because at its heart what is Amazon but a giant data center, whose core business is keeping other businesses going. That would suggest that Thursday’s outage was something extraordinary to bypass all of the fail-safes that a system like Amazon has to have in place to keep things going. Today it all fell apart, and it could just as easily happen to you because chances are, your data center doesn’t have nearly the number of contingencies in place that Amazon has.

That means that ultimately you’re probably closer to a disaster like yesterday morning than Amazon ever was (yet it happened anyway).

All of this is not to scare you because IT pros know the score about these things, but it is to remind you that having systems in place to monitor and alert you *before* that disaster strikes is more important than ever. Now, it may end up that it doesn’t matter how prepared you are if a disaster strikes that’s completely beyond the scope of anything you could possibly have imagined in a reasonable contingency plan.

All you can do is follow MacVittie’s simple advice and be prepared for whatever comes. It might not always be enough, but if you do your best, you’ll minimize those major outages and be ready to deal with them when they do happen. But remember disasters happen to everyone at some point, whether your in the cloud or in-house in a data center, and you need to be ready.

Photo by rbrwr on Flickr. Used under Creative Commons License.

Outages Can Wreak Havoc on Productivity

Tags: Availability, Business Service Management, Gmail, Intuit, IT, Monitoring, Networking, Outages, Skype


In September, 2009 Gmail went down for two hours. To hear the complaining on social networks like Twitter at the time, you would have thought the entire world had come to a stand-still, but for many people it did. That’s because this service meant more to them than just a nice-to-have free service. People had actually come to depend on it to communicate for business and personal means. 

Other high profile outages have followed including the Intuit outage last June and the Skype outage in December. These two outages lasted more than a day, leaving many unhappy users in their wakes and providing a snapshot for you of what happens when your systems go down.

People who need these services to do their jobs are left looking for work-arounds that IT might not ultimately be happy with (like using unauthorized services to try and get something done).

The fact is that as you sit there looking at your monitoring dashboard, there are real people behind those red lights trying get their work done, and these stories illustrate in a very concrete fashion that when services go down–whether it’s a public service or a private one– it can have a profound impact on actual users.  It can be easy to forget that as you look at the data in front of you on monitors, but it’s important to keep in mind that it’s not just some abstract representation of the service levels inside your company.

In fact, for every red light you see on the dashboard, is another person unable to complete a task using that service and the more mission critical it is, the bigger the effect.

So as you monitor your systems, and review your data and watch the activity streaming through your equipment, always remember that there are humans who depend on these tools to do their jobs, and when a service goes down, even for a little while, it can have major ramifications.

Photo by nan palmero on Flickr. Used under Creative Commons License