Why aren’t we at Service Level Agreement (SLA) nirvana? I mean really, we have had SLA tools for 10, 15 years or more. You probably have 1 or 10 or more tools that measure SLAs, of which most probably aren’t used. Why aren’t all of our data centers, applications, servers and everything else just numbers on some dashboard that we just glance at to make sure everything is good to go and that we are open for business? This troubled me so I decided to make a list of some of the possible reasons:
1. Too many different tools, specialties and areas of focus
You have tools the measure SLAs for the network, different ones for the infrastructure, different ones for virtual machines, different ones for the cloud, and the list goes on and on. I think this is one of the biggest issues with SLA reporting. Who wants to look at 3 – 10 different tools to know if they are passing all of their SLAs? Or who wants to maintain integration into all of those tools to then pull all of that data into one dashboard? And then what do you do if someone wants to see historical data? This becomes a very deep and very big hole. So then companies move on to my number 2 reason.
2. SLA monitoring via trouble tickets
Wow, this is great. Finally one source for all of our SLA data. All we have to do is make sure every issue we have gets opened as an issue in our help desk tool. Right! The issue eventually happens that you missed an outage and that outage caused you to violate your SLA. Then the logic pervades the company something like: ‘If our tool missed that SLA, what else is it missing?’ And eventually: ‘We just can’t trust this tool’ or ‘We just can’t trust our monitoring’ etc. Also, this is dependant on someone putting in the correct data and time. Not to say they would purposely fudge the numbers but how long would you say something was down that you were responsible for?
3. SLA status based on Network availability
Ok, we have all been guilty of it. If you have ever had to guarantee 5 9’s availability, you reported on just the network availability. Why? Because you had the data, your data met what was expected ( 5 9’s ) and you could easily report on it. Did that meet the intention of the SLA? No, but (insert your excuse here). When someone that cares about an SLA defines it as 99.999% availability, they truly want to be able to access the application or business function 99.999% of the time not just the network. This is discussed further in item 5.
4. Can’t get the data.
Sometimes we just can’t get at the data that we would need in an automated fashion to allow us to have an SLA defined. This may be due to political or technical issues, I am sure you have seen both. This must be resolved with either the customer pushing for it or someone pushing for the customer. In the IT world we live in today, virtually all data is accessible with permission and ingenuity.
5. Technical vs business data
This one is also very common. You report you are meeting your SLA of 99.999% up time and the customer says, ‘but it is never available when I need to use it.’ Been there? Why is this? Because you are reporting that all of the things that you are responsible for technically, are available. But when the customer goes to use the application or business service, some piece that he uses and you might not be responsible for isn’t functioning or responding in a timely manner, etc. Does this make your SLA data wrong? Yes, from a customer perspective (and does anything else really matter?). Your SLA must be looked at from the business point of view as much as possible. Now, you won’t be able to take into account the customer’s home network being down and then having that blamed on you, but if you have enough data showing the service was available from a business point of view, you will be able to push back on them.
What do I mean about monitoring the SLA from a business point of view? Well, it means a few things and these will change depending on how your customer uses the service. Through put, response time, transactions processed per time period, synthetic transaction, functional status of all single points of failure for the service.
6. Data is too bad
When you do get everything monitored and all of the data in one source, sometimes the data is just too bad. Instead of 5 9’s, you’re showing 5 7’s. So instead of showing this to the customer or management instead you (insert your excuse here). This issue can be overcome by either going into the underlying tools and fixing the monitoring to only report outages when they are outages or by fixing your applications and infrastructure.
7. SLA’s just a punishment tool
I have seen this in many different companies. You struggle to meet the SLAs and whenever you miss, here comes the stick. This will then motivate you to either fix the issues or quit reporting. Too often I have seen the later. This doesn’t have to be. Used correctly SLAs can be a carrot and a stick. They can allow you to qualify exactly what is part of the SLA and what hours you are responsible to meet the SLA, thereby reducing/eliminating penalties for off hours and devices that aren’t part of the service or not in you control and then allow you to better meet the SLA for the true service times. SLAs need to have the carrot to be managed effectively.
As we have remained in a reactive mode for many years, now is the time to turn that around into proactive and aligning with the objectives of the business. In the next post we’ll talk about how you turn this around and stitch together a successful Service Level strategy.
What would you add to this list of challenges?