Ok, so we aren’t there yet. The first part of getting over a problem is admitting that you have one. How can we resolve the issues I brought up in my previous post? Let’s talk about that now.
1. Too many tools…
You are never going to reduce the number of tools you have down to 1. Someone will always need this tool or that functionality. So, to resolve this you need a tool that can pull data from multiple sources through integration. Databases, APIs, web interfaces, traps, etc. These tools do exist.
2. SLA monitoring via trouble tickets
As I mentioned in my previous post, there is a lot of potential for human error here. I would suggest to you that trouble tickets back up or provide the background reasons why the service level agreement (SLA) was violated but they should never be used to be those SLAs. You also need your SLA to potentially have different thresholds for different parts/pieces. Once you have integrated to the sources of information in item 1, then you should be able to build out your SLAs based on the business service taking into account the different parts of the service and areas where you have redundancy versus single points of failure. Then being able to roll all of that up to a dashboards where you can see the results.
3. SLA status based on Network availability
Total network availability should never be part of an SLA! Your SLA should only include the parts/routes of the network that your service depends on. The network availability is important, but not as important as the service availability. Ultimately the SLA is there to insure that the customer can use the service. If the service functions then the SLA is good, from the customers eyes. You need to build a model for the service so that you can take into account all of the parts of the service both physical and logical and include a synthetic transaction to confirm that the service is functioning. One last point here, if the service is available and it takes 5 minutes to log in, the customer sees this as the service is down. A well defined SLA looks at all SLA components from the customers point of view.
4. Can’t get the data
This can be a hard nut to crack. If you have the ability to get the data but because of political reasons you can’t get the data, then you have to involve the customer or customer advocate. Ask things like: How important is it to you? Point out the holes and the areas you will be blind to. What happens if this part fails and we don’t know it? Ultimately this is either a big deal or it isn’t. If it isn’t, fine. If it is a big deal then you can leverage the pain that the customer conveyed to you to get at that forbidden data. Use the customer as the club to get at the data if needed. No one can argue (successfully) against providing good service to the customer.
5. Technical vs business data
You have integrated your data from the different sources and built out a model of the service but the customer still complains? Look at the service from a business point of view. What tells me that the service is functioning? Things like: transactions processed per (time period), web hits, database rows update, etc. Now use this as data you need to integrate to. Pull in this data along side your model to validate the technical with the business data.
6. Data is too bad
Ok, valid point, but everyone starts somewhere and if you don’t start now, maybe your successor will do it. To overcome this one, simply do everything as above only don’t show the results to anyone. Instead use this data to improve the service, validate the model, confirm the SLA hours of availability, etc before the data is shown to the customer or management. Use this time to improve your monitoring and functionality of your environment.
7. SLAs just a punishment tool
Although I am sure you have seen this, it doesn’t have to be this way. Instead of struggling to meet the SLA, change it, further define it, eliminate the false information. Include the business information as mentioned above in item 5. I have seen companies do this well and been willing to up the penalties they would pay during business hours, because they eliminated all of the non production information that they were paying for that had nothing to do with the SLA. They also were able to exactly define the SLA hours, when 5 9’s were needed and when 5 7’s was fine. This can give you some breathing room as well as allow you to more easily meet the defined SLA. This can also allow you to setup different levels of SLA that then can enable you to charge more for those services that ‘must always be available.’
8. SLA’s are only historical and I need real time
I hear it all of the time, I can’t worry about SLAs. I am trying to deal with right now.’ A well defined SLA allows you to see the state of how things are right now AND they can give you predictive warnings as well. Allowing you to be notified not just when there is an outage but also when (if nothing changes) you will violate your SLA in X hours or n minutes. This can then take the service you provide to a whole other level. Allowing you to see potentially customer impacting issues before they violate your SLA. How can you afford NOT to set up your SLAs?
At the end of the day, well defined, monitored SLAs can improve how you are perceived by the customer and improve the service you provide as well. Can we ever get to SLA nirvana? Yes, I think we can. It’s just a process that, when managed well and the correct information is gathered, really functions for you.