‘It’s obvious that when a critical production system goes offline restoring the system back to operation is a mission worthy of focused resources. How many times have you sat in a meeting where ridiculous ideas are thrown around in a vain attempt to restore the business service? Probably many! A great deal of time and energy is spent “brainstorming” in an attempt to restore the system often with unproductive results.
Here are a couple of classic mistakes that organizations make when a system crashes:
1) You start fixing the issue without knowing what the problem is.
I’m not talking about Problem Management, so you can put your ITIL bible down for a second. What I mean is, what is the impact, what’s not happening from a business perspective because of this condition.
Because IT Operations folks are traditionally reactive, they all too often start fixing system issues without knowing what is truly broken. Engineers will start applying solutions to situations without fully understanding what happened in the first place. It is imperative that you take a moment to define what “resolved” will mean in each scenario. The concern must be clearly identified before solutions can be thrown out. Putting a system back on-line may not solve the business impact.
HINT: Identify the Critical Success Factor (CSF) and then identify 3 Key Performance Indicators (KPIs) that will confirm that the issues are resolved. I’m not talking about writing a book here. Simply state what success will look like and how we will measure it. For instance, 40 users are able to input 200 policies in a work day period. That’s the CSF. Now how will you measure this? We will record how many policies are entered into the system per hour. We will measure the number of users logged on to the system per hour. We will measure the number of failed attempts. Thus we know have Key Performance Indicators that we state: 1) x numbers of policies can be done per hour. 2) x number of policies fail out of y 3) x number of agents will experience a y% fail rate.
CSF and KPIs should be measureable so that an evaluation can be made as to whether the issue has been resolved. In this example if we hit KPI1 -30 policies per hour, KPI2 – Less than 1 out of 6 fail KPI3 – 5 agents or less experience an 15% fail rate, than we know that we are achieving the required business condition, and we have solved the problem.
Do you notice the mindset difference here? Have we solved all the problems? Well if you ask the engineering team the answer is obviously no. We are still experiencing failures. However, from a business perspective, we have solved the problem.
Stuff breaks, crap happens, that’s life. IT can and never will be able to provide a perfect operating environment. That is why we need SLA’s that are realistic and service driven, not metric driven. So remember, before you dive deep and call in the SWAT team (please don’t do that, we’ll get to that in an upcoming blog) make sure you define what the problem and the success look like. This little bit of effort and time will get the business up and running faster and save you lots of frustration.
Next blog, let’s break down another classic mistake in handling incidents: Assuming anything.
You hit on very critical area: IT having visibility into business metrics. So much time and energy is spent developing BI, specifically business performance information, so the business can adjust and improve. Giving IT visibility into those performance indicators combined with the operational intelligence a IT group should be measuring significantly improves the event management of the Incident. How critical is it really? Spot-on Matt!