Handling major system failure mistake number 2: Broad based assumptions without facts. “It’s probably the network” or “the server” or “the last code release”… or any other scapegoat. A systemic system failure that has been plaguing an organization is rarely impacted by these discrete failures in and of themselves. Assuming these areas to be the cause pre-disposes the team to certain opinions. You’ve heard the expression about what happens when you assume. I’ll spare you the spelling lesson and get to the point. Don’t excite your team to chase red herrings. When you point fingers and play the blame game you waste time and generally end up in the same place you started. It’s not productive it creates animosity and most of all it distracts from the issue at hand. Every possible cause of the problem across the entire infrastructure must be considered for fault isolation. That’s a waste of time and impossible, you say! Not at all. In fact, once you have mastered this troubleshooting technique, you will find isolating the “where” of a problem will almost immediately tell you the “what”. Here is how you do it: Start at the user.
Either through remote tools, or even better if you are able stand over the shoulder of an affected user and see the issue in action, that will really help. It helps the user community see presence from IT, which is good politically, and it helps quantify the business impact. Most importantly it gives you the ability to rapidly execute the process of elimination. (Need a tool? See below. )
By starting at the user interface you are able to very quickly determine the availability and response of many of the underlying components. For instance, if the user is able to get past the login screen, you have validated PC, LAN, WAN, DC, and server to server connectivity in one shot. DNS, authentication and some business logic is also being validated as the user works through the system to the point of impact. Note: If you are having sporadic performance or availability failures, you may have to utilize and end-user experience tool to capture over longer periods of time or across geographic locations.
Collect your data and go to the whiteboard. Starting at the user-interface, start tracing the transaction to the underlying systems that support that specific problematic business process. Here is when a well configured Configuration Management System pays for itself 10-fold. Now you can deploy very targeted diagnostic tools at these specific locations to determine response times that are affecting the overall business impact. Only then can you accuse or excuse a particular component.
Obviously having a well-executed end-to-end monitoring system accelerates this process. None the less, it should not take more than 1 week to get to the diagnostic phase and start isolating bottlenecks within the infrastructure.
Email me at Matthew.Hooper@inforonics.com and I will send you a utility we used to trace the end-user experience and inserts markers into the network stream for easier diagnostics. I have many other blog links of my own and others who have written about tools that I am happy to share. Of course I’m happy to have the Inforonics team speak to you about getting tools as a service.