
As part of my career as a dad in IT, I’ve become quite the problem solver and have worked my way up through the ranks to be a senior technical consultant, which I like to think of as cheif problem solver and wearer of many hats. I was pondering about how I approach a problem and why I am usually successful at working out why something has happened or what the root cause is. Do I approach this differently to others?
So I thought, why not share my approach and how I like to tackle a problem.
Please note that the samples/examples given are not exhaustive of the possibilities, they are just here to demostrate what I typically do.
Start at the Start
It is all too easy to start diving straight to the technical solutions of the situation without consideration for how this is impacting the customer or clearly understanding the problem itself.
I like to ask myself 5 relevant questions for each heading (Q’s starting with the heading) – What, Where, When, Who, Why, How. e.g. When does the problem occur, is it all the time? What changes have been made recently? How is it impacting the customer, everyone affected? Who are the key stakeholders?
I then answer these questions factually based on what I know of the issue so far. If there are any gaps I try and plug them quickly at this stage – like going off to ask who was affected or speaking with the team doing the investigation and delegating them each a question to answer. This stage can sometimes feel a little silly, but continue with the process… try to keep your answers in simple language which doesn’t need a technical skillset to understand.
With the answers now written down, I move the answers around to try to group them together logically and bin any which turn out to be duplicates of another answer. I then turn the Answers into paragraphs describing the underlying problem, the final version being my “Problem Mission Statement”.
The number of investigations I’ve seen go off path due to starting too soon, just because of the severity of the situation is another story. Take the time to prepare, even if it is just 10 minutes spent here, when you’re a few hours into the investigation you’ll thank me!
Plan to Succeed
Once I’ve clearly defined the problem into my mission statement, I will create a plan for the steps required to eliminate the most likely root causes first. My gut feeling is derived from 20+ years in IT, I should trust my gut. If you’ve got more experience under your belt, or are just starting out, you will get some feelings about where the problem may be coming from at this stage.
Try to again group these areas of investigation together, but align these with the problem at hand. For example, the problem may exist in an on-premise application of which the application depends upon SQL Database, Windows OS, Hypervisor, Network (Switching/LAN, Physical/Virtual), WAN, End User Devices, maybe the Internet Connection. If only users at the head office are affected then you can eliminate Internet Connection or WAN links straight away.
I will then think about my gut feeling of the root cause based on the areas of investigation left. Here is where I start to plan the approach for each area into “Battle Cards”. Over time you may want to keep the battle cards you’ve created to save time the next time you need to review them.
A baatle card for Windows OS may be an example of where your investigation may follow a similar approach each time – for example: I must check System, Application and Security logs on the server. I must check running services which are on Automatic. I must check recently installed updates to see if the timeline matches.
These battle cards may help shortcut future planning durations by starting with this planned approach the next time I need to review that area.
It is critically important to go back to your Problem Mission Statement when writing your battle cards for this investigation, are there specific times/places/users/devices that you’re looking for in the logs etc.
There is one more battle card to add to all investigations and which is vitally important. Do you have current backups and configuration backups of the devices involved, before any changes are made? Consider what backups do you need to make, if you’re fortunate enough to have a team of colleagues you can delegate to then set them the priority task of checking backup outcomes and refreshing configuration backups now.
Process of Elimination
The Sherlock moment, we are going to use our battle cards to eliminate down the large stack of potential root causes. Once you’ve eliminated the impossible…
Follow your gut, stack your battle cards in the order you feel is most likely to turn a result. You may not be right every time, but if you’ve experience and a good understanding of the problem (using your W5H mission statement) then you’ll probably be right 80% of the time which is a great first time fix rate!
Use the battle card to keep yourself aligned to the mission statement.
Don’t get too bogged down into one area of investigation here, set yourself time limits like 30 minutes max per area on the first pass. If there is nothing obviously wrong in that area strike it off for now. If there is potentially something wrong in that area or this is your gut feeling, allow yourself some extra time, like 45 minutes max. Hold yourself responsible for not spending hours on one area, move on to find additional clues.
When you’ve done your initial pass, revisit the Problem Mission Statement, checking if the clues you’ve found may be a symptom or cause of the problem.
Tweak, Monitor and Test
Just like starting too soon, the number of investigations I’ve seen go off the rails due to the sheer number of tweaks and changes being made to try and find the cause, only later on to find out that by flicking that seemingly minor setting from on to off you’ve made a much bigger issue down the road. Back pedalling again turns into a mare as you’re trying to remember all the things that were done and in what order. Utter chaos!
You must be strict at this stage. Make only independant changes at the same time. Ask yourself if the changes will impact on each other and results can be easily monitored separately of each other without one masking the other. Document the changes you’re making. Think about how you can test if this will change the problem, carry out your testing before you make the change and afterwards to see if anything has changed.
Remember to give things time to work, sometimes flicking a switch can take a little time to change the outcome of the testing.
No such thing as too much communication
Lastly, when you’ve spent the time to prepare a plan of action and a problem mission statement it can make communicating the problem to non-technical colleagues or customers much easier to do.
It is best, when you can afford to and have people around you that you can rely on, to assign someone to manage the communication. They can be someone non-technical, this often works best to be honest, who can share an update into the investigation with key stakeholders.
Set clear time limits for when they can expect the next communication to be sent by. A “nothing to update, investigation ongoing” update is more valuable in the long-run compared to no update at all, as the stakeholders will be less likely to start to worry if you care about resolving the issue or not. Handing over to another team mid-way through an investigation is so unproductive!
Another pitfall to avoid is that an engineer working on a problem is likely to tell you they think they may have found the problem when all they have identified is another clue. By communicating these too soon and without reviewing them against the Problem Mission Statement, you run the risk of planting large seeds of doubt into the customer’s mind. They may walk away thinking “my network is broken”, then in the next update “my server is broken”, then “my database is broken” and very quickly you’ll see the despair on their face as they start to think their years of investment in their business are rapidly going down the drain. Trying to rebuild the trust that they’ve bought the right things and have the right people running them after that event is not a nice scenario to be in.
In Conclusion
To summarise, I don’t think there is specifically a single right way to run an investigation as each problem is typically different and some are naturally way more complex than others. But by investing some time and effort in making the process something that you can quickly repeat and get a good outcome 80% of the time (or more), then you’ll be going in the right direction.
I’ve shared some insights into what I’ve learned over the years, and hope this may help you travel the challenging path of problem solving too. I’m always learning and constantly trying to improve the way I do things, so I’m not in any disbelief that what I’ve written is 100% the best way to do it. Please do get in touch if you’ve any thoughts or experiences to share, would love any insights.
Leave a Reply