Troubleshooting generally consists of the following steps. Different
methodologies may call them by slightly different names, but the
similarities are pretty obvious.
- Investigation
- Problem Statement: Create a clear, concise statement of the
problem.
- Problem Description: Identify the symptoms. What works?
What doesn't?
- Identify Differences and Changes: What has changed recently?
What is unique about this system?
- Analysis
- Brainstorm: Gather Hypotheses: What might have caused the
problem?
- Identify Likely Causes: Which hypotheses are most likely?
- Test Possible Causes: Schedule the testing for the most
likely hypotheses. Perform any non-disruptive testing immediately.
- Implementation
- Implement the Fix: Complete the repair.
- Verify the Fix: Is the problem really fixed?
- Document the Resolution: What did we do? Get a sign-off
from the system owner.
Problem Statement
The problem statement must be broad enough to describe the problem,
but narrow enough to focus the investigation. It should not
contain value judgements. It should be a factual answer to the
question "What is wrong?"
Problem Description
Gather all symptoms, including error messages, core dumps,
descriptions of any service outages, and contrasting descriptions of
what still works. As near as possible, we need to identify the
time of the incident.
Identify Differences and Changes
Identify differences between the faulted system and any similar
working systems. Also identify any recent changes to the system.
Brainstorm
In this stage, we need to come up with as many possible explanations
for the problem as possible. It is sometimes helpful (especially
in a group setting) to use an Ishikawa diagram to organize our
thoughts so that we don't leave any possibilities unconsidered.
Generate an Ishikawa diagram by drawing a backbone arrow pointing
to the right at the problem statement. Then attach 4-6 ribs,
each of which represents a major broad category of items which may
contribute to the problem. Each of our components should fit on
one or another of these ribs.
Identify Likely Causes
We need to consider how likely each potential cause is. We
should only eliminate hypotheses when they are absolutely
disproven.
For more complex problems, something like an Interrelationship
Diagram may be useful in identifying which potential cause
may be might be a root cause.
Interrelationship Diagrams use boxes containing phrases describing
the potential causes. Arrows between the potential causes demonstrate
influence relationships between these issues. Each relationship
can only have an arrow pointing in one direction. (Where the
relationship's influence runs in both directions, the troubleshooters
must decide which one is predominant.) Items with more out arrows
than in arrows are causes. Items with more in arrows are effects.
Test Possible Causes
We need to perform testing in the least disruptive fashion possible.
Data should be backed up if possible before testing proceeds.
The best approach is to schedule testing of the most likely
hypotheses immediately. Then start to perform any non-disruptive
or minimally disruptive testing of hypotheses. If several of the
most likely hypotheses can be tested non-disruptively, so much
the better. Start with them.
In some cases, it may be possible to test the hypothesis directly
in some sort of test environment. This may be as simple as running
an alternative copy of a program without overwriting the original.
Or it may be as complex as setting up a near copy of the faulted
system in a test lab. If a realistic test can be carried out
without too great a cost in terms of money or time, it can really
help nail down whether we have identified the root cause of the problem.
Depending on the situation, it may even be appropriate to test
out the hypotheses by directly applying the fix associated with
that problem. If this approach is used, it is important to only
perform one test at a time, and back out the results of each failed
hypotheses before trying the next one. Otherwise, you will not have
a good handle on the root cause of the problem, and you may never be
confident that it will not re-emerge at the worst possible moment.
Implement the Fix
The fix needs to be implemented in the least-disruptive, lowest-cost
manner possible. Ideally, the fix should be performed in a way that
will completely verify that the fix itself has resolved the problem.
Verify the Fix
We need to check that the problem is resolved, and also that we
have not introduced any new problems. Each service in your environment
should have a test suite associated with it so that you can quickly
eliminate the possibility that we have introduced a new problem.
Part of this verification should include a root-cause analysis to make
sure that the real problem has been resolved. Band-Aid solutions are
not really solutions.
Document the Fix
Over time, the collection of data on resolved problems can become a
valuable resource. It can be referenced to deal with similar problems.
It can be used to track recurring problems over time, which can help
with a root cause analysis. Or it can be used to continue the
troubleshooting process if it turns out that the problem was not
really resolved after all.