Intermittent problems are extremely difficult to troubleshoot.
Any reproducible problem can be troubleshot, if for no other
reason than that each individual component can be proven to not
be the problem through experimentation. Problems that are not
reproducible cannot be approached in the same way.
Problems present as intermittent for one of two reasons:
- We have not identified the real cause of the problem.
- The problem is being caused by failing or flaky hardware.
The first possibility should be addressed by going back to
brainstorming hypotheses.
It may be helpful to bring a fresh perspective into the
brainstorming session, either by bringing in different people,
or by sleeping on the problem.
The second problem is tougher. There are hardware diagnostics
tests that can be run to try to identify the failing piece of
hardware.
The first thing to do is to perform general maintenance on the system.
Re-seat memory chips, processors, expansion boards and hard drives.
Once general maintenance has been performed, test suites like
SunVTS can perform stress-testing on a system to try to
trigger the failure and identify the failing part.
It may be the case, however, that the costs associated with this
level of troubleshooting are prohibitive. In this case, we may
want to attempt to shotgun the problem.
Shotgunning is the practice of replacing potentially failing parts
without having identified them as actually being flaky. In general,
parts are replaced by price point, with the cheapest parts being replaced first.
Though we are likely to inadvertently replace working parts,
the cost of the replacement may be cheaper than the costs of the
alternatives (like the downtime cost associated with stress testing).
When parts are removed during shotgunning, it is important to discard
them rather than keep them as spares. Any part you remove as part of
a troubleshooting exercise is questionable. (After all, what if a power
surge caused multiple parts to fail? Or what if there was a cascading failure?)
It does not make sense to have questionable parts in inventory; such parts
would be useless for troubleshooting, and putting questionable parts into
service just generates additional downtime down the road.
This practice may violate your service contract if performed without
the knowledge and consent of your service provider.
Regardless of the method used to deal with intermittent problems, it is
essential to keep good records. Relationships between our problem and other
events may only become clear when we look at patterns over time. We may
only be confident that we have really resolved the problem if we can demonstrate
that we've gone well beyond the usual re-occurrence frequency without
the problem re-emerging.