About Me

My photo
Experienced Information Technology leader, author, system administrator, and systems architect.

Tuesday, April 2, 2013

Root Cause Analysis

Sometimes we end up "fixing" the same problem over and over. Root Cause Analysis helps us make sure that we have actually resolved the root cause of the problem.

5 Whys

For most problems, we can get to the root cause by drilling into proposed explanations by repeatedly asking "Why?" The 5 Whys method was developed by the Toyota Motor Corporation. It is based on the observation that five iterations of asking "Why?" is usually enough to get to the root cause of most real world problems.

For example:
Problem Statement: The system crashed. (Why?)
A memory chip failed. (Why?)
The machine room temperature exceeds recommendations. (Why?)
The HVAC unit is undersized given our heat load. (Why?)
Our projections for heat load were lower than what has been observed. (Why?)
We did the heat load projections ourselves rather than bringing in a qualified expert.

Some disadvantages of the 5 Whys method are:

  • The results are not repeatable. We may well end up with different results depending on who runs the exercise. For example, what if we had answered the second "why" with some other plausible explanation?
  • We are limited to the participants' knowledge of the system. In particular, we aren't going to find any answers that the participants don't already suspect.
  • We may not ask "why?" about the right symptoms of the problem.
  • We may stop short and not proceed to the actual root cause of the problem. For example, people may stop at the point about the HVAC unit being undersized, run the estimates themselves, and promptly purchase a larger (but still undersized) unit.

Current Reality Tree

The CRT's primary components are boxes describing symptoms and arrows representing relationships between them. Symptoms are divided into Undesirable Effects (UDE) and Neutral Effects (NE). This allows us to recognize the effects of things in our environment that are not viewed as undesirable, but which may contribute to a UDE.

Arrows may flow in both directions if necessary. In particular, this allows us to identify a negative feedback loop.

Two or more symptoms may have their arrows combined with an ellipse. This means that the combination of those symptoms is sufficient to provoke the following UDE, but that all of them are required.

To build a CRT, we ask a Key Question with our Problem Statement. The question will usually be of the form "Why is this happening?" Next, we need to create a list of several Undesirable Effects which are related to the Key Question. Each symptom (UDE or NE) gets a box. Wherever we can say something like "If A, then B," we would draw an arrow from A to B. Where we can say something like "If A is combined with B, then we get C," we would draw arrows from A and B to C, then group the arrows with an ellipse.

At the lowest level of the CRT, we should ask "Why?" and continue to build the tree down until we are at the Root Causes, also known as "Problems." If the lowest level boxes are still just symptoms of an underlying problem, build down as far as possible by asking "Why?" at each stage.

Some cases, like the one diagrammed here, end up with the root cause ending in a conflict between two Neutral Effects.

Evaporating Cloud and Future Reality Diagrams

The Evaporating Cloud refers to Goldratt's method for dealing with conflicts. In particular, Goldratt discusses the Core Conflict Cloud representing the Core Conflict in our CRT.

In an Evaporating Cloud Diagram, the end goal (aka the Systemic Objective) is placed in a box on the left. The two conflicting Prerequisite Conditions are placed in boxes at the right hand side of the drawing, with a lightning bolt arrow between them. The Necessary Conditions for the Systemic Objective are placed in boxes next to their respective conflicting prerequisite conditions.

The Evaporating Cloud Diagram illustrates the age-old conflict between upgrades and system stability. On the one hand, upgrades will increase the system reliability and performance. Neglecting upgrades for too long will eventually result in system problems. On the other hand, changes always carry some risk, so there is a strong desire to avoid the pain of changes, including upgrades.

In this case, we need to recognize the end goal of providing a reliable service. Upgrades need to be performed, but should be performed in a way that allows for adequate planning and testing in order to avoid introducing problems to a working system. This sort of solution "evaporates" the cloud.

We can use this solution to build a Future Reality Tree, which is like a Current Reality Tree, but with our solution injected into the diagram:

No comments:

Post a Comment