This post was originally written while I was at LShift / Oliver Wyman
Adverse events happen – a website breaks down, a project doesn’t get delivered on time – and a proposed technique to find ‘the root cause’ is to ask the “5 Whys”. Attributed to Sakichi Toyoda in the 1930’s and adopted by Toyota and other formal techniques it’s basically the technique of listing a fault and then asking “Why did that happen?” – repeat until you get to a cause that ‘feels’ like the root. The name comes from the observation that 5 repetitions are usually enough.
I find this problematic for several reasons:
- This is the analysis technique of an attention-seeking three year old
- It often raises the question “Why did you do that!!!” and the resulting blame game never helps…
- In my experience there is never a single root cause
Let’s pick a well-known example: “why did so many people die in the Titanic disaster?”
- The watcher didn’t see the iceberg
- The message didn’t get back to the helm
- The management insisted on cheap rivets
- The bulkheads were too low because of the ballroom
- There weren’t enough lifeboats
- Nearby boats didn’t recognise the meaning of flares
- The “SOS” radio sequence wasn’t well known (at the time)
And so on… Maybe some of these are debatable, but the point still stands: in any significant failure it’s usually the case that a whole sequence of partial failures had to happen for the main failure to occur. Fixing any one of them would prevent the disaster happening again, but it’s clearly better to fix as many as possible (which may also prevent other, related, failure scenarios).
This is why I prefer the 4 (or 5) whats:
- What happened (what were the symptoms)? Be precise and objective:
- “the service melted down” is not enough (if you look in the server room there will be no puddles of plastic or aluminium in sight)
- “the service had a latency greater than N seconds resulting in new connections being rejected” is objective.
- What did we do as an immediate workaround?
- What was the damage (and what do we need to do to make up for it)?
- What do we have to do to ensure it never happens again? This will usually be a set of actions, not just one.
This avoids any blame game, avoids the futile attempt to pin down a single ‘root’ cause, and, most importantly, in light of the new information empowers the creativity and ownership of your engineering team to come up with the best solutions. After all, they should be the people who know most about the system.