The Tolerance Stack: How Small Errors Compound Into Catastrophic Failures

Engineers who design mechanical parts know about tolerance stacking. Every manufactured component has tiny dimensional variations: a bolt that's supposed to be 10mm might be 10.02mm, a hole drilled at 9.98mm, a gasket compressed 0.05mm beyond spec. Each deviation is acceptable on its own. Stack five components in sequence, each at the edge of its tolerance range, and suddenly your assembly doesn't fit. The whole thing fails not because any single part was defective, but because all the small imprecisions accumulated in the same direction.

Artistic arrangement of red and blue dice in stacks casting shadows on a white surface. Photo by DS stories on Pexels.

This principle extends far beyond manufacturing. Most organizational disasters follow the same pattern, and most decision-makers miss it entirely.

Why We're Wired to Think in Single Causes

Human cognition gravitates toward single-cause explanations. A plane crashes; investigators look for the one critical failure. A project collapses; the postmortem identifies the one bad decision. This isn't laziness. It's how narrative works, and narrative is how we make sense of events. But complex systems rarely fail from a single cause. They fail when multiple small deviations align.

James Reason's Swiss Cheese Model captures this: each layer of a system has holes (weaknesses), but as long as the holes don't line up, no harm passes through. When they align, a failure pathway opens. Reason developed this model studying industrial accidents, but it describes nearly every organizational catastrophe you've seen. The point is the alignment, not any individual hole.

The problem with the Swiss Cheese framing is that it implies a relatively static picture. Holes are either there or not. Tolerance stacking is more dynamic and more insidious. Deviations accumulate gradually, each small drift normalized by the people living inside the system.

Normalization of Deviance

Sociologist Diane Vaughan coined the term "normalization of deviance" studying the Challenger disaster. Over years, NASA engineers observed O-ring erosion at temperatures above the design threshold. Each launch that didn't catastrophically fail taught them that the erosion was acceptable. The deviation from design spec became the new normal. On January 28, 1986, the temperature was 29 degrees Fahrenheit. The tolerances stacked past their limit.

The mechanism repeats across industries. A hospital repeatedly diverts ambulances because of capacity issues; staff begins to treat diversion as standard operating procedure. A financial institution repeatedly makes exceptions to lending criteria during a bull market; exceptions become policy. Each individual decision carries a justification. The cumulative drift is invisible until it isn't.

Asking "was this individual decision defensible?" will not catch tolerance stacking. You need a different question: "How far has our actual practice drifted from the standard this system was designed around?"

Detecting the Stack Before It Collapses

Here's what makes this genuinely hard: the components of a tolerance stack often don't look like problems. They look like successful adaptations. Workarounds that stuck. Pragmatic adjustments. The team that quietly bypasses a bureaucratic approval step because it never adds value. The contractor who always delivers late but whose work is good. The monitoring system that everyone knows gives false positives, so alerts get ignored.

Some practical detection approaches:

graph TD
    A[Identify system design baseline] --> B(Audit actual current practice)
    B --> C{Deviation present?}
    C -->|No| D[Document and continue]
    C -->|Yes| E(Assess deviation direction)
    E --> F{Does deviation reduce margin?}
    F -->|No| D
    F -->|Yes| G[Flag for tolerance stack review]
    G --> H((Decide: correct or formally rebaseline))

The branching point at H matters more than it looks. Sometimes the right answer is to correct the drift. Sometimes the drift has revealed that the original design standard was wrong, and the system should be formally rebaselined. Conflating the two is its own failure mode: organizations that pretend the drift doesn't exist, or that paper over it by rewriting the standard without examining whether multiple drifts are now compounding.

The Margin Question

Good engineers build margin into designs precisely because tolerances stack. Safety factors. Redundancy. Buffer capacity. The question for any decision-maker running a complex operation is: where are your margins, and how much have they eroded?

This requires more than a checklist. It requires understanding which deviations are independent and which are correlated. Correlated deviations are catastrophically more dangerous. If your team is understaffed, fatigued, and operating aging equipment simultaneously, those aren't three separate small problems. They're a stack, and they push in the same direction.

The disasters that feel sudden almost never are. They're the moment when accumulated drift finally exceeded the remaining margin. Someone knew about each individual piece. Nobody added them up.

Start adding them up now, before the stack settles.

The Tolerance Stack: How Small Errors Compound Into Catastrophic Failures

Related Reading

The Cobra Effect: When Your Solution Becomes the Problem

The Illusion of Control: Why Tightening Your Grip on Uncertainty Makes Things Worse

The Weight of Weak Signals: How to Detect Early Warning Signs Before They Become Crises