The Architecture of the Exception: Why Temporary Safety Overrides Become Permanent Risk

How the bypass log becomes the most honest document in your facility — and why nobody reads it.

Eight of twelve channels in override. The oldest waiver, 218 days old, was authorized for 72 hours. Each bypass carries a signed MOC, a documented risk assessment, and a compensatory measure on paper — the architecture of every major process safety investigation of the last three decades.





Walk into any control room in a Tier-1 process facility and ask one question: “How many active overrides are on the SIS right now?”

If the answer comes back instantly with a precise number, you are in a well-run facility. If the answer is “let me check” followed by a number larger than the operator expected, you are looking at the most under-managed category of risk in heavy industry: the accumulated inventory of temporary safety bypasses.

This is not a theoretical concern. It is the common forensic signature found in the root cause analyses of Longford (1998), Texas City (2005), Buncefield (2005), Deepwater Horizon (2010), and the 2019 Philadelphia Energy Solutions refinery explosion. In each case, the formal investigation identified one or more safety-critical elements operating under an authorized override at the moment of the event. None of these overrides were illegal. All were documented. Most had compensatory measures on paper.

That is the problem worth examining.

The mechanism: how a 72-hour waiver becomes the operational baseline

The lifecycle of a safety override is well-understood and remarkably consistent across industries.

A safety-critical element fails or requires maintenance. The replacement part has a lead time. Production loss from a full unit shutdown is calculated in six or seven figures per day. A Management of Change (MOC) is initiated, a risk assessment is drafted, compensatory measures are defined (typically manual monitoring at a defined frequency), and the override is authorized for a specific duration — usually 24 to 72 hours.

What happens next is where the safety case quietly degrades.

The CSB’s analysis of the 2005 Texas City refinery explosion documented that critical instruments on the raffinate splitter had been malfunctioning for years and that operating procedures had drifted to accommodate them. The Baker Panel report afterward identified “normalization of deviance” as a systemic finding, not an isolated lapse. The same pattern is visible in the Longford Royal Commission findings: alarms that had been persistently active or bypassed had lost their signal value to operators. The Esso plant did not explode because of a single bypass. It exploded because the cumulative effect of many small deviations had reshaped what operators considered “normal.”

This is the mechanism worth naming. A waiver authorized for 72 hours has a half-life measured in organizational memory, not engineering tolerance. By day four, hypervigilance has decayed. By week two, the compensatory manual check is competing with five other tasks. By month three, a shift handover transfers the override to operators who never knew the system in its undegraded state.

The override has not been forgotten. It has been absorbed.

Why compensatory measures rarely compensate

The risk assessment underpinning most overrides assumes that human action can substitute for an automated function during the bypass period. This assumption deserves more scrutiny than it usually receives.

A SIL-2 or SIL-3 rated interlock has a Probability of Failure on Demand (PFD) between 10⁻² and 10⁻⁴. The functional safety standards (IEC 61508, IEC 61511) define these tiers precisely because the underlying hardware and logic have been engineered, tested, and proof-tested to deliver that reliability.

The “compensatory measure” of a two-hour manual gas check, a four-hour operator round, or an “increased vigilance” instruction has no defined PFD. It cannot have one. Human reliability under sustained low-frequency monitoring tasks degrades in ways that are well-documented in human factors literature — vigilance decrement is observable within 30 minutes of task onset under controlled conditions, and substantially worse under shift work, alarm-rich environments, and competing demands.

The substitution is not equivalent. It is rarely close to equivalent. And the gap is widest precisely in the failure modes the original interlock was designed to catch: fast-onset, low-frequency, high-consequence events.

This does not mean compensatory measures are useless. It means the MOC process should treat the override period as a window of elevated risk requiring a specific exposure budget — not as a state of “equivalent safety” that can extend indefinitely.

The cumulative risk problem

A single override is a localized degradation. The standard MOC risk assessment is designed to evaluate exactly that — one barrier at a time, against one defined hazard scenario.

The CSB has repeatedly flagged the limitation of this approach. Process facilities are tightly coupled systems where safety barriers are not independent. The hydrocarbon detection layer is not separable from the deluge system; the high-level alarm on a vessel is not separable from the relief valve sizing assumption; the compressor vibration monitor is not separable from the trip logic that protects downstream equipment.

When multiple overrides are active simultaneously — even in different units — the layer-of-protection analysis (LOPA) that originally justified the facility’s design is no longer accurate. The effective risk has moved, but no individual MOC captures this because each was assessed in isolation.

A practical implication: any QHSE function tracking overrides only at the unit level is structurally blind to the facility’s actual risk posture. A dashboard showing “12 active overrides, all individually assessed” is not equivalent to a facility operating within its design envelope. It is a facility operating in a configuration that was never engineered, tested, or modeled.

What a serious override discipline looks like

The most rigorous override regimes I have seen share a small number of practices. None are novel. All are uncomfortable to implement because they remove flexibility from operations management.

Hard expiry enforced by the control system itself. When the authorized period expires, the override is removed automatically and the unit either trips or escalates to senior management for an extension. The bypass cannot persist by inattention. This is consistent with IEC 61511 guidance on disabling protective functions and is implemented in some form by most major operators in their highest-tier facilities — though enforcement varies.

Facility-level override aggregation. A single dashboard maintained by the process safety function, visible to the plant manager and reported weekly to the asset’s senior leadership. Not just count — total exposure (override-hours weighted by criticality), age of oldest override, and overrides that have been renewed more than once.

MOC quality review, not just MOC volume tracking. Sample-based audit of recent MOCs by an independent assessor, focused on whether compensatory measures are realistic and verifiable. Many MOCs fail this test on inspection. The MOC was never the problem; the lack of audit on MOC quality is.

Override aging review. Any override older than 30 days receives executive-level review with a defined repair date. Any override older than 90 days is treated as a structural deficiency requiring a capital decision, not an operational one.

Decoupling override authorization from production pressure. The authority to extend an override should not sit with the same person whose performance metrics include production output. This is the single hardest organizational change because it cuts against how operations are typically managed, but it is the one that most directly addresses the root cause.

The honest conversation

The reason temporary overrides accumulate is not because QHSE functions are weak or because operators are reckless. It is because the cost of the override is borne in a probability distribution that the organization rarely sees, while the cost of the shutdown is borne in a P&L line item that everyone sees on Monday morning.

This asymmetry is the actual engineering problem. Until the organization makes the cost of a sustained override visible — in budget, in executive attention, in board reporting — the gravitational pull toward bypass-and-continue will dominate every other intention in the safety management system.

The bypass log is, in many facilities, the most honest document in the building. It records what the organization actually tolerates, as opposed to what its policies claim. For QHSE functions, the question worth asking this quarter is straightforward: when did you last read yours in full, and when did the executive team last see it?


Sources referenced: CSB investigation reports on Texas City (2007), Macondo (2016), and PES Philadelphia (2022); Longford Royal Commission Report (1999); Baker Panel Report (2007); IEC 61511–1:2016.

Comments

Popular posts from this blog

The Myth of the Root Cause: Why Your Accident Investigations Are Just Creative Writing for Lawyers

The Price of Blood: Why the "Lowest Bidder" Is Your Highest Safety Risk

The "Behavior-Based" Trap: Why Obsessing Over "Unsafe Acts" Is a Billion-Dollar Industrial Failure