2024-11-14

Air traffic failure caused by two locations 3600nm apart sharing 3-letter code

Abbreviation confusion (“nm”)

Many readers initially interpreted “3600nm” as nanometers, not nautical miles.
Discussion over correct aviation notation: some say “NM” or “nmi” are standard; others report widespread real-world use of lowercase “nm” in cockpits and marine systems.
Several note the irony of an article about identifier collisions itself using an ambiguous unit abbreviation.
Broader gripe that aviation still mixes legacy units (feet, inches of mercury, gallons) and inconsistent conventions.

Failure mode and safety trade‑offs

The system correctly detected inconsistent flight-plan data and refused to propagate it, but then shut down entirely (including the backup, which ran the same code).
One camp argues full shutdown is appropriate for safety:
- If a “valid” plan can’t be processed, either upstream data is untrustworthy or the system itself is faulty.
- In both cases, continued automatic processing might produce undetected unsafe states, so forcing manual control is safer.
Others argue it’s unreasonable for a single corrupt plan to halt a national system; better to:
- Reject or flag the individual plan.
- Continue tracking the physical aircraft via radar/transponder.
- Use manual handling only for the problematic flight.

Identifiers and waypoint code collisions

Core technical trigger: two distinct navigation locations (Deauville VOR in France and Devil’s Lake VOR in the US) share the three-letter code “DVL”.
Some developers would have assumed three-letter identifiers are globally unique; aviation practitioners note they’re only regionally unique and long known to collide.
Suggestions include:
- Namespacing codes by issuing authority.
- Using surrogate keys internally while still accepting non-unique human-facing codes.
- Moving toward globally unique waypoints, though commenters note enormous retrofit cost.

Robustness, DoS, and input validation

Multiple comments frame this as a denial-of-service vulnerability: one crafted or unlucky flight plan can halt all automated processing.
Proposed mitigations:
- Harden parsers to treat weird but parseable plans as “reject/flag” rather than fatal.
- Fuzzing and better test suites around edge cases like duplicate waypoints, implicit segments, and long overflight routes.
Some see the system as “over‑vouchsafing” (failing too hard), given other safety nets (procedural separation, TCAS, etc.), especially over oceanic tracks with limited radar.

Incident management and operational issues

Excerpts from UK CAA reports highlight non-software contributors to the length and impact of the outage:
- Key engineer was on-call offsite; physical presence was required for a full restart.
- Escalation to higher-level engineers and the vendor was slow.
- The Level 3 engineer didn’t recognize the fault message and needed vendor help.
- System connectivity and data status documentation were unclear.
- A password-database placement issue delayed restarting one server, as correct credentials couldn’t be validated.
Some argue this shows remote-only staffing is insufficient for critical infrastructure; others focus on the need for architectures that support full remote recovery.

Broader statistical and risk discussion

A long subthread debates whether shutting down air traffic truly results in “zero excess deaths.”
One side cites research around 9/11 suggesting increased road injuries (and likely deaths) when people substituted driving for flying.
The other side emphasizes lack of statistically significant excess mortality directly attributable to the airspace shutdown and stresses careful use of “excess deaths” vs raw increases.
Meta-point: in safety discussions, statistical significance, causality, and practical risk modeling can diverge.

Software engineering lessons and meta‑discussion

Recurring themes:
- “Falsehoods programmers believe about identifiers” – assuming uniqueness or invariance of human-generated keys.
- Desire for type systems with physical units and richer invariants; mentions of languages and libraries that support units, but note that mainstream stacks rarely enforce this.
- Debate over bug-tracker hygiene: whether to keep very old/low-priority bugs open vs close as “won’t fix,” balancing honesty, triage overhead, and the value of long-lived records.
- Comparisons to chaos engineering (e.g., Netflix’s Chaos Monkey) and periodic synthetic failures to keep fallback paths realistic and well-practiced.

Related topics