Air traffic failure caused by two locations 3600nm apart sharing 3-letter code
Abbreviation confusion (“nm”)
- Many readers initially interpreted “3600nm” as nanometers, not nautical miles.
- Discussion over correct aviation notation: some say “NM” or “nmi” are standard; others report widespread real-world use of lowercase “nm” in cockpits and marine systems.
- Several note the irony of an article about identifier collisions itself using an ambiguous unit abbreviation.
- Broader gripe that aviation still mixes legacy units (feet, inches of mercury, gallons) and inconsistent conventions.
Failure mode and safety trade‑offs
- The system correctly detected inconsistent flight-plan data and refused to propagate it, but then shut down entirely (including the backup, which ran the same code).
- One camp argues full shutdown is appropriate for safety:
- If a “valid” plan can’t be processed, either upstream data is untrustworthy or the system itself is faulty.
- In both cases, continued automatic processing might produce undetected unsafe states, so forcing manual control is safer.
- Others argue it’s unreasonable for a single corrupt plan to halt a national system; better to:
- Reject or flag the individual plan.
- Continue tracking the physical aircraft via radar/transponder.
- Use manual handling only for the problematic flight.
Identifiers and waypoint code collisions
- Core technical trigger: two distinct navigation locations (Deauville VOR in France and Devil’s Lake VOR in the US) share the three-letter code “DVL”.
- Some developers would have assumed three-letter identifiers are globally unique; aviation practitioners note they’re only regionally unique and long known to collide.
- Suggestions include:
- Namespacing codes by issuing authority.
- Using surrogate keys internally while still accepting non-unique human-facing codes.
- Moving toward globally unique waypoints, though commenters note enormous retrofit cost.
Robustness, DoS, and input validation
- Multiple comments frame this as a denial-of-service vulnerability: one crafted or unlucky flight plan can halt all automated processing.
- Proposed mitigations:
- Harden parsers to treat weird but parseable plans as “reject/flag” rather than fatal.
- Fuzzing and better test suites around edge cases like duplicate waypoints, implicit segments, and long overflight routes.
- Some see the system as “over‑vouchsafing” (failing too hard), given other safety nets (procedural separation, TCAS, etc.), especially over oceanic tracks with limited radar.
Incident management and operational issues
- Excerpts from UK CAA reports highlight non-software contributors to the length and impact of the outage:
- Key engineer was on-call offsite; physical presence was required for a full restart.
- Escalation to higher-level engineers and the vendor was slow.
- The Level 3 engineer didn’t recognize the fault message and needed vendor help.
- System connectivity and data status documentation were unclear.
- A password-database placement issue delayed restarting one server, as correct credentials couldn’t be validated.
- Some argue this shows remote-only staffing is insufficient for critical infrastructure; others focus on the need for architectures that support full remote recovery.
Broader statistical and risk discussion
- A long subthread debates whether shutting down air traffic truly results in “zero excess deaths.”
- One side cites research around 9/11 suggesting increased road injuries (and likely deaths) when people substituted driving for flying.
- The other side emphasizes lack of statistically significant excess mortality directly attributable to the airspace shutdown and stresses careful use of “excess deaths” vs raw increases.
- Meta-point: in safety discussions, statistical significance, causality, and practical risk modeling can diverge.
Software engineering lessons and meta‑discussion
- Recurring themes:
- “Falsehoods programmers believe about identifiers” – assuming uniqueness or invariance of human-generated keys.
- Desire for type systems with physical units and richer invariants; mentions of languages and libraries that support units, but note that mainstream stacks rarely enforce this.
- Debate over bug-tracker hygiene: whether to keep very old/low-priority bugs open vs close as “won’t fix,” balancing honesty, triage overhead, and the value of long-lived records.
- Comparisons to chaos engineering (e.g., Netflix’s Chaos Monkey) and periodic synthetic failures to keep fallback paths realistic and well-practiced.