How NASA built Artemis II’s fault-tolerant computer

Agile/DevOps vs Deterministic Space Systems

  • Several commenters contrast modern Agile/DevOps practices with the discipline needed for deterministic, safety‑critical systems.
  • Some argue Agile can be compatible with strong architecture and reliability; others say in practice it degrades rigor and obscures worst‑case behavior.
  • There’s debate over whether “agile” is meaningful at all versus just a buzzword used to justify rushed, low‑quality work.

Redundancy and Fail-Silent Architecture

  • The article’s “fail-silent” quad-redundant design attracts attention: pairs of CPUs detect their own errors and go silent, while a higher-level system picks the first healthy channel.
  • Commenters contrast this with classic triple-voting systems, noting the different trust model (self-detection vs external voting).
  • Several question what happens if both CPUs in a pair produce the same wrong result; responses note this is extremely unlikely but nonzero.

Hardware, RTOS, and Radiation Hardening

  • Discussion mentions rad-hardened PowerPC (RAD750-class) CPUs, small RAM, and RTOSes like INTEGRITY-178 and VxWorks.
  • Time-Triggered Ethernet and ARINC-style scheduling are framed as long-established in aerospace/automotive safety systems.
  • Others note rad-hard processes lag commercial nodes by many generations and rely heavily on redundancy.

Backup Flight Software and Dissimilar Redundancy

  • A detailed comment describes Orion’s separate Backup Flight Software stack using a different CPU, OS, and NASA’s cFS framework.
  • This “dissimilar redundancy” is praised for avoiding common-mode software failures, though it’s noted even independent teams can replicate the same design bug.

Skepticism, Cost, and Comparisons

  • Some see the system as overengineered and bureaucratic, “throwing money at redundancy,” and point to Artemis’ cost and schedule issues.
  • Others stress that human-rated spaceflight demands extreme reliability and can’t be compared to web apps or even unmanned systems.

Reliability Culture vs Modern Software

  • Multiple comments lament the perceived decline in software quality in mainstream development.
  • There’s reflection on different “good enough” standards: acceptable for CRUD apps but not for life-critical guidance systems.