2026-04-14

Lean proved this program correct; then I found a bug

What actually went wrong

The verified part was the core zlib compression/decompression logic in Lean.
Two bugs surfaced:
- A denial-of-service via unverified archive-header parsing in the Lean-zip code.
- A heap buffer overflow in the Lean runtime’s scalar array support, affecting any Lean program using that feature.
Both issues were outside the formal proof’s stated scope, so the core correctness proof was not invalidated.

Limits of formal verification

Multiple comments stress: proofs are always “correct with respect to a specification,” not to human intent or the full system.
Common failure modes:
- Spec bugs (the spec doesn’t capture what was really wanted).
- Spec gaps (important behaviors like parsers, resource limits, or adversarial inputs not modeled).
- Unverified layers (runtime, OS, hardware, compilers).
Some argue a “verified system” should be understood as the whole binary stack; others say that’s unrealistic unless every component is verified.

Trust in verification tools

Concern: why trust the verifier more than the verified code?
Replies: verifiers can be simpler, heavily scrutinized, and used by many, making serious undetected bugs less likely (though not impossible).
Distinction emphasized between Lean’s kernel (logic/proofs) and runtime (execution/FFI); a runtime bug doesn’t directly falsify existing mathematical proofs.

Spec correctness and completeness

Discussion about needing methods to “verify the specs” themselves, including adversarial thinking and high-level property proofs.
Recognized gap between:
- Implementation soundness (“the code does X”) and
- Design soundness (“doing X is actually the right, safe thing”).
Compression is highlighted as a good domain: simple correctness spec, complex implementation, but still vulnerable to unmodeled behaviors (e.g., malformed inputs).

AI, fuzzing, and “vibe-coded” proofs

Several commenters report success using LLMs to help write proofs and models (Lean, TLA+, etc.).
The AI-assisted fuzzing that found the Lean runtime overflow is seen as a significant, positive result.
Enthusiasm that combining AI, fuzzing, and formal methods can narrow where bugs hide, even if perfection remains impossible.

Title and framing debate

Many see the article title as misleading or clickbait, implying a flaw in the proof itself.
Others defend it as fair from an end-user perspective: if the binary crashes or is exploitable, the “verified program” is still buggy in practice.

Related topics