Formatting a 25M-line codebase overnight
Codebase Size and Nature
- Many are struck by the scale: 25M–42M lines in one repo, but note it’s a monorepo containing most server-side code over ~16 years, not one app.
- Comparisons are made to other large companies with massive monorepos; vendor’ed dependencies are noted as inflating raw LoC counts.
- Some infer that much code likely isn’t directly handling card transactions; sensitive PCI/vaulting logic is described as being isolated in a separate, locked-down repo with mixed languages.
Typing and Language Choices
- Questions arise whether tens of millions of Ruby lines are untyped.
- Commenters point out Stripe’s Ruby type-checker (Sorbet) and link to broader discussions of Ruby typing.
AI, Code Growth, and Quality
- One company reports ~4.5M lines and “exponential” growth, with AI-generated code significantly accelerating bloat.
- Several examples describe LLMs producing overly complex, class-heavy, repetitive code that looks impressive but is fragile or incorrect.
- There’s concern about turning millions of reasonable lines into many more millions of “fluffy filler.”
Formatter Performance and Implementation
- Some are more surprised by the “overnight” aspect than by size, comparing it to running clang-format on Chromium’s ~21M C++ lines in minutes on old hardware.
- Debate whether Ruby formatting is inherently slower, or if it’s mainly tooling and implementation details.
- Clarification that the discussed Ruby formatter is written in Rust and uses a C parser, countering assumptions it’s Ruby-based.
Big-Bang vs Incremental Reformat
- Several question why Stripe did a single massive reformat instead of incremental opt-in or “ratcheting” approaches.
- Arguments for big-bang: single, well-tested transition; avoids repeated test cycles; clear before/after state; can combine with
git blameignore-revs. - Arguments for incremental: fewer conflicts with active PRs; can skip files under active development; easier for some teams to manage.
- Others describe practical playbooks: training sessions, staged rollouts, daily scripts that avoid files in open PRs.
Correctness Guarantees and Sanity Checks
- Discussion of techniques to ensure formatters are semantics-preserving:
- Simple checks that input/output match ignoring whitespace.
- Considering AST or token-stream comparisons, though complexity and performance tradeoffs are noted.
- Acknowledgment that formatters can still introduce subtle bugs, especially around new or unusual syntax.
Why Format at All?
- Some ask why we care about formatting in an AI-driven future or for machine consumers.
- Replies stress: human readability, easier understanding of complex logic, clean diffs, and reduced “noise” in version control.
- LLMs themselves are reported (by users) to perform better with pretty-printed code, though others question how an LLM could “know” this.