Formatting a 25M-line codebase overnight

Codebase Size and Nature

  • Many are struck by the scale: 25M–42M lines in one repo, but note it’s a monorepo containing most server-side code over ~16 years, not one app.
  • Comparisons are made to other large companies with massive monorepos; vendor’ed dependencies are noted as inflating raw LoC counts.
  • Some infer that much code likely isn’t directly handling card transactions; sensitive PCI/vaulting logic is described as being isolated in a separate, locked-down repo with mixed languages.

Typing and Language Choices

  • Questions arise whether tens of millions of Ruby lines are untyped.
  • Commenters point out Stripe’s Ruby type-checker (Sorbet) and link to broader discussions of Ruby typing.

AI, Code Growth, and Quality

  • One company reports ~4.5M lines and “exponential” growth, with AI-generated code significantly accelerating bloat.
  • Several examples describe LLMs producing overly complex, class-heavy, repetitive code that looks impressive but is fragile or incorrect.
  • There’s concern about turning millions of reasonable lines into many more millions of “fluffy filler.”

Formatter Performance and Implementation

  • Some are more surprised by the “overnight” aspect than by size, comparing it to running clang-format on Chromium’s ~21M C++ lines in minutes on old hardware.
  • Debate whether Ruby formatting is inherently slower, or if it’s mainly tooling and implementation details.
  • Clarification that the discussed Ruby formatter is written in Rust and uses a C parser, countering assumptions it’s Ruby-based.

Big-Bang vs Incremental Reformat

  • Several question why Stripe did a single massive reformat instead of incremental opt-in or “ratcheting” approaches.
  • Arguments for big-bang: single, well-tested transition; avoids repeated test cycles; clear before/after state; can combine with git blame ignore-revs.
  • Arguments for incremental: fewer conflicts with active PRs; can skip files under active development; easier for some teams to manage.
  • Others describe practical playbooks: training sessions, staged rollouts, daily scripts that avoid files in open PRs.

Correctness Guarantees and Sanity Checks

  • Discussion of techniques to ensure formatters are semantics-preserving:
    • Simple checks that input/output match ignoring whitespace.
    • Considering AST or token-stream comparisons, though complexity and performance tradeoffs are noted.
  • Acknowledgment that formatters can still introduce subtle bugs, especially around new or unusual syntax.

Why Format at All?

  • Some ask why we care about formatting in an AI-driven future or for machine consumers.
  • Replies stress: human readability, easier understanding of complex logic, clean diffs, and reduced “noise” in version control.
  • LLMs themselves are reported (by users) to perform better with pretty-printed code, though others question how an LLM could “know” this.