2025-06-06

How we decreased GitLab repo backup times from 48 hours to 41 minutes

Nature of the bug and the fix

Backup slowness was traced to a 15-year-old git bundle create function with an O(N²) duplicate-ref check implemented as a nested loop over refs.
The fix replaced this with a set/map-based deduplication, turning the critical section into O(N) (or N·logN depending on implementation), yielding a ~6x improvement locally but much larger end-to-end impact for GitLab (48 hours → 41 minutes).
Commenters note this is a textbook “use a hash/set instead of a quadratic scan” situation, and that in C it’s common to default to arrays/loops because container usage is more cumbersome.

Debate over “exponential” language

The article’s phrase “reducing backup times exponentially” in the same sentence as big-O notation drew heavy criticism.
Many argued that in a technical performance post, “exponential” should not be used colloquially; it created confusion and wasted readers’ time trying to locate an actual exponential-time algorithm.
Suggested alternatives: “from quadratic to linear,” “dramatically,” “by a factor of n,” or explicitly stating the new big-O.
A few defended colloquial use, but most saw it as sloppy or misleading in this context; the author later acknowledged the issue.

Quadratic complexity in practice

Multiple anecdotes support the idea that O(N²) is the “sweet spot of bad algorithms”: fast in tests, disastrous at scale (sometimes far beyond quadratic in the wild).
Discussion covers when N is small and bounded (e.g., hardware-limited cases) where quadratic or even cubic can be acceptable, versus unbounded N where it’s a latent production time bomb.
Some advocate systematically eliminating N² unless N has a hard, low upper bound, and adding explicit safeguards if assumptions about N might change.

C, data structures, and Git implementations

Several comments note this as an example that C alone doesn’t guarantee speed; algorithm and data structure choices dominate.
Others argue C makes these mistakes likelier because sets/maps aren’t standard and are harder to integrate than in languages with built-in containers.
Thread lists alternative Git implementations/libraries (Rust, Go, Java, Haskell, C libraries), with some using them successfully in production.

Backup strategies and alternatives

Many question why GitLab relies on git bundle instead of filesystem-level backups (e.g., ZFS/Btrfs snapshots plus send/receive).
Defenses: direct filesystem copying can bypass Git’s integrity checking and is tricky across diverse filesystems and self-hosted environments; bundles provide a portable, Git-aware backup format.
Offsite replication of ZFS snapshots and snapshot consistency issues (e.g., with Syncthing) are discussed; Git’s lack of a WAL-like mechanism makes naive snapshotting risky in some setups.

Profiling and tooling

The flame graph in the article was produced via perf plus FlameGraph tooling; alternatives recommended include pprof, gperftools, and Samply with Firefox’s profiler.
Several commenters emphasize that algorithmic changes uncovered via profiling dwarf typical micro-optimizations.

Reactions to GitLab and the write-up

Some praise the upstream contribution (landing in Git v2.50.0) and note GitLab now has a dedicated Git team.
Others complain about GitLab’s slow UI and the blog post’s style: too long, possibly LLM-like, and missing concrete code snippets, though still useful once skimmed to the core technical section.

Related topics