2025-12-26

Package managers keep using Git as a database, it never works out

Scope of the Problem: Git vs. GitHub vs. Filesystems

Several commenters argue the core issues are not Git’s Merkle-tree data model, but:
- Git’s network protocol (inefficient transfers, shallow/sparse behavior).
- GitHub’s hosting constraints, rate limits, and monorepo scaling.
Others agree that “having every client clone the whole index” is the real design mistake: O(n) work when users care about O(1) subset.
Some push back that the Nixpkgs example is misused: it’s literally a source repo, and many of its pain points are about GitHub scale and monorepo size, not “Git as a database” per se.

Architectural Alternatives and Patterns

Common suggested pattern:
- Keep Git as authoritative source for manifests/recipes.
- Generate a compact index (often SQLite or similar) and/or static metadata, then distribute via HTTP/CDN, rsync, or OCI registries.
- Examples cited: MacPorts (rsync + index), Gentoo (git → rsync), WinGet (SQLite index), Hackage (append-only tar index), Nix’s older channel tarballs + SQLite index, OCI backends for Homebrew.
SQLite is frequently mentioned as an “ideal” local index, but people warn against storing a monolithic SQLite file directly in Git (binary, no good diffs/merges). Better: text manifests in Git → compiled to SQLite.
Some see Fossil/other SCMs or distributed databases (CRDT-based, ledger-like, TUF-inspired designs) as promising, but adoption and complexity are open questions.

Scaling, Ethics, and “Do the Easy Thing First”

One camp: starting on Git/GitHub is rational—free hosting, trivial to implement, great for early adoption. When scale hurts, migrate; many successful ecosystems (Cargo, Homebrew, Julia) did exactly that.
Opposing camp: this is short‑sighted or even “unethical”; known scaling pitfalls are deferred until change is extremely expensive or impossible, creating long‑term technical debt and user pain.
Counter‑argument: most projects never reach that scale; over‑engineering early wastes scarce volunteer time. For package managers, though, “if it succeeds, it will hit scale,” so design should anticipate that.

Ecosystem-Specific Notes

Go modules: discussion of the old go get behavior (cloning repos to read go.mod), the dramatic speedup from module proxies, and workarounds for private/self‑hosted Git (GOPRIVATE, SSH, custom CAs).
Julia: registry still lives in Git, but most clients use a separate “Pkg protocol,” avoiding Git at scale.
Nix/AUR/Gentoo contrasts: monorepo vs. per‑package repos vs. rsync trees, with different scaling and tooling tradeoffs.

Externalities and User Time

Broader tangent on “tragedy of the commons”: using free GitHub bandwidth and user time as unpriced externalities.
Long debate on whether micro‑performance improvements are worth engineering time, and how much companies actually optimize for user latency in practice.

Related topics