Package managers keep using Git as a database, it never works out

Scope of the Problem: Git vs. GitHub vs. Filesystems

  • Several commenters argue the core issues are not Git’s Merkle-tree data model, but:
    • Git’s network protocol (inefficient transfers, shallow/sparse behavior).
    • GitHub’s hosting constraints, rate limits, and monorepo scaling.
  • Others agree that “having every client clone the whole index” is the real design mistake: O(n) work when users care about O(1) subset.
  • Some push back that the Nixpkgs example is misused: it’s literally a source repo, and many of its pain points are about GitHub scale and monorepo size, not “Git as a database” per se.

Architectural Alternatives and Patterns

  • Common suggested pattern:
    • Keep Git as authoritative source for manifests/recipes.
    • Generate a compact index (often SQLite or similar) and/or static metadata, then distribute via HTTP/CDN, rsync, or OCI registries.
    • Examples cited: MacPorts (rsync + index), Gentoo (git → rsync), WinGet (SQLite index), Hackage (append-only tar index), Nix’s older channel tarballs + SQLite index, OCI backends for Homebrew.
  • SQLite is frequently mentioned as an “ideal” local index, but people warn against storing a monolithic SQLite file directly in Git (binary, no good diffs/merges). Better: text manifests in Git → compiled to SQLite.
  • Some see Fossil/other SCMs or distributed databases (CRDT-based, ledger-like, TUF-inspired designs) as promising, but adoption and complexity are open questions.

Scaling, Ethics, and “Do the Easy Thing First”

  • One camp: starting on Git/GitHub is rational—free hosting, trivial to implement, great for early adoption. When scale hurts, migrate; many successful ecosystems (Cargo, Homebrew, Julia) did exactly that.
  • Opposing camp: this is short‑sighted or even “unethical”; known scaling pitfalls are deferred until change is extremely expensive or impossible, creating long‑term technical debt and user pain.
  • Counter‑argument: most projects never reach that scale; over‑engineering early wastes scarce volunteer time. For package managers, though, “if it succeeds, it will hit scale,” so design should anticipate that.

Ecosystem-Specific Notes

  • Go modules: discussion of the old go get behavior (cloning repos to read go.mod), the dramatic speedup from module proxies, and workarounds for private/self‑hosted Git (GOPRIVATE, SSH, custom CAs).
  • Julia: registry still lives in Git, but most clients use a separate “Pkg protocol,” avoiding Git at scale.
  • Nix/AUR/Gentoo contrasts: monorepo vs. per‑package repos vs. rsync trees, with different scaling and tooling tradeoffs.

Externalities and User Time

  • Broader tangent on “tragedy of the commons”: using free GitHub bandwidth and user time as unpriced externalities.
  • Long debate on whether micro‑performance improvements are worth engineering time, and how much companies actually optimize for user latency in practice.