Package managers keep using Git as a database, it never works out
Scope of the Problem: Git vs. GitHub vs. Filesystems
- Several commenters argue the core issues are not Git’s Merkle-tree data model, but:
- Git’s network protocol (inefficient transfers, shallow/sparse behavior).
- GitHub’s hosting constraints, rate limits, and monorepo scaling.
- Others agree that “having every client clone the whole index” is the real design mistake: O(n) work when users care about O(1) subset.
- Some push back that the Nixpkgs example is misused: it’s literally a source repo, and many of its pain points are about GitHub scale and monorepo size, not “Git as a database” per se.
Architectural Alternatives and Patterns
- Common suggested pattern:
- Keep Git as authoritative source for manifests/recipes.
- Generate a compact index (often SQLite or similar) and/or static metadata, then distribute via HTTP/CDN, rsync, or OCI registries.
- Examples cited: MacPorts (rsync + index), Gentoo (git → rsync), WinGet (SQLite index), Hackage (append-only tar index), Nix’s older channel tarballs + SQLite index, OCI backends for Homebrew.
- SQLite is frequently mentioned as an “ideal” local index, but people warn against storing a monolithic SQLite file directly in Git (binary, no good diffs/merges). Better: text manifests in Git → compiled to SQLite.
- Some see Fossil/other SCMs or distributed databases (CRDT-based, ledger-like, TUF-inspired designs) as promising, but adoption and complexity are open questions.
Scaling, Ethics, and “Do the Easy Thing First”
- One camp: starting on Git/GitHub is rational—free hosting, trivial to implement, great for early adoption. When scale hurts, migrate; many successful ecosystems (Cargo, Homebrew, Julia) did exactly that.
- Opposing camp: this is short‑sighted or even “unethical”; known scaling pitfalls are deferred until change is extremely expensive or impossible, creating long‑term technical debt and user pain.
- Counter‑argument: most projects never reach that scale; over‑engineering early wastes scarce volunteer time. For package managers, though, “if it succeeds, it will hit scale,” so design should anticipate that.
Ecosystem-Specific Notes
- Go modules: discussion of the old
go getbehavior (cloning repos to readgo.mod), the dramatic speedup from module proxies, and workarounds for private/self‑hosted Git (GOPRIVATE, SSH, custom CAs). - Julia: registry still lives in Git, but most clients use a separate “Pkg protocol,” avoiding Git at scale.
- Nix/AUR/Gentoo contrasts: monorepo vs. per‑package repos vs. rsync trees, with different scaling and tooling tradeoffs.
Externalities and User Time
- Broader tangent on “tragedy of the commons”: using free GitHub bandwidth and user time as unpriced externalities.
- Long debate on whether micro‑performance improvements are worth engineering time, and how much companies actually optimize for user latency in practice.