2025-08-15

The future of large files in Git is Git

Enthusiasm for native large-file support

Many welcome large-file handling moving into core Git rather than external tools.
Separate “large object remotes” and partial clones are seen as enabling broader use cases, including asset-heavy projects.

How Git already handles binaries

Several comments stress that all Git objects are binary and packfiles already use binary deltas.
The real pain is with files where small logical changes rewrite the whole binary (compressed, encrypted, some archives), inflating history.
Another pain point: once a big file is committed, it lives forever in history unless you rewrite it.

Critique of Git LFS

Criticisms: awkward opt‑in (extra install, hooks, .gitattributes), confusing pointer files, poor server UX, multiple auth prompts, and bad offline/sneakernet behavior.
Migration tooling can rewrite history in surprising ways (e.g., .gitattributes “pollution” in older commits).
Some argue “vendor lock‑in” is mostly about GitHub’s pricing and behavior, not the open LFS protocol itself; others say practically it does lock you in once used.

Partial clones & large object promisors

--filter and promisors are seen as addressing history bloat by not downloading unused large blobs.
Clarification: even with filters, the checked‑out working tree should be complete; only historical versions are lazily fetched.
Skeptics worry about:
- New flags on git clone that beginners won’t know.
- Broken behavior if promisor storage is lost/migrated.
- Server support being uneven; many forges don’t support partial clones yet.
Debate over whether these should become safe defaults vs niche power‑user options.

Should Git manage large/binary assets?

One camp: Git is a general SCM for whole projects; splitting code and assets (e.g., separate artifact store, submodules) is harmful to reproducibility and release tracking.
Other camp: Git is fundamentally for text source; large binaries belong in Perforce/SVN/artifact stores; forcing Git into that role is a “square peg in a round hole”.
Game and media developers report Git/LFS struggling at hundreds of GB–TB scales; Perforce or Plastic often fare better, despite weaker surrounding tooling.

Alternatives and ecosystem tools

Mentioned tools: git‑annex, datalad, DVC, dud, Oxen, Xet, datamon, jj (future roadmap), DVC‑style indirection layers, artifact repos (Artifactory), and S3‑backed setups.
git‑annex praised for private, multi‑remote, N‑copies workflows but considered too complex and not well suited for public multi‑user projects.
DVC appreciated for decoupling data storage from Git history; complaints include hashing overhead and unlimited revision accumulation unless pruned.
Several projects pitch themselves as “Git‑like but large‑file‑first”, often with chunking, dedupe, or custom backends.

Ideas for better large-file storage

Proposals include:
- Content‑defined chunking and dedup (borg/restic style) inside Git or a new SCM.
- Prolly trees or similar structures for huge mutable blobs with efficient partial updates.
- Format‑aware diff/merge (e.g., for Office docs, archives, JSON, scenes) or reversible text‑like representations.
Some argue Git should instead focus on fixing shallow/partial clones and pruning policies so any repo can be an efficient mirror, without pointer schemes.

DX, defaults, and scale

Repeated complaints that Git “fixes” issues by adding flags, not changing defaults; beginners are left exposed to poor UX (slow clones, obscure options).
Others counter that Git’s decentralized model and local full history are core strengths and worth preserving, especially for offline and OSS workflows.
Thread ends without consensus: many see the new features as a big step forward; others think a fundamentally new SCM may be needed for petabyte‑scale, asset‑heavy projects.

Related topics