2026-01-27

I made my own Git

Alternative VCS Designs & SQLite Idea

A suggestion to back the toy Git with SQLite leads to discussion of Fossil, which already uses SQLite internally and bundles issues, wiki, docs, and forum as first-class, local-first data.
Several commenters like Fossil for personal/small-team projects and offline work, but note: no rebasing, different collaboration model, and weaker story for drive‑by contributions compared to Git forges.
Others argue Git is also “local-first”, but are reminded that its ecosystem typically offloads issues/docs to external platforms.
Other VCSes mentioned: Sapling (Meta’s Mercurial fork, zstd-based deltas), Pijul and Jujutsu (first-class conflict objects), Got (encrypted, large-data friendly).

Learning Git Internals & DIY Reimplementations

Many links shared to “build your own Git” resources and explanations (Python/Rust implementations, “Git from the Bottom Up”, “The Git Parable”).
Reimplementing Git is seen as a powerful way to expose its hidden complexity and improve intuition about everyday commands.

Storage, Compression & Hashing Choices

Some think focusing early on compression (zstd vs zlib) is less interesting than Git’s object model, but others note implementation details all matter when learning.
Discussion of SHA‑1 vs SHA‑256: collisions are a theoretical concern even for “just identifiers”; Git’s migration to SHA‑256 is noted.
Multiple comments argue SHA‑256 is slow; BLAKE3 or similar parallel-friendly hashes can be much faster, depending on hardware.
Git’s file-based object model is criticized as suboptimal for many small or large files; content-defined chunking is proposed as a better long-term approach.
Some question whether compression belongs in the VCS or at the filesystem layer (e.g., btrfs with transparent zstd).

Performance, Caching & Empty Directories

The toy VCS recomputes hashes for all files on each operation; commenters point out this will not scale, and reference Git’s “racy git” handling using timestamps + filesize as a heuristic.
Git’s data model technically supports empty trees, but its index doesn’t track empty directories; the toy implementation does support empty folders explicitly.

Merging, Rebasing & Conflict Handling

Git’s recursive merge strategy is praised for remembering conflict resolutions; rerere is mentioned but considered local and sometimes dangerous.
Some advocate merge-based workflows over squash/rebase to preserve history and past attempts.
Newer systems (Pijul, Jujutsu) that model conflicts as first-class objects are highlighted as more principled.

AI Training, Scraping & “Self-Eating” Models

Several comments pivot to LLMs: code repos are clearly being scraped (e.g., many unexplained GitHub clones).
People discuss blocking AI crawlers, model training on imperfect code, and the possibility of “poisoning” training data (mostly as a thought experiment).
Broader concerns about humans over-trusting AI outputs and the feedback loop when authors also use LLMs to write content.

Format & UX Choices

Strong pushback against YAML for machine-generated metadata; JSON or TOML are preferred for simplicity and fewer edge cases.
Question raised why introduce a new ignore file instead of reusing .gitignore.

Related topics