I made my own Git

Alternative VCS Designs & SQLite Idea

  • A suggestion to back the toy Git with SQLite leads to discussion of Fossil, which already uses SQLite internally and bundles issues, wiki, docs, and forum as first-class, local-first data.
  • Several commenters like Fossil for personal/small-team projects and offline work, but note: no rebasing, different collaboration model, and weaker story for drive‑by contributions compared to Git forges.
  • Others argue Git is also “local-first”, but are reminded that its ecosystem typically offloads issues/docs to external platforms.
  • Other VCSes mentioned: Sapling (Meta’s Mercurial fork, zstd-based deltas), Pijul and Jujutsu (first-class conflict objects), Got (encrypted, large-data friendly).

Learning Git Internals & DIY Reimplementations

  • Many links shared to “build your own Git” resources and explanations (Python/Rust implementations, “Git from the Bottom Up”, “The Git Parable”).
  • Reimplementing Git is seen as a powerful way to expose its hidden complexity and improve intuition about everyday commands.

Storage, Compression & Hashing Choices

  • Some think focusing early on compression (zstd vs zlib) is less interesting than Git’s object model, but others note implementation details all matter when learning.
  • Discussion of SHA‑1 vs SHA‑256: collisions are a theoretical concern even for “just identifiers”; Git’s migration to SHA‑256 is noted.
  • Multiple comments argue SHA‑256 is slow; BLAKE3 or similar parallel-friendly hashes can be much faster, depending on hardware.
  • Git’s file-based object model is criticized as suboptimal for many small or large files; content-defined chunking is proposed as a better long-term approach.
  • Some question whether compression belongs in the VCS or at the filesystem layer (e.g., btrfs with transparent zstd).

Performance, Caching & Empty Directories

  • The toy VCS recomputes hashes for all files on each operation; commenters point out this will not scale, and reference Git’s “racy git” handling using timestamps + filesize as a heuristic.
  • Git’s data model technically supports empty trees, but its index doesn’t track empty directories; the toy implementation does support empty folders explicitly.

Merging, Rebasing & Conflict Handling

  • Git’s recursive merge strategy is praised for remembering conflict resolutions; rerere is mentioned but considered local and sometimes dangerous.
  • Some advocate merge-based workflows over squash/rebase to preserve history and past attempts.
  • Newer systems (Pijul, Jujutsu) that model conflicts as first-class objects are highlighted as more principled.

AI Training, Scraping & “Self-Eating” Models

  • Several comments pivot to LLMs: code repos are clearly being scraped (e.g., many unexplained GitHub clones).
  • People discuss blocking AI crawlers, model training on imperfect code, and the possibility of “poisoning” training data (mostly as a thought experiment).
  • Broader concerns about humans over-trusting AI outputs and the feedback loop when authors also use LLMs to write content.

Format & UX Choices

  • Strong pushback against YAML for machine-generated metadata; JSON or TOML are preferred for simplicity and fewer edge cases.
  • Question raised why introduce a new ignore file instead of reusing .gitignore.