DiffX – Next-Generation Extensible Diff Format

Existing Tools vs. “New Standard”

  • Many commenters argue the problems DiffX claims to solve are already covered by:
    • git format-patch/git am and mbox for multi-commit patch sets.
    • Git-style unified diffs with rich headers.
    • RFC822/email-style headers above diffs for metadata.
  • Several see DiffX as “standard n+1” (invoking xkcd 927), especially since Git’s format is de facto canonical for many workflows.
  • Others point out these Git-centric solutions don’t help tools that must integrate with many different SCMs (SVN, Perforce, ClearCase, in-house systems) that lack consistent or rich diff formats.

Who Actually Has the Problem?

  • Proponents (notably from the Review Board side) say the real pain is on tool authors:
    • Every SCM has its own diff quirks, often undocumented, requiring bespoke parsers.
    • Some SCMs have no diff format, or omit crucial info: revisions, modes, symlinks, deletions, encodings, binary changes.
    • Large diffs (hundreds of MB) are expensive to parse without clear sectioning and lengths.
  • Skeptics respond that:
    • Most users stick to a single SCM per project and never see these issues.
    • Better SCMs or documented Git-style formats would be preferable to inventing a new one.
    • Claims about massive binary/versioning setups are viewed by some as edge cases or “imaginary problems.”

Design of DiffX Format

  • Structure:
    • Hierarchical section headers (#..meta, #...diff, etc.) plus explicit length= fields.
    • Metadata in JSON blobs, with a simple key/value header syntax indicating format and length.
  • Critiques:
    • Dot-based hierarchy is hard to read and error-prone; different levels all called “meta.”
    • Mixing custom header syntax and JSON means two parsers, less friendly to grep/awk-style tooling.
    • Length fields are seen as fragile when humans edit patches.
    • JSON is criticized as noisy and awkward for hand-editing; some argue JSON5 would be nicer, others insist on baseline JSON for maximal compatibility.
  • Defenses:
    • DiffX is intended to be machine-generated/consumed; human editing is not the main use case.
    • Lengths and hierarchy allow efficient partial parsing and mutation in large diffs.
    • JSON was chosen after trying other grammars; widely supported, unambiguous types.

Scope: Diff vs Patch, Metadata, Commits, Binary

  • Some say DiffX conflates concepts:
    • A “diff” should just be line changes; commit lists and metadata belong in the VCS/transport (or in patch sets).
    • Encoding and metadata problems should be solved by standardizing on UTF‑8 and one SCM.
  • Others argue:
    • In practice, many VCSs expose only textified content with local encoding, mixed newlines, or incomplete metadata.
    • Tools need a portable representation of “delta state” including commit ranges, per-file revisions, symlinks, modes, and binary deltas to reconstruct or analyze changes across diverse backends.
    • Multi-commit-in-one-file is valuable to avoid ordering/missing-patch issues for downstream tools.

Alternatives and Broader Perspectives

  • Suggestions include:
    • Formalizing Git’s diff header grammar and/or email-style headers instead of creating DiffX.
    • Using more semantic/AST-based diffs (e.g., difftastic) or structured formats for JSON/AST changes.
    • In some scenarios, just shipping both full file versions (or compressed pairs) may be simpler.
  • Some note diffs are still important for:
    • Code review pipelines.
    • Tools interacting with LLMs where diffs can dramatically reduce token usage and latency.
  • Adoption concerns:
    • Currently appears mostly used inside the Review Board ecosystem.
    • Without buy-in from major VCSs, some doubt it will gain wide traction, though others see it as a useful documented format that others may adopt if they share similar pain points.