2025-06-04

DiffX – Next-Generation Extensible Diff Format

Existing Tools vs. “New Standard”

Many commenters argue the problems DiffX claims to solve are already covered by:
- git format-patch/git am and mbox for multi-commit patch sets.
- Git-style unified diffs with rich headers.
- RFC822/email-style headers above diffs for metadata.
Several see DiffX as “standard n+1” (invoking xkcd 927), especially since Git’s format is de facto canonical for many workflows.
Others point out these Git-centric solutions don’t help tools that must integrate with many different SCMs (SVN, Perforce, ClearCase, in-house systems) that lack consistent or rich diff formats.

Who Actually Has the Problem?

Proponents (notably from the Review Board side) say the real pain is on tool authors:
- Every SCM has its own diff quirks, often undocumented, requiring bespoke parsers.
- Some SCMs have no diff format, or omit crucial info: revisions, modes, symlinks, deletions, encodings, binary changes.
- Large diffs (hundreds of MB) are expensive to parse without clear sectioning and lengths.
Skeptics respond that:
- Most users stick to a single SCM per project and never see these issues.
- Better SCMs or documented Git-style formats would be preferable to inventing a new one.
- Claims about massive binary/versioning setups are viewed by some as edge cases or “imaginary problems.”

Design of DiffX Format

Structure:
- Hierarchical section headers (#..meta, #...diff, etc.) plus explicit length= fields.
- Metadata in JSON blobs, with a simple key/value header syntax indicating format and length.
Critiques:
- Dot-based hierarchy is hard to read and error-prone; different levels all called “meta.”
- Mixing custom header syntax and JSON means two parsers, less friendly to grep/awk-style tooling.
- Length fields are seen as fragile when humans edit patches.
- JSON is criticized as noisy and awkward for hand-editing; some argue JSON5 would be nicer, others insist on baseline JSON for maximal compatibility.
Defenses:
- DiffX is intended to be machine-generated/consumed; human editing is not the main use case.
- Lengths and hierarchy allow efficient partial parsing and mutation in large diffs.
- JSON was chosen after trying other grammars; widely supported, unambiguous types.

Scope: Diff vs Patch, Metadata, Commits, Binary

Some say DiffX conflates concepts:
- A “diff” should just be line changes; commit lists and metadata belong in the VCS/transport (or in patch sets).
- Encoding and metadata problems should be solved by standardizing on UTF‑8 and one SCM.
Others argue:
- In practice, many VCSs expose only textified content with local encoding, mixed newlines, or incomplete metadata.
- Tools need a portable representation of “delta state” including commit ranges, per-file revisions, symlinks, modes, and binary deltas to reconstruct or analyze changes across diverse backends.
- Multi-commit-in-one-file is valuable to avoid ordering/missing-patch issues for downstream tools.

Alternatives and Broader Perspectives

Suggestions include:
- Formalizing Git’s diff header grammar and/or email-style headers instead of creating DiffX.
- Using more semantic/AST-based diffs (e.g., difftastic) or structured formats for JSON/AST changes.
- In some scenarios, just shipping both full file versions (or compressed pairs) may be simpler.
Some note diffs are still important for:
- Code review pipelines.
- Tools interacting with LLMs where diffs can dramatically reduce token usage and latency.
Adoption concerns:
- Currently appears mostly used inside the Review Board ecosystem.
- Without buy-in from major VCSs, some doubt it will gain wide traction, though others see it as a useful documented format that others may adopt if they share similar pain points.

Related topics