On File Formats

Streamability, indexing, and updates

  • Several comments stress making formats streamable or at least efficient over remote/seekable I/O.
  • Strong debate about where to place indexes/TOCs:
    • Index-at-end favors append, in‑place updates, concatenation, large archives, and workflows like PDFs where small edits just append data.
    • Index-at-start favors non-seekable streams and immediate discovery of contents.
    • Some suggest hybrid or linked index structures; others note “it’s just a tradeoff, not one right answer.”
  • Many real‑world workflows recreate files rather than update in place, but formats supporting cheap updates still bring UX and performance wins.

Compression and performance tradeoffs

  • Compression is “probably desired” for large data, but algorithm and level should match use: high effort only pays off for frequently copied/decompressed data.
  • General vs domain-specific compression is noted; specialized schemes may outperform generic ones in narrow domains.

Chunking, partial parsing, and versioning

  • Chunked/binary formats are praised for incremental/partial parsing and robustness, but commenters warn chunking alone doesn’t guarantee reorderability or backward/forward compatibility; explicit versioning is essential.
  • DER/ASN.1 is cited as an example of structured, partially skippable binary encoding; others find ASN.1 overkill for most custom formats.

Using existing containers (ZIP, SQLite, etc.)

  • Strong encouragement to reuse existing containers (ZIP, tar, sBOX, CBOR tags, HDF5) instead of inventing from scratch.
  • ZIP as a multipurpose container is praised; many complex formats (Office, APK, EPUB, etc.) already use it.
  • SQLite as a file format/container splits opinion:
    • Pro: great for composite/stateful data, metadata, queries, incremental updates, encryption extensions; multiple real projects use it successfully.
    • Con: overhead, complexity, blob limits, nontrivial format, possibly inferior to ZIP for simple archives or large monolithic blobs.

Human-readable vs binary, numbers and floats

  • Consensus that human-readable formats should be extremely simple; otherwise binary is safer and clearer.
  • Textual numbers, especially floats, are called tricky to parse/round-trip correctly; binary IEEE754 with fixed endianness is seen as easier and less error-prone.
  • Ideas like hex floats or editor support for visualizing binary floats appear, but trade off readability or complexity.

Directories vs single files, diffability, and tooling

  • Some advocate directory-based “formats” (structured folders, or unzipped equivalents of ZIP-based formats) for better version control, experimentation, and debugging; ZIP can then be an export format.
  • Others note that dumping runtime data (pickle, raw object graphs, SQLite snapshots) is convenient but harms portability and can enlarge attack surface; deserializers must be strictly bounded by a spec.

File extensions and type detection

  • Suggestion: long, app-specific extensions (e.g., .mustachemingle) to minimize collisions.
  • Counterpoints: Windows hides “known” extensions; Linux often relies on MIME/magic; long extensions can hurt UX (truncation, typing).
  • Agreement that clear, specific extensions like .sqlite are still useful; distinction between generic shared formats and app-specific ones is highlighted.

Design pitfalls and backward compatibility

  • Warnings against over-clever bit-packing (splitting flags across nibbles/bytes) that later prevents extension. Real examples show such schemes becoming brittle.
  • Concern that some parsers ignore documented flexibility (e.g., header growth with offset fields) and hard-code assumptions, breaking future versions.
  • One view holds that human-editable formats can tempt developers to skip proper UI support, degrading usability.
  • Emphasis on documenting formats thoroughly; good specs and tables clarify intent more than code/flowcharts alone.