2025-05-21

On File Formats

Streamability, indexing, and updates

Several comments stress making formats streamable or at least efficient over remote/seekable I/O.
Strong debate about where to place indexes/TOCs:
- Index-at-end favors append, in‑place updates, concatenation, large archives, and workflows like PDFs where small edits just append data.
- Index-at-start favors non-seekable streams and immediate discovery of contents.
- Some suggest hybrid or linked index structures; others note “it’s just a tradeoff, not one right answer.”
Many real‑world workflows recreate files rather than update in place, but formats supporting cheap updates still bring UX and performance wins.

Compression and performance tradeoffs

Compression is “probably desired” for large data, but algorithm and level should match use: high effort only pays off for frequently copied/decompressed data.
General vs domain-specific compression is noted; specialized schemes may outperform generic ones in narrow domains.

Chunking, partial parsing, and versioning

Chunked/binary formats are praised for incremental/partial parsing and robustness, but commenters warn chunking alone doesn’t guarantee reorderability or backward/forward compatibility; explicit versioning is essential.
DER/ASN.1 is cited as an example of structured, partially skippable binary encoding; others find ASN.1 overkill for most custom formats.

Using existing containers (ZIP, SQLite, etc.)

Strong encouragement to reuse existing containers (ZIP, tar, sBOX, CBOR tags, HDF5) instead of inventing from scratch.
ZIP as a multipurpose container is praised; many complex formats (Office, APK, EPUB, etc.) already use it.
SQLite as a file format/container splits opinion:
- Pro: great for composite/stateful data, metadata, queries, incremental updates, encryption extensions; multiple real projects use it successfully.
- Con: overhead, complexity, blob limits, nontrivial format, possibly inferior to ZIP for simple archives or large monolithic blobs.

Human-readable vs binary, numbers and floats

Consensus that human-readable formats should be extremely simple; otherwise binary is safer and clearer.
Textual numbers, especially floats, are called tricky to parse/round-trip correctly; binary IEEE754 with fixed endianness is seen as easier and less error-prone.
Ideas like hex floats or editor support for visualizing binary floats appear, but trade off readability or complexity.

Directories vs single files, diffability, and tooling

Some advocate directory-based “formats” (structured folders, or unzipped equivalents of ZIP-based formats) for better version control, experimentation, and debugging; ZIP can then be an export format.
Others note that dumping runtime data (pickle, raw object graphs, SQLite snapshots) is convenient but harms portability and can enlarge attack surface; deserializers must be strictly bounded by a spec.

File extensions and type detection

Suggestion: long, app-specific extensions (e.g., .mustachemingle) to minimize collisions.
Counterpoints: Windows hides “known” extensions; Linux often relies on MIME/magic; long extensions can hurt UX (truncation, typing).
Agreement that clear, specific extensions like .sqlite are still useful; distinction between generic shared formats and app-specific ones is highlighted.

Design pitfalls and backward compatibility

Warnings against over-clever bit-packing (splitting flags across nibbles/bytes) that later prevents extension. Real examples show such schemes becoming brittle.
Concern that some parsers ignore documented flexibility (e.g., header growth with offset fields) and hard-code assumptions, breaking future versions.
One view holds that human-editable formats can tempt developers to skip proper UI support, degrading usability.
Emphasis on documenting formats thoroughly; good specs and tables clarify intent more than code/flowcharts alone.

Related topics