On File Formats
Streamability, indexing, and updates
- Several comments stress making formats streamable or at least efficient over remote/seekable I/O.
- Strong debate about where to place indexes/TOCs:
- Index-at-end favors append, in‑place updates, concatenation, large archives, and workflows like PDFs where small edits just append data.
- Index-at-start favors non-seekable streams and immediate discovery of contents.
- Some suggest hybrid or linked index structures; others note “it’s just a tradeoff, not one right answer.”
- Many real‑world workflows recreate files rather than update in place, but formats supporting cheap updates still bring UX and performance wins.
Compression and performance tradeoffs
- Compression is “probably desired” for large data, but algorithm and level should match use: high effort only pays off for frequently copied/decompressed data.
- General vs domain-specific compression is noted; specialized schemes may outperform generic ones in narrow domains.
Chunking, partial parsing, and versioning
- Chunked/binary formats are praised for incremental/partial parsing and robustness, but commenters warn chunking alone doesn’t guarantee reorderability or backward/forward compatibility; explicit versioning is essential.
- DER/ASN.1 is cited as an example of structured, partially skippable binary encoding; others find ASN.1 overkill for most custom formats.
Using existing containers (ZIP, SQLite, etc.)
- Strong encouragement to reuse existing containers (ZIP, tar, sBOX, CBOR tags, HDF5) instead of inventing from scratch.
- ZIP as a multipurpose container is praised; many complex formats (Office, APK, EPUB, etc.) already use it.
- SQLite as a file format/container splits opinion:
- Pro: great for composite/stateful data, metadata, queries, incremental updates, encryption extensions; multiple real projects use it successfully.
- Con: overhead, complexity, blob limits, nontrivial format, possibly inferior to ZIP for simple archives or large monolithic blobs.
Human-readable vs binary, numbers and floats
- Consensus that human-readable formats should be extremely simple; otherwise binary is safer and clearer.
- Textual numbers, especially floats, are called tricky to parse/round-trip correctly; binary IEEE754 with fixed endianness is seen as easier and less error-prone.
- Ideas like hex floats or editor support for visualizing binary floats appear, but trade off readability or complexity.
Directories vs single files, diffability, and tooling
- Some advocate directory-based “formats” (structured folders, or unzipped equivalents of ZIP-based formats) for better version control, experimentation, and debugging; ZIP can then be an export format.
- Others note that dumping runtime data (pickle, raw object graphs, SQLite snapshots) is convenient but harms portability and can enlarge attack surface; deserializers must be strictly bounded by a spec.
File extensions and type detection
- Suggestion: long, app-specific extensions (e.g.,
.mustachemingle) to minimize collisions. - Counterpoints: Windows hides “known” extensions; Linux often relies on MIME/magic; long extensions can hurt UX (truncation, typing).
- Agreement that clear, specific extensions like
.sqliteare still useful; distinction between generic shared formats and app-specific ones is highlighted.
Design pitfalls and backward compatibility
- Warnings against over-clever bit-packing (splitting flags across nibbles/bytes) that later prevents extension. Real examples show such schemes becoming brittle.
- Concern that some parsers ignore documented flexibility (e.g., header growth with offset fields) and hard-code assumptions, breaking future versions.
- One view holds that human-editable formats can tempt developers to skip proper UI support, degrading usability.
- Emphasis on documenting formats thoroughly; good specs and tables clarify intent more than code/flowcharts alone.