F3: Open-source data file format for the future [pdf]

Overview of F3 and Its Goals

  • Columnar, Arrow-oriented file format where each file is self-describing.
  • Encoders/decoders are shipped as embedded WebAssembly, allowing new encodings without changing the global standard or clients.
  • Intended to be a “universal”, future-proof successor to Parquet/ORC for analytical workloads.

Embedding WASM Decoders: Value vs Complexity

  • Supporters see embedded WASM as:
    • A compatibility layer so old software can read future encodings.
    • A way to ship experimental/specialized encodings (e.g., better compression, new float layouts) without waiting years for ecosystem upgrades.
    • A backup: use native decoders when available, fall back to WASM with modest (10–30%) overhead.
  • Critics argue:
    • You’re effectively bundling a decoder with every file, reminiscent of self-extracting archives.
    • Programs already need decoders; requiring a WASM runtime can be heavier than adding one more codec.
    • It risks a proliferation of incompatible, per-file encodings.

Security, Bugs, and “Code-as-Data”

  • Concern about repeating the mistakes of macro-enabled documents and scripted file formats.
  • Paper relies on WASM sandboxing and explicit copying into guest memory; acknowledged overhead accepted for isolation.
  • Open issues:
    • Sandbox vulnerabilities and side channels.
    • Denial-of-service / non-termination; time/space bounds are suggested but non-trivial.
    • Shipping buggy decoders inside datasets, version skew, and how to roll out bugfixes.
    • Potential for data-dependent malicious decoders; some see this as far-fetched but possible.
  • Authors mention future ideas like verified module registries; commenters see “hope” rather than a concrete safety story.

Relation to Other Formats and Fragmentation

  • Backstory of an attempted consortium that collapsed, leading to multiple competing formats:
    • F3, Vortex, Nimble, FastLanes, AnyBlox, plus bespoke scientific formats (e.g., CERN).
  • F3 prototype reuses Vortex encoders but has its own type/API model; Vortex is “orthogonal” and more engineering-focused.
  • Some see this as healthy exploration; others as a “format mess” that complicates adoption and interoperability.

Columnar Storage, Parquet Pain Points, and F3 Improvements

  • Several comments explain columnar vs row-based storage and why columnar is suited to OLAP (scans, aggregates).
  • Parquet is described as:
    • Arcane, with fragile Thrift metadata and Dremel shredding.
    • Hard to implement optimally (especially in Java).
    • Using variable-size pages and heavyweight compressors that add many dependencies.
  • F3 praised for:
    • Composable, lightweight encodings and direct Arrow buffer access.
    • Fixed-size IO units and random-access metadata.
    • Avoiding Dremel-style complexity.
  • Skepticism around using FlatBuffers (safety concerns), and questions about why not just store Arrow directly (answer: Arrow isn’t compressed).

Performance, Compression, and WASM Overhead

  • 10–30% WASM slowdown is seen by some as an unacceptable baseline; others see it as fine for a fallback path.
  • Debate on whether lighter encodings suffice vs needing heavyweight compression (zstd/brotli) for some string-heavy columns.
  • Idea that specialized, per-file compressors could yield big archival wins, but at the cost of more complex decoders.

Adoption, Inertia, and WASM’s Future

  • Strong recognition that Parquet/ORC’s installed base and tooling create path dependence; better tech may lose.
  • Success would require high-quality connectors (DuckDB, Iceberg, Spark, etc.).
  • Divergent views on WASM’s longevity and versioning:
    • Optimists highlight multiple runtimes and likely long-term support.
    • Skeptics note that small spec changes can strand old bytecode; “nothing screams future-proof like WASM” is used both sincerely and sarcastically.

Miscellaneous Reactions

  • Some see F3 as a clever, overdue rethinking of file formats; others as a late-night brainstorm that will age poorly.
  • Concerns about environments that intentionally minimize dependencies, where requiring a WASM runtime is a non-starter.
  • Curiosity about encryption support via WASM, and about “optimal” standard formats for rows vs columns.
  • Light humor about the irony of a “file format for the future” being presented as a PDF, and about an embedded chess move challenge in the paper.