F3: Open-source data file format for the future [pdf]
Overview of F3 and Its Goals
- Columnar, Arrow-oriented file format where each file is self-describing.
- Encoders/decoders are shipped as embedded WebAssembly, allowing new encodings without changing the global standard or clients.
- Intended to be a “universal”, future-proof successor to Parquet/ORC for analytical workloads.
Embedding WASM Decoders: Value vs Complexity
- Supporters see embedded WASM as:
- A compatibility layer so old software can read future encodings.
- A way to ship experimental/specialized encodings (e.g., better compression, new float layouts) without waiting years for ecosystem upgrades.
- A backup: use native decoders when available, fall back to WASM with modest (10–30%) overhead.
- Critics argue:
- You’re effectively bundling a decoder with every file, reminiscent of self-extracting archives.
- Programs already need decoders; requiring a WASM runtime can be heavier than adding one more codec.
- It risks a proliferation of incompatible, per-file encodings.
Security, Bugs, and “Code-as-Data”
- Concern about repeating the mistakes of macro-enabled documents and scripted file formats.
- Paper relies on WASM sandboxing and explicit copying into guest memory; acknowledged overhead accepted for isolation.
- Open issues:
- Sandbox vulnerabilities and side channels.
- Denial-of-service / non-termination; time/space bounds are suggested but non-trivial.
- Shipping buggy decoders inside datasets, version skew, and how to roll out bugfixes.
- Potential for data-dependent malicious decoders; some see this as far-fetched but possible.
- Authors mention future ideas like verified module registries; commenters see “hope” rather than a concrete safety story.
Relation to Other Formats and Fragmentation
- Backstory of an attempted consortium that collapsed, leading to multiple competing formats:
- F3, Vortex, Nimble, FastLanes, AnyBlox, plus bespoke scientific formats (e.g., CERN).
- F3 prototype reuses Vortex encoders but has its own type/API model; Vortex is “orthogonal” and more engineering-focused.
- Some see this as healthy exploration; others as a “format mess” that complicates adoption and interoperability.
Columnar Storage, Parquet Pain Points, and F3 Improvements
- Several comments explain columnar vs row-based storage and why columnar is suited to OLAP (scans, aggregates).
- Parquet is described as:
- Arcane, with fragile Thrift metadata and Dremel shredding.
- Hard to implement optimally (especially in Java).
- Using variable-size pages and heavyweight compressors that add many dependencies.
- F3 praised for:
- Composable, lightweight encodings and direct Arrow buffer access.
- Fixed-size IO units and random-access metadata.
- Avoiding Dremel-style complexity.
- Skepticism around using FlatBuffers (safety concerns), and questions about why not just store Arrow directly (answer: Arrow isn’t compressed).
Performance, Compression, and WASM Overhead
- 10–30% WASM slowdown is seen by some as an unacceptable baseline; others see it as fine for a fallback path.
- Debate on whether lighter encodings suffice vs needing heavyweight compression (zstd/brotli) for some string-heavy columns.
- Idea that specialized, per-file compressors could yield big archival wins, but at the cost of more complex decoders.
Adoption, Inertia, and WASM’s Future
- Strong recognition that Parquet/ORC’s installed base and tooling create path dependence; better tech may lose.
- Success would require high-quality connectors (DuckDB, Iceberg, Spark, etc.).
- Divergent views on WASM’s longevity and versioning:
- Optimists highlight multiple runtimes and likely long-term support.
- Skeptics note that small spec changes can strand old bytecode; “nothing screams future-proof like WASM” is used both sincerely and sarcastically.
Miscellaneous Reactions
- Some see F3 as a clever, overdue rethinking of file formats; others as a late-night brainstorm that will age poorly.
- Concerns about environments that intentionally minimize dependencies, where requiring a WASM runtime is a non-starter.
- Curiosity about encryption support via WASM, and about “optimal” standard formats for rows vs columns.
- Light humor about the irony of a “file format for the future” being presented as a PDF, and about an embedded chess move challenge in the paper.