F3

What F3 Is (as inferred from the thread)

  • Columnar data storage format intended as an alternative to Parquet/ORC/Nimble/Lance, not a general file format.
  • Designed for analytics / “big data” workloads, with focus on random access and extensibility.
  • Embeds WebAssembly (Wasm) decoders in each file as a self-describing, forward-compatible mechanism.
  • Decoders appear to output Arrow-style buffers; format metadata itself is defined via FlatBuffers.

Critique of Documentation and “Why”

  • Many readers find the GitHub README vague and marketing-heavy: unclear what the format does, what problems it solves, or where it should be used.
  • The core rationale is mostly in the linked research paper; the repo alone is considered hard to understand.
  • Requests that advantages over Parquet (with metrics) be summarized directly on the README.

Motivation vs Parquet and Other Formats

  • Cited shortcomings of Parquet include: hardware-oblivious design, global/awkward metadata, difficulty adding new encodings while maintaining compatibility, and weak random access.
  • Some argue these could be addressed by investing more engineering into Parquet or alternative formats like Vortex or Lance.
  • Others see value in new formats for mixed batch + random access and ML workloads, though Parquet’s broad compatibility remains a major moat.

Embedded Wasm Decoders: Pros and Cons

  • Proponents:
    • Solves forward-compatibility for new encodings without updating every reader.
    • Platform-independent, sandboxed VM; decoders can be pure functions returning buffers.
    • Similar ideas have existed (RAR VM, fonts, Anyblox); Wasm runtimes can limit memory and instruction counts.
  • Skeptics:
    • Embedding executable code in data files increases attack surface (RCE, DoS, compression bombs).
    • Even with sandboxing, bugs in Wasm engines or host interfaces are likely.
    • Makes ingestion of untrusted data risky unless Wasm is disabled, which undercuts a key selling point.
    • Debugging third-party Wasm decoders can be painful.

Performance, Adoption, and Longevity

  • Concerns that Wasm-based decoding may be slower and interfere with engine-level optimizations (e.g., DuckDB-style vectorization).
  • Question whether a research project with few recent commits and no ecosystem support can displace Parquet.
  • Some see F3 as potentially better for archival, but others argue that simple, text-like formats (CSV/JSON) or Parquet itself are more future-proof.
  • Overall sentiment: interesting, clever idea with serious practical, security, and adoption hurdles.