2025-03-26

A love letter to the CSV format

Excel and CSV Frictions

Many comments argue “Excel hates CSV” by default: double‑click/open does locale-based parsing, silently transforms data, and may drop or mangle columns.
Locale coupling causes major breakage: in many European locales Excel uses commas as decimal separators and silently switches CSV delimiters to semicolons; different machines/OS languages produce different “CSV” for the same workflow.
Excel historically mishandled UTF‑8 (requiring BOM) and still auto‑coerces values (dates, large integers, ZIP codes, gene names), sometimes forcing users to rename real-world identifiers.
Using the “From Text/CSV” importer or Power Query mitigates many issues but is seen as non-obvious, clunky, and not round‑trippable without manual fixes.

CSV’s Underspecification and RFC 4180

A recurring theme: there is no single CSV, only dialects (delimiters, quoting rules, encodings, headers, line endings).
RFC 4180 exists but is late, partial, and often ignored (especially around Unicode and multiline fields).
This leads to brittle integrations, especially when ingesting “wild” CSV from banks, ERPs, or legacy tools; developers often end up writing ad‑hoc heuristics and per‑partner parsers.

TSV, Pipe, and ASCII Control Delimiters

Many prefer TSV: tabs occur less often than commas and are handled well by tools and copy‑paste into spreadsheets.
Others propose pipe‑separated or using ASCII unit/record separators (0x1F/0x1E) to avoid quoting entirely; pushback is that these break plain-text editing and will eventually need escaping too.
Consensus: any delimiter will appear in data eventually; robust escaping or quoting is unavoidable.

Quoting, Corruption, and Parallelism

A key criticism: CSV quoting has “non‑local” effects—one missing/extra quote can corrupt interpretation of the rest of the file and hinders parallel reading from arbitrary offsets.
Some advocate escape-based schemes (e.g., backslash‑escaping commas/newlines) or length‑delimited/binary formats for reliability and parallelism.

Alternatives: JSON(L), Parquet, SQLite, Others

JSON/JSONL/NDJSON are seen as better-specified, typed, streamable replacements for many CSV uses; keys cost space but compress well and reuse ubiquitous JSON tooling.
Columnar/binary formats (Parquet, Arrow) are preferred for large analytical datasets; SQLite as an interchange format is debated—powerful but too feature-rich and heavy for generic consumption.
XML, YAML, and S‑expressions come up as more rigorous but heavier options; many view CSV as “good enough” only for flat tables.

Ubiquity, Tools, and Pragmatism

Despite flaws, CSV remains the de facto “data plumbing” format in finance, insurance, government, and ETL pipelines because non‑technical users understand it and spreadsheets open it.
Numerous CLI and library tools (xsv/xan, Miller, csvkit, awk/gawk, VisiData, ClickHouse/duckdb import, etc.) exist to tame real-world CSV.
Several comments frame CSV as the “lowest common denominator”: ugly, underspecified, but incredibly practical when you control both ends or are willing to own the compatibility layer.

Related topics