A love letter to the CSV format
Excel and CSV Frictions
- Many comments argue “Excel hates CSV” by default: double‑click/open does locale-based parsing, silently transforms data, and may drop or mangle columns.
- Locale coupling causes major breakage: in many European locales Excel uses commas as decimal separators and silently switches CSV delimiters to semicolons; different machines/OS languages produce different “CSV” for the same workflow.
- Excel historically mishandled UTF‑8 (requiring BOM) and still auto‑coerces values (dates, large integers, ZIP codes, gene names), sometimes forcing users to rename real-world identifiers.
- Using the “From Text/CSV” importer or Power Query mitigates many issues but is seen as non-obvious, clunky, and not round‑trippable without manual fixes.
CSV’s Underspecification and RFC 4180
- A recurring theme: there is no single CSV, only dialects (delimiters, quoting rules, encodings, headers, line endings).
- RFC 4180 exists but is late, partial, and often ignored (especially around Unicode and multiline fields).
- This leads to brittle integrations, especially when ingesting “wild” CSV from banks, ERPs, or legacy tools; developers often end up writing ad‑hoc heuristics and per‑partner parsers.
TSV, Pipe, and ASCII Control Delimiters
- Many prefer TSV: tabs occur less often than commas and are handled well by tools and copy‑paste into spreadsheets.
- Others propose pipe‑separated or using ASCII unit/record separators (0x1F/0x1E) to avoid quoting entirely; pushback is that these break plain-text editing and will eventually need escaping too.
- Consensus: any delimiter will appear in data eventually; robust escaping or quoting is unavoidable.
Quoting, Corruption, and Parallelism
- A key criticism: CSV quoting has “non‑local” effects—one missing/extra quote can corrupt interpretation of the rest of the file and hinders parallel reading from arbitrary offsets.
- Some advocate escape-based schemes (e.g., backslash‑escaping commas/newlines) or length‑delimited/binary formats for reliability and parallelism.
Alternatives: JSON(L), Parquet, SQLite, Others
- JSON/JSONL/NDJSON are seen as better-specified, typed, streamable replacements for many CSV uses; keys cost space but compress well and reuse ubiquitous JSON tooling.
- Columnar/binary formats (Parquet, Arrow) are preferred for large analytical datasets; SQLite as an interchange format is debated—powerful but too feature-rich and heavy for generic consumption.
- XML, YAML, and S‑expressions come up as more rigorous but heavier options; many view CSV as “good enough” only for flat tables.
Ubiquity, Tools, and Pragmatism
- Despite flaws, CSV remains the de facto “data plumbing” format in finance, insurance, government, and ETL pipelines because non‑technical users understand it and spreadsheets open it.
- Numerous CLI and library tools (xsv/xan, Miller, csvkit, awk/gawk, VisiData, ClickHouse/duckdb import, etc.) exist to tame real-world CSV.
- Several comments frame CSV as the “lowest common denominator”: ugly, underspecified, but incredibly practical when you control both ends or are willing to own the compatibility layer.