Which table format do LLMs understand best?

Overall result and initial reactions

  • The article finds GPT‑4.1‑nano does best with Markdown key–value (KV) “records,” modestly better than YAML/JSON and clearly better than CSV/Markdown tables/pipe‑delimited, with overall accuracy around 60% on a large table.
  • Many are surprised KV‑Markdown wins, but the key explanation offered is: explicit key–value pairing and clear record boundaries reduce misalignment between column headers and values.

Format characteristics and tokenization

  • CSV and classic Markdown tables are criticized as too easy for the model to mis-associate a cell with the wrong header.
  • JSON and XML are viewed as noisy and token-heavy; one commenter notes XML used ~50% more tokens for similar accuracy, hinting that extra syntax harms performance at long context lengths.
  • Several people stress that token efficiency (CSV/Markdown tables) may outperform more “legible” formats once you approach context limits.
  • Minor discussion on abbreviating field names (e.g., f vs function) ends with: often both are a single token, so savings may be negligible, and common words may carry useful semantic context.

Critiques of methodology

  • Strong pushback that only one small model (GPT‑4.1‑nano) and one data size were tested, making generalization to “LLMs” questionable.
  • Commenters want:
    • Multiple models and sizes (nano/mini/full/frontier).
    • Multiple table sizes (e.g., 50–5000 rows).
    • Randomized row and question orders to probe positional bias and “lost in the middle” effects.
  • Several highlight that ~50–60% accuracy is practically useless; the author explains this was intentional to magnify differences between formats.

Follow‑up benchmarks with larger models

  • Independent re-runs on ~30 models report near‑100% recall across formats for many frontier models, with format differences shrinking; CSV and Markdown tables come out slightly best in that broader test.
  • Another replication shows, on 1000‑row KV‑Markdown:
    • GPT‑4.1‑nano ≈ 52%, 4.1‑mini ≈ 72%, 4.1 ≈ 93%, GPT‑5 ≈ 100% (999/1000 on repeat).
    • GPT‑5 also hits 100% on CSV and JSON at 100 samples.
  • Consensus from these replications: model quality and table size matter more than format; with strong models and modest row counts, almost any reasonable format works.

When (and whether) to use LLMs on tables

  • Many argue this is a “solved problem” for code/SQL/Pandas; using an LLM just to query structured tables is wasteful and error‑prone.
  • Counterpoint: the hard part is understanding natural‑language questions; a good pattern is:
    • Use traditional tools for table operations.
    • Have the LLM generate and/or interpret code, and explain or work with the resulting (smaller) tables.
  • Several note that in practice they mostly:
    • Use LLMs to create tables from unstructured text, not to scan large tables.
    • Rely on LLMs for analysis/interpretation of small result tables, and want to know how small is “safe.”
  • Some suggest tool-use or agentic patterns (SQL, Pandas, code execution) and database-backed workflows; raw table dumping into context is considered brittle beyond small sizes.

Alternative representations and upstream issues

  • Mention of XML and TOML: anecdotal reports that XML can work well for deeply nested tables; TOML/YAML-like formats are generally serviceable.
  • Vision-Language suggestion: instead of linearizing tables, pass the table image plus question to a VLM, preserving 2D structure.
  • Others point out that an even bigger real-world challenge is upstream: robustly extracting tables and layout from PDFs/Scans; if structure is lost there, format choice downstream matters less.

Broader reliability concerns

  • Several commenters see the 60% result as evidence that LLMs “don’t understand tables,” arguing anything short of 100% is unacceptable for numerical lookup.
  • Others distinguish between:
    • Deterministic calculation/lookup (should use traditional tools or code), and
    • Higher-level math or reasoning, where LLMs can still add value even with occasional mistakes.
  • Overall takeaway from the thread:
    • For strong models on moderate data sizes, format choice is a second‑order concern (CSV/Markdown/YAML all fine).
    • For weaker models or huge contexts, explicit key–value formats help, but better tooling and code execution are usually a superior solution.