2025-10-03

Which table format do LLMs understand best?

Overall result and initial reactions

The article finds GPT‑4.1‑nano does best with Markdown key–value (KV) “records,” modestly better than YAML/JSON and clearly better than CSV/Markdown tables/pipe‑delimited, with overall accuracy around 60% on a large table.
Many are surprised KV‑Markdown wins, but the key explanation offered is: explicit key–value pairing and clear record boundaries reduce misalignment between column headers and values.

Format characteristics and tokenization

CSV and classic Markdown tables are criticized as too easy for the model to mis-associate a cell with the wrong header.
JSON and XML are viewed as noisy and token-heavy; one commenter notes XML used ~50% more tokens for similar accuracy, hinting that extra syntax harms performance at long context lengths.
Several people stress that token efficiency (CSV/Markdown tables) may outperform more “legible” formats once you approach context limits.
Minor discussion on abbreviating field names (e.g., f vs function) ends with: often both are a single token, so savings may be negligible, and common words may carry useful semantic context.

Critiques of methodology

Strong pushback that only one small model (GPT‑4.1‑nano) and one data size were tested, making generalization to “LLMs” questionable.
Commenters want:
- Multiple models and sizes (nano/mini/full/frontier).
- Multiple table sizes (e.g., 50–5000 rows).
- Randomized row and question orders to probe positional bias and “lost in the middle” effects.
Several highlight that ~50–60% accuracy is practically useless; the author explains this was intentional to magnify differences between formats.

Follow‑up benchmarks with larger models

Independent re-runs on ~30 models report near‑100% recall across formats for many frontier models, with format differences shrinking; CSV and Markdown tables come out slightly best in that broader test.
Another replication shows, on 1000‑row KV‑Markdown:
- GPT‑4.1‑nano ≈ 52%, 4.1‑mini ≈ 72%, 4.1 ≈ 93%, GPT‑5 ≈ 100% (999/1000 on repeat).
- GPT‑5 also hits 100% on CSV and JSON at 100 samples.
Consensus from these replications: model quality and table size matter more than format; with strong models and modest row counts, almost any reasonable format works.

When (and whether) to use LLMs on tables

Many argue this is a “solved problem” for code/SQL/Pandas; using an LLM just to query structured tables is wasteful and error‑prone.
Counterpoint: the hard part is understanding natural‑language questions; a good pattern is:
- Use traditional tools for table operations.
- Have the LLM generate and/or interpret code, and explain or work with the resulting (smaller) tables.
Several note that in practice they mostly:
- Use LLMs to create tables from unstructured text, not to scan large tables.
- Rely on LLMs for analysis/interpretation of small result tables, and want to know how small is “safe.”
Some suggest tool-use or agentic patterns (SQL, Pandas, code execution) and database-backed workflows; raw table dumping into context is considered brittle beyond small sizes.

Alternative representations and upstream issues

Mention of XML and TOML: anecdotal reports that XML can work well for deeply nested tables; TOML/YAML-like formats are generally serviceable.
Vision-Language suggestion: instead of linearizing tables, pass the table image plus question to a VLM, preserving 2D structure.
Others point out that an even bigger real-world challenge is upstream: robustly extracting tables and layout from PDFs/Scans; if structure is lost there, format choice downstream matters less.

Broader reliability concerns

Several commenters see the 60% result as evidence that LLMs “don’t understand tables,” arguing anything short of 100% is unacceptable for numerical lookup.
Others distinguish between:
- Deterministic calculation/lookup (should use traditional tools or code), and
- Higher-level math or reasoning, where LLMs can still add value even with occasional mistakes.
Overall takeaway from the thread:
- For strong models on moderate data sizes, format choice is a second‑order concern (CSV/Markdown/YAML all fine).
- For weaker models or huge contexts, explicit key–value formats help, but better tooling and code execution are usually a superior solution.

Related topics