Which table format do LLMs understand best?
Overall result and initial reactions
- The article finds GPT‑4.1‑nano does best with Markdown key–value (KV) “records,” modestly better than YAML/JSON and clearly better than CSV/Markdown tables/pipe‑delimited, with overall accuracy around 60% on a large table.
- Many are surprised KV‑Markdown wins, but the key explanation offered is: explicit key–value pairing and clear record boundaries reduce misalignment between column headers and values.
Format characteristics and tokenization
- CSV and classic Markdown tables are criticized as too easy for the model to mis-associate a cell with the wrong header.
- JSON and XML are viewed as noisy and token-heavy; one commenter notes XML used ~50% more tokens for similar accuracy, hinting that extra syntax harms performance at long context lengths.
- Several people stress that token efficiency (CSV/Markdown tables) may outperform more “legible” formats once you approach context limits.
- Minor discussion on abbreviating field names (e.g.,
fvsfunction) ends with: often both are a single token, so savings may be negligible, and common words may carry useful semantic context.
Critiques of methodology
- Strong pushback that only one small model (GPT‑4.1‑nano) and one data size were tested, making generalization to “LLMs” questionable.
- Commenters want:
- Multiple models and sizes (nano/mini/full/frontier).
- Multiple table sizes (e.g., 50–5000 rows).
- Randomized row and question orders to probe positional bias and “lost in the middle” effects.
- Several highlight that ~50–60% accuracy is practically useless; the author explains this was intentional to magnify differences between formats.
Follow‑up benchmarks with larger models
- Independent re-runs on ~30 models report near‑100% recall across formats for many frontier models, with format differences shrinking; CSV and Markdown tables come out slightly best in that broader test.
- Another replication shows, on 1000‑row KV‑Markdown:
- GPT‑4.1‑nano ≈ 52%, 4.1‑mini ≈ 72%, 4.1 ≈ 93%, GPT‑5 ≈ 100% (999/1000 on repeat).
- GPT‑5 also hits 100% on CSV and JSON at 100 samples.
- Consensus from these replications: model quality and table size matter more than format; with strong models and modest row counts, almost any reasonable format works.
When (and whether) to use LLMs on tables
- Many argue this is a “solved problem” for code/SQL/Pandas; using an LLM just to query structured tables is wasteful and error‑prone.
- Counterpoint: the hard part is understanding natural‑language questions; a good pattern is:
- Use traditional tools for table operations.
- Have the LLM generate and/or interpret code, and explain or work with the resulting (smaller) tables.
- Several note that in practice they mostly:
- Use LLMs to create tables from unstructured text, not to scan large tables.
- Rely on LLMs for analysis/interpretation of small result tables, and want to know how small is “safe.”
- Some suggest tool-use or agentic patterns (SQL, Pandas, code execution) and database-backed workflows; raw table dumping into context is considered brittle beyond small sizes.
Alternative representations and upstream issues
- Mention of XML and TOML: anecdotal reports that XML can work well for deeply nested tables; TOML/YAML-like formats are generally serviceable.
- Vision-Language suggestion: instead of linearizing tables, pass the table image plus question to a VLM, preserving 2D structure.
- Others point out that an even bigger real-world challenge is upstream: robustly extracting tables and layout from PDFs/Scans; if structure is lost there, format choice downstream matters less.
Broader reliability concerns
- Several commenters see the 60% result as evidence that LLMs “don’t understand tables,” arguing anything short of 100% is unacceptable for numerical lookup.
- Others distinguish between:
- Deterministic calculation/lookup (should use traditional tools or code), and
- Higher-level math or reasoning, where LLMs can still add value even with occasional mistakes.
- Overall takeaway from the thread:
- For strong models on moderate data sizes, format choice is a second‑order concern (CSV/Markdown/YAML all fine).
- For weaker models or huge contexts, explicit key–value formats help, but better tooling and code execution are usually a superior solution.