The case for zero-error horizons in trustworthy LLMs
Paper’s Claim and Setup
- Thread centers on a paper showing GPT‑5.2 failing basic tasks (parity of “11000”, balancing “((((( )))))”, small multiplications) despite strong performance elsewhere.
- Many say this is unsurprising for “bare” LLMs; others argue it’s surprising given marketing claims and public expectations of “reasoning.”
Reasoning Tokens and Experimental Design
- Major critique: authors used GPT‑5.2 with
reasoning.effortleft at its default “none”, i.e., zero reasoning tokens, akin to an instant model. - Critics call this misleading: the model is advertised as needing reasoning tokens for hard problems, and “no one uses it this way” in serious applications.
- Defenders respond that the paper explicitly evaluates the LLM without tools or extra thinking, to map intrinsic limits.
Tokenization, Counting, and Architecture
- Debate over whether failures arise mainly from tokenization (no direct character access; “strawberry” split into opaque tokens) or deeper issues.
- Some note LLMs can spell and manipulate characters when forced, suggesting broader limitations: counting items, strict ordering, scattered retrieval, lack of explicit state (stack/accumulator).
Tools vs Core LLM Ability
- Many argue production systems rely on tools (Python, calculators, spreadsheets); with tools, these tasks are trivial and ZEH can be effectively infinite.
- Others question whether outsourcing to tools counts as “reasoning” or just harness design, especially for AGI claims.
- It’s noted LLMs don’t reliably know when to invoke tools or where their knowledge boundaries lie.
Usefulness of ZEH and Reliability
- Supporters see ZEH as a way to quantify reliability per task, not to declare LLMs useless.
- Critics argue a system that can’t robustly count small numbers yet can solve advanced math exposes a fundamental mismatch with human learning and undermines “trustworthy” branding.
Patching, Generalization, and Sociological Themes
- Some suspect ad‑hoc prompt-specific patches when viral “LLM fails” get fixed quickly in UIs but remain for nearby variants.
- Separate discussion claims LLMs often overproduce one frontend template, suggesting weak abstraction vs humans.
- Meta-thread: strong polarization (AGI hype vs “useless grift”), identity-like attachment to views, and worries about misleading, sensationalist papers.