The case for zero-error horizons in trustworthy LLMs

Paper’s Claim and Setup

  • Thread centers on a paper showing GPT‑5.2 failing basic tasks (parity of “11000”, balancing “((((( )))))”, small multiplications) despite strong performance elsewhere.
  • Many say this is unsurprising for “bare” LLMs; others argue it’s surprising given marketing claims and public expectations of “reasoning.”

Reasoning Tokens and Experimental Design

  • Major critique: authors used GPT‑5.2 with reasoning.effort left at its default “none”, i.e., zero reasoning tokens, akin to an instant model.
  • Critics call this misleading: the model is advertised as needing reasoning tokens for hard problems, and “no one uses it this way” in serious applications.
  • Defenders respond that the paper explicitly evaluates the LLM without tools or extra thinking, to map intrinsic limits.

Tokenization, Counting, and Architecture

  • Debate over whether failures arise mainly from tokenization (no direct character access; “strawberry” split into opaque tokens) or deeper issues.
  • Some note LLMs can spell and manipulate characters when forced, suggesting broader limitations: counting items, strict ordering, scattered retrieval, lack of explicit state (stack/accumulator).

Tools vs Core LLM Ability

  • Many argue production systems rely on tools (Python, calculators, spreadsheets); with tools, these tasks are trivial and ZEH can be effectively infinite.
  • Others question whether outsourcing to tools counts as “reasoning” or just harness design, especially for AGI claims.
  • It’s noted LLMs don’t reliably know when to invoke tools or where their knowledge boundaries lie.

Usefulness of ZEH and Reliability

  • Supporters see ZEH as a way to quantify reliability per task, not to declare LLMs useless.
  • Critics argue a system that can’t robustly count small numbers yet can solve advanced math exposes a fundamental mismatch with human learning and undermines “trustworthy” branding.

Patching, Generalization, and Sociological Themes

  • Some suspect ad‑hoc prompt-specific patches when viral “LLM fails” get fixed quickly in UIs but remain for nearby variants.
  • Separate discussion claims LLMs often overproduce one frontend template, suggesting weak abstraction vs humans.
  • Meta-thread: strong polarization (AGI hype vs “useless grift”), identity-like attachment to views, and worries about misleading, sensationalist papers.