2026-04-02

The case for zero-error horizons in trustworthy LLMs

Original Article ↗ Hacker News Discussion ↗

Paper’s Claim and Setup

Thread centers on a paper showing GPT‑5.2 failing basic tasks (parity of “11000”, balancing “((((( )))))”, small multiplications) despite strong performance elsewhere.
Many say this is unsurprising for “bare” LLMs; others argue it’s surprising given marketing claims and public expectations of “reasoning.”

Reasoning Tokens and Experimental Design

Major critique: authors used GPT‑5.2 with reasoning.effort left at its default “none”, i.e., zero reasoning tokens, akin to an instant model.
Critics call this misleading: the model is advertised as needing reasoning tokens for hard problems, and “no one uses it this way” in serious applications.
Defenders respond that the paper explicitly evaluates the LLM without tools or extra thinking, to map intrinsic limits.

Tokenization, Counting, and Architecture

Debate over whether failures arise mainly from tokenization (no direct character access; “strawberry” split into opaque tokens) or deeper issues.
Some note LLMs can spell and manipulate characters when forced, suggesting broader limitations: counting items, strict ordering, scattered retrieval, lack of explicit state (stack/accumulator).

Tools vs Core LLM Ability

Many argue production systems rely on tools (Python, calculators, spreadsheets); with tools, these tasks are trivial and ZEH can be effectively infinite.
Others question whether outsourcing to tools counts as “reasoning” or just harness design, especially for AGI claims.
It’s noted LLMs don’t reliably know when to invoke tools or where their knowledge boundaries lie.

Usefulness of ZEH and Reliability

Supporters see ZEH as a way to quantify reliability per task, not to declare LLMs useless.
Critics argue a system that can’t robustly count small numbers yet can solve advanced math exposes a fundamental mismatch with human learning and undermines “trustworthy” branding.

Patching, Generalization, and Sociological Themes

Some suspect ad‑hoc prompt-specific patches when viral “LLM fails” get fixed quickly in UIs but remain for nearby variants.
Separate discussion claims LLMs often overproduce one frontend template, suggesting weak abstraction vs humans.
Meta-thread: strong polarization (AGI hype vs “useless grift”), identity-like attachment to views, and worries about misleading, sensationalist papers.