2025-08-08

GPT-5: "How many times does the letter b appear in blueberry?"

Context and reaction to the “blueberry” failure

GPT‑5 repeatedly answering “3” b’s in “blueberry” is used as a vivid counterexample to claims of “PhD‑level” intelligence.
Commenters highlight the model’s confident, wrong explanations (“extra bounce,” invented spellings) as emblematic of LLM overconfidence and inability to absorb correction.
Some see it as poetic: a system marketed as expert failing a trivial perceptual task.

Tokenization, tools, and the counting blindspot

Many attribute the failure to tokenization: models operate on tokens/embeddings, not raw characters, so “count letters” is structurally hard.
Others argue tokenization alone doesn’t fully explain persistent errors, especially when the word is provided spaced as single letters.
Several suggest giving LLMs explicit tools (Python, shell, math engines) and prompting them to offload such tasks, likening this to humans using calculators.

Reasoning models, routing, and cost tradeoffs

“Thinking” / reasoning variants (GPT‑5 Thinking, o3, some Qwen and Claude modes) often get the answer right, sometimes by spelling and counting internally.
Non‑reasoning or “chat” variants frequently fail, leading to speculation that routers choose cheaper models for seemingly simple queries to save compute.
Some see this as economics, not capability: full power may be reserved for internal or paying use.

Intelligence, reasoning, and consciousness debate

Long subthreads argue whether LLMs “really” reason or just scale pattern‑matching and auto‑completion.
One side stresses functional tests (they can beat many humans on reasoning benchmarks); the other insists reasoning requires conscious, reflective checking that these models lack.
Analogy disputes: are these like humans fooled by optical illusions, or more like hearsay machines without true understanding?

Reliability, safety, and appropriate use

Several insist LLMs should not be treated as truth engines: they’re useful for drafts or low‑stakes tasks, but every factual claim should be checked.
Others argue that anything articulate yet “merely guessing” must either be constrained from consequential domains or augmented with robust “handles” (tools, validation layers).
The blueberry test is seen as a good teaching example of systemic limits and a warning against AGI hype, not just a meme.

Model variation, patching, and synthetic data

Some report other models (Gemini, Qwen, OSS models) getting such questions right on first try; others show those same models failing on similar prompts.
There’s discussion of whether fixes are narrow patches or genuine capability improvements, and speculation about synthetic data or even intentional “watermark‑like” behaviors.

Related topics