GPT-5: "How many times does the letter b appear in blueberry?"

Context and reaction to the “blueberry” failure

  • GPT‑5 repeatedly answering “3” b’s in “blueberry” is used as a vivid counterexample to claims of “PhD‑level” intelligence.
  • Commenters highlight the model’s confident, wrong explanations (“extra bounce,” invented spellings) as emblematic of LLM overconfidence and inability to absorb correction.
  • Some see it as poetic: a system marketed as expert failing a trivial perceptual task.

Tokenization, tools, and the counting blindspot

  • Many attribute the failure to tokenization: models operate on tokens/embeddings, not raw characters, so “count letters” is structurally hard.
  • Others argue tokenization alone doesn’t fully explain persistent errors, especially when the word is provided spaced as single letters.
  • Several suggest giving LLMs explicit tools (Python, shell, math engines) and prompting them to offload such tasks, likening this to humans using calculators.

Reasoning models, routing, and cost tradeoffs

  • “Thinking” / reasoning variants (GPT‑5 Thinking, o3, some Qwen and Claude modes) often get the answer right, sometimes by spelling and counting internally.
  • Non‑reasoning or “chat” variants frequently fail, leading to speculation that routers choose cheaper models for seemingly simple queries to save compute.
  • Some see this as economics, not capability: full power may be reserved for internal or paying use.

Intelligence, reasoning, and consciousness debate

  • Long subthreads argue whether LLMs “really” reason or just scale pattern‑matching and auto‑completion.
  • One side stresses functional tests (they can beat many humans on reasoning benchmarks); the other insists reasoning requires conscious, reflective checking that these models lack.
  • Analogy disputes: are these like humans fooled by optical illusions, or more like hearsay machines without true understanding?

Reliability, safety, and appropriate use

  • Several insist LLMs should not be treated as truth engines: they’re useful for drafts or low‑stakes tasks, but every factual claim should be checked.
  • Others argue that anything articulate yet “merely guessing” must either be constrained from consequential domains or augmented with robust “handles” (tools, validation layers).
  • The blueberry test is seen as a good teaching example of systemic limits and a warning against AGI hype, not just a meme.

Model variation, patching, and synthetic data

  • Some report other models (Gemini, Qwen, OSS models) getting such questions right on first try; others show those same models failing on similar prompts.
  • There’s discussion of whether fixes are narrow patches or genuine capability improvements, and speculation about synthetic data or even intentional “watermark‑like” behaviors.