2025-10-06

Why do LLMs freak out over the seahorse emoji?

Mechanistic cause of the “seahorse emoji” failure

Many commenters align with the article’s explanation: the model forms a coherent internal representation of “seahorse emoji”, but there is no corresponding output token in the tokenizer.
The final projection layer (lm_head) is forced to pick the closest existing emoji token embedding (horse, fish, shell, etc.), so the model outputs the “wrong” emoji.
Because the model is trained to explain and justify its answers, it then sees its own wrong output as input, detects an inconsistency (“this isn’t a seahorse”), and enters a repair loop rather than stopping.

Hallucination vs tokenization vs knowledge

Debate over whether this is “classic hallucination”:
- One view: the hallucination starts as soon as it asserts “yes, it exists” when it doesn’t.
- Another view: the failure is more like a tokenization/representation gap plus incorrect prior “knowledge” from training data that a seahorse emoji exists.
Several note that humans also “remember” a seahorse emoji (Mandela effect, legacy MSN/Skype/custom emoji), so training data likely contains both “it exists” and “it doesn’t” claims.

Self-correction, reasoning, and “freakout” behavior

People highlight the striking, human-like behavior: models contradict themselves mid-answer, apologize, retry, and sometimes spiral into long, frantic sequences (or emoji spam).
Explanations offered:
- Transformers generate strictly left-to-right with a fixed compute budget per token; there’s no built‑in “silent revision” pass.
- “Thinking” / reasoning modes are effectively hidden self‑conversation: the model does the same repair process but off-screen, sometimes with web search to ground facts.
- Attempts to add “backspace” or revision tokens exist in research, but don’t seem to scale as well as chain‑of‑thought and external tools.

Comparisons across models and prompts

Different models behave differently:
- Some (especially with web search or explicit “thinking”) quickly answer “no, there is no seahorse emoji” and frame it as a Mandela effect.
- Others loop, emit near-miss emoji, or confidently invent fake Unicode code points and then retract them.
Wording matters: “Is there a seahorse emoji?” sometimes elicits a clean “no”; “show me the seahorse emoji” more often triggers the meltdown.

Broader implications and proposed fixes

Suggestions include: adding explicit training examples (“there is no seahorse emoji”), relying on web search, or simply lobbying Unicode to add one.
Several see this as an illustrative, fundamental limitation: LLMs are excellent at fluent interpolation in their learned manifold, but brittle on “negative knowledge” and on concepts that are linguistically common yet lack direct symbols.

Related topics