Why do LLMs freak out over the seahorse emoji?
Mechanistic cause of the “seahorse emoji” failure
- Many commenters align with the article’s explanation: the model forms a coherent internal representation of “seahorse emoji”, but there is no corresponding output token in the tokenizer.
- The final projection layer (
lm_head) is forced to pick the closest existing emoji token embedding (horse, fish, shell, etc.), so the model outputs the “wrong” emoji. - Because the model is trained to explain and justify its answers, it then sees its own wrong output as input, detects an inconsistency (“this isn’t a seahorse”), and enters a repair loop rather than stopping.
Hallucination vs tokenization vs knowledge
- Debate over whether this is “classic hallucination”:
- One view: the hallucination starts as soon as it asserts “yes, it exists” when it doesn’t.
- Another view: the failure is more like a tokenization/representation gap plus incorrect prior “knowledge” from training data that a seahorse emoji exists.
- Several note that humans also “remember” a seahorse emoji (Mandela effect, legacy MSN/Skype/custom emoji), so training data likely contains both “it exists” and “it doesn’t” claims.
Self-correction, reasoning, and “freakout” behavior
- People highlight the striking, human-like behavior: models contradict themselves mid-answer, apologize, retry, and sometimes spiral into long, frantic sequences (or emoji spam).
- Explanations offered:
- Transformers generate strictly left-to-right with a fixed compute budget per token; there’s no built‑in “silent revision” pass.
- “Thinking” / reasoning modes are effectively hidden self‑conversation: the model does the same repair process but off-screen, sometimes with web search to ground facts.
- Attempts to add “backspace” or revision tokens exist in research, but don’t seem to scale as well as chain‑of‑thought and external tools.
Comparisons across models and prompts
- Different models behave differently:
- Some (especially with web search or explicit “thinking”) quickly answer “no, there is no seahorse emoji” and frame it as a Mandela effect.
- Others loop, emit near-miss emoji, or confidently invent fake Unicode code points and then retract them.
- Wording matters: “Is there a seahorse emoji?” sometimes elicits a clean “no”; “show me the seahorse emoji” more often triggers the meltdown.
Broader implications and proposed fixes
- Suggestions include: adding explicit training examples (“there is no seahorse emoji”), relying on web search, or simply lobbying Unicode to add one.
- Several see this as an illustrative, fundamental limitation: LLMs are excellent at fluent interpolation in their learned manifold, but brittle on “negative knowledge” and on concepts that are linguistically common yet lack direct symbols.