Why do LLMs freak out over the seahorse emoji?

Mechanistic cause of the “seahorse emoji” failure

  • Many commenters align with the article’s explanation: the model forms a coherent internal representation of “seahorse emoji”, but there is no corresponding output token in the tokenizer.
  • The final projection layer (lm_head) is forced to pick the closest existing emoji token embedding (horse, fish, shell, etc.), so the model outputs the “wrong” emoji.
  • Because the model is trained to explain and justify its answers, it then sees its own wrong output as input, detects an inconsistency (“this isn’t a seahorse”), and enters a repair loop rather than stopping.

Hallucination vs tokenization vs knowledge

  • Debate over whether this is “classic hallucination”:
    • One view: the hallucination starts as soon as it asserts “yes, it exists” when it doesn’t.
    • Another view: the failure is more like a tokenization/representation gap plus incorrect prior “knowledge” from training data that a seahorse emoji exists.
  • Several note that humans also “remember” a seahorse emoji (Mandela effect, legacy MSN/Skype/custom emoji), so training data likely contains both “it exists” and “it doesn’t” claims.

Self-correction, reasoning, and “freakout” behavior

  • People highlight the striking, human-like behavior: models contradict themselves mid-answer, apologize, retry, and sometimes spiral into long, frantic sequences (or emoji spam).
  • Explanations offered:
    • Transformers generate strictly left-to-right with a fixed compute budget per token; there’s no built‑in “silent revision” pass.
    • “Thinking” / reasoning modes are effectively hidden self‑conversation: the model does the same repair process but off-screen, sometimes with web search to ground facts.
    • Attempts to add “backspace” or revision tokens exist in research, but don’t seem to scale as well as chain‑of‑thought and external tools.

Comparisons across models and prompts

  • Different models behave differently:
    • Some (especially with web search or explicit “thinking”) quickly answer “no, there is no seahorse emoji” and frame it as a Mandela effect.
    • Others loop, emit near-miss emoji, or confidently invent fake Unicode code points and then retract them.
  • Wording matters: “Is there a seahorse emoji?” sometimes elicits a clean “no”; “show me the seahorse emoji” more often triggers the meltdown.

Broader implications and proposed fixes

  • Suggestions include: adding explicit training examples (“there is no seahorse emoji”), relying on web search, or simply lobbying Unicode to add one.
  • Several see this as an illustrative, fundamental limitation: LLMs are excellent at fluent interpolation in their learned manifold, but brittle on “negative knowledge” and on concepts that are linguistically common yet lack direct symbols.