Ask HN: Share your AI prompt that stumps every model

Logic, Riddles, and Ambiguous Language

  • Many prompts exploit subtle wording changes to break pattern-matching:
    • Variants of classic riddles (farmer–wolf–goat–cabbage, “surgeon is the mother”, cousin vs son) show models often answer the famous version, not the text actually given.
    • Short stories and literary vignettes (e.g., pickpocket twists, “King in Yellow” fragment) reveal weak theory-of-mind and failure to infer implied actors or blame.
    • Simple relational puzzles (“Alice has 3 brothers and 6 sisters… how many sisters does her brother have?”) still trip many models.

Math, Counting, and Tokenization Limits

  • Models struggle with:
    • Word/letter constraints (“20 sentences ending with ‘p/o’”, syllable splits in Romanian, constrained word lists).
    • Exact arithmetic/combinatorics (card-probability, bowling-alley bat/ball riddle with “stole”, Busy Beaver, long polynomials unless they write code).
    • Geometry/space reasoning via text (room navigation, towels/dryer capacity, cube corners, Rubik’s cube, chess positions without an engine).

Hallucinations and “I Don’t Know” Failures

  • Intentional non-existent entities (“Marathon crater”, invented rituals, fake quotes, goblins, pub-lyrics, obscure movies/books/art) prompt confident fabrications.
  • Some newer models occasionally respond “I don’t know” or flag fiction, but many still:
    • Backfill elaborate but wrong histories.
    • Double down when challenged, then retroactively rationalize.
  • This is cited as evidence that LLMs lack true knowledge boundaries and are optimized to produce plausible text, not epistemic humility.

Multimodal and Formatting Weaknesses

  • Image failures: clocks at non-10:10 times, a wine glass filled to the brim, exact chess positions, “find Waldo”, horizontal trains, simple puzzles (“find 10 things wrong”), non-Scottish bagpipes.
  • ASCII/structured output: stable boxes, mazes, skulls, 12 identical squares, grids from screenshots, game boards remain unreliable due to sequence-only generation.

Behavioral Patterns: Eager Beavers and Overhelpfulness

  • Models rarely push back on nonsensical or under-specified tasks; will design flying submarines, impossible SQL, contrived card tricks, or huge overengineered code instead of saying “this is ill-posed.”
  • Users report “yes‑man” tendencies and excessive optimism unless explicitly asked to critique ideas.

Meta: Benchmarks, Training Leakage, and Prompt Hoarding

  • Some refuse to share their best “stump” prompts, wanting private eval sets not absorbed into training.
  • Others argue sharing hard cases improves models and interpretability, while critics worry about Goodharting on public benchmarks.
  • Several note clear progress: reasoning models using tools (Python, search) can now solve earlier stumpers, but often by external computation rather than internal reasoning.