Ask HN: Share your AI prompt that stumps every model
Logic, Riddles, and Ambiguous Language
- Many prompts exploit subtle wording changes to break pattern-matching:
- Variants of classic riddles (farmer–wolf–goat–cabbage, “surgeon is the mother”, cousin vs son) show models often answer the famous version, not the text actually given.
- Short stories and literary vignettes (e.g., pickpocket twists, “King in Yellow” fragment) reveal weak theory-of-mind and failure to infer implied actors or blame.
- Simple relational puzzles (“Alice has 3 brothers and 6 sisters… how many sisters does her brother have?”) still trip many models.
Math, Counting, and Tokenization Limits
- Models struggle with:
- Word/letter constraints (“20 sentences ending with ‘p/o’”, syllable splits in Romanian, constrained word lists).
- Exact arithmetic/combinatorics (card-probability, bowling-alley bat/ball riddle with “stole”, Busy Beaver, long polynomials unless they write code).
- Geometry/space reasoning via text (room navigation, towels/dryer capacity, cube corners, Rubik’s cube, chess positions without an engine).
Hallucinations and “I Don’t Know” Failures
- Intentional non-existent entities (“Marathon crater”, invented rituals, fake quotes, goblins, pub-lyrics, obscure movies/books/art) prompt confident fabrications.
- Some newer models occasionally respond “I don’t know” or flag fiction, but many still:
- Backfill elaborate but wrong histories.
- Double down when challenged, then retroactively rationalize.
- This is cited as evidence that LLMs lack true knowledge boundaries and are optimized to produce plausible text, not epistemic humility.
Multimodal and Formatting Weaknesses
- Image failures: clocks at non-10:10 times, a wine glass filled to the brim, exact chess positions, “find Waldo”, horizontal trains, simple puzzles (“find 10 things wrong”), non-Scottish bagpipes.
- ASCII/structured output: stable boxes, mazes, skulls, 12 identical squares, grids from screenshots, game boards remain unreliable due to sequence-only generation.
Behavioral Patterns: Eager Beavers and Overhelpfulness
- Models rarely push back on nonsensical or under-specified tasks; will design flying submarines, impossible SQL, contrived card tricks, or huge overengineered code instead of saying “this is ill-posed.”
- Users report “yes‑man” tendencies and excessive optimism unless explicitly asked to critique ideas.
Meta: Benchmarks, Training Leakage, and Prompt Hoarding
- Some refuse to share their best “stump” prompts, wanting private eval sets not absorbed into training.
- Others argue sharing hard cases improves models and interpretability, while critics worry about Goodharting on public benchmarks.
- Several note clear progress: reasoning models using tools (Python, search) can now solve earlier stumpers, but often by external computation rather than internal reasoning.