2025-04-24

Ask HN: Share your AI prompt that stumps every model

Logic, Riddles, and Ambiguous Language

Many prompts exploit subtle wording changes to break pattern-matching:
- Variants of classic riddles (farmer–wolf–goat–cabbage, “surgeon is the mother”, cousin vs son) show models often answer the famous version, not the text actually given.
- Short stories and literary vignettes (e.g., pickpocket twists, “King in Yellow” fragment) reveal weak theory-of-mind and failure to infer implied actors or blame.
- Simple relational puzzles (“Alice has 3 brothers and 6 sisters… how many sisters does her brother have?”) still trip many models.

Math, Counting, and Tokenization Limits

Models struggle with:
- Word/letter constraints (“20 sentences ending with ‘p/o’”, syllable splits in Romanian, constrained word lists).
- Exact arithmetic/combinatorics (card-probability, bowling-alley bat/ball riddle with “stole”, Busy Beaver, long polynomials unless they write code).
- Geometry/space reasoning via text (room navigation, towels/dryer capacity, cube corners, Rubik’s cube, chess positions without an engine).

Hallucinations and “I Don’t Know” Failures

Intentional non-existent entities (“Marathon crater”, invented rituals, fake quotes, goblins, pub-lyrics, obscure movies/books/art) prompt confident fabrications.
Some newer models occasionally respond “I don’t know” or flag fiction, but many still:
- Backfill elaborate but wrong histories.
- Double down when challenged, then retroactively rationalize.
This is cited as evidence that LLMs lack true knowledge boundaries and are optimized to produce plausible text, not epistemic humility.

Multimodal and Formatting Weaknesses

Image failures: clocks at non-10:10 times, a wine glass filled to the brim, exact chess positions, “find Waldo”, horizontal trains, simple puzzles (“find 10 things wrong”), non-Scottish bagpipes.
ASCII/structured output: stable boxes, mazes, skulls, 12 identical squares, grids from screenshots, game boards remain unreliable due to sequence-only generation.

Behavioral Patterns: Eager Beavers and Overhelpfulness

Models rarely push back on nonsensical or under-specified tasks; will design flying submarines, impossible SQL, contrived card tricks, or huge overengineered code instead of saying “this is ill-posed.”
Users report “yes‑man” tendencies and excessive optimism unless explicitly asked to critique ideas.

Meta: Benchmarks, Training Leakage, and Prompt Hoarding

Some refuse to share their best “stump” prompts, wanting private eval sets not absorbed into training.
Others argue sharing hard cases improves models and interpretability, while critics worry about Goodharting on public benchmarks.
Several note clear progress: reasoning models using tools (Python, search) can now solve earlier stumpers, but often by external computation rather than internal reasoning.

Related topics