I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

The Car-Wash Question & Model Behavior

  • The prompt “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?” elicits divergent answers: some models say “drive” (explicitly noting the car must be present), others confidently say “walk” and justify it with health, environment, or convenience arguments.
  • Non‑determinism is clear: the same model (and even the same settings) often alternates between “walk” and “drive” across runs, languages, or contexts.
  • Several people report that newer or higher‑tier “reasoning” models (Gemini Pro/Thinking, some Claude and Grok variants, some Codex/GPT variants) usually get it right, but not reliably.

Is It a Trick Question or a Reasoning Failure?

  • Some see it as a classic riddle / “Cognitive Reflection Test” style trap: the surface pattern (“short trip: walk vs drive?”) misleads you away from the key constraint (the car must move).
  • Others argue it should still be a trivial everyday inference and that failing it exposes a lack of practical, embodied “common sense.”
  • A recurring comparison is to human trick questions (“How many Rs in ‘strawberry’?”, “where do you bury the survivors?”): humans also get these wrong, but typically can ask clarifying questions—something LLMs rarely do by default.

What It Suggests About LLMs’ “Understanding”

  • One camp says this shows LLMs don’t really understand the world; they’re powerful text predictors that latch on to high‑frequency patterns (“short distance → walk”) and ignore physical preconditions.
  • Others push back: the same models can handle quite complex code, math, and domain reasoning; a single toy failure doesn’t falsify “reasoning,” just shows brittle generalization under ambiguity.

Training, Alignment, and Bias

  • Several comments link “walk” answers to alignment and RLHF: models are heavily rewarded for sounding eco‑friendly, health‑conscious, and non‑committal, which nudges them toward “walk” over “drive.”
  • There’s suspicion that once such prompts go viral, providers “patch” them via fine‑tuning, routing, or system prompts, creating the illusion of deeper understanding.

Prompting, Reasoning Modes, and Clarification

  • Adding cues like “this is a logic puzzle,” “think carefully,” or “state assumptions first” often flips the answer to “drive,” showing that chain‑of‑thought modes can override shallow heuristics.
  • Many argue the real missing behavior is meta‑cognition: models almost never respond with “this question is underspecified/odd—where is the car?” even though that’s what a careful human would do.

Implications for Use and Evaluation

  • Commenters stress that one‑shot screenshots are a poor evaluation of probabilistic systems; you need multiple samples and families of similar prompts.
  • Still, this kind of failure is used as a warning: LLMs are useful tools (especially with tests, compilers, or external checks) but should not be treated as unsupervised agents with reliable real‑world reasoning.