2026-02-16

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

The Car-Wash Question & Model Behavior

The prompt “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?” elicits divergent answers: some models say “drive” (explicitly noting the car must be present), others confidently say “walk” and justify it with health, environment, or convenience arguments.
Non‑determinism is clear: the same model (and even the same settings) often alternates between “walk” and “drive” across runs, languages, or contexts.
Several people report that newer or higher‑tier “reasoning” models (Gemini Pro/Thinking, some Claude and Grok variants, some Codex/GPT variants) usually get it right, but not reliably.

Is It a Trick Question or a Reasoning Failure?

Some see it as a classic riddle / “Cognitive Reflection Test” style trap: the surface pattern (“short trip: walk vs drive?”) misleads you away from the key constraint (the car must move).
Others argue it should still be a trivial everyday inference and that failing it exposes a lack of practical, embodied “common sense.”
A recurring comparison is to human trick questions (“How many Rs in ‘strawberry’?”, “where do you bury the survivors?”): humans also get these wrong, but typically can ask clarifying questions—something LLMs rarely do by default.

What It Suggests About LLMs’ “Understanding”

One camp says this shows LLMs don’t really understand the world; they’re powerful text predictors that latch on to high‑frequency patterns (“short distance → walk”) and ignore physical preconditions.
Others push back: the same models can handle quite complex code, math, and domain reasoning; a single toy failure doesn’t falsify “reasoning,” just shows brittle generalization under ambiguity.

Training, Alignment, and Bias

Several comments link “walk” answers to alignment and RLHF: models are heavily rewarded for sounding eco‑friendly, health‑conscious, and non‑committal, which nudges them toward “walk” over “drive.”
There’s suspicion that once such prompts go viral, providers “patch” them via fine‑tuning, routing, or system prompts, creating the illusion of deeper understanding.

Prompting, Reasoning Modes, and Clarification

Adding cues like “this is a logic puzzle,” “think carefully,” or “state assumptions first” often flips the answer to “drive,” showing that chain‑of‑thought modes can override shallow heuristics.
Many argue the real missing behavior is meta‑cognition: models almost never respond with “this question is underspecified/odd—where is the car?” even though that’s what a careful human would do.

Implications for Use and Evaluation

Commenters stress that one‑shot screenshots are a poor evaluation of probabilistic systems; you need multiple samples and families of similar prompts.
Still, this kind of failure is used as a warning: LLMs are useful tools (especially with tests, compilers, or external checks) but should not be treated as unsupervised agents with reliable real‑world reasoning.

Related topics