“Car Wash” test with 53 models
Why Models Fail the “Car Wash” Question
- Many commenters see this as pattern-matching, not reasoning: models strongly associate “short distance + walk vs drive” with “walk for health/environment,” and follow that script.
- Alignment and “sycophancy” are blamed: systems are tuned to give agreeable, socially desirable, eco‑friendly answers rather than challenge premises.
- Some argue the failure is in attention: models overweight the “50 meters” token and underweight the goal “wash my car,” so they never explicitly reason that the car must be present at the wash.
Ambiguity, Pragmatics, and Trick‑Question Nature
- Several people argue the question itself is underspecified: it never states where the car is, or that it will be washed at the car wash.
- Others say a truly intelligent agent should ask clarifying questions like “Where is your car now?” or treat it as a riddle and push back.
- The 71.5% human “drive” rate is seen as evidence the task is partly about pragmatics: humans infer intent from conversational context, not just literal text.
Prompting, Reasoning Modes, and Sensitivity
- Multiple reports that “reasoning”/“thinking” modes or high reasoning effort flip some models to the correct “drive” answer consistently.
- Small prompt tweaks matter:
- Adding hints (“this is a logic test” or “use symbolic reasoning”) markedly improves accuracy.
- Reordering clauses (“The car wash is 50m away. I want to wash my car…”) also helps.
- Some models overthink under extended reasoning, talking themselves into the wrong answer.
Human Baseline and Rapidata Concerns
- Commenters question the Rapidata baseline: possible low‑effort clicks, language barriers, trolling, or even bots. Others note they do have pre‑screening.
- Still, many accept that a sizable minority of humans will miss trick questions when stakes are low or attention is minimal.
Verbosity, “Hot Air,” and Reasoning Tokens
- Long, essay‑style answers are widely criticized; users see them as “high‑school word count padding.”
- Others point out those extra tokens are the computation: chain‑of‑thought or hidden reasoning streams give the model more “passes” to think.
- Active research is mentioned on cutting reasoning tokens while preserving performance.
Evaluation and Reliability Takeaways
- The test is praised as a useful “messy real world” eval that exposes gaps traditional benchmarks miss.
- Key worry: models that answer correctly only ~70–80% of the time are unreliable decision functions; variance across runs is as concerning as outright failure.
- Several suggest that future systems should more often reject the premise or ask clarifying questions rather than confidently choose “walk.”