O3 beats a master-level GeoGuessr player, even with fake EXIF data
Reasoning vs memorization
- Some see o3’s step-by-step explanations and accurate geolocation as evidence of “reasoning” or at least adaptation to novelty, echoing the idea that “can it handle novelty?” is a better question than “can it reason?”
- Others argue GeoGuessr is mostly pattern recognition and memorized world-knowledge (plants, signs, infrastructure), something LLMs plus vision models are naturally good at.
- Several comments stress that human “reasoning” is itself often post‑hoc storytelling; an LLM’s inner monologue is likewise not a reliable window into its internal process.
- Turing Test discussion: LLMs show the test is a weak proxy for “thinking”; Chinese Room and Chomsky are invoked to argue that passing conversation tests doesn’t settle the intelligence question.
GeoGuessr as a benchmark
- The “master-level” label is downplayed: it’s solid but well below top competitive players. Specialized geo models and Gemini 1.5/2.5 already outperform o3 on structured benchmarks.
- Many note that CNN/vision-only models have done similar geolocation before; the novelty here is an integrated system that explains its reasoning in natural language.
Web search, cheating, and model boundaries
- GeoGuessr rules ban Google and other external aids; several commenters call o3’s use of web search “cheating” and the headline misleading.
- The author later reran the problematic rounds with search disabled and reports nearly identical guesses, arguing search wasn’t actually decisive in this case.
- Debate centers on what counts as the “system”: is “o3 + web” a fair comparator to a human without web, and does that matter if the goal is measuring raw capability, not game fairness?
- This is linked to AI alignment: systems will exploit whatever tools are allowed unless constraints are specified extremely clearly.
Real‑world capability and limitations
- Multiple users test o3 (and related models) on personal photos: it’s “scarily good” in well-photographed US/European locations, often nailing cities or specific landmarks; much weaker in deserts, rural Latin America, and less-photographed regions.
- It sometimes fabricates confident but wrong specifics (e.g., misidentifying which Mars rover took an image, or misinterpreting recent events without web search).
- The model appears capable of surprisingly rich inferences (e.g., night-sky shots → latitude, light pollution → approximate metro area).
Privacy, OSINT, and societal impact
- Many highlight doxxing/OSINT implications: mass, cheap geolocation of social media photos could make stalking and location profiling easier, especially for women.
- Others argue motivated humans with Street View and OSINT tools already had this power; LLMs mainly lower the skill and time barrier.
- Potential positive applications are also raised (child exploitation investigations, law enforcement, historical/film location matching), with concern about over-trusting fallible AI in legal contexts.
Reactions to the blog post and “goalpost moving”
- Some think the title oversells the result and blurs the cheating issue; others find the capability striking regardless of strict GeoGuessr rules.
- Broader meta‑theme: every time AI clears a previously touted benchmark (Turing test, high-level GeoGuessr), critics shift standards—seen by some as healthy skepticism, by others as perpetual goalpost moving.