O3 beats a master-level GeoGuessr player, even with fake EXIF data

Reasoning vs memorization

  • Some see o3’s step-by-step explanations and accurate geolocation as evidence of “reasoning” or at least adaptation to novelty, echoing the idea that “can it handle novelty?” is a better question than “can it reason?”
  • Others argue GeoGuessr is mostly pattern recognition and memorized world-knowledge (plants, signs, infrastructure), something LLMs plus vision models are naturally good at.
  • Several comments stress that human “reasoning” is itself often post‑hoc storytelling; an LLM’s inner monologue is likewise not a reliable window into its internal process.
  • Turing Test discussion: LLMs show the test is a weak proxy for “thinking”; Chinese Room and Chomsky are invoked to argue that passing conversation tests doesn’t settle the intelligence question.

GeoGuessr as a benchmark

  • The “master-level” label is downplayed: it’s solid but well below top competitive players. Specialized geo models and Gemini 1.5/2.5 already outperform o3 on structured benchmarks.
  • Many note that CNN/vision-only models have done similar geolocation before; the novelty here is an integrated system that explains its reasoning in natural language.

Web search, cheating, and model boundaries

  • GeoGuessr rules ban Google and other external aids; several commenters call o3’s use of web search “cheating” and the headline misleading.
  • The author later reran the problematic rounds with search disabled and reports nearly identical guesses, arguing search wasn’t actually decisive in this case.
  • Debate centers on what counts as the “system”: is “o3 + web” a fair comparator to a human without web, and does that matter if the goal is measuring raw capability, not game fairness?
  • This is linked to AI alignment: systems will exploit whatever tools are allowed unless constraints are specified extremely clearly.

Real‑world capability and limitations

  • Multiple users test o3 (and related models) on personal photos: it’s “scarily good” in well-photographed US/European locations, often nailing cities or specific landmarks; much weaker in deserts, rural Latin America, and less-photographed regions.
  • It sometimes fabricates confident but wrong specifics (e.g., misidentifying which Mars rover took an image, or misinterpreting recent events without web search).
  • The model appears capable of surprisingly rich inferences (e.g., night-sky shots → latitude, light pollution → approximate metro area).

Privacy, OSINT, and societal impact

  • Many highlight doxxing/OSINT implications: mass, cheap geolocation of social media photos could make stalking and location profiling easier, especially for women.
  • Others argue motivated humans with Street View and OSINT tools already had this power; LLMs mainly lower the skill and time barrier.
  • Potential positive applications are also raised (child exploitation investigations, law enforcement, historical/film location matching), with concern about over-trusting fallible AI in legal contexts.

Reactions to the blog post and “goalpost moving”

  • Some think the title oversells the result and blurs the cheating issue; others find the capability striking regardless of strict GeoGuessr rules.
  • Broader meta‑theme: every time AI clears a previously touted benchmark (Turing test, high-level GeoGuessr), critics shift standards—seen by some as healthy skepticism, by others as perpetual goalpost moving.