Gemini 3 Deep Think

Model performance and positioning

  • Gemini 3 Deep Think benchmarks as “healthily ahead” of Claude Opus 4.6 on several reasoning tests, especially ARC‑AGI‑2 and vision/world‑modeling.
  • Many commenters think Google now leads on raw model capability and visual intelligence, but lags OpenAI/Anthropic on agentic behavior, coding assistants, and overall product polish.
  • Others argue it’s just “leapfrog”: stretch the time window and all frontier models look similar.

ARC‑AGI‑2 and benchmarks

  • Deep Think scores 84.6% on the semi‑private ARC‑AGI‑2 set versus ~69% for Opus 4.6; this is widely seen as a major jump, but cost is ~$13.62 per task vs ~$3.64 for Opus.
  • Debate over significance: some see ARC‑AGI as “toast” and overhyped (narrow visual puzzles), others stress it’s still one of the few fluid‑intelligence‑style tests not obviously saturated.
  • Concerns about “benchmarkmaxxing” and possible leakage from semi‑private sets; counter‑argument is that certified results still indicate real progress, though exact percentages may be inflated.
  • Several note that solving ARC‑AGI does not equal AGI; newer versions (ARC‑AGI‑3/4) will add trial‑and‑error and game‑like exploration.

Real‑world usage: strengths and weaknesses

  • Fans report Gemini 3 Pro/Flash are excellent for science/engineering, biology, math, document understanding, OCR of historical texts, and even non‑trained tasks like playing Balatro from a text description.
  • Deep Think is praised for very strong visual reasoning (e.g., hard Raven matrices, CAD/3D demos, high‑quality SVG output).
  • Critics find Gemini “garbage” for day‑to‑day coding, tool calling, legal/regulatory research, and instruction following, with more hallucinations than GPT/Claude; some suspect over‑optimization for benchmarks versus production reliability.
  • Experiences vary wildly; several note that prompting style and “learning” a particular model matter a lot.

Agentic workflows, “thinking” modes, and cost

  • Deep Think and GPT‑5.x Pro are described as high test‑time‑compute “best‑of‑N” / parallel‑trace models: powerful but too expensive for most agents at current prices.
  • Discussion of “non‑thinking” vs “thinking” vs best‑of‑N models, agent swarms, and pass@N metrics; consensus is that these methods are useful but computationally heavy.
  • Google is seen as behind in ready‑made coding agents (VS Code, Antigravity), compared to Claude Code and OpenAI’s tools, despite strong base models.

Product, UX, access, and trust

  • Many complain about Gemini’s web/app UX, VS Code plugin instability, missing features (projects, stable context), and inconsistent “Deep Research.”
  • Access to Deep Think is limited (Ultra subscription or early‑access API), leading to frustration that top models are locked behind $250/month tiers.
  • Ongoing distrust of Google’s privacy posture and product longevity makes some hesitant to adopt Gemini even if it’s technically strong.

AGI, consciousness, and societal impact

  • Long subthread debates whether high ARC scores imply “smarter than average human,” what would constitute AGI, and whether consciousness is required or even testable.
  • Others focus on economics: rapid capability gains plus agentic workflows may displace many white‑collar jobs; some frame the real problem as capitalism, not AI itself.
  • There’s pushback against “singularity soon” narratives, noting that benchmarks and spectacular demos haven’t yet translated into broadly reliable autonomous systems.

Pelican‑on‑a‑bicycle and visual reasoning

  • The now‑traditional “pelican riding a bicycle” SVG test shows Deep Think producing the best result so far; this is treated as both a lighthearted but also telling indicator of improved spatial and vector‑graphics reasoning.
  • Some worry even this informal benchmark could be gamed, though others argue its combinatorial nature (any animal/vehicle pair) makes systematic overfitting costly.