Gemini 3 Pro Model Card [pdf]
Leak, authenticity, and rollout
- Model card appeared briefly on an official Google storage bucket, then was removed; archived copies confirm it as a genuine, slightly early publication.
- Document title and date suggest a coordinated release on the same day; users later report Gemini 3 Pro is live in AI Studio and in some third‑party tools (e.g., Cursor via a preview model name).
- Some mirrors of the PDF are blocked in certain countries due to ISP‑level censorship (CSAM filters, sports piracy enforcement); this triggers side discussion about DNS blocking and overbroad content filters, not about the model itself.
Training data, privacy, and trust
- Model card explicitly lists: web crawl, public datasets, licensed data, Google business data, workforce‑generated data, synthetic data, and user data from Google products “pursuant to user controls.”
- Commenters connect this to Gemini being enabled by default in products like Gmail and note ongoing lawsuits; several express distrust that Google will respect its own privacy policies.
- Some see this as a strong data advantage; others see it as a major reason to avoid Google models.
Architecture, TPUs, and “from scratch”
- Card states Gemini 3 Pro is not a fine‑tune of prior models, interpreted as a new base architecture, likely under the Pathways system and MoE‑style scaling.
- Training is reported as fully on TPUs; commenters see this as a strategic win (cost, independence from Nvidia) but note that “faster than CPUs” wording is odd and likely a typo.
- Long training/post‑training timeline (knowledge cutoff Jan 2025, release Nov 2025) is seen as evidence that compute is still a bottleneck.
Benchmark results and skepticism
- On many reasoning and multimodal benchmarks (ARC‑AGI‑2, MathArena, HLE, GPQA, ScreenSpot, t2‑bench, Vending‑Bench, various multimodal suites), Gemini 3 Pro significantly outperforms Gemini 2.5 and usually beats GPT‑5.1 and Claude Sonnet 4.5.
- ARC‑AGI‑2 semi‑private scores are viewed as particularly impressive and as evidence of major reasoning gains, possibly via better synthetic data or self‑play (details unclear).
- Coding is notably not a blowout:
- SWE‑Bench Verified: Gemini 3 ≈ GPT‑5.1, slightly behind Sonnet 4.5.
- LiveCodeBench/Terminal‑Bench: Gemini 3 is strong but comparable to GPT‑5.1 Codex; some wins, some losses.
- Several point out benchmarks are saturating and easy to “benchmaxx” by training/tuning on them; others counter that all labs are equally incentivized, so relative rankings still matter.
- Comparisons to strong open models (e.g., Kimi K2) show Gemini 3 is no longer uniformly ahead; for some aggregate views, it’s only clearly best because of a few standout benchmarks.
Real‑world coding and tools
- Multiple users report Claude Code and GPT‑5.1/Codex still feel better for day‑to‑day agentic coding, especially with mature IDE tooling; Gemini CLI is described as rough, buggy, and less polished, though improving quickly.
- Some users nevertheless find Gemini 2.5 already excellent for contextual reasoning on large codebases and SQL, and expect 3.0’s big context window and speed to be a major draw even if raw coding quality is only “on par.”
- SWE‑Bench’s limited domain (old Python/Django tasks) and near‑saturation are cited as reasons it may no longer distinguish real coding ability well.
Google Antigravity and agentic workflows
- The model card and DNS hints leak “Google Antigravity,” later described on its landing page as an “agent‑first” development platform:
- An AI‑centric IDE/workbench where agents operate across editor, terminal, and browser to autonomously plan and execute software tasks.
- Widely interpreted as a Cursor/Windsurf‑style environment tightly integrated with Gemini 3.
- Some see this as Google betting heavily on agentic coding as the main high‑value LLM use case.
Pricing, business impact, and competition
- Gemini 3 Pro API pricing is higher than 2.5 Pro and GPT‑5.1 (e.g., $2/M input and $12/M output up to 200k tokens; double in long‑context tier).
- Opinions diverge:
- Optimists: if benchmarks translate to practice and Google stays cheaper than Anthropic/OpenAI at similar capability, enterprises and cost‑sensitive users will migrate.
- Skeptics: labs leapfrog each other frequently; no one is “done,” and differences often feel marginal in real use.
- Many argue Google has the most sustainable position (massive existing cash flow, TPUs, vast data, Cloud distribution), while pure‑play labs are still dependent on external funding and haven’t proven durable business models.
- Others emphasize moats in brand and integration: OpenAI via habit and Microsoft bundling, Anthropic via enterprise relationships and coding focus.
Model behavior, UX, and benchmarks people want
- Gemini is widely criticized for sycophancy/over‑agreeableness and low “self‑esteem”; some users explicitly tune system prompts to make it more direct and less flattering.
- A few propose explicit “sycophancy” and even safety‑harm benchmarks (e.g., induced suicides per user count).
- Some users ask for instruction‑adherence benchmarks (how many detailed instructions can be followed reliably) and argue that improving this may be more valuable than further IQ‑style gains.
Economic and societal threads
- Some argue AI will only justify its cost if it can do serious engineering work (SWE‑Bench plateau worries them); others counter that even 1.5× productivity for highly paid engineers or broad consumer subscriptions could be enough.
- There is disagreement on whether we’re in an AI bubble: coding is a small share of tokens; most tokens are “chat,” and long‑term consumer monetization and “enshitification” are expected.
- A subset of commenters express fatigue and indifference to yet another “frontier” release and benchmark table, despite the technical progress.