2025-10-16

Gemini 3.0 spotted in the wild through A/B testing

Caution about A/B tests and hype

Several commenters stress that current “Gemini 3.0” sightings are just A/B tests on single prompts (often SVG/controller examples), which are a poor proxy for real-world performance.
Single-prompt comparisons can show speed/latency and rough adherence to instructions but say nothing about tools, multi-file workflows, or robustness.
Many are irritated by Twitter/X-style “game changer!!!” hype built on unprofessional evaluations and urge waiting for an official release.

Where Gemini 2.5 shines (for some)

Many report Gemini 2.5 Pro as their best general model, especially for:
- UI/UX and web work (notably Angular/HTML/CSS), and large-context codebase reading.
- Creative writing, critique of fiction/poetry, and generating/structuring essays.
- Factual Q&A, explanations (including medical/lab results), and summarizing papers.
- Complex math and theoretical physics for some users (others disagree).
- OCR-like tasks (e.g., receipts) and structured extraction (e.g., questions -> CSV).
Deep Think / Deep Research modes are praised for long, detailed, and well-grounded analyses.
Some clinical/workflow users prefer Gemini for quality, price, and speed.

Where it falls short (for others)

A large contingent finds Gemini markedly worse than GPT‑5 Thinking and Claude Sonnet/Code for:
- Coding (especially agentic tasks, CLI/CLI-like tools, MCP calls, multi-file refactors).
- Iterative work: it loops, repeats itself verbatim, or “multishots itself to death.”
- Web-grounded questions: does few searches, shallow grounding, or hallucinates; Google “AI Mode” and AI Overviews are specifically criticized with concrete false examples.
Reports of context collapse: quality degrades quickly in long chats despite big advertised windows; some suspect aggressive context truncation.
Several users feel Gemini 2.5 has regressed over time (faster but “dumber” or more hallucinatory).

Style, alignment, and steerability

Many dislike Gemini’s verbosity, “glazing”/sycophantic praise, and blog-post tone; some mitigate with system prompts or personal context.
It’s seen as more censored than ChatGPT on medical topics.
Others appreciate the verbosity and narrative style for “high-stakes” reasoning and writing.
Several say Gemini is “theoretically smarter” but harder to steer; Claude and GPT feel more forgiving of vague prompts.

Coding workflows and model mixing

Split experiences: some say Gemini 2.5 Pro is their primary coder and “uncontested king,” others say they’ve “never gotten a single useful result” compared to Sonnet 4.5 or GPT‑5 Codex.
Common pattern:
- Gemini for big-picture design, understanding large codebases, or one-shot analyses.
- Claude Code / Codex CLI / Cursor for agentic editing, CLI use, and multi-file work.
Tools like repomix + Gemini (via AI Studio or CLI) are popular for loading entire repos, but people see effective limits around ~50k–256k tokens.

Creative writing and authenticity debate

Some argue Gemini (or Deepseek at extreme temperatures) is uniquely good for generating surprising, high-quality raw text and for critiquing human writing.
Others see LLM-assisted “creative writing” as inauthentic, especially in collaborative storytelling (e.g., D&D); counter-argument is that outcomes and reader experience matter more than process.
High-temperature sampling, SVG/generative art, and “pelican riding a bike” benchmarks spark debate: fun, visual proxies vs. shallow, overfitted party tricks.

Product confusion and internal status

Confusion over Google’s many fronts: Gemini app/site, AI Studio, AI Mode, AI Overviews, and fine‑tuned “Gemini for Google” variants; users want clearer guidance on when to use what.
Googlers in the thread say Gemini 3.0 is not broadly available internally yet; most internal coding tools still run 2.5-based models.

Divergent experiences and benchmarking difficulty

Commenters note that wildly different tasks, prompting skill, expectations, and tolerance for fixing AI output lead to seemingly contradictory opinions.
Many now routinely run the same prompt across multiple models and pick the best, instead of betting on a single “winner.”

Related topics