Gemini 3.0 spotted in the wild through A/B testing
Caution about A/B tests and hype
- Several commenters stress that current “Gemini 3.0” sightings are just A/B tests on single prompts (often SVG/controller examples), which are a poor proxy for real-world performance.
- Single-prompt comparisons can show speed/latency and rough adherence to instructions but say nothing about tools, multi-file workflows, or robustness.
- Many are irritated by Twitter/X-style “game changer!!!” hype built on unprofessional evaluations and urge waiting for an official release.
Where Gemini 2.5 shines (for some)
- Many report Gemini 2.5 Pro as their best general model, especially for:
- UI/UX and web work (notably Angular/HTML/CSS), and large-context codebase reading.
- Creative writing, critique of fiction/poetry, and generating/structuring essays.
- Factual Q&A, explanations (including medical/lab results), and summarizing papers.
- Complex math and theoretical physics for some users (others disagree).
- OCR-like tasks (e.g., receipts) and structured extraction (e.g., questions -> CSV).
- Deep Think / Deep Research modes are praised for long, detailed, and well-grounded analyses.
- Some clinical/workflow users prefer Gemini for quality, price, and speed.
Where it falls short (for others)
- A large contingent finds Gemini markedly worse than GPT‑5 Thinking and Claude Sonnet/Code for:
- Coding (especially agentic tasks, CLI/CLI-like tools, MCP calls, multi-file refactors).
- Iterative work: it loops, repeats itself verbatim, or “multishots itself to death.”
- Web-grounded questions: does few searches, shallow grounding, or hallucinates; Google “AI Mode” and AI Overviews are specifically criticized with concrete false examples.
- Reports of context collapse: quality degrades quickly in long chats despite big advertised windows; some suspect aggressive context truncation.
- Several users feel Gemini 2.5 has regressed over time (faster but “dumber” or more hallucinatory).
Style, alignment, and steerability
- Many dislike Gemini’s verbosity, “glazing”/sycophantic praise, and blog-post tone; some mitigate with system prompts or personal context.
- It’s seen as more censored than ChatGPT on medical topics.
- Others appreciate the verbosity and narrative style for “high-stakes” reasoning and writing.
- Several say Gemini is “theoretically smarter” but harder to steer; Claude and GPT feel more forgiving of vague prompts.
Coding workflows and model mixing
- Split experiences: some say Gemini 2.5 Pro is their primary coder and “uncontested king,” others say they’ve “never gotten a single useful result” compared to Sonnet 4.5 or GPT‑5 Codex.
- Common pattern:
- Gemini for big-picture design, understanding large codebases, or one-shot analyses.
- Claude Code / Codex CLI / Cursor for agentic editing, CLI use, and multi-file work.
- Tools like repomix + Gemini (via AI Studio or CLI) are popular for loading entire repos, but people see effective limits around ~50k–256k tokens.
Creative writing and authenticity debate
- Some argue Gemini (or Deepseek at extreme temperatures) is uniquely good for generating surprising, high-quality raw text and for critiquing human writing.
- Others see LLM-assisted “creative writing” as inauthentic, especially in collaborative storytelling (e.g., D&D); counter-argument is that outcomes and reader experience matter more than process.
- High-temperature sampling, SVG/generative art, and “pelican riding a bike” benchmarks spark debate: fun, visual proxies vs. shallow, overfitted party tricks.
Product confusion and internal status
- Confusion over Google’s many fronts: Gemini app/site, AI Studio, AI Mode, AI Overviews, and fine‑tuned “Gemini for Google” variants; users want clearer guidance on when to use what.
- Googlers in the thread say Gemini 3.0 is not broadly available internally yet; most internal coding tools still run 2.5-based models.
Divergent experiences and benchmarking difficulty
- Commenters note that wildly different tasks, prompting skill, expectations, and tolerance for fixing AI output lead to seemingly contradictory opinions.
- Many now routinely run the same prompt across multiple models and pick the best, instead of betting on a single “winner.”