Gemini 3.0 spotted in the wild through A/B testing

Caution about A/B tests and hype

  • Several commenters stress that current “Gemini 3.0” sightings are just A/B tests on single prompts (often SVG/controller examples), which are a poor proxy for real-world performance.
  • Single-prompt comparisons can show speed/latency and rough adherence to instructions but say nothing about tools, multi-file workflows, or robustness.
  • Many are irritated by Twitter/X-style “game changer!!!” hype built on unprofessional evaluations and urge waiting for an official release.

Where Gemini 2.5 shines (for some)

  • Many report Gemini 2.5 Pro as their best general model, especially for:
    • UI/UX and web work (notably Angular/HTML/CSS), and large-context codebase reading.
    • Creative writing, critique of fiction/poetry, and generating/structuring essays.
    • Factual Q&A, explanations (including medical/lab results), and summarizing papers.
    • Complex math and theoretical physics for some users (others disagree).
    • OCR-like tasks (e.g., receipts) and structured extraction (e.g., questions -> CSV).
  • Deep Think / Deep Research modes are praised for long, detailed, and well-grounded analyses.
  • Some clinical/workflow users prefer Gemini for quality, price, and speed.

Where it falls short (for others)

  • A large contingent finds Gemini markedly worse than GPT‑5 Thinking and Claude Sonnet/Code for:
    • Coding (especially agentic tasks, CLI/CLI-like tools, MCP calls, multi-file refactors).
    • Iterative work: it loops, repeats itself verbatim, or “multishots itself to death.”
    • Web-grounded questions: does few searches, shallow grounding, or hallucinates; Google “AI Mode” and AI Overviews are specifically criticized with concrete false examples.
  • Reports of context collapse: quality degrades quickly in long chats despite big advertised windows; some suspect aggressive context truncation.
  • Several users feel Gemini 2.5 has regressed over time (faster but “dumber” or more hallucinatory).

Style, alignment, and steerability

  • Many dislike Gemini’s verbosity, “glazing”/sycophantic praise, and blog-post tone; some mitigate with system prompts or personal context.
  • It’s seen as more censored than ChatGPT on medical topics.
  • Others appreciate the verbosity and narrative style for “high-stakes” reasoning and writing.
  • Several say Gemini is “theoretically smarter” but harder to steer; Claude and GPT feel more forgiving of vague prompts.

Coding workflows and model mixing

  • Split experiences: some say Gemini 2.5 Pro is their primary coder and “uncontested king,” others say they’ve “never gotten a single useful result” compared to Sonnet 4.5 or GPT‑5 Codex.
  • Common pattern:
    • Gemini for big-picture design, understanding large codebases, or one-shot analyses.
    • Claude Code / Codex CLI / Cursor for agentic editing, CLI use, and multi-file work.
  • Tools like repomix + Gemini (via AI Studio or CLI) are popular for loading entire repos, but people see effective limits around ~50k–256k tokens.

Creative writing and authenticity debate

  • Some argue Gemini (or Deepseek at extreme temperatures) is uniquely good for generating surprising, high-quality raw text and for critiquing human writing.
  • Others see LLM-assisted “creative writing” as inauthentic, especially in collaborative storytelling (e.g., D&D); counter-argument is that outcomes and reader experience matter more than process.
  • High-temperature sampling, SVG/generative art, and “pelican riding a bike” benchmarks spark debate: fun, visual proxies vs. shallow, overfitted party tricks.

Product confusion and internal status

  • Confusion over Google’s many fronts: Gemini app/site, AI Studio, AI Mode, AI Overviews, and fine‑tuned “Gemini for Google” variants; users want clearer guidance on when to use what.
  • Googlers in the thread say Gemini 3.0 is not broadly available internally yet; most internal coding tools still run 2.5-based models.

Divergent experiences and benchmarking difficulty

  • Commenters note that wildly different tasks, prompting skill, expectations, and tolerance for fixing AI output lead to seemingly contradictory opinions.
  • Many now routinely run the same prompt across multiple models and pick the best, instead of betting on a single “winner.”