Gemini 3.0 Pro – early tests

Unclear nature of “Gemini 3.0 Pro” tests

  • Many assume the flashy Twitter demos come from an A/B test in Google AI Studio, but it’s unclear whether they’re actually Gemini 3.0.
  • Some find the showcased HTML/CSS/JS outputs unimpressive or pedestrian when inspected closely.

Benchmarks, SVG “pelican” test, and training data leakage

  • Several comments center on the “SVG of X riding Y” benchmark (e.g., pelican on a bicycle) as a private way to test models beyond public benchmarks.
  • Concern: once a benchmark becomes popular, it seeps into training sets (directly or via discussion), weakening its value.
  • Others argue that “being in the training data” is overrated; models still fail on many memorized problems, so overfitting to small, quirky tests is unlikely at scale.

Skepticism about “vibe” demos

  • Many dismiss influencer demos (bouncing balls, fake Apple pages) as shallow and easy to one-shot with existing models.
  • Some are tired of visually impressive but practically irrelevant tests that don’t reflect hard, real-world software problems.

Comparisons across frontier models

  • No consensus “best” model: different people report Claude, Gemini, GPT‑5, or others as superior, often based on narrow coding workflows.
  • One synthesis:
    • Gemini: highest “ceiling” and best long-context/multimodal, but weak on token-level accuracy, tool-calling, and steering.
    • Claude: most consistent and steerable, strong on detail, but can lose track in very complex contexts.
    • GPT‑5: for some, best at long instruction-following and large feature builds; for others, erratic and inconsistent.

Gemini-specific pain points and strengths

  • Multi-turn instruction following and conversation “loops” (repeating itself, ignoring feedback) are a major complaint.
  • Tool-calling and structured JSON output are described as “terrible” or broken, limiting agentic coding.
  • On the plus side, Gemini’s long context and PDF handling are praised for tasks like reading huge spec documents or logs.

Google’s product culture and packaging issues

  • Recurrent theme: Google has strong research and engineering but weak product vision and integration.
  • People find Gemini and other Google AI offerings hard to discover, configure, and pay for; APIs, billing, and docs are called confusing and fragmented.
  • Some believe Google had the tech for ChatGPT‑like systems early but lacked the product culture to ship; OpenAI forced their hand.

Hype fatigue, AGI chatter, and eval difficulty

  • Commenters recall past GPT‑5/AGI hype and see similar cycles around each new Google announcement.
  • There’s broad agreement that reliable evaluations are hard: public benchmarks get gamed, private ones risk being ingested, and subjective reports conflict.

Privacy and policy concerns

  • One criticism: on consumer plans, Gemini reportedly trains on user data unless history is disabled, seen as worse privacy than other major providers.