Gemini 3 Pro: the frontier of vision AI
Launch, Links & Product Positioning
- Several commenters note broken or internal-only links in the blog post and confusing “see in AI Studio” prompts.
- Some confusion over branding: Gemini 3 Pro (reasoning + vision) vs Nano Banana (image generation) vs other variants; users find the alphabet soup hurts trust and expectations.
- A few point out that, functionally, this is more a showcase of Gemini 3’s vision abilities than a truly new model.
Benchmarks & Vision Capabilities
- ScreenSpot-Pro benchmark scores impress many: Gemini 3 Pro ~73% vs Claude Opus 4.5 ~50%, Gemini 2.5 ~11%, GPT‑5.1 ~3.5%, suggesting a large leap in GUI grounding and screen understanding.
- GPT‑5.x is widely reported as weak at OCR and high-res UI tasks, likely due to aggressive downscaling and token limits; earlier GPT‑4 was seen as better.
- Commenters see a “data flywheel”: better OCR → more usable scanned books/documents → better models.
Real-World Experiments & Use Cases
- Users report strong performance on:
- Complex OCR (including puzzles and timestamp-based letter extraction) where other models failed.
- Electrical drafting workflows (reading PDFs, mapping outlets into Revit, using code tools).
- Plant health assessment via live camera.
- Detailed video descriptions (e.g., Zelda and Witcher gameplay) and potential for audio-described YouTube.
- Others compare against Amazon Textract: Textract still wins on handwritten character accuracy, while Gemini wins on context and flexible reasoning.
Image Generation vs Understanding
- Multiple tests show a gap between “understanding” and “generation”:
- Prompts like “wine glass full to the brim” often yield ~2/3-full glasses.
- Nano Banana can sometimes draw 5‑legged dogs or odd objects but fails to recognize them as such later.
- Word-search highlighting and maze-solving remain brittle: models can solve via code, but one-shot visual editing is unreliable.
Limits: Counting, Novel Configurations & “Cognition”
- Extensive discussion around failures on:
- Counting legs on 5‑legged animals, fingers on hands, or designing 13‑hour clocks.
- Identifying hippocampus in MRI slices or solving mazes directly in images.
- Some view these as evidence that models are pattern-matchers lacking robust conceptual grounding; others argue this is an efficiency trade‑off and similar to human perceptual biases.
- Long subthreads debate “hallucination” vs generalization, and whether it’s meaningful to call LLM behavior “cognition” or “intelligence.”
Cloud Dependence, Privacy & Market Dynamics
- Strong concerns about mandatory cloud use, data harvesting, and reliance on US companies, especially for sensitive corporate or governmental data.
- Some argue this “centralized AI” market ignores a substantial offline/industrial segment needing on‑device or on‑prem models.
- Others note that most mainstream users do not care and expect Google’s free, data‑subsidized offerings to be highly competitive.
Jobs, Automation & Broader Impact
- Vision+tooling is perceived as a key bottleneck for full software and CAD automation; several see Gemini 3 as a big step toward agentic “software genies.”
- Debate over whether this threatens engineering and drafting roles vs mainly automating repetitive tasks with humans still steering.
- Turing test, “moving goalposts,” and the gap between marketing claims (“true visual and spatial reasoning”) and edge-case behavior are recurring themes.