2024-12-11

Gemini 2.0: our new AI model for the agentic era

Model capabilities & demos

Gemini 2.0 Flash adds native multimodality (text, images, audio; video via Multimodal Live) with low latency.
Native image and audio output are delayed to early next year; current image generation routes through Imagen 3.
The Multimodal Live API and AI Studio “Live” UI impress many: real‑time voice plus camera/screen sharing can identify objects, read text, critique physical movement, and tutor in tools like Blender.
Code execution inside the model sandbox works for local Python but has no outbound network access and runs into missing package issues.

Benchmarks, quality & comparisons

Google claims Gemini 2.0 Flash beats 1.5 Pro on most benchmarks; some users see Flash ≈ old Pro, others say experimental 1206 is stronger.
On community leaderboards (e.g., LM Arena), Gemini 2.0 Flash ranks near GPT‑4o and other top models, but many distrust benchmarks as over‑optimized.
One hallucination benchmark shows a very low hallucination rate for 2.0 Flash, but several hands‑on reports still see confident errors and verbose “reasoning” that can mislead.
Mixed anecdotes: some say coding, Advent of Code, and vision tasks are now competitive with GPT‑4o / Claude; others find GPT‑4o or o1 clearly superior for reasoning and hard debugging.

Search integration & hallucinations

Strong disagreement about Gemini in Search: some find it increasingly useful; others report frequent factual errors (locations, chemistry definitions, counts of islands, corporate facts) presented as authoritative.
A few note that some failures are likely inherited from underlying web search, not just the model.

Pricing, access & quotas

Gemini Advanced subscription is ~£18/month.
API usage for Flash 2.0 is currently free in preview with 10 RPM limits and ~1,500 requests/day; developers complain this is too low for “agentic” workloads.
Multimodal Live API is free during preview; many hope production pricing will undercut OpenAI’s relatively expensive audio I/O.

On‑device vs cloud, hardware & economics

Long debate on whether training or inference is the real moat:
- One side: training compute (TPUs, data) is the scarce asset; inference is a commodity many hardware vendors can provide.
- Other side: at scale, inference costs dominate; without cheap inference or good on‑device performance, economics and adoption suffer.
Discussion about whether on‑device models (Apple, Android Tensor chips) will become “good enough” to erode demand for paid cloud services.
Several argue Google doesn’t need to “win” on‑device if cloud inference remains cheap and fast; others think Apple’s eventual strong on‑device AI will force Android to respond.

“Agentic” models & terminology

“Agentic” is widely mocked as vague marketing jargon; people prefer plain terms like “autonomous” or “tool‑using.”
Some insist most “agents” are just LLMs plus tools and static workflows; complex multi‑agent handoff systems often underperform a single strong model with tools and long context.
Others see real promise in browser‑control projects (e.g., Project Mariner) and live multimodal agents, but think the term is over‑applied.

Trust, product longevity & ecosystem

Persistent concern about Google’s habit of killing or deprecating products and APIs (Reader, messaging apps, Stadia, GCP deprecations).
Some organizations explicitly avoid Google for core infra, preferring AWS/Anthropic due to perceived stability and clearer long‑term support.
Fears that violations of vaguely defined AI terms of service could trigger bans affecting entire Google accounts (Gmail, Docs, Photos), with little recourse.
Counterpoint: core products like Search, Gmail, and Workspace are long‑lived and widely relied upon; AI is seen as strategically central and unlikely to be abandoned.

Developer tooling & practical use

New Python SDK (googleapis/python‑genai) is praised as more modern; supports structured outputs via schemas (including Pydantic).
Developers like Gemini’s large context windows for RAG and dumping big docs; also note good speed vs GPT‑4o’s “dog slow” feel.
Some find Gemini’s web UI weaker than its raw API, which can integrate well into tools (VS Code via Cline, CLI tools like llm, custom MCP/agent setups).

Related topics