2025-06-17

Making 2.5 Flash and 2.5 Pro GA, and introducing Gemini 2.5 Flash-Lite

Real‑world usage outside coding

Frequent non-coding uses: translation, long-document summarization, research reports, web/Youtube summarization, web scraping → semi-structured data, NDA/contract extraction, converting handwritten or scanned text to spreadsheets, real-estate listing feeds, home automation, math exploration, audio transcription, and “book club” / journaling / self‑reflection.
Vision and multimodal: praised for handling large batches of images cheaply and reliably (e.g., product Lexikon), YouTube access and giant context window were repeatedly cited as differentiators.
Many use Flash / Flash-Lite for “cheap and fast” tasks, often as a delegate from a larger model to generate or edit structured objects.

Model quality & comparisons

Several users say Gemini 2.5 Pro is strong for translation, summarization, law-like writing, math help, and long-context drafting; some prefer its writing tone and research depth to ChatGPT.
Others find Gemini worse than Claude or OpenAI for serious coding or complex reasoning, describing it as verbose, off-topic, or “Buzzfeed-style” in tone.
Some report very good coding performance and stable code from 2.5 Pro (especially via tools like Aider), but complain about excessive comments and try/except clutter.
There’s a sense that preview versions of 2.5 Pro felt smarter, more willing to push back, and less sycophantic than the GA release.

Long context & benchmarks

Users praise the 1M-token window for translation, big-doc Q&A, and “pile of NDAs”‑type workflows.
A detailed subthread debates long-context evals (NIAH, MRCR, RULER, LongProc, HELMET).
- One side: Gemini 2.5 Pro collapses after ~32k tokens on internal enterprise benchmarks; long-context reasoning is still weak across all models.
- Other side: in real-world doc‑assembly tasks (reports/proposals) 2.5 Pro performs uniquely well.
Consensus: long context is useful but far from “solved,” and benchmark choice strongly affects perceived performance.

Product tiers, UX, privacy

Gemini app, AI Studio, and Vertex behave differently:
- Gemini app: smaller thinking budgets, stronger safety filters, nerfed behavior; often underperforms API/AI Studio.
- AI Studio: better control (system instructions, temperature, schemas) but confusion over when data may be used for training; clarified that any account with a billed project gets private treatment.
- Vertex: same models with higher, more negotiable rate limits.
Many dislike Gemini’s chat UX and file-handling; want native Git/FTP/file integration instead of copy-paste.

“Thinking” mode and behavior

“Thinking” is described as scratchpad / chain-of-thought tokens before the final answer. It improves quality but adds latency and lots of tokens.
Users question the value of a “thinking” variant that’s weaker than regular Flash, and some see no need for thinking on latency‑sensitive tasks (voice, real-time apps).
Reports that Flash sometimes emits thinking tokens even when thinking budget is set to zero.

Pricing and “bait‑and‑switch” concerns

Major point of contention: 2.5 Flash price changes vs preview and vs 2.0 Flash:
- Input text/image/video: 2x increase over 2.5 preview (and higher than 2.0).
- Output: single $2.50/M rate replaces $0.60 (non-thinking) and $3.50 (thinking); effective 4x increase for prior non-thinking use.
- Audio for Flash-Lite up ~6.3x over 2.0 Flash-Lite.
Many see this as a “bait‑and‑switch”: developers built on cheap preview pricing, then face steep increases as models go GA.
Others argue earlier pricing was clearly subsidized to gain adoption; as Gemini becomes competitive, it converges toward market rates.

Limits, reliability, and access issues

Complaints about:
- Low default rate limits (e.g., 10k RPD), opaque upgrade process, getting 403s mid‑batch; some moved back to OpenAI for throughput.
- Empty responses or loops (e.g., Flash-Lite repeating phrases in transcripts), often tied to length limits or safety filters.
- 2.5 Pro being unavailable or tricky to access via some API endpoints.
Some note that Vertex alleviates many rate-limit issues and offers more formal throughput guarantees.

Cloud vs local models

A few users consider local LLMs due to API pricing and limits, but others argue:
- Hardware costs and rapid model churn make local generally worse economics unless you’re processing huge volumes.
- Quality gap: local models on 24–48GB GPUs are closer to Flash-Lite level while being slower than top hosted models.
Local is framed as mainly for hobby and privacy, not efficiency—at least for now.

General sentiment

Many have shifted primary usage from ChatGPT/Claude to Gemini (especially 2.5 Pro and 2.0/2.5 Flash) and are impressed by speed, multimodal capabilities, and long context.
At the same time, there’s strong frustration about:
- Perceived “nerfs”/quantization, particularly in the consumer app.
- Overly cheerful, verbose tone.
- Sudden price hikes and confusing thinking/non-thinking semantics.
Overall, Gemini is seen as technically strong and rapidly improving, but trust is undermined by pricing moves, behavior changes, and product fragmentation.

Related topics