2026-02-07

Speed up responses with fast mode

Pricing & Value Perception

Fast mode is widely seen as extremely expensive: ~$30/MTok input and $150/MTok output, about 6× the normal Opus API price for ~2.5× speed.
Multiple users report burning $10–$100 of credit in minutes to a couple of hours under typical “serious dev” usage; some say their normal $200/month subscription would be gone in a day at fast-mode rates.
Confusion over the docs: fast mode is “available” to Pro/Max/Team/Enterprise, but usage is not included in plans and is billed only from extra-usage credit.

Speed & Developer Experience

Supporters argue that latency is a real bottleneck: waiting 1–2+ minutes per step forces context switching, increases mental load, and breaks “single-threaded” deep work.
Fast mode is seen as especially attractive for short, serial, blocking tasks (e.g., small merges, UI iteration, planning phases) where humans must wait for the agent.
Others say their bottleneck is reading, understanding, and validating AI-generated code, so faster output doesn’t help much.

Implementation & Infrastructure Speculation

Many assume this is primarily about prioritization/queue-skipping and retuning serving infrastructure: fewer concurrent users per GPU, smaller batches, higher per-user tokens/sec at lower overall throughput.
Alternatives raised: newer hardware (GB200/Blackwell, TPUs), speculative decoding, keeping KV cache in GPU memory; debate over how much each could contribute.
Some emphasize that large-scale serving always trades off throughput vs. per-request latency; “premium” speed simply chooses a different point on that curve.

Business Model & “Enshitification” Concerns

Strong worry that introducing a paid “fast lane” creates incentives to degrade the free/standard lane over time, analogized to airline “speedy boarding” or food-delivery premium tiers.
Others call this conspiratorial, arguing there’s no evidence of intentional slowdowns and intense competition would punish obvious degradation.

Desire for Slow/Cheap Modes & Alternatives

Many request a cheaper slow mode or easier integration of batch processing/spot-style pricing, especially for overnight/background agents.
Comparisons: OpenAI’s priority tier and batch API, Gemini 3 Pro’s better speed/price but weaker coding, and fast local/open models (Groq/Cerebras, large local GPUs) as eventual substitutes.

Related topics