Speed up responses with fast mode
Pricing & Value Perception
- Fast mode is widely seen as extremely expensive: ~$30/MTok input and $150/MTok output, about 6× the normal Opus API price for ~2.5× speed.
- Multiple users report burning $10–$100 of credit in minutes to a couple of hours under typical “serious dev” usage; some say their normal $200/month subscription would be gone in a day at fast-mode rates.
- Confusion over the docs: fast mode is “available” to Pro/Max/Team/Enterprise, but usage is not included in plans and is billed only from extra-usage credit.
Speed & Developer Experience
- Supporters argue that latency is a real bottleneck: waiting 1–2+ minutes per step forces context switching, increases mental load, and breaks “single-threaded” deep work.
- Fast mode is seen as especially attractive for short, serial, blocking tasks (e.g., small merges, UI iteration, planning phases) where humans must wait for the agent.
- Others say their bottleneck is reading, understanding, and validating AI-generated code, so faster output doesn’t help much.
Implementation & Infrastructure Speculation
- Many assume this is primarily about prioritization/queue-skipping and retuning serving infrastructure: fewer concurrent users per GPU, smaller batches, higher per-user tokens/sec at lower overall throughput.
- Alternatives raised: newer hardware (GB200/Blackwell, TPUs), speculative decoding, keeping KV cache in GPU memory; debate over how much each could contribute.
- Some emphasize that large-scale serving always trades off throughput vs. per-request latency; “premium” speed simply chooses a different point on that curve.
Business Model & “Enshitification” Concerns
- Strong worry that introducing a paid “fast lane” creates incentives to degrade the free/standard lane over time, analogized to airline “speedy boarding” or food-delivery premium tiers.
- Others call this conspiratorial, arguing there’s no evidence of intentional slowdowns and intense competition would punish obvious degradation.
Desire for Slow/Cheap Modes & Alternatives
- Many request a cheaper slow mode or easier integration of batch processing/spot-style pricing, especially for overnight/background agents.
- Comparisons: OpenAI’s priority tier and batch API, Gemini 3 Pro’s better speed/price but weaker coding, and fast local/open models (Groq/Cerebras, large local GPUs) as eventual substitutes.