Cloud Run GPUs, now GA, makes running AI workloads easier for everyone
Serverless GPU model and use cases
- Commenters interpret Cloud Run GPUs as a way to run arbitrary models (e.g. Hugging Face) behind an API, paying only when used and scaling to zero between bursts.
- Main value seen in small/custom or cutting-edge open-weight models where managed APIs don’t exist or are too restrictive.
- Several note this is best for bursty or early-stage workloads (new apps without clear steady traffic), not for consistently busy services where VMs+GPUs are cheaper.
Cold starts and latency
- Cold start is a major concern. Reports for CPU Cloud Run range from ~200ms to 30s+, heavily dependent on language and whether using gen1 vs gen2 runtime.
- For GPUs, a cited example is ~19s time-to-first-token for a 4B model including container start and model load; some see this as unacceptable for interactive UX, others say it’s fine for first-request-only or batch/agent use.
- Model weights download and GPU memory load can significantly add to startup time; several say you’ll likely keep at least one warm instance, so “scale to zero” is not always practical.
Pricing, billing, and cost controls
- Pricing of Google’s GPUs (especially beyond L4) is widely viewed as uncompetitive versus specialized providers; L4 on other platforms is quoted at ~40¢/hr vs ~67–71¢/hr here.
- Cloud Run GPUs bill per use but with a ~15-minute idle window; if you get at least one request every 15 minutes you effectively pay 24/7, often several times the cost of a comparable VM.
- Lack of hard spending caps on GCP is a major worry. Budgets and alerts exist but are delayed and can’t prevent “runaway” bills; some hack auto-disabling billing but fear breakage.
- Limiting max instances and concurrency can cap Cloud Run service spend, but not other APIs (e.g. Gemini). Several argue real stop-loss billing is essential for individuals and small teams.
Comparisons to other providers
- Runpod, vast.ai, Coreweave, Modal, Coiled, DataCrunch, Lambda Labs, Fly, and others are discussed as cheaper or more flexible GPU options, often with per-second billing and/or true caps or prepaid credit.
- Modal, in particular, is praised for fast cold starts, good documentation, and scale-to-zero GPUs.
Cloud Run experience and architecture
- Many praise Cloud Run’s developer experience and autoscaling, often preferring it to AWS Lambda/ECS/Fargate/AppRunner; some report large-scale, cost-effective production use.
- Others report mysterious scale-outs, restarts, and outages that support couldn’t fully explain, prompting moves to self-managed VMs or Kubernetes.
- Differences between Cloud Run gen1 (faster startup) and gen2 (microVM-based, slower startup) are noted; Cloud Run Jobs (non-HTTP batch) are highlighted.
- Root access is not yet generally available but is being worked on; GPU types are currently limited (mainly L4), with more promised.
Local vs cloud AI and GPU market
- Some wish for consumer-grade, local “AI appliance” hardware, arguing many LLMs can run locally if UX were better.
- Others counter that large-scale training and heavy inference still demand cloud GPUs; GPU supply on major clouds is described as constrained and expensive, fueling the rise of “neo-cloud” GPU providers.