2025-06-04

Cloud Run GPUs, now GA, makes running AI workloads easier for everyone

Serverless GPU model and use cases

Commenters interpret Cloud Run GPUs as a way to run arbitrary models (e.g. Hugging Face) behind an API, paying only when used and scaling to zero between bursts.
Main value seen in small/custom or cutting-edge open-weight models where managed APIs don’t exist or are too restrictive.
Several note this is best for bursty or early-stage workloads (new apps without clear steady traffic), not for consistently busy services where VMs+GPUs are cheaper.

Cold starts and latency

Cold start is a major concern. Reports for CPU Cloud Run range from ~200ms to 30s+, heavily dependent on language and whether using gen1 vs gen2 runtime.
For GPUs, a cited example is ~19s time-to-first-token for a 4B model including container start and model load; some see this as unacceptable for interactive UX, others say it’s fine for first-request-only or batch/agent use.
Model weights download and GPU memory load can significantly add to startup time; several say you’ll likely keep at least one warm instance, so “scale to zero” is not always practical.

Pricing, billing, and cost controls

Pricing of Google’s GPUs (especially beyond L4) is widely viewed as uncompetitive versus specialized providers; L4 on other platforms is quoted at ~40¢/hr vs ~67–71¢/hr here.
Cloud Run GPUs bill per use but with a ~15-minute idle window; if you get at least one request every 15 minutes you effectively pay 24/7, often several times the cost of a comparable VM.
Lack of hard spending caps on GCP is a major worry. Budgets and alerts exist but are delayed and can’t prevent “runaway” bills; some hack auto-disabling billing but fear breakage.
Limiting max instances and concurrency can cap Cloud Run service spend, but not other APIs (e.g. Gemini). Several argue real stop-loss billing is essential for individuals and small teams.

Comparisons to other providers

Runpod, vast.ai, Coreweave, Modal, Coiled, DataCrunch, Lambda Labs, Fly, and others are discussed as cheaper or more flexible GPU options, often with per-second billing and/or true caps or prepaid credit.
Modal, in particular, is praised for fast cold starts, good documentation, and scale-to-zero GPUs.

Cloud Run experience and architecture

Many praise Cloud Run’s developer experience and autoscaling, often preferring it to AWS Lambda/ECS/Fargate/AppRunner; some report large-scale, cost-effective production use.
Others report mysterious scale-outs, restarts, and outages that support couldn’t fully explain, prompting moves to self-managed VMs or Kubernetes.
Differences between Cloud Run gen1 (faster startup) and gen2 (microVM-based, slower startup) are noted; Cloud Run Jobs (non-HTTP batch) are highlighted.
Root access is not yet generally available but is being worked on; GPU types are currently limited (mainly L4), with more promised.

Local vs cloud AI and GPU market

Some wish for consumer-grade, local “AI appliance” hardware, arguing many LLMs can run locally if UX were better.
Others counter that large-scale training and heavy inference still demand cloud GPUs; GPU supply on major clouds is described as constrained and expensive, fueling the rise of “neo-cloud” GPU providers.

Related topics