Cloud Run GPUs, now GA, makes running AI workloads easier for everyone

Serverless GPU model and use cases

  • Commenters interpret Cloud Run GPUs as a way to run arbitrary models (e.g. Hugging Face) behind an API, paying only when used and scaling to zero between bursts.
  • Main value seen in small/custom or cutting-edge open-weight models where managed APIs don’t exist or are too restrictive.
  • Several note this is best for bursty or early-stage workloads (new apps without clear steady traffic), not for consistently busy services where VMs+GPUs are cheaper.

Cold starts and latency

  • Cold start is a major concern. Reports for CPU Cloud Run range from ~200ms to 30s+, heavily dependent on language and whether using gen1 vs gen2 runtime.
  • For GPUs, a cited example is ~19s time-to-first-token for a 4B model including container start and model load; some see this as unacceptable for interactive UX, others say it’s fine for first-request-only or batch/agent use.
  • Model weights download and GPU memory load can significantly add to startup time; several say you’ll likely keep at least one warm instance, so “scale to zero” is not always practical.

Pricing, billing, and cost controls

  • Pricing of Google’s GPUs (especially beyond L4) is widely viewed as uncompetitive versus specialized providers; L4 on other platforms is quoted at ~40¢/hr vs ~67–71¢/hr here.
  • Cloud Run GPUs bill per use but with a ~15-minute idle window; if you get at least one request every 15 minutes you effectively pay 24/7, often several times the cost of a comparable VM.
  • Lack of hard spending caps on GCP is a major worry. Budgets and alerts exist but are delayed and can’t prevent “runaway” bills; some hack auto-disabling billing but fear breakage.
  • Limiting max instances and concurrency can cap Cloud Run service spend, but not other APIs (e.g. Gemini). Several argue real stop-loss billing is essential for individuals and small teams.

Comparisons to other providers

  • Runpod, vast.ai, Coreweave, Modal, Coiled, DataCrunch, Lambda Labs, Fly, and others are discussed as cheaper or more flexible GPU options, often with per-second billing and/or true caps or prepaid credit.
  • Modal, in particular, is praised for fast cold starts, good documentation, and scale-to-zero GPUs.

Cloud Run experience and architecture

  • Many praise Cloud Run’s developer experience and autoscaling, often preferring it to AWS Lambda/ECS/Fargate/AppRunner; some report large-scale, cost-effective production use.
  • Others report mysterious scale-outs, restarts, and outages that support couldn’t fully explain, prompting moves to self-managed VMs or Kubernetes.
  • Differences between Cloud Run gen1 (faster startup) and gen2 (microVM-based, slower startup) are noted; Cloud Run Jobs (non-HTTP batch) are highlighted.
  • Root access is not yet generally available but is being worked on; GPU types are currently limited (mainly L4), with more promised.

Local vs cloud AI and GPU market

  • Some wish for consumer-grade, local “AI appliance” hardware, arguing many LLMs can run locally if UX were better.
  • Others counter that large-scale training and heavy inference still demand cloud GPUs; GPU supply on major clouds is described as constrained and expensive, fueling the rise of “neo-cloud” GPU providers.