So you want to rent an NVIDIA H100 cluster? 2024 Consumer Guide

Pricing, Transparency, and Economics

  • Several commenters note the article is light on concrete prices; many vendors insist on private, quote-based deals.
  • One data point: median H100 Infiniband cluster price from a meta-source is around $2.3–2.47 per GPU-hour.
  • A back-of-envelope for 256 H100s at $2.47/h is ~$455k/month, implying ~21 months to recoup $10M of GPUs (excluding power, infra, and ops) — seen as a tough business with rapid depreciation risk.
  • Discussion highlights deliberate price discrimination in high-end markets and the difficulty of “price discovery,” especially for non-technical buyers.

Networking: InfiniBand vs Ethernet

  • InfiniBand is reported to systematically outperform Ethernet for large distributed training (3–10% at 16 nodes; gap widens with scale) and to be more reliable.
  • Ethernet-based RoCE clusters can work, but require careful design (lossless Ethernet, separate compute vs storage networks, avoiding packet reordering).
  • Lead times for current-gen IB switches are reported as ~50+ weeks, pushing some to Ethernet.
  • PCIe topology and switches are stressed as critical when combining many GPUs and high-speed NICs.

Cluster Design: New vs Used Hardware

  • Some argue used Mellanox/IB gear is cheap and reliable for labs.
  • Others reject used gear for a commercial GPU cloud, prioritizing vendor support contracts, uptime guarantees, and avoiding blame on unsupported hardware.

AMD MI300x vs Nvidia H100 and Software Stack

  • One operator is building a 128-GPU MI300x Ethernet cluster, positioning AMD as the “underdog” with strong vendor support and lower strategic risk than joining the Nvidia crowd.
  • ROCm is recognized as the CUDA-equivalent but widely viewed as immature; some ML users say that alone is a deal-breaker today.
  • Technical limitations (e.g., missing PCIe passthrough for multi-tenant VMs) currently block fine-grained GPU rental; container-based workarounds have drawbacks.

Datacenter Power and “Green” Claims

  • Multiple comments discuss “green” datacenters, hydro-heavy regions (PNW, Quebec, Iceland), and future solar+battery-powered DCs.
  • There is debate over whether buying “green” power actually reduces global CO₂, versus needing strong regulation.
  • Large AI clusters are estimated to consume on the order of tens to hundreds of MW; renewables plus large batteries are argued by some to be already cost-competitive at that scale.