2025-09-28

We bought the whole GPU, so we're damn well going to use the whole GPU

Hardware-specific optimization & historical parallels

Several comments relate the work to console programming and the demoscene: when hardware is fixed and known, extreme efficiencies become possible.
Others note that even consoles are now heterogeneous (multiple SKUs, docked/undocked modes), so truly “coding to the metal” is rare outside demos and niche environments.
Historical examples (BeOS, early PlayStation, Itanium, dual-CPU BeBox) are cited as proof that hardware can be driven much harder—but that users usually prefer software ecosystems and portability over maximal efficiency.

Cost, skills, and practicality

Many emphasize that in commercial settings it's usually cheaper to ship “fast-enough” code and lean on compilers, rather than hyper-optimizing.
There is a skills bottleneck: people who deeply understand CUDA and modern ML architectures are rare, and they face many competing high-impact tasks.
One person with game-optimization experience notes that “just get it done” code tends to become very expensive to fix later, prompting internal performance education efforts.

Compilers, AI, and “functionally equivalent” optimization

Some hope that future AI tools will automatically optimize code, turning performance tuning into a reinforcement-learning problem (same behavior, faster runtime).
Others push back that verifying true functional equivalence is hard, especially in languages with undefined behavior, and that even advanced compiler optimizations like automatic vectorization remain challenging.

GPU sharing, MIG, and security

Discussion covers NVIDIA’s MIG and MPS as ways to slice a GPU or share it across processes.
Opinions differ on how useful MIG is: some call it “weak” and awkward; HPC operators report it as practical for subdividing big GPUs into smaller, isolated instances.
On security, participants say cross-tenant leakage on shared GPUs is “very real” in general, but the specific risk for MIG isolation is described as currently low/unclear, with no widely known breakouts.

CUDA moat, custom kernels, and abstraction losses

The article is praised for showing how much performance generic frameworks leave on the table, especially via “megakernel” approaches tightly tuned to a model and chip.
Several note this is exactly why CUDA is such a moat: vendor libraries and generic kernels trade performance for generality, and replicating that stack elsewhere (e.g., AMD) is nontrivial.
A few readers are surprised this level of low-hanging optimization is still being discovered in 2025, but others explain that rapid architectural change makes it rational to avoid chasing tiny last-percent gains everywhere.

Miscellaneous reactions

Some appreciate the author’s honesty about the fragility of the research code.
There is mild criticism of the writing style as dense or overwrought, while still acknowledging the technical value.
A side thread explores how much consumer GPUs could do for non-graphics signal processing (e.g., audio) if tooling and drivers were more open and accessible.

Related topics