2025-05-30

Surprisingly fast AI-generated kernels we didn't mean to publish yet

Fixed-size kernels and PyTorch as a baseline

Some note the experiment seems to assume fixed input sizes; others explain PyTorch already uses multiple specialized kernels and tiling, but not for every possible shape.
A few suspect the speedups may reflect PyTorch choosing a suboptimal kernel for that exact shape, not fundamental superiority of the AI-generated code.
Others point out that beating generic framework kernels on a single fixed configuration has long been feasible.

Numerical precision, correctness, and evaluation

Several comments focus on the 1e-2 FP32 tolerance, arguing this effectively allows FP16-like behavior and makes FP32 comparisons misleading.
One user reports large mean squared error (~0.056) and slower performance than PyTorch on their RTX 3060M, suggesting results are hardware- and workload-dependent.
There is concern that using random-input testing rather than formal verification risks “empirically correct but wrong” kernels; contrasted with work that proves algebraic correctness.
Some kernels were found initially non–numerically stable (e.g., LayerNorm) and later regenerated.

Novelty vs existing optimization techniques

Multiple commenters argue there is nothing obviously novel in the example kernels; similar gains have been achieved for years via ML-guided scheduling (e.g., Halide, TVM) and vendor libraries.
Others emphasize that NVIDIA/PyTorch FP32 kernels are relatively neglected and that AI may just be porting known FP16/BF16 tricks.
Skeptics stress that “beating heavily optimized libraries” often ignores kernel-selection heuristics and real-world constraints (alignment, stability, accuracy).

Hardware, microarchitecture, and optimality

Discussion on NVIDIA’s poorly documented microarchitecture: this may make AI-guided exploratory search particularly effective.
Counterpoint: even with perfect documentation, global optimal scheduling/register allocation is combinatorially hard; compilers don’t attempt fully optimal code due to time constraints.
Some note that certain operations (e.g., matrix multiply on tensor cores) are already near hardware limits, leaving limited headroom.

Implications for AI capabilities and “self-improvement”

One camp sees this, AlphaEvolve, and o3-based bug-finding as evidence that recent models plus automated search cross a new capability threshold.
Others say it’s closer to genetic programming with a strong mutation operator and a clear objective; not direct evidence of broad recursive self-improvement.

Agent methodology and parallel LLM usage

Commenters highlight the interesting use of many short-lived “agents” in parallel, each exploring variants with an explicit reasoning step rather than pure low-level hill climbing.
This is contrasted with typical “one long-lived agent” patterns; some see fan-out/fan-in task graphs as a more natural fit for LLMs, though merging results is costly and lossy.

LLMs, reasoning, and “understanding” (meta-discussion)

Extended debate over whether LLMs “reason” or “understand,” or merely approximate patterns well enough to pass tests.
Some argue behaviorally they meet practical notions of understanding; others insist that anthropomorphic language obscures real limits, especially under novel conditions or strict logical demands.

Related topics