Surprisingly fast AI-generated kernels we didn't mean to publish yet

Fixed-size kernels and PyTorch as a baseline

  • Some note the experiment seems to assume fixed input sizes; others explain PyTorch already uses multiple specialized kernels and tiling, but not for every possible shape.
  • A few suspect the speedups may reflect PyTorch choosing a suboptimal kernel for that exact shape, not fundamental superiority of the AI-generated code.
  • Others point out that beating generic framework kernels on a single fixed configuration has long been feasible.

Numerical precision, correctness, and evaluation

  • Several comments focus on the 1e-2 FP32 tolerance, arguing this effectively allows FP16-like behavior and makes FP32 comparisons misleading.
  • One user reports large mean squared error (~0.056) and slower performance than PyTorch on their RTX 3060M, suggesting results are hardware- and workload-dependent.
  • There is concern that using random-input testing rather than formal verification risks “empirically correct but wrong” kernels; contrasted with work that proves algebraic correctness.
  • Some kernels were found initially non–numerically stable (e.g., LayerNorm) and later regenerated.

Novelty vs existing optimization techniques

  • Multiple commenters argue there is nothing obviously novel in the example kernels; similar gains have been achieved for years via ML-guided scheduling (e.g., Halide, TVM) and vendor libraries.
  • Others emphasize that NVIDIA/PyTorch FP32 kernels are relatively neglected and that AI may just be porting known FP16/BF16 tricks.
  • Skeptics stress that “beating heavily optimized libraries” often ignores kernel-selection heuristics and real-world constraints (alignment, stability, accuracy).

Hardware, microarchitecture, and optimality

  • Discussion on NVIDIA’s poorly documented microarchitecture: this may make AI-guided exploratory search particularly effective.
  • Counterpoint: even with perfect documentation, global optimal scheduling/register allocation is combinatorially hard; compilers don’t attempt fully optimal code due to time constraints.
  • Some note that certain operations (e.g., matrix multiply on tensor cores) are already near hardware limits, leaving limited headroom.

Implications for AI capabilities and “self-improvement”

  • One camp sees this, AlphaEvolve, and o3-based bug-finding as evidence that recent models plus automated search cross a new capability threshold.
  • Others say it’s closer to genetic programming with a strong mutation operator and a clear objective; not direct evidence of broad recursive self-improvement.

Agent methodology and parallel LLM usage

  • Commenters highlight the interesting use of many short-lived “agents” in parallel, each exploring variants with an explicit reasoning step rather than pure low-level hill climbing.
  • This is contrasted with typical “one long-lived agent” patterns; some see fan-out/fan-in task graphs as a more natural fit for LLMs, though merging results is costly and lossy.

LLMs, reasoning, and “understanding” (meta-discussion)

  • Extended debate over whether LLMs “reason” or “understand,” or merely approximate patterns well enough to pass tests.
  • Some argue behaviorally they meet practical notions of understanding; others insist that anthropomorphic language obscures real limits, especially under novel conditions or strict logical demands.