The FFT Strikes Back: An Efficient Alternative to Self-Attention

High-level idea and intuition

  • Core mechanism: take the token embeddings, apply an FFT along the sequence (token) dimension, multiply by a learned, input-dependent complex filter (via an MLP + bias), apply a complex activation (e.g. modReLU), then inverse FFT back.
  • This effectively performs a global convolution over the sequence by using the convolution theorem: convolutions in “token space” become elementwise multiplications in “frequency space.”
  • Several commenters find the mechanism conceptually elegant and simpler than many attention variants, even if the math looks intimidating.

Relation to self-attention and other architectures

  • It is not strictly equivalent to self-attention; it trades off exact pairwise interactions for a global spectral mixing that may capture many of the same long-range relationships.
  • Viewed as another “token mixer,” akin to convolutions or Fourier Neural Operators, rather than a drop-in conceptual match for attention.
  • Comparisons drawn to FNet (fixed Fourier mixing) and Hyena / Fourier Neural Operators, with this work adding data-dependent, learnable spectral filters and nonlinearities in the complex domain.
  • Some discussion of Mamba: different paradigm (state-space / recurrent-like) with O(n) training, serving different use cases.

Complexity, efficiency, and hardware considerations

  • FFT-based mixing gives O(n log n) time vs O(n²) for standard attention, at least in theory; FlashAttention only improves memory (O(n²) → O(n)), not time.
  • Real-world efficiency depends heavily on hardware: matrix multiplication is extremely optimized on TPUs/GPUs, while FFT support can be weaker; prior FNet results showed FFT slower than matmul on TPUs for shorter sequences.
  • Complex numbers are handled as pairs of real tensors on GPUs; numerical stability is generally considered acceptable (FFT is unitary).

Questions, skepticism, and benchmarks

  • Concerns about how to handle causal masking and positional information in the frequency domain; details are unclear or absent to some readers.
  • For language, several are skeptical: text isn’t obviously “periodic,” and smoothing via spectral mixing might miss sharp, discrete effects (like a single “not” flipping meaning).
  • Reported results beat the authors’ own baselines on LRA but are far from current SOTA (e.g. S5), raising suspicions of weak baselines / “sandbagging.”

Prior work and literature recycling

  • Multiple comments note substantial prior art: FNet, Hyena, adaptive Fourier neural operators, FFT-based token mixers, etc.
  • Some frustration that similar ideas from years ago are being rediscovered without thorough literature integration, though others see value in revisiting older ideas with modern baselines and tooling.

Broader perspectives and open questions

  • Interest in variants: wavelets, learned transforms, finite-group Fourier transforms, or Walsh–Hadamard transforms.
  • Speculation that FFT-based mixers could enable ultra-long contexts if they integrate cleanly with existing inference engines and masking schemes.
  • Overall: enthusiasm for the elegance and potential scaling benefits, tempered by doubts about practical gains for large LLMs and incomplete empirical comparisons.