2024-10-03

Were RNNs all we needed?

RNN variant and main technical idea

Paper studies simplified GRU/LSTM-style RNNs (“minGRU”) whose gates depend only on current input, enabling parallel scan over sequences during training.
Hidden state is not removed; only its role in gates is constrained to keep recurrence linear and parallelizable.
Implementation tricks include operating in log-space for numerical stability.

Comparison to Transformers, SSMs, and Mamba-like models

Experiments show small RNNs can train much faster than transformers (e.g., large speedups for length-512 sequences) with competitive accuracy at that scale.
Several note this echoes earlier work on parallelizable RNNs (QRNN, SRU) and more recent state-space models / Mamba.
Some argue results should be phrased as “competitive performance confirmed only at small scale”; past SSM/Mamba models sometimes degrade at larger sizes or contexts.

Context length, memory, and recall

A central dispute: transformers can re-attend to all past tokens, while RNNs compress everything into a fixed-size hidden state.
Critics doubt RNNs can match transformer-style recall on tasks like translating long documents or multi-turn chat.
Counterarguments:
- Hidden state can be enlarged, and high-dimensional spaces can store rich summaries.
- Multiple passes, explicit “note buffers,” or hybrid attention + RNN architectures could mitigate recall limits.
- Transformers themselves have finite, effectively bounded state.

Expressiveness vs efficiency and “curve fitting”

Many emphasize that multiple architectures (RNNs, transformers, SSMs, big MLPs) are universal or near-universal approximators; architecture choice is mostly about data, compute, and training efficiency.
Others argue architecture and inductive bias matter greatly in practice, analogous to algorithmic complexity (e.g., “bogosort vs quicksort”).
Faster convergence and lower resource use are seen as key wins, even if asymptotic performance converges.

Training stability, long horizons, and scaling

Discussion revisits vanishing/exploding gradients in RNNs; gated units (LSTM/GRU) alleviate but don’t fully solve issues at very long contexts.
Some expect recurrent or neuromorphic-like designs to be essential for truly long-horizon, efficient intelligence; others think very large context windows and better tooling may suffice.
Several note that real breakthroughs may require new training objectives or optimizers beyond backprop.

Meta: citations, review, and hype

One contributor highlights that very similar architectures were published years ago, and laments weak citation practices.
There is skepticism about conference peer review quality and a desire for clearer “preprint” labeling.
Overall tone: cautiously optimistic about RNNs-as-competitive-alternative, but unconvinced they are “all we need” without larger-scale evidence.

Related topics