Were RNNs all we needed?

RNN variant and main technical idea

  • Paper studies simplified GRU/LSTM-style RNNs (“minGRU”) whose gates depend only on current input, enabling parallel scan over sequences during training.
  • Hidden state is not removed; only its role in gates is constrained to keep recurrence linear and parallelizable.
  • Implementation tricks include operating in log-space for numerical stability.

Comparison to Transformers, SSMs, and Mamba-like models

  • Experiments show small RNNs can train much faster than transformers (e.g., large speedups for length-512 sequences) with competitive accuracy at that scale.
  • Several note this echoes earlier work on parallelizable RNNs (QRNN, SRU) and more recent state-space models / Mamba.
  • Some argue results should be phrased as “competitive performance confirmed only at small scale”; past SSM/Mamba models sometimes degrade at larger sizes or contexts.

Context length, memory, and recall

  • A central dispute: transformers can re-attend to all past tokens, while RNNs compress everything into a fixed-size hidden state.
  • Critics doubt RNNs can match transformer-style recall on tasks like translating long documents or multi-turn chat.
  • Counterarguments:
    • Hidden state can be enlarged, and high-dimensional spaces can store rich summaries.
    • Multiple passes, explicit “note buffers,” or hybrid attention + RNN architectures could mitigate recall limits.
    • Transformers themselves have finite, effectively bounded state.

Expressiveness vs efficiency and “curve fitting”

  • Many emphasize that multiple architectures (RNNs, transformers, SSMs, big MLPs) are universal or near-universal approximators; architecture choice is mostly about data, compute, and training efficiency.
  • Others argue architecture and inductive bias matter greatly in practice, analogous to algorithmic complexity (e.g., “bogosort vs quicksort”).
  • Faster convergence and lower resource use are seen as key wins, even if asymptotic performance converges.

Training stability, long horizons, and scaling

  • Discussion revisits vanishing/exploding gradients in RNNs; gated units (LSTM/GRU) alleviate but don’t fully solve issues at very long contexts.
  • Some expect recurrent or neuromorphic-like designs to be essential for truly long-horizon, efficient intelligence; others think very large context windows and better tooling may suffice.
  • Several note that real breakthroughs may require new training objectives or optimizers beyond backprop.

Meta: citations, review, and hype

  • One contributor highlights that very similar architectures were published years ago, and laments weak citation practices.
  • There is skepticism about conference peer review quality and a desire for clearer “preprint” labeling.
  • Overall tone: cautiously optimistic about RNNs-as-competitive-alternative, but unconvinced they are “all we need” without larger-scale evidence.