Were RNNs all we needed?
RNN variant and main technical idea
- Paper studies simplified GRU/LSTM-style RNNs (“minGRU”) whose gates depend only on current input, enabling parallel scan over sequences during training.
- Hidden state is not removed; only its role in gates is constrained to keep recurrence linear and parallelizable.
- Implementation tricks include operating in log-space for numerical stability.
Comparison to Transformers, SSMs, and Mamba-like models
- Experiments show small RNNs can train much faster than transformers (e.g., large speedups for length-512 sequences) with competitive accuracy at that scale.
- Several note this echoes earlier work on parallelizable RNNs (QRNN, SRU) and more recent state-space models / Mamba.
- Some argue results should be phrased as “competitive performance confirmed only at small scale”; past SSM/Mamba models sometimes degrade at larger sizes or contexts.
Context length, memory, and recall
- A central dispute: transformers can re-attend to all past tokens, while RNNs compress everything into a fixed-size hidden state.
- Critics doubt RNNs can match transformer-style recall on tasks like translating long documents or multi-turn chat.
- Counterarguments:
- Hidden state can be enlarged, and high-dimensional spaces can store rich summaries.
- Multiple passes, explicit “note buffers,” or hybrid attention + RNN architectures could mitigate recall limits.
- Transformers themselves have finite, effectively bounded state.
Expressiveness vs efficiency and “curve fitting”
- Many emphasize that multiple architectures (RNNs, transformers, SSMs, big MLPs) are universal or near-universal approximators; architecture choice is mostly about data, compute, and training efficiency.
- Others argue architecture and inductive bias matter greatly in practice, analogous to algorithmic complexity (e.g., “bogosort vs quicksort”).
- Faster convergence and lower resource use are seen as key wins, even if asymptotic performance converges.
Training stability, long horizons, and scaling
- Discussion revisits vanishing/exploding gradients in RNNs; gated units (LSTM/GRU) alleviate but don’t fully solve issues at very long contexts.
- Some expect recurrent or neuromorphic-like designs to be essential for truly long-horizon, efficient intelligence; others think very large context windows and better tooling may suffice.
- Several note that real breakthroughs may require new training objectives or optimizers beyond backprop.
Meta: citations, review, and hype
- One contributor highlights that very similar architectures were published years ago, and laments weak citation practices.
- There is skepticism about conference peer review quality and a desire for clearer “preprint” labeling.
- Overall tone: cautiously optimistic about RNNs-as-competitive-alternative, but unconvinced they are “all we need” without larger-scale evidence.