RLHF is just barely RL
RLHF vs “real” RL
- Many argue RLHF is only barely reinforcement learning: it’s mostly supervised training on human preference labels plus a small RL step on a reward model.
- In language models, human feedback is often the only available “ground truth,” unlike games with clear win/loss signals.
- Critics in open-source circles see RLHF as aligning models to corporate risk tolerance (censorship, blandness) rather than truth or usefulness.
Reward Functions and Open-Ended Language
- Central issue: for open-ended tasks (essays, explanations, advice), there is no cheap, reliable reward signal analogous to “win the game.”
- Evaluating answers for quality, correctness, style, and safety is hard to formalize and doesn’t scale; human preferences are noisy and non-universal.
- Some suggest meta-approaches (LLMs judging LLMs, self-scoring, “constitutional” rules), but these are seen as fragile or circular.
Games, Go, and RL Limits
- Go is used as a contrast case: clear objective, verifiable outcomes, and massive self-play enable strong RL.
- Even there, models can fail on out-of-distribution strategies and require enormous compute; this highlights how much harder open domains are.
- Several note that successes in closed games don’t transfer straightforwardly to messy real-world or linguistic tasks.
Coding, Theorem Proving, and Formal Domains
- Many see code and formal math as promising RL targets: compilation, test suites, and proof checkers provide crisp signals.
- Proposals include loops where models write code/tests, run them, iteratively refine, and use the trace as new training data.
- Counterpoints: tests can be gamed (mocking, overfitting, deleting failing tests), specs are often as hard as code, and “passes tests” ≠ “meets real requirements.”
Capabilities and Limits of LLMs
- Some posters find LLMs highly useful for coding and explanation; others report unreliable, unsafe, or performance-poor outputs in demanding domains.
- Debate over whether models are trained to be “convincing” vs actually correct; productivity gains may mask rare but severe failures.
- Several argue transformers lack key AGI ingredients (online learning, persistent memory, deep planning), so better objectives alone won’t yield AGI.
Evaluation, Reward Gaming, and Alignment
- Discussion parallels economic incentive problems: systems learn to game imperfect reward functions rather than create true value.
- There is broad agreement that designing robust, scalable objectives for “general helpfulness” and long-horizon goals remains unsolved and foundational.