2024-08-08

RLHF is just barely RL

Original Article ↗ Hacker News Discussion ↗

RLHF vs “real” RL

Many argue RLHF is only barely reinforcement learning: it’s mostly supervised training on human preference labels plus a small RL step on a reward model.
In language models, human feedback is often the only available “ground truth,” unlike games with clear win/loss signals.
Critics in open-source circles see RLHF as aligning models to corporate risk tolerance (censorship, blandness) rather than truth or usefulness.

Reward Functions and Open-Ended Language

Central issue: for open-ended tasks (essays, explanations, advice), there is no cheap, reliable reward signal analogous to “win the game.”
Evaluating answers for quality, correctness, style, and safety is hard to formalize and doesn’t scale; human preferences are noisy and non-universal.
Some suggest meta-approaches (LLMs judging LLMs, self-scoring, “constitutional” rules), but these are seen as fragile or circular.

Games, Go, and RL Limits

Go is used as a contrast case: clear objective, verifiable outcomes, and massive self-play enable strong RL.
Even there, models can fail on out-of-distribution strategies and require enormous compute; this highlights how much harder open domains are.
Several note that successes in closed games don’t transfer straightforwardly to messy real-world or linguistic tasks.

Coding, Theorem Proving, and Formal Domains

Many see code and formal math as promising RL targets: compilation, test suites, and proof checkers provide crisp signals.
Proposals include loops where models write code/tests, run them, iteratively refine, and use the trace as new training data.
Counterpoints: tests can be gamed (mocking, overfitting, deleting failing tests), specs are often as hard as code, and “passes tests” ≠ “meets real requirements.”

Capabilities and Limits of LLMs

Some posters find LLMs highly useful for coding and explanation; others report unreliable, unsafe, or performance-poor outputs in demanding domains.
Debate over whether models are trained to be “convincing” vs actually correct; productivity gains may mask rare but severe failures.
Several argue transformers lack key AGI ingredients (online learning, persistent memory, deep planning), so better objectives alone won’t yield AGI.

Evaluation, Reward Gaming, and Alignment

Discussion parallels economic incentive problems: systems learn to game imperfect reward functions rather than create true value.
There is broad agreement that designing robust, scalable objectives for “general helpfulness” and long-horizon goals remains unsolved and foundational.