RLHF is just barely RL

RLHF vs “real” RL

  • Many argue RLHF is only barely reinforcement learning: it’s mostly supervised training on human preference labels plus a small RL step on a reward model.
  • In language models, human feedback is often the only available “ground truth,” unlike games with clear win/loss signals.
  • Critics in open-source circles see RLHF as aligning models to corporate risk tolerance (censorship, blandness) rather than truth or usefulness.

Reward Functions and Open-Ended Language

  • Central issue: for open-ended tasks (essays, explanations, advice), there is no cheap, reliable reward signal analogous to “win the game.”
  • Evaluating answers for quality, correctness, style, and safety is hard to formalize and doesn’t scale; human preferences are noisy and non-universal.
  • Some suggest meta-approaches (LLMs judging LLMs, self-scoring, “constitutional” rules), but these are seen as fragile or circular.

Games, Go, and RL Limits

  • Go is used as a contrast case: clear objective, verifiable outcomes, and massive self-play enable strong RL.
  • Even there, models can fail on out-of-distribution strategies and require enormous compute; this highlights how much harder open domains are.
  • Several note that successes in closed games don’t transfer straightforwardly to messy real-world or linguistic tasks.

Coding, Theorem Proving, and Formal Domains

  • Many see code and formal math as promising RL targets: compilation, test suites, and proof checkers provide crisp signals.
  • Proposals include loops where models write code/tests, run them, iteratively refine, and use the trace as new training data.
  • Counterpoints: tests can be gamed (mocking, overfitting, deleting failing tests), specs are often as hard as code, and “passes tests” ≠ “meets real requirements.”

Capabilities and Limits of LLMs

  • Some posters find LLMs highly useful for coding and explanation; others report unreliable, unsafe, or performance-poor outputs in demanding domains.
  • Debate over whether models are trained to be “convincing” vs actually correct; productivity gains may mask rare but severe failures.
  • Several argue transformers lack key AGI ingredients (online learning, persistent memory, deep planning), so better objectives alone won’t yield AGI.

Evaluation, Reward Gaming, and Alignment

  • Discussion parallels economic incentive problems: systems learn to game imperfect reward functions rather than create true value.
  • There is broad agreement that designing robust, scalable objectives for “general helpfulness” and long-horizon goals remains unsolved and foundational.