Ladder: Self-improving LLMs through recursive problem decomposition

Performance claims and benchmarks

  • Discussion centers on Ladder and its Test-Time Reinforcement Learning (TTRL) boosting a small distilled 7B model to ~90% on the MIT Integration Bee qualifier and taking a 3B Llama from ~1% to 82% on undergrad integration problems.
  • Some note that with a verifier in the loop, raw solve rate is less impressive unless compared against brute-force random generation under the same compute budget.

RL, curriculum learning, and “self-improvement”

  • Multiple comments unpack reinforcement learning as reward-based optimization on task outcomes, contrasting it with earlier “RL on token prediction.”
  • Curriculum learning is described as training on easier examples first, then harder ones; Ladder is seen as an automated, task-specific curriculum for math.
  • Test-time RL is framed as blurring the line between training and inference: models refine themselves on related problems during inference, akin to humans mulling over and decomposing tasks.

Symbolic integration vs learned reasoning

  • Commenters remind that rule-based systems like RUBI already solve symbolic integrals very well.
  • Debate over whether LLMs should just memorize such rule sets vs learning more general strategies that transfer across domains.
  • Others argue models likely have the rules in training data but struggle to reliably recall and apply them, motivating specialized synthetic curricula like Ladder.

Methodological concerns and fairness

  • Some see persona prompts and recursive decomposition as “prompt engineering in a loop,” questioning how much true learning occurs at test time for a nominally stateless model.
  • Others reply that context itself is the state; memory/tool use and context compression strategies are discussed.
  • One criticism: using numerical integrators to check “simplified” problems risks effectively training on test cases if the simplification is minimal.
  • Another point: providing explicit integration operations in-context may give Ladder an advantage over models evaluated without such scaffolding.

Compute, accessibility, and broader implications

  • Test-time RL is praised as a way to “spend compute” productively, analogous to AlphaZero-style search, with interest in distilling the gains back into smaller models.
  • Some report that similar ideas were internally developed and kept proprietary, viewing current disclosures as “cashing out” now that open-source baselines (e.g., DeepSeek/Qwen) are strong.
  • Costs for such RL/fine-tuning are seen as reachable for small labs and ambitious hobbyists, especially on smaller models.
  • Thread also veers into general excitement about rapid AI progress, comparisons to scaling failures (e.g., GPT‑4.5 expectations), and recurring fears about eventual superintelligence.