2025-03-07

Ladder: Self-improving LLMs through recursive problem decomposition

Performance claims and benchmarks

Discussion centers on Ladder and its Test-Time Reinforcement Learning (TTRL) boosting a small distilled 7B model to ~90% on the MIT Integration Bee qualifier and taking a 3B Llama from ~1% to 82% on undergrad integration problems.
Some note that with a verifier in the loop, raw solve rate is less impressive unless compared against brute-force random generation under the same compute budget.

RL, curriculum learning, and “self-improvement”

Multiple comments unpack reinforcement learning as reward-based optimization on task outcomes, contrasting it with earlier “RL on token prediction.”
Curriculum learning is described as training on easier examples first, then harder ones; Ladder is seen as an automated, task-specific curriculum for math.
Test-time RL is framed as blurring the line between training and inference: models refine themselves on related problems during inference, akin to humans mulling over and decomposing tasks.

Symbolic integration vs learned reasoning

Commenters remind that rule-based systems like RUBI already solve symbolic integrals very well.
Debate over whether LLMs should just memorize such rule sets vs learning more general strategies that transfer across domains.
Others argue models likely have the rules in training data but struggle to reliably recall and apply them, motivating specialized synthetic curricula like Ladder.

Methodological concerns and fairness

Some see persona prompts and recursive decomposition as “prompt engineering in a loop,” questioning how much true learning occurs at test time for a nominally stateless model.
Others reply that context itself is the state; memory/tool use and context compression strategies are discussed.
One criticism: using numerical integrators to check “simplified” problems risks effectively training on test cases if the simplification is minimal.
Another point: providing explicit integration operations in-context may give Ladder an advantage over models evaluated without such scaffolding.

Compute, accessibility, and broader implications

Test-time RL is praised as a way to “spend compute” productively, analogous to AlphaZero-style search, with interest in distilling the gains back into smaller models.
Some report that similar ideas were internally developed and kept proprietary, viewing current disclosures as “cashing out” now that open-source baselines (e.g., DeepSeek/Qwen) are strong.
Costs for such RL/fine-tuning are seen as reachable for small labs and ambitious hobbyists, especially on smaller models.
Thread also veers into general excitement about rapid AI progress, comparisons to scaling failures (e.g., GPT‑4.5 expectations), and recurring fears about eventual superintelligence.

Related topics