Ladder: Self-improving LLMs through recursive problem decomposition
Performance claims and benchmarks
- Discussion centers on Ladder and its Test-Time Reinforcement Learning (TTRL) boosting a small distilled 7B model to ~90% on the MIT Integration Bee qualifier and taking a 3B Llama from ~1% to 82% on undergrad integration problems.
- Some note that with a verifier in the loop, raw solve rate is less impressive unless compared against brute-force random generation under the same compute budget.
RL, curriculum learning, and “self-improvement”
- Multiple comments unpack reinforcement learning as reward-based optimization on task outcomes, contrasting it with earlier “RL on token prediction.”
- Curriculum learning is described as training on easier examples first, then harder ones; Ladder is seen as an automated, task-specific curriculum for math.
- Test-time RL is framed as blurring the line between training and inference: models refine themselves on related problems during inference, akin to humans mulling over and decomposing tasks.
Symbolic integration vs learned reasoning
- Commenters remind that rule-based systems like RUBI already solve symbolic integrals very well.
- Debate over whether LLMs should just memorize such rule sets vs learning more general strategies that transfer across domains.
- Others argue models likely have the rules in training data but struggle to reliably recall and apply them, motivating specialized synthetic curricula like Ladder.
Methodological concerns and fairness
- Some see persona prompts and recursive decomposition as “prompt engineering in a loop,” questioning how much true learning occurs at test time for a nominally stateless model.
- Others reply that context itself is the state; memory/tool use and context compression strategies are discussed.
- One criticism: using numerical integrators to check “simplified” problems risks effectively training on test cases if the simplification is minimal.
- Another point: providing explicit integration operations in-context may give Ladder an advantage over models evaluated without such scaffolding.
Compute, accessibility, and broader implications
- Test-time RL is praised as a way to “spend compute” productively, analogous to AlphaZero-style search, with interest in distilling the gains back into smaller models.
- Some report that similar ideas were internally developed and kept proprietary, viewing current disclosures as “cashing out” now that open-source baselines (e.g., DeepSeek/Qwen) are strong.
- Costs for such RL/fine-tuning are seen as reachable for small labs and ambitious hobbyists, especially on smaller models.
- Thread also veers into general excitement about rapid AI progress, comparisons to scaling failures (e.g., GPT‑4.5 expectations), and recurring fears about eventual superintelligence.