2026-03-16

Leanstral: Open-source agent for trustworthy coding and formal proof engineering

Model performance and benchmarking

Many are excited about a specialized agent for Lean / proof engineering, but several note that it still underperforms frontier models like Claude Opus on the reported benchmark.
Some argue Opus’s higher cost is justified for mission‑critical correctness; others say Leanstral’s cost/performance looks better once you consider pass@2 / pass@3 and the ability to auto‑check proofs.
Comparisons to Haiku/Sonnet are seen as somewhat confusing; a few suggest comparisons to older code-focused models like Codex would be more meaningful.
One criticism: marketing claims that performance “scales linearly” with more passes do not match the chart, where other models appear to scale more cleanly.

Passes, pass@k, and multi-model strategies

“Passes” are clarified as multiple independent attempts; pass@k counts success if any of k tries is correct.
This is viewed as particularly appropriate for Lean because correctness can be automatically validated.
Some discuss “LLM alloys”: rotating different models across passes. Reported elsewhere to improve scores when models have complementary strengths.

Trustworthy coding, vibe coding, and TDD

Strong split between “trustworthy”/spec‑driven workflows and “vibe coding” (loosely-guided code generation).
Several developers dislike vibe coding, comparing it to building disposable, fragile systems; others say they use agents as careful assistants, not as autonomous coders.
Many connect Leanstral to TDD / property‑based testing: agents generating tests/specs, then code to satisfy them. Some warn that letting models generate both code and tests can still hide bugs.

Formal methods and Lean’s role

Lean is framed as a way to offload proof search to an agent while humans focus on writing specifications.
Supporters argue specs are much smaller and easier to review than implementations; Lean’s small trusted kernel then checks validity.
Skeptics note that understanding whether a spec/theorem is the “right” one is still hard, and that proofs can be huge and opaque.
There’s discussion of using Lean (or TLA+ etc.) as targeted tools for tricky state/consistency bugs rather than verifying entire systems.

Mistral models, openness, and ecosystem

Mixed views on Mistral’s overall competitiveness: some find the models behind top US offerings, others happily use them (especially smaller/open ones) for daily tasks and local deployment.
Leanstral’s weights are Apache‑2.0–licensed; some call this genuinely open source, others insist this is better described as “open weights” because the full training process isn’t open.
Debate over practicality of running a 120B‑scale model locally: possible on high‑end Macs or servers with quantization, but often slower and less convenient than using hosted SOTA models.

European positioning and regulation

Several see Mistral as part of a European push for AI independence, but point out that if it runs largely on US cloud providers, the sovereignty benefits are limited.
The EU AI Act is criticized by some as prematurely strict and harmful to competitiveness; others say it mostly targets large players and is reasonable for traditional ML, though poorly adapted to foundation models.

Related topics