Leanstral: Open-source agent for trustworthy coding and formal proof engineering
Model performance and benchmarking
- Many are excited about a specialized agent for Lean / proof engineering, but several note that it still underperforms frontier models like Claude Opus on the reported benchmark.
- Some argue Opus’s higher cost is justified for mission‑critical correctness; others say Leanstral’s cost/performance looks better once you consider pass@2 / pass@3 and the ability to auto‑check proofs.
- Comparisons to Haiku/Sonnet are seen as somewhat confusing; a few suggest comparisons to older code-focused models like Codex would be more meaningful.
- One criticism: marketing claims that performance “scales linearly” with more passes do not match the chart, where other models appear to scale more cleanly.
Passes, pass@k, and multi-model strategies
- “Passes” are clarified as multiple independent attempts; pass@k counts success if any of k tries is correct.
- This is viewed as particularly appropriate for Lean because correctness can be automatically validated.
- Some discuss “LLM alloys”: rotating different models across passes. Reported elsewhere to improve scores when models have complementary strengths.
Trustworthy coding, vibe coding, and TDD
- Strong split between “trustworthy”/spec‑driven workflows and “vibe coding” (loosely-guided code generation).
- Several developers dislike vibe coding, comparing it to building disposable, fragile systems; others say they use agents as careful assistants, not as autonomous coders.
- Many connect Leanstral to TDD / property‑based testing: agents generating tests/specs, then code to satisfy them. Some warn that letting models generate both code and tests can still hide bugs.
Formal methods and Lean’s role
- Lean is framed as a way to offload proof search to an agent while humans focus on writing specifications.
- Supporters argue specs are much smaller and easier to review than implementations; Lean’s small trusted kernel then checks validity.
- Skeptics note that understanding whether a spec/theorem is the “right” one is still hard, and that proofs can be huge and opaque.
- There’s discussion of using Lean (or TLA+ etc.) as targeted tools for tricky state/consistency bugs rather than verifying entire systems.
Mistral models, openness, and ecosystem
- Mixed views on Mistral’s overall competitiveness: some find the models behind top US offerings, others happily use them (especially smaller/open ones) for daily tasks and local deployment.
- Leanstral’s weights are Apache‑2.0–licensed; some call this genuinely open source, others insist this is better described as “open weights” because the full training process isn’t open.
- Debate over practicality of running a 120B‑scale model locally: possible on high‑end Macs or servers with quantization, but often slower and less convenient than using hosted SOTA models.
European positioning and regulation
- Several see Mistral as part of a European push for AI independence, but point out that if it runs largely on US cloud providers, the sovereignty benefits are limited.
- The EU AI Act is criticized by some as prematurely strict and harmful to competitiveness; others say it mostly targets large players and is reasonable for traditional ML, though poorly adapted to foundation models.