ProofOfThought: LLM-based reasoning using Z3 theorem proving
Using formal methods for policies and compliance
- Some teams report prototyping similar ideas with Lean: converting business or compliance policies (from docs/wikis) into formal logic via LLMs, then re-checking with a solver as a kind of “process linter” when documents change.
- This is seen as promising for domains needing tight legal/financial compliance, but manual engineer review of auto-formalized specs is still required.
Structured outputs vs custom DSL + Z3
- Several commenters criticize the project for parsing raw LLM text instead of using modern structured-output APIs and constrained decoding, which can enforce JSON schemas and reduce hallucinations.
- Others note that older APIs only enforced JSON structure, not complex DSL grammars; designing constraints for a rich custom DSL was non-trivial when the project began.
- There are reports of occasional JSON/structured-output failures even with schemas, suggesting validation and retries are still needed.
Autoformalization gap & verification limits
- A core concern: the LLM may generate incorrect logical models or inject unsound facts; the solver then only “proves” whatever the model says.
- The paper reportedly shows high false-positive rates on logic benchmarks, highlighting this “autoformalization gap.”
- Follow-up work measures consistency between text reasoning and SMT programs, and proposes uncertainty quantification / selective verification to reduce risk, but skeptics argue this doesn’t solve the fundamental “crap in, crap out” issue.
Neurosymbolic systems and LLM-as-judge
- Many see hybrid neurosymbolic systems (LLM + logic/prover/CAS) as the way forward: LLMs propose plans or formalizations; symbolic engines check them.
- Some advocate LLMs or agent ensembles as judges/critics, while others argue that LLM-judging-LLM inherits biases and eventually caps performance, requiring human or deterministic oracles for high-stakes tasks.
Do LLMs ‘think’ or reason?
- The project title (“ProofOfThought”) triggers a long philosophical dispute about whether computers/LLMs can “think” or “reason” versus merely emulate reasoning statistically.
- One side insists computation (even domino cascades) cannot meet any reasonable definition of thought; others counter that cognition is substrate-independent and that insisting it be uniquely human is circular.
Interpretability and practicality
- Commenters ask why we can’t simply log all neural activations; replies stress the sheer dimensionality and that such traces are uninterpretable to humans, though there is ongoing work in mechanistic interpretability.
- Practical issues raised: sparse documentation of the DSL in the repo, potential solver latency for real-time applications, and examples where the generated SMT model for a simple puzzle is shallow and uninformative.