2025-10-04

ProofOfThought: LLM-based reasoning using Z3 theorem proving

Using formal methods for policies and compliance

Some teams report prototyping similar ideas with Lean: converting business or compliance policies (from docs/wikis) into formal logic via LLMs, then re-checking with a solver as a kind of “process linter” when documents change.
This is seen as promising for domains needing tight legal/financial compliance, but manual engineer review of auto-formalized specs is still required.

Structured outputs vs custom DSL + Z3

Several commenters criticize the project for parsing raw LLM text instead of using modern structured-output APIs and constrained decoding, which can enforce JSON schemas and reduce hallucinations.
Others note that older APIs only enforced JSON structure, not complex DSL grammars; designing constraints for a rich custom DSL was non-trivial when the project began.
There are reports of occasional JSON/structured-output failures even with schemas, suggesting validation and retries are still needed.

Autoformalization gap & verification limits

A core concern: the LLM may generate incorrect logical models or inject unsound facts; the solver then only “proves” whatever the model says.
The paper reportedly shows high false-positive rates on logic benchmarks, highlighting this “autoformalization gap.”
Follow-up work measures consistency between text reasoning and SMT programs, and proposes uncertainty quantification / selective verification to reduce risk, but skeptics argue this doesn’t solve the fundamental “crap in, crap out” issue.

Neurosymbolic systems and LLM-as-judge

Many see hybrid neurosymbolic systems (LLM + logic/prover/CAS) as the way forward: LLMs propose plans or formalizations; symbolic engines check them.
Some advocate LLMs or agent ensembles as judges/critics, while others argue that LLM-judging-LLM inherits biases and eventually caps performance, requiring human or deterministic oracles for high-stakes tasks.

Do LLMs ‘think’ or reason?

The project title (“ProofOfThought”) triggers a long philosophical dispute about whether computers/LLMs can “think” or “reason” versus merely emulate reasoning statistically.
One side insists computation (even domino cascades) cannot meet any reasonable definition of thought; others counter that cognition is substrate-independent and that insisting it be uniquely human is circular.

Interpretability and practicality

Commenters ask why we can’t simply log all neural activations; replies stress the sheer dimensionality and that such traces are uninterpretable to humans, though there is ongoing work in mechanistic interpretability.
Practical issues raised: sparse documentation of the DSL in the repo, potential solver latency for real-time applications, and examples where the generated SMT model for a simple puzzle is shallow and uninformative.

Related topics