ProofOfThought: LLM-based reasoning using Z3 theorem proving

Using formal methods for policies and compliance

  • Some teams report prototyping similar ideas with Lean: converting business or compliance policies (from docs/wikis) into formal logic via LLMs, then re-checking with a solver as a kind of “process linter” when documents change.
  • This is seen as promising for domains needing tight legal/financial compliance, but manual engineer review of auto-formalized specs is still required.

Structured outputs vs custom DSL + Z3

  • Several commenters criticize the project for parsing raw LLM text instead of using modern structured-output APIs and constrained decoding, which can enforce JSON schemas and reduce hallucinations.
  • Others note that older APIs only enforced JSON structure, not complex DSL grammars; designing constraints for a rich custom DSL was non-trivial when the project began.
  • There are reports of occasional JSON/structured-output failures even with schemas, suggesting validation and retries are still needed.

Autoformalization gap & verification limits

  • A core concern: the LLM may generate incorrect logical models or inject unsound facts; the solver then only “proves” whatever the model says.
  • The paper reportedly shows high false-positive rates on logic benchmarks, highlighting this “autoformalization gap.”
  • Follow-up work measures consistency between text reasoning and SMT programs, and proposes uncertainty quantification / selective verification to reduce risk, but skeptics argue this doesn’t solve the fundamental “crap in, crap out” issue.

Neurosymbolic systems and LLM-as-judge

  • Many see hybrid neurosymbolic systems (LLM + logic/prover/CAS) as the way forward: LLMs propose plans or formalizations; symbolic engines check them.
  • Some advocate LLMs or agent ensembles as judges/critics, while others argue that LLM-judging-LLM inherits biases and eventually caps performance, requiring human or deterministic oracles for high-stakes tasks.

Do LLMs ‘think’ or reason?

  • The project title (“ProofOfThought”) triggers a long philosophical dispute about whether computers/LLMs can “think” or “reason” versus merely emulate reasoning statistically.
  • One side insists computation (even domino cascades) cannot meet any reasonable definition of thought; others counter that cognition is substrate-independent and that insisting it be uniquely human is circular.

Interpretability and practicality

  • Commenters ask why we can’t simply log all neural activations; replies stress the sheer dimensionality and that such traces are uninterpretable to humans, though there is ongoing work in mechanistic interpretability.
  • Practical issues raised: sparse documentation of the DSL in the repo, potential solver latency for real-time applications, and examples where the generated SMT model for a simple puzzle is shallow and uninformative.