Legal models hallucinate in 1 out of 6 (or more) benchmarking queries
Reliability of Legal AI Tools
- Many commenters argue a 1-in-6 hallucination rate is unacceptable in adversarial, high‑stakes domains like law, where wrong citations can tank cases or even send people to jail.
- Examples are raised of real lawyers sanctioned for filing briefs with AI‑invented cases.
- Some see legal research as one of the worst matches for LLMs because law changes frequently, jurisdictions differ, and precedent can be overruled; mislabeling or missing this context is fatal.
Hallucinations vs “Mere Generation”
- One camp claims LLMs “hallucinate 100% of the time”: the same stochastic token‑generation process underlies both right and wrong answers; there is no internal truth criterion.
- Others push back that “hallucination” should mean factual fabrication, not all generation; conflating the terms is seen as rhetorical or propagandistic.
- There’s extended debate over whether human perception and memory are similarly “hallucinatory,” and whether that analogy is useful or misleading.
Comparison to Human Professionals
- Several ask: what’s the baseline error rate for human lawyers, doctors, and advisors? Some report substantial human error and bad advice.
- Others counter that human professionals are accountable (malpractice, sanctions), rarely fabricate entire citations, and can empirically check reality; LLM users often have no recourse.
Workflows, RAG, and Mitigation
- Suggested safe pattern: LLMs do drafting/brainstorming; humans must audit every substantive output. Critics note this inverts the hoped‑for automation: machines “do creativity,” humans do tedious verification.
- RAG and multi‑model cross‑checking are proposed to reduce hallucinations, but commenters say existing legal products integrate citators and retrieval poorly and still fail frequently.
- Some argue careful engineering and non‑rushed products can dramatically outperform current big‑vendor tools, but details are sparse.
Impact on Professions and Skills
- Concern that firms will replace junior staff with AI, leading to skill atrophy and a future shortage of experienced professionals.
- Others note similar oversight structures already exist (e.g., assistants supervised by licensed experts), but question if a 17–33% error rate can ever be acceptable.
Hype, Limits, and Future Directions
- Strong skepticism that current LLM architecture is suitable for mission‑critical legal or medical advice; alternative systems (expert systems, structured databases, DSLs) are suggested.
- Others see LLMs as powerful linguistic components inside larger, more rigorous systems, and expect rapid improvement, though metrics and incentives remain unclear.