Legal models hallucinate in 1 out of 6 (or more) benchmarking queries

Reliability of Legal AI Tools

  • Many commenters argue a 1-in-6 hallucination rate is unacceptable in adversarial, high‑stakes domains like law, where wrong citations can tank cases or even send people to jail.
  • Examples are raised of real lawyers sanctioned for filing briefs with AI‑invented cases.
  • Some see legal research as one of the worst matches for LLMs because law changes frequently, jurisdictions differ, and precedent can be overruled; mislabeling or missing this context is fatal.

Hallucinations vs “Mere Generation”

  • One camp claims LLMs “hallucinate 100% of the time”: the same stochastic token‑generation process underlies both right and wrong answers; there is no internal truth criterion.
  • Others push back that “hallucination” should mean factual fabrication, not all generation; conflating the terms is seen as rhetorical or propagandistic.
  • There’s extended debate over whether human perception and memory are similarly “hallucinatory,” and whether that analogy is useful or misleading.

Comparison to Human Professionals

  • Several ask: what’s the baseline error rate for human lawyers, doctors, and advisors? Some report substantial human error and bad advice.
  • Others counter that human professionals are accountable (malpractice, sanctions), rarely fabricate entire citations, and can empirically check reality; LLM users often have no recourse.

Workflows, RAG, and Mitigation

  • Suggested safe pattern: LLMs do drafting/brainstorming; humans must audit every substantive output. Critics note this inverts the hoped‑for automation: machines “do creativity,” humans do tedious verification.
  • RAG and multi‑model cross‑checking are proposed to reduce hallucinations, but commenters say existing legal products integrate citators and retrieval poorly and still fail frequently.
  • Some argue careful engineering and non‑rushed products can dramatically outperform current big‑vendor tools, but details are sparse.

Impact on Professions and Skills

  • Concern that firms will replace junior staff with AI, leading to skill atrophy and a future shortage of experienced professionals.
  • Others note similar oversight structures already exist (e.g., assistants supervised by licensed experts), but question if a 17–33% error rate can ever be acceptable.

Hype, Limits, and Future Directions

  • Strong skepticism that current LLM architecture is suitable for mission‑critical legal or medical advice; alternative systems (expert systems, structured databases, DSLs) are suggested.
  • Others see LLMs as powerful linguistic components inside larger, more rigorous systems, and expect rapid improvement, though metrics and incentives remain unclear.