2024-05-31

Legal models hallucinate in 1 out of 6 (or more) benchmarking queries

Reliability of Legal AI Tools

Many commenters argue a 1-in-6 hallucination rate is unacceptable in adversarial, high‑stakes domains like law, where wrong citations can tank cases or even send people to jail.
Examples are raised of real lawyers sanctioned for filing briefs with AI‑invented cases.
Some see legal research as one of the worst matches for LLMs because law changes frequently, jurisdictions differ, and precedent can be overruled; mislabeling or missing this context is fatal.

Hallucinations vs “Mere Generation”

One camp claims LLMs “hallucinate 100% of the time”: the same stochastic token‑generation process underlies both right and wrong answers; there is no internal truth criterion.
Others push back that “hallucination” should mean factual fabrication, not all generation; conflating the terms is seen as rhetorical or propagandistic.
There’s extended debate over whether human perception and memory are similarly “hallucinatory,” and whether that analogy is useful or misleading.

Comparison to Human Professionals

Several ask: what’s the baseline error rate for human lawyers, doctors, and advisors? Some report substantial human error and bad advice.
Others counter that human professionals are accountable (malpractice, sanctions), rarely fabricate entire citations, and can empirically check reality; LLM users often have no recourse.

Workflows, RAG, and Mitigation

Suggested safe pattern: LLMs do drafting/brainstorming; humans must audit every substantive output. Critics note this inverts the hoped‑for automation: machines “do creativity,” humans do tedious verification.
RAG and multi‑model cross‑checking are proposed to reduce hallucinations, but commenters say existing legal products integrate citators and retrieval poorly and still fail frequently.
Some argue careful engineering and non‑rushed products can dramatically outperform current big‑vendor tools, but details are sparse.

Impact on Professions and Skills

Concern that firms will replace junior staff with AI, leading to skill atrophy and a future shortage of experienced professionals.
Others note similar oversight structures already exist (e.g., assistants supervised by licensed experts), but question if a 17–33% error rate can ever be acceptable.

Hype, Limits, and Future Directions

Strong skepticism that current LLM architecture is suitable for mission‑critical legal or medical advice; alternative systems (expert systems, structured databases, DSLs) are suggested.
Others see LLMs as powerful linguistic components inside larger, more rigorous systems, and expect rapid improvement, though metrics and incentives remain unclear.

Related topics