GPT-5 outperforms federal judges in legal reasoning experiment
What the paper is really measuring
- Several commenters note the paper itself defines “error” as departure from a formal reading of law, not from “justice.”
- The task was narrow: a technical choice-of-law question in a car accident scenario, where there is (for the experiment) a legally “correct” jurisdiction.
- Many stress that this is clerical/legal analysis, not the core work of judges in hard, unsettled, or morally fraught cases.
Judgment, discretion, and fairness vs. consistency
- One side argues inconsistency is a feature: law is full of vague standards and impossible edge cases; humane outcomes require discretion.
- Others counter that inconsistency is where bias, corruption, and “noise” creep in, and that like cases should be treated alike.
- Example repeatedly cited: teen “sexting” cases where literal application of child-porn laws would label kids as predators; judges sometimes deliberately bend the law to avoid absurd, destructive results.
Arguments for using AI in the legal system
- As a second opinion or “AI clerk” to check legal reasoning, reduce bias/noise, and flag outlier rulings.
- As a first-pass or parallel system: AI decision, then human review/appeal, potentially speeding justice and reducing pretrial harms.
- Possible role in public defense or administrative-style proceedings, where overworked humans currently do mechanistic work.
Arguments against AI judges
- Fairness ≠ consistency: LLMs are praised here for rigid formalism, which might amplify unjust statutes and remove mercy.
- Legitimacy: people want to feel they were heard by a human; the process is partly about public trust, not just correct rule application.
- Accountability and control questions: who trains, tunes, and owns the model; hidden biases in data and prompts; risk of political or corporate capture.
Methodological and result skepticism
- Suspicion of a “100% correct” result; some think this signals a contrived benchmark or possible training-data contamination.
- Point that real judges offload such technical questions to clerks; the comparison may be more “AI vs. clerks” than “AI vs. judges.”
- Several commenters think the HN title is misleading: the paper is about “silicon formalism,” not a clean “AI beats judges” story.