GPT-5 outperforms federal judges in legal reasoning experiment

What the paper is really measuring

  • Several commenters note the paper itself defines “error” as departure from a formal reading of law, not from “justice.”
  • The task was narrow: a technical choice-of-law question in a car accident scenario, where there is (for the experiment) a legally “correct” jurisdiction.
  • Many stress that this is clerical/legal analysis, not the core work of judges in hard, unsettled, or morally fraught cases.

Judgment, discretion, and fairness vs. consistency

  • One side argues inconsistency is a feature: law is full of vague standards and impossible edge cases; humane outcomes require discretion.
  • Others counter that inconsistency is where bias, corruption, and “noise” creep in, and that like cases should be treated alike.
  • Example repeatedly cited: teen “sexting” cases where literal application of child-porn laws would label kids as predators; judges sometimes deliberately bend the law to avoid absurd, destructive results.

Arguments for using AI in the legal system

  • As a second opinion or “AI clerk” to check legal reasoning, reduce bias/noise, and flag outlier rulings.
  • As a first-pass or parallel system: AI decision, then human review/appeal, potentially speeding justice and reducing pretrial harms.
  • Possible role in public defense or administrative-style proceedings, where overworked humans currently do mechanistic work.

Arguments against AI judges

  • Fairness ≠ consistency: LLMs are praised here for rigid formalism, which might amplify unjust statutes and remove mercy.
  • Legitimacy: people want to feel they were heard by a human; the process is partly about public trust, not just correct rule application.
  • Accountability and control questions: who trains, tunes, and owns the model; hidden biases in data and prompts; risk of political or corporate capture.

Methodological and result skepticism

  • Suspicion of a “100% correct” result; some think this signals a contrived benchmark or possible training-data contamination.
  • Point that real judges offload such technical questions to clerks; the comparison may be more “AI vs. clerks” than “AI vs. judges.”
  • Several commenters think the HN title is misleading: the paper is about “silicon formalism,” not a clean “AI beats judges” story.