Re-Evaluating GPT-4's Bar Exam Performance

Revised Bar Exam Performance

  • Paper argues GPT-4’s bar scores were overstated:
    • Around 69th percentile overall vs all takers, not 90+.
    • Around 48th percentile on essays vs all takers.
    • Estimated ~62nd percentile vs first-time takers, ~42nd on essays.
    • Among those who passed, estimated ~48th percentile overall, ~15th on essays.
  • Several note this is still “bar-passing territory,” but closer to the lower half of successful candidates.
  • Repeat takers heavily skew stats; first-time takers are more representative of practicing lawyers.

Nature of the Bar Exam

  • Strong disagreement over difficulty:
    • Some licensed lawyers describe the exam as surprisingly simple, heavily memorization-based and formulaic.
    • Others (including non-lawyers who tried samples) found it hard, with non-intuitive rules and tricky questions.
  • Bar is seen as testing minimum competence and “black-letter law,” not full real-world legal skill.

Exams as AI Benchmarks

  • Many question using human exams to evaluate AI:
    • Tests are proxies calibrated on correlations among human abilities.
    • High exam percentiles don’t imply good lawyering or general reasoning.
    • Some are bothered that original GPT-4 bar claims lacked detailed “receipts” or methodology.
  • Passing an exam doesn’t equal performing real legal tasks; lawyering involves judgment, planning, ethics, client selection, and reputation.

LLMs in Legal Practice

  • Advocates: a domain-tuned legal LLM could be a powerful research aid, acting as a compressed (but lossy) index over case law and statutes.
  • Critics: hallucinations and fabricated case law make generative drafting dangerous; safest use is as a pointer to sources lawyers still read themselves.

Broader Views on LLM Capability and Hype

  • Supporters highlight:
    • Rapid progress (GPT-4 vs earlier models).
    • Usefulness as tutors and productivity aids when answers can be vetted.
  • Skeptics emphasize:
    • Frequent subtle errors, weak analysis, lack of true planning or “street smarts.”
    • Performance often “freshman-level” on topics experts know well.
    • Concern that hype overstates capabilities and fuels job anxiety, especially for “keyboard” work.
  • Several note US-centric biases and jurisdictional mix-ups in legal answers.