Re-Evaluating GPT-4's Bar Exam Performance
Revised Bar Exam Performance
- Paper argues GPT-4’s bar scores were overstated:
- Around 69th percentile overall vs all takers, not 90+.
- Around 48th percentile on essays vs all takers.
- Estimated ~62nd percentile vs first-time takers, ~42nd on essays.
- Among those who passed, estimated ~48th percentile overall, ~15th on essays.
- Several note this is still “bar-passing territory,” but closer to the lower half of successful candidates.
- Repeat takers heavily skew stats; first-time takers are more representative of practicing lawyers.
Nature of the Bar Exam
- Strong disagreement over difficulty:
- Some licensed lawyers describe the exam as surprisingly simple, heavily memorization-based and formulaic.
- Others (including non-lawyers who tried samples) found it hard, with non-intuitive rules and tricky questions.
- Bar is seen as testing minimum competence and “black-letter law,” not full real-world legal skill.
Exams as AI Benchmarks
- Many question using human exams to evaluate AI:
- Tests are proxies calibrated on correlations among human abilities.
- High exam percentiles don’t imply good lawyering or general reasoning.
- Some are bothered that original GPT-4 bar claims lacked detailed “receipts” or methodology.
- Passing an exam doesn’t equal performing real legal tasks; lawyering involves judgment, planning, ethics, client selection, and reputation.
LLMs in Legal Practice
- Advocates: a domain-tuned legal LLM could be a powerful research aid, acting as a compressed (but lossy) index over case law and statutes.
- Critics: hallucinations and fabricated case law make generative drafting dangerous; safest use is as a pointer to sources lawyers still read themselves.
Broader Views on LLM Capability and Hype
- Supporters highlight:
- Rapid progress (GPT-4 vs earlier models).
- Usefulness as tutors and productivity aids when answers can be vetted.
- Skeptics emphasize:
- Frequent subtle errors, weak analysis, lack of true planning or “street smarts.”
- Performance often “freshman-level” on topics experts know well.
- Concern that hype overstates capabilities and fuels job anxiety, especially for “keyboard” work.
- Several note US-centric biases and jurisdictional mix-ups in legal answers.