2024-06-01

Re-Evaluating GPT-4's Bar Exam Performance

Revised Bar Exam Performance

Paper argues GPT-4’s bar scores were overstated:
- Around 69th percentile overall vs all takers, not 90+.
- Around 48th percentile on essays vs all takers.
- Estimated ~62nd percentile vs first-time takers, ~42nd on essays.
- Among those who passed, estimated ~48th percentile overall, ~15th on essays.
Several note this is still “bar-passing territory,” but closer to the lower half of successful candidates.
Repeat takers heavily skew stats; first-time takers are more representative of practicing lawyers.

Nature of the Bar Exam

Strong disagreement over difficulty:
- Some licensed lawyers describe the exam as surprisingly simple, heavily memorization-based and formulaic.
- Others (including non-lawyers who tried samples) found it hard, with non-intuitive rules and tricky questions.
Bar is seen as testing minimum competence and “black-letter law,” not full real-world legal skill.

Exams as AI Benchmarks

Many question using human exams to evaluate AI:
- Tests are proxies calibrated on correlations among human abilities.
- High exam percentiles don’t imply good lawyering or general reasoning.
- Some are bothered that original GPT-4 bar claims lacked detailed “receipts” or methodology.
Passing an exam doesn’t equal performing real legal tasks; lawyering involves judgment, planning, ethics, client selection, and reputation.

LLMs in Legal Practice

Advocates: a domain-tuned legal LLM could be a powerful research aid, acting as a compressed (but lossy) index over case law and statutes.
Critics: hallucinations and fabricated case law make generative drafting dangerous; safest use is as a pointer to sources lawyers still read themselves.

Broader Views on LLM Capability and Hype

Supporters highlight:
- Rapid progress (GPT-4 vs earlier models).
- Usefulness as tutors and productivity aids when answers can be vetted.
Skeptics emphasize:
- Frequent subtle errors, weak analysis, lack of true planning or “street smarts.”
- Performance often “freshman-level” on topics experts know well.
- Concern that hype overstates capabilities and fuels job anxiety, especially for “keyboard” work.
Several note US-centric biases and jurisdictional mix-ups in legal answers.

Related topics