2025-01-23

Results of "Humanity's Last Exam" benchmark published

Benchmark design and difficulty

Dataset: ~3,000 challenging questions across >100 subjects; public split on Hugging Face with a private held‑out test set.
Sample questions are considered extremely hard; many commenters say they can solve 1–3, and suspect very few humans could solve 5+ without preparation.
Some find the computer science questions comparatively easy (multiple choice, eliminable by reasoning), while math and some domain‑specific questions are much harder.
Several note many questions test narrow, obscure knowledge (e.g., detailed bird anatomy) more than general problem‑solving.

Scores, models, and calibration

Current top accuracies are under 10%; DeepSeek R1 appears to perform best in text‑only evaluations, with OpenAI’s o1 around 8.9% on text‑only.
Discussion on “calibration error”: lower error is seen as positive because it means the model is less confidently wrong.
Some question comparability because not all models have both multimodal and text‑only evaluations reported in the same way.

Intelligence vs. knowledge and benchmark scope

Many argue the benchmark mostly measures knowledge and academic problem‑solving, not “general intelligence.”
Long subthread contrasts intelligence, knowledge, and wisdom; stresses that intelligence is about applying knowledge to new settings, which is harder to test.
Some defend the benchmark as a pragmatic tool: we should test what we want models to be able to do, not solve “intelligence” philosophically.
Others want benchmarks for spatial reasoning, theory of mind, agency, planning, social interaction, and real‑world tasks; captchas and simulations are mentioned.
ARC‑AGI and work on intelligence measurement (e.g., separate paper linked) are cited as alternatives/related efforts.

Overfitting and benchmark lifecycle

Multiple comments note that once public, benchmarks quickly become training data, reducing their value as progress indicators.
Private, black‑box test suites are proposed, but there is pushback that opaque scoring would be hard to trust.

Branding and marketing criticism

The name “Humanity’s Last Exam” is widely seen as grandiose, arrogant, and marketing‑driven rather than literal.
Some feel this continues a pattern of overhyping AI capabilities and existential stakes.

Contest and compensation issues

Several contributors describe a question‑submission contest with shifting deadlines and unclear payout criteria.
They allege expectations of higher rewards were undermined when the deadline was extended and selection tightened, and some feel misled or “conned.”
Suggestions include small‑claims actions or class‑action lawsuits; others criticize the broader labor practices of data/labeling platforms.

Related topics