Results of "Humanity's Last Exam" benchmark published
Benchmark design and difficulty
- Dataset: ~3,000 challenging questions across >100 subjects; public split on Hugging Face with a private held‑out test set.
- Sample questions are considered extremely hard; many commenters say they can solve 1–3, and suspect very few humans could solve 5+ without preparation.
- Some find the computer science questions comparatively easy (multiple choice, eliminable by reasoning), while math and some domain‑specific questions are much harder.
- Several note many questions test narrow, obscure knowledge (e.g., detailed bird anatomy) more than general problem‑solving.
Scores, models, and calibration
- Current top accuracies are under 10%; DeepSeek R1 appears to perform best in text‑only evaluations, with OpenAI’s o1 around 8.9% on text‑only.
- Discussion on “calibration error”: lower error is seen as positive because it means the model is less confidently wrong.
- Some question comparability because not all models have both multimodal and text‑only evaluations reported in the same way.
Intelligence vs. knowledge and benchmark scope
- Many argue the benchmark mostly measures knowledge and academic problem‑solving, not “general intelligence.”
- Long subthread contrasts intelligence, knowledge, and wisdom; stresses that intelligence is about applying knowledge to new settings, which is harder to test.
- Some defend the benchmark as a pragmatic tool: we should test what we want models to be able to do, not solve “intelligence” philosophically.
- Others want benchmarks for spatial reasoning, theory of mind, agency, planning, social interaction, and real‑world tasks; captchas and simulations are mentioned.
- ARC‑AGI and work on intelligence measurement (e.g., separate paper linked) are cited as alternatives/related efforts.
Overfitting and benchmark lifecycle
- Multiple comments note that once public, benchmarks quickly become training data, reducing their value as progress indicators.
- Private, black‑box test suites are proposed, but there is pushback that opaque scoring would be hard to trust.
Branding and marketing criticism
- The name “Humanity’s Last Exam” is widely seen as grandiose, arrogant, and marketing‑driven rather than literal.
- Some feel this continues a pattern of overhyping AI capabilities and existential stakes.
Contest and compensation issues
- Several contributors describe a question‑submission contest with shifting deadlines and unclear payout criteria.
- They allege expectations of higher rewards were undermined when the deadline was extended and selection tightened, and some feel misled or “conned.”
- Suggestions include small‑claims actions or class‑action lawsuits; others criticize the broader labor practices of data/labeling platforms.