Results of "Humanity's Last Exam" benchmark published

Benchmark design and difficulty

  • Dataset: ~3,000 challenging questions across >100 subjects; public split on Hugging Face with a private held‑out test set.
  • Sample questions are considered extremely hard; many commenters say they can solve 1–3, and suspect very few humans could solve 5+ without preparation.
  • Some find the computer science questions comparatively easy (multiple choice, eliminable by reasoning), while math and some domain‑specific questions are much harder.
  • Several note many questions test narrow, obscure knowledge (e.g., detailed bird anatomy) more than general problem‑solving.

Scores, models, and calibration

  • Current top accuracies are under 10%; DeepSeek R1 appears to perform best in text‑only evaluations, with OpenAI’s o1 around 8.9% on text‑only.
  • Discussion on “calibration error”: lower error is seen as positive because it means the model is less confidently wrong.
  • Some question comparability because not all models have both multimodal and text‑only evaluations reported in the same way.

Intelligence vs. knowledge and benchmark scope

  • Many argue the benchmark mostly measures knowledge and academic problem‑solving, not “general intelligence.”
  • Long subthread contrasts intelligence, knowledge, and wisdom; stresses that intelligence is about applying knowledge to new settings, which is harder to test.
  • Some defend the benchmark as a pragmatic tool: we should test what we want models to be able to do, not solve “intelligence” philosophically.
  • Others want benchmarks for spatial reasoning, theory of mind, agency, planning, social interaction, and real‑world tasks; captchas and simulations are mentioned.
  • ARC‑AGI and work on intelligence measurement (e.g., separate paper linked) are cited as alternatives/related efforts.

Overfitting and benchmark lifecycle

  • Multiple comments note that once public, benchmarks quickly become training data, reducing their value as progress indicators.
  • Private, black‑box test suites are proposed, but there is pushback that opaque scoring would be hard to trust.

Branding and marketing criticism

  • The name “Humanity’s Last Exam” is widely seen as grandiose, arrogant, and marketing‑driven rather than literal.
  • Some feel this continues a pattern of overhyping AI capabilities and existential stakes.

Contest and compensation issues

  • Several contributors describe a question‑submission contest with shifting deadlines and unclear payout criteria.
  • They allege expectations of higher rewards were undermined when the deadline was extended and selection tightened, and some feel misled or “conned.”
  • Suggestions include small‑claims actions or class‑action lawsuits; others criticize the broader labor practices of data/labeling platforms.