IQ tests results for AI

Benchmark validity & overfitting

  • Many see this as “just another benchmark to overfit,” predicting vendors will tune specifically to these items for marketing (“170 IQ worker”) rather than genuine capability.
  • Some note the presence of an “offline” test set with lower scores, but doubt it’s truly outside training data or leak-free; concerns that the benchmark may be measuring dataset coverage more than reasoning.
  • Several argue that a single pass per question is insufficient: LLMs may get items right for the wrong pattern; to be meaningful, you’d need repeated sampling and analysis of reasoning traces.

IQ vs AI: category errors and timing

  • Strong pushback on assigning human-style IQ scores to LLMs, since:
    • Human IQ is normed, age-adjusted, and heavily time-limited; models are given unlimited parallel time.
    • IQ in humans is about variation under constraints; a machine with near-infinite memory and speed breaks core assumptions.
  • Some argue this mainly shows that with enough compute a model can brute-force short, low-context puzzles—closer to spellchecking or chess memorization than general intelligence.
  • Others say it’s still useful as a relative AI–vs–AI benchmark, but misleading when mapped to human percentiles.

What IQ actually measures (for humans)

  • One long subthread explains g (general factor derived from the “positive manifold”) and notes IQ’s stability, predictive power for education/work, and cross-test consistency.
  • A very large debate erupts over genetics vs environment:
    • One side cites heritability estimates, twin/adoption studies, and “g as largely genetic”.
    • The other stresses environmental variation, health, nutrition, education, socioeconomics, test practice effects, and the Flynn effect; argues The Bell Curve is politicized and out of date.
  • Multiple posters argue IQ is decent at detecting deficits, but far weaker at predicting outcomes among normal-to-high ranges and is often misused for group/race claims.

Are LLMs “intelligent”?

  • Some note that models can score “140 IQ” yet fail simple tasks (e.g., counting letters in “blueberry,” drawing clocks), which for them demonstrates IQ ≠ broad competence.
  • Others counter that humans with high IQ also fail basic tasks; the more relevant question is adaptability to novel variants of a task.
  • There is interest in AI-specific “g-like” benchmarks (e.g., ARC-AGI, time-horizon coding tests) instead of repurposed human IQ tests.

Political bias results

  • The site’s political quiz shows all major models clustering as left-libertarian/“liberal.”
  • Explanations offered:
    • Training data and RLHF favor broadly egalitarian, non-authoritarian positions.
    • The quiz itself is biased (framing issues as “humans vs corporations”).
  • Some see this as evidence of “massaged” ideology; others say it’s what you’d expect from models trained on mainstream scientific and media discourse.

Implementation & other observations

  • Vision models perform poorly relative to verbal ones, especially when the verbal version names the pattern (e.g., “clocks” and times), effectively solving half the puzzle up front.
  • Several question benchmark contamination: many IQ tests and answers are already online.
  • Some call the whole enterprise fun but “basically useless” for model selection; others like it as a rough, intuitive metric for non-experts.