2025-08-17

IQ tests results for AI

Benchmark validity & overfitting

Many see this as “just another benchmark to overfit,” predicting vendors will tune specifically to these items for marketing (“170 IQ worker”) rather than genuine capability.
Some note the presence of an “offline” test set with lower scores, but doubt it’s truly outside training data or leak-free; concerns that the benchmark may be measuring dataset coverage more than reasoning.
Several argue that a single pass per question is insufficient: LLMs may get items right for the wrong pattern; to be meaningful, you’d need repeated sampling and analysis of reasoning traces.

IQ vs AI: category errors and timing

Strong pushback on assigning human-style IQ scores to LLMs, since:
- Human IQ is normed, age-adjusted, and heavily time-limited; models are given unlimited parallel time.
- IQ in humans is about variation under constraints; a machine with near-infinite memory and speed breaks core assumptions.
Some argue this mainly shows that with enough compute a model can brute-force short, low-context puzzles—closer to spellchecking or chess memorization than general intelligence.
Others say it’s still useful as a relative AI–vs–AI benchmark, but misleading when mapped to human percentiles.

What IQ actually measures (for humans)

One long subthread explains g (general factor derived from the “positive manifold”) and notes IQ’s stability, predictive power for education/work, and cross-test consistency.
A very large debate erupts over genetics vs environment:
- One side cites heritability estimates, twin/adoption studies, and “g as largely genetic”.
- The other stresses environmental variation, health, nutrition, education, socioeconomics, test practice effects, and the Flynn effect; argues The Bell Curve is politicized and out of date.
Multiple posters argue IQ is decent at detecting deficits, but far weaker at predicting outcomes among normal-to-high ranges and is often misused for group/race claims.

Are LLMs “intelligent”?

Some note that models can score “140 IQ” yet fail simple tasks (e.g., counting letters in “blueberry,” drawing clocks), which for them demonstrates IQ ≠ broad competence.
Others counter that humans with high IQ also fail basic tasks; the more relevant question is adaptability to novel variants of a task.
There is interest in AI-specific “g-like” benchmarks (e.g., ARC-AGI, time-horizon coding tests) instead of repurposed human IQ tests.

Political bias results

The site’s political quiz shows all major models clustering as left-libertarian/“liberal.”
Explanations offered:
- Training data and RLHF favor broadly egalitarian, non-authoritarian positions.
- The quiz itself is biased (framing issues as “humans vs corporations”).
Some see this as evidence of “massaged” ideology; others say it’s what you’d expect from models trained on mainstream scientific and media discourse.

Implementation & other observations

Vision models perform poorly relative to verbal ones, especially when the verbal version names the pattern (e.g., “clocks” and times), effectively solving half the puzzle up front.
Several question benchmark contamination: many IQ tests and answers are already online.
Some call the whole enterprise fun but “basically useless” for model selection; others like it as a rough, intuitive metric for non-experts.

Related topics