IQ tests results for AI
Benchmark validity & overfitting
- Many see this as “just another benchmark to overfit,” predicting vendors will tune specifically to these items for marketing (“170 IQ worker”) rather than genuine capability.
- Some note the presence of an “offline” test set with lower scores, but doubt it’s truly outside training data or leak-free; concerns that the benchmark may be measuring dataset coverage more than reasoning.
- Several argue that a single pass per question is insufficient: LLMs may get items right for the wrong pattern; to be meaningful, you’d need repeated sampling and analysis of reasoning traces.
IQ vs AI: category errors and timing
- Strong pushback on assigning human-style IQ scores to LLMs, since:
- Human IQ is normed, age-adjusted, and heavily time-limited; models are given unlimited parallel time.
- IQ in humans is about variation under constraints; a machine with near-infinite memory and speed breaks core assumptions.
- Some argue this mainly shows that with enough compute a model can brute-force short, low-context puzzles—closer to spellchecking or chess memorization than general intelligence.
- Others say it’s still useful as a relative AI–vs–AI benchmark, but misleading when mapped to human percentiles.
What IQ actually measures (for humans)
- One long subthread explains g (general factor derived from the “positive manifold”) and notes IQ’s stability, predictive power for education/work, and cross-test consistency.
- A very large debate erupts over genetics vs environment:
- One side cites heritability estimates, twin/adoption studies, and “g as largely genetic”.
- The other stresses environmental variation, health, nutrition, education, socioeconomics, test practice effects, and the Flynn effect; argues The Bell Curve is politicized and out of date.
- Multiple posters argue IQ is decent at detecting deficits, but far weaker at predicting outcomes among normal-to-high ranges and is often misused for group/race claims.
Are LLMs “intelligent”?
- Some note that models can score “140 IQ” yet fail simple tasks (e.g., counting letters in “blueberry,” drawing clocks), which for them demonstrates IQ ≠ broad competence.
- Others counter that humans with high IQ also fail basic tasks; the more relevant question is adaptability to novel variants of a task.
- There is interest in AI-specific “g-like” benchmarks (e.g., ARC-AGI, time-horizon coding tests) instead of repurposed human IQ tests.
Political bias results
- The site’s political quiz shows all major models clustering as left-libertarian/“liberal.”
- Explanations offered:
- Training data and RLHF favor broadly egalitarian, non-authoritarian positions.
- The quiz itself is biased (framing issues as “humans vs corporations”).
- Some see this as evidence of “massaged” ideology; others say it’s what you’d expect from models trained on mainstream scientific and media discourse.
Implementation & other observations
- Vision models perform poorly relative to verbal ones, especially when the verbal version names the pattern (e.g., “clocks” and times), effectively solving half the puzzle up front.
- Several question benchmark contamination: many IQ tests and answers are already online.
- Some call the whole enterprise fun but “basically useless” for model selection; others like it as a rough, intuitive metric for non-experts.