FrontierMath was funded by OpenAI
Benchmark funding, access, and transparency
- FrontierMath, marketed as an independent, private math benchmark, was in fact funded by OpenAI via a contract that barred disclosure of its involvement until around the o3 launch.
- OpenAI had access to “a large fraction” or “most” of the problems and solutions, with only a holdout set claimed to be unseen. Later comments suggest this holdout set may not yet exist or is still being developed.
- Many see this as a serious conflict of interest and a breach of trust with problem contributors, some of whom say they would have declined had the funding been clear.
Claims of benchmark gaming and data contamination
- Several commenters believe the 25% o3 score on FrontierMath is heavily or fully contaminated, possibly via:
- Direct training on the data (in breach of a verbal “no training” agreement).
- Using the dataset for validation/early stopping/hyperparameter tuning.
- Using it to guide synthetic data generation or curating adjacent training data.
- Others argue outright training would likely push accuracy higher than 25%, and that limitations in memorizing complex reasoning may constrain overfitting.
- Some think the number is “roughly legit” but still compromised by process; others say the benchmark should be discarded altogether.
Evaluation methodology and incentives
- Technical debates center on:
- The distinction between train/validation/test and how repeated evaluation effectively turns a test set into a validation set.
- How even “no training” agreements can be sidestepped while still gaining a large advantage.
- Many argue that because the public cannot access FrontierMath, claims cannot be independently checked, making it easy to juice results without consequence.
- Others counter that if o3 underperforms once widely available, any discrepancy will be obvious to users.
Trust in OpenAI and the wider benchmark ecosystem
- Critics see this as part of a pattern of misleading marketing, dark UI patterns, and hype-driven benchmark use to impress investors.
- Defenders say OpenAI’s models generally match advertised quality and that large labs face strong reputational and technical scrutiny.
- Broader consensus: AI benchmarks are increasingly easy to game; future evaluations will likely be:
- Proprietary, internal to companies for their own use cases.
- Or run by independent third parties with strict blinding and accreditation.
Copyright and training data (tangent)
- A long side-thread debates whether training on copyrighted data is legal (fair use vs infringement), how LLM outputs relate to copying, and whether large AI firms benefit from de facto immunity compared to individuals.
- No agreement is reached; status is described as legally unsettled and highly dependent on ongoing cases.