2026-06-23

Will It Mythos?

Benchmark design and scope

Corpus consists of real bugs Mythos previously found; other models are tested on those files.
During benchmarking, models are asked to audit a file (with optional repo context) without being told where or what the bug is.
A stronger “judge” model, given the bug location and description, scores whether contestants found and explained the right issue.
Cost caps (e.g., $100 per model) significantly affect results; GPT‑5.5‑Pro only completed 4/9 cases before hitting the limit.
A minimal harness with basic tools (read/grep) performed as well as or better than richer “agent” setups, while consuming fewer tokens.

Relative model performance

No public model matched Mythos’ implied 9/9 performance; top non‑Mythos models typically found 4/9 bugs.
GPT‑5.5‑Pro surfaced at the top by percentage only because it ran fewer cases and is considered unrealistically expensive for broad audits.
Follow‑on analysis using Wilson score and time/cost suggests DeepSeek‑V4 and MiMo v2.5 Pro as best value among tested models.
Replication runs suggest Gemma 4 31B (dense) is exceptionally strong for its size, sometimes finding 6/9 bugs and rivaling larger models.
Cheap Chinese/open models (DeepSeek, MiMo, Qwen) are seen as genuinely competitive, not just “benchmaxxed”.

Mythos/Fable vs other models

Many report Fable/Mythos as a noticeable step above Opus and others, especially in:
- Security auditing, reverse‑engineering, and finding subtle bugs.
- Spatial reasoning and complex math/geometry (e.g., 6DOF, computational geometry).
- Autonomously driving large code changes or meta‑applications.
Some users see smaller or task‑specific models (e.g., Codex, GPT‑5.5) outperform Fable on narrow, highly optimized workloads.
Others felt Fable/Mythos were overhyped or only marginally better, especially when accounting for high token use.

Guardrails, safety, and “nerfing”

Debate over whether Mythos is just a standard model with safety filters off vs a distinct fine‑tune plus specialized harness and persistence.
Observations that some Google offerings (Gemini via Antigravity) now resist security tasks, while Gemma 4 remains strong at bug‑finding.
Widespread perception that older frontier models (e.g., Opus 4.6, o4‑mini) have degraded over time; proposed mechanisms include quantization, reduced reasoning budgets, KV‑cache compression, and MoE expert reduction.
Skeptics compare “nerfing” claims to audiophile placebo; others cite performance trackers suggesting systematic pre‑release drift.

Usage patterns and broader implications

Several participants describe Fable as better at persistent, goal‑directed work but also more “agentic,” sometimes over‑autonomously editing code.
Some prefer chatting with models in a human‑like style, claiming better results and reduced cognitive mode‑switching; others worry this blurs human–machine boundaries and risks “AI psychosis”.
Key concern about Mythos: enabling non‑experts to find and weaponize zero‑days, though defenders can also use LLMs to discover and patch vulnerabilities.

Related topics