Will It Mythos?
Benchmark design and scope
- Corpus consists of real bugs Mythos previously found; other models are tested on those files.
- During benchmarking, models are asked to audit a file (with optional repo context) without being told where or what the bug is.
- A stronger “judge” model, given the bug location and description, scores whether contestants found and explained the right issue.
- Cost caps (e.g., $100 per model) significantly affect results; GPT‑5.5‑Pro only completed 4/9 cases before hitting the limit.
- A minimal harness with basic tools (read/grep) performed as well as or better than richer “agent” setups, while consuming fewer tokens.
Relative model performance
- No public model matched Mythos’ implied 9/9 performance; top non‑Mythos models typically found 4/9 bugs.
- GPT‑5.5‑Pro surfaced at the top by percentage only because it ran fewer cases and is considered unrealistically expensive for broad audits.
- Follow‑on analysis using Wilson score and time/cost suggests DeepSeek‑V4 and MiMo v2.5 Pro as best value among tested models.
- Replication runs suggest Gemma 4 31B (dense) is exceptionally strong for its size, sometimes finding 6/9 bugs and rivaling larger models.
- Cheap Chinese/open models (DeepSeek, MiMo, Qwen) are seen as genuinely competitive, not just “benchmaxxed”.
Mythos/Fable vs other models
- Many report Fable/Mythos as a noticeable step above Opus and others, especially in:
- Security auditing, reverse‑engineering, and finding subtle bugs.
- Spatial reasoning and complex math/geometry (e.g., 6DOF, computational geometry).
- Autonomously driving large code changes or meta‑applications.
- Some users see smaller or task‑specific models (e.g., Codex, GPT‑5.5) outperform Fable on narrow, highly optimized workloads.
- Others felt Fable/Mythos were overhyped or only marginally better, especially when accounting for high token use.
Guardrails, safety, and “nerfing”
- Debate over whether Mythos is just a standard model with safety filters off vs a distinct fine‑tune plus specialized harness and persistence.
- Observations that some Google offerings (Gemini via Antigravity) now resist security tasks, while Gemma 4 remains strong at bug‑finding.
- Widespread perception that older frontier models (e.g., Opus 4.6, o4‑mini) have degraded over time; proposed mechanisms include quantization, reduced reasoning budgets, KV‑cache compression, and MoE expert reduction.
- Skeptics compare “nerfing” claims to audiophile placebo; others cite performance trackers suggesting systematic pre‑release drift.
Usage patterns and broader implications
- Several participants describe Fable as better at persistent, goal‑directed work but also more “agentic,” sometimes over‑autonomously editing code.
- Some prefer chatting with models in a human‑like style, claiming better results and reduced cognitive mode‑switching; others worry this blurs human–machine boundaries and risks “AI psychosis”.
- Key concern about Mythos: enabling non‑experts to find and weaponize zero‑days, though defenders can also use LLMs to discover and patch vulnerabilities.