Will It Mythos?

Benchmark design and scope

  • Corpus consists of real bugs Mythos previously found; other models are tested on those files.
  • During benchmarking, models are asked to audit a file (with optional repo context) without being told where or what the bug is.
  • A stronger “judge” model, given the bug location and description, scores whether contestants found and explained the right issue.
  • Cost caps (e.g., $100 per model) significantly affect results; GPT‑5.5‑Pro only completed 4/9 cases before hitting the limit.
  • A minimal harness with basic tools (read/grep) performed as well as or better than richer “agent” setups, while consuming fewer tokens.

Relative model performance

  • No public model matched Mythos’ implied 9/9 performance; top non‑Mythos models typically found 4/9 bugs.
  • GPT‑5.5‑Pro surfaced at the top by percentage only because it ran fewer cases and is considered unrealistically expensive for broad audits.
  • Follow‑on analysis using Wilson score and time/cost suggests DeepSeek‑V4 and MiMo v2.5 Pro as best value among tested models.
  • Replication runs suggest Gemma 4 31B (dense) is exceptionally strong for its size, sometimes finding 6/9 bugs and rivaling larger models.
  • Cheap Chinese/open models (DeepSeek, MiMo, Qwen) are seen as genuinely competitive, not just “benchmaxxed”.

Mythos/Fable vs other models

  • Many report Fable/Mythos as a noticeable step above Opus and others, especially in:
    • Security auditing, reverse‑engineering, and finding subtle bugs.
    • Spatial reasoning and complex math/geometry (e.g., 6DOF, computational geometry).
    • Autonomously driving large code changes or meta‑applications.
  • Some users see smaller or task‑specific models (e.g., Codex, GPT‑5.5) outperform Fable on narrow, highly optimized workloads.
  • Others felt Fable/Mythos were overhyped or only marginally better, especially when accounting for high token use.

Guardrails, safety, and “nerfing”

  • Debate over whether Mythos is just a standard model with safety filters off vs a distinct fine‑tune plus specialized harness and persistence.
  • Observations that some Google offerings (Gemini via Antigravity) now resist security tasks, while Gemma 4 remains strong at bug‑finding.
  • Widespread perception that older frontier models (e.g., Opus 4.6, o4‑mini) have degraded over time; proposed mechanisms include quantization, reduced reasoning budgets, KV‑cache compression, and MoE expert reduction.
  • Skeptics compare “nerfing” claims to audiophile placebo; others cite performance trackers suggesting systematic pre‑release drift.

Usage patterns and broader implications

  • Several participants describe Fable as better at persistent, goal‑directed work but also more “agentic,” sometimes over‑autonomously editing code.
  • Some prefer chatting with models in a human‑like style, claiming better results and reduced cognitive mode‑switching; others worry this blurs human–machine boundaries and risks “AI psychosis”.
  • Key concern about Mythos: enabling non‑experts to find and weaponize zero‑days, though defenders can also use LLMs to discover and patch vulnerabilities.