2026-04-11

Small models also found the vulnerabilities that Mythos found

Methodology and Comparability

Many argue the AISLE test isn’t comparable to Mythos: small models were given the exact vulnerable function plus contextual hints, not a whole unknown codebase.
Critics say that’s like being handed the right room and told “there might be something here,” whereas Mythos had to search an entire “continent” of code.
Others counter that Mythos itself used per-file agents with a harness, not “entire codebase in one prompt,” so isolating files or functions is conceptually similar.

Harness / System vs Model

One camp claims “the moat is the system”: the real value is in the pipeline that:
- Breaks code into files/functions.
- Classifies behavior (arithmetic, memory, etc.).
- Asks targeted vulnerability questions.
- Verifies via tools like ASan and reachability/taint analysis.
They argue small, cheap models can handle this when orchestrated well, especially for “shallow” bugs.
Others reply that deeper, cross-file or temporal bugs need large context windows, better attention, and stronger reasoning; harnesses can’t fully substitute model capability.

False Positives and Validation

Many see AISLE’s evaluation as incomplete: they mostly tested known positives and a trivial false-positive snippet rather than whole-codebase scans.
A key concern: small models that still flag the FreeBSD bug after it’s patched, suggesting very high false-positive risk.
Commenters emphasize that any realistic system must measure both recall and precision (e.g., F-score) and use verifiable oracles (crash, exploit success) to filter hallucinated bugs.

Mythos Capabilities and Hype

The original Mythos claims focus on autonomous exploit development, not just bug finding; exploit success is said to be orders of magnitude higher than prior models.
Some see “too dangerous to release” and the $20k figure as calculated marketing and capacity management rather than purely safety-driven.
Others respond that even at that cost, Mythos-level auditing is cheaper than human experts and represents a genuine shift.

Security and Industry Implications

Consensus that AI-assisted bug finding lowers the cost of “nation-state-level” attention, threatening current “security by obscurity.”
Debate over whether attackers are actually bottlenecked by tooling versus human response times.
Widespread expectation that both attackers and defenders will industrialize these techniques; impact on traditional security vendors and code-scanning tools is seen as significant but still unclear.

Related topics