Small models also found the vulnerabilities that Mythos found
Methodology and Comparability
- Many argue the AISLE test isn’t comparable to Mythos: small models were given the exact vulnerable function plus contextual hints, not a whole unknown codebase.
- Critics say that’s like being handed the right room and told “there might be something here,” whereas Mythos had to search an entire “continent” of code.
- Others counter that Mythos itself used per-file agents with a harness, not “entire codebase in one prompt,” so isolating files or functions is conceptually similar.
Harness / System vs Model
- One camp claims “the moat is the system”: the real value is in the pipeline that:
- Breaks code into files/functions.
- Classifies behavior (arithmetic, memory, etc.).
- Asks targeted vulnerability questions.
- Verifies via tools like ASan and reachability/taint analysis.
- They argue small, cheap models can handle this when orchestrated well, especially for “shallow” bugs.
- Others reply that deeper, cross-file or temporal bugs need large context windows, better attention, and stronger reasoning; harnesses can’t fully substitute model capability.
False Positives and Validation
- Many see AISLE’s evaluation as incomplete: they mostly tested known positives and a trivial false-positive snippet rather than whole-codebase scans.
- A key concern: small models that still flag the FreeBSD bug after it’s patched, suggesting very high false-positive risk.
- Commenters emphasize that any realistic system must measure both recall and precision (e.g., F-score) and use verifiable oracles (crash, exploit success) to filter hallucinated bugs.
Mythos Capabilities and Hype
- The original Mythos claims focus on autonomous exploit development, not just bug finding; exploit success is said to be orders of magnitude higher than prior models.
- Some see “too dangerous to release” and the $20k figure as calculated marketing and capacity management rather than purely safety-driven.
- Others respond that even at that cost, Mythos-level auditing is cheaper than human experts and represents a genuine shift.
Security and Industry Implications
- Consensus that AI-assisted bug finding lowers the cost of “nation-state-level” attention, threatening current “security by obscurity.”
- Debate over whether attackers are actually bottlenecked by tooling versus human response times.
- Widespread expectation that both attackers and defenders will industrialize these techniques; impact on traditional security vendors and code-scanning tools is seen as significant but still unclear.