Small models also found the vulnerabilities that Mythos found

Methodology and Comparability

  • Many argue the AISLE test isn’t comparable to Mythos: small models were given the exact vulnerable function plus contextual hints, not a whole unknown codebase.
  • Critics say that’s like being handed the right room and told “there might be something here,” whereas Mythos had to search an entire “continent” of code.
  • Others counter that Mythos itself used per-file agents with a harness, not “entire codebase in one prompt,” so isolating files or functions is conceptually similar.

Harness / System vs Model

  • One camp claims “the moat is the system”: the real value is in the pipeline that:
    • Breaks code into files/functions.
    • Classifies behavior (arithmetic, memory, etc.).
    • Asks targeted vulnerability questions.
    • Verifies via tools like ASan and reachability/taint analysis.
  • They argue small, cheap models can handle this when orchestrated well, especially for “shallow” bugs.
  • Others reply that deeper, cross-file or temporal bugs need large context windows, better attention, and stronger reasoning; harnesses can’t fully substitute model capability.

False Positives and Validation

  • Many see AISLE’s evaluation as incomplete: they mostly tested known positives and a trivial false-positive snippet rather than whole-codebase scans.
  • A key concern: small models that still flag the FreeBSD bug after it’s patched, suggesting very high false-positive risk.
  • Commenters emphasize that any realistic system must measure both recall and precision (e.g., F-score) and use verifiable oracles (crash, exploit success) to filter hallucinated bugs.

Mythos Capabilities and Hype

  • The original Mythos claims focus on autonomous exploit development, not just bug finding; exploit success is said to be orders of magnitude higher than prior models.
  • Some see “too dangerous to release” and the $20k figure as calculated marketing and capacity management rather than purely safety-driven.
  • Others respond that even at that cost, Mythos-level auditing is cheaper than human experts and represents a genuine shift.

Security and Industry Implications

  • Consensus that AI-assisted bug finding lowers the cost of “nation-state-level” attention, threatening current “security by obscurity.”
  • Debate over whether attackers are actually bottlenecked by tooling versus human response times.
  • Widespread expectation that both attackers and defenders will industrialize these techniques; impact on traditional security vendors and code-scanning tools is seen as significant but still unclear.