Vulnerability research is cooked

Quality of LLM-Generated Vulnerability Reports

  • Older “curl-style” spam reports were mostly low-quality outputs from weaker models with no verification.
  • Newer frontier models are described as “scarily good” at finding real, exploitable issues, including complex chains.
  • Some participants stress that human spammers used naive prompts and skipped validation, whereas research setups iterate systematically over codebases and add verification stages.

Pipelines, False Positives, and Spam

  • Effective pipelines use multi-stage systems: initial LLM scan → secondary LLM or tool-based exploit validation → human sanity check.
  • Suggestions include requiring proof-of-concept exploits and automated sandbox testing to filter out slop.
  • Consensus that spam reports will continue; maintainers may now face both slop and a rising stream of real issues, with AI also helping triage.

Defenders vs Attackers

  • One view: lower exploit-finding cost favors defenders, who can integrate agents into CI (“find vulnerabilities in this PR”) and break exploit chains by fixing any link.
  • Counterview: attackers specialize in exploitation, have stronger incentives, and may get a larger effective boost than generalist developers.
  • Some argue the net effect still benefits defense if the same models validate patches and scan for regressions; others note many systems won’t be regularly patched or can’t auto-update.

Remediation, Code Quality, and Agents

  • Multiple comments: discovery is not the bottleneck—remediation capacity, risk of regressions, and organizational priorities are.
  • Debate over “agent loops” that auto-fix bug queues: supporters claim massive productivity; skeptics warn about non-convergence, new bugs, and design decay.
  • Distinction emphasized between ordinary bugs and high-severity, reliably exploitable vulns.

Static Analysis, Formal Methods, and Memory Safety

  • LLMs are seen as pushing practice closer to a vision of exhaustive static/dynamic analysis and test generation.
  • Some think this strengthens the case for memory-safe languages; others note heavily-tested unsafe languages can still work but become less viable as exploit-finding becomes cheaper.
  • Formal methods are discussed as powerful but costly and limited in practice; LLMs might help generate tests, contracts, and proofs but won’t eliminate all bugs.

Hype vs Reality

  • Enthusiasts cite recent demos where models found nontrivial vulns and even generated working exploits, not just crashers.
  • Skeptics find current case studies underwhelming, seeing mostly pattern matching for known bug classes rather than “tectonic shifts.”
  • Unclear how far models will go beyond automating existing scanning/fuzzing workflows, and whether this is a step change or another incremental tool.