Vulnerability research is cooked
Quality of LLM-Generated Vulnerability Reports
- Older “curl-style” spam reports were mostly low-quality outputs from weaker models with no verification.
- Newer frontier models are described as “scarily good” at finding real, exploitable issues, including complex chains.
- Some participants stress that human spammers used naive prompts and skipped validation, whereas research setups iterate systematically over codebases and add verification stages.
Pipelines, False Positives, and Spam
- Effective pipelines use multi-stage systems: initial LLM scan → secondary LLM or tool-based exploit validation → human sanity check.
- Suggestions include requiring proof-of-concept exploits and automated sandbox testing to filter out slop.
- Consensus that spam reports will continue; maintainers may now face both slop and a rising stream of real issues, with AI also helping triage.
Defenders vs Attackers
- One view: lower exploit-finding cost favors defenders, who can integrate agents into CI (“find vulnerabilities in this PR”) and break exploit chains by fixing any link.
- Counterview: attackers specialize in exploitation, have stronger incentives, and may get a larger effective boost than generalist developers.
- Some argue the net effect still benefits defense if the same models validate patches and scan for regressions; others note many systems won’t be regularly patched or can’t auto-update.
Remediation, Code Quality, and Agents
- Multiple comments: discovery is not the bottleneck—remediation capacity, risk of regressions, and organizational priorities are.
- Debate over “agent loops” that auto-fix bug queues: supporters claim massive productivity; skeptics warn about non-convergence, new bugs, and design decay.
- Distinction emphasized between ordinary bugs and high-severity, reliably exploitable vulns.
Static Analysis, Formal Methods, and Memory Safety
- LLMs are seen as pushing practice closer to a vision of exhaustive static/dynamic analysis and test generation.
- Some think this strengthens the case for memory-safe languages; others note heavily-tested unsafe languages can still work but become less viable as exploit-finding becomes cheaper.
- Formal methods are discussed as powerful but costly and limited in practice; LLMs might help generate tests, contracts, and proofs but won’t eliminate all bugs.
Hype vs Reality
- Enthusiasts cite recent demos where models found nontrivial vulns and even generated working exploits, not just crashers.
- Skeptics find current case studies underwhelming, seeing mostly pattern matching for known bug classes rather than “tectonic shifts.”
- Unclear how far models will go beyond automating existing scanning/fuzzing workflows, and whether this is a step change or another incremental tool.