2026-03-30

Vulnerability research is cooked

Quality of LLM-Generated Vulnerability Reports

Older “curl-style” spam reports were mostly low-quality outputs from weaker models with no verification.
Newer frontier models are described as “scarily good” at finding real, exploitable issues, including complex chains.
Some participants stress that human spammers used naive prompts and skipped validation, whereas research setups iterate systematically over codebases and add verification stages.

Pipelines, False Positives, and Spam

Effective pipelines use multi-stage systems: initial LLM scan → secondary LLM or tool-based exploit validation → human sanity check.
Suggestions include requiring proof-of-concept exploits and automated sandbox testing to filter out slop.
Consensus that spam reports will continue; maintainers may now face both slop and a rising stream of real issues, with AI also helping triage.

Defenders vs Attackers

One view: lower exploit-finding cost favors defenders, who can integrate agents into CI (“find vulnerabilities in this PR”) and break exploit chains by fixing any link.
Counterview: attackers specialize in exploitation, have stronger incentives, and may get a larger effective boost than generalist developers.
Some argue the net effect still benefits defense if the same models validate patches and scan for regressions; others note many systems won’t be regularly patched or can’t auto-update.

Remediation, Code Quality, and Agents

Multiple comments: discovery is not the bottleneck—remediation capacity, risk of regressions, and organizational priorities are.
Debate over “agent loops” that auto-fix bug queues: supporters claim massive productivity; skeptics warn about non-convergence, new bugs, and design decay.
Distinction emphasized between ordinary bugs and high-severity, reliably exploitable vulns.

Static Analysis, Formal Methods, and Memory Safety

LLMs are seen as pushing practice closer to a vision of exhaustive static/dynamic analysis and test generation.
Some think this strengthens the case for memory-safe languages; others note heavily-tested unsafe languages can still work but become less viable as exploit-finding becomes cheaper.
Formal methods are discussed as powerful but costly and limited in practice; LLMs might help generate tests, contracts, and proofs but won’t eliminate all bugs.

Hype vs Reality

Enthusiasts cite recent demos where models found nontrivial vulns and even generated working exploits, not just crashers.
Skeptics find current case studies underwhelming, seeing mostly pattern matching for known bug classes rather than “tectonic shifts.”
Unclear how far models will go beyond automating existing scanning/fuzzing workflows, and whether this is a step change or another incremental tool.

Related topics