2026-06-26

What happened after 2k people tried to hack my AI assistant

Overall reaction

Many found the experiment fun and interesting, but most thought the optimism about prompt injection was overstated.
The “no secrets leaked in ~6k attempts” result was seen as suggestive but far from conclusive.

Concerns about experimental design

Single‑shot emails only; no multi‑turn “frog‑boiling” interactions, which several consider the real danger.
The agent mostly didn’t reply to emails, by design, to save cost. Critics argue that if an assistant never responds, it’s easy for it to be “secure” but useless.
Almost all inputs were malicious, unlike real inboxes where legitimate mail dominates. This allows the model to treat the whole channel as hostile.
Only direct, in-band exfiltration via email was meaningfully tested; indirect channels (tools, web requests, filesystem, other network exfil) were largely absent.
Early batching contaminated behavior; later runs used fresh context per email, which some felt moved the model into an unrealistically “paranoid” mode.

Security interpretation and limitations

Commenters stressed that 0 failures in 6k random attempts doesn’t bound worst‑case attack success; stochastic models with a low failure rate can still show no failures in small samples.
The conclusion “less worried about prompt injection” was widely disputed; people emphasized tail risk, unknown best‑in‑class jailbreaking methods, and role‑confusion attacks.
Several noted that attackers with strong jailbreaks are unlikely to burn those on a low‑stakes public contest, especially with modest rewards.

Utility vs. safety

Multiple comments argued the test didn’t measure the key tradeoff: distinguishing malicious from legitimate instructions while still doing useful work.
People wanted metrics on false positives/negatives and evidence that normal email-based workflows would still function.

Cost, operations, and privacy

Cost (~$500) and “denial of wallet” risk were repeatedly mentioned; some suggested cheaper models or different channels.
Google spam filtering and account suspension reinforced that such agents should run on burner accounts.
Publishing partially redacted email addresses in the attack log raised privacy and ethics concerns.

Suggestions for better follow‑ups

Replay the same corpus across multiple models (including smaller/local ones) and publish comparative results.
Allow the agent to actually act and reply, wire up real tools/web access, mix benign and malicious emails, and test longer‑horizon, multi‑step attacks.

Related topics