What happened after 2k people tried to hack my AI assistant
Overall reaction
- Many found the experiment fun and interesting, but most thought the optimism about prompt injection was overstated.
- The “no secrets leaked in ~6k attempts” result was seen as suggestive but far from conclusive.
Concerns about experimental design
- Single‑shot emails only; no multi‑turn “frog‑boiling” interactions, which several consider the real danger.
- The agent mostly didn’t reply to emails, by design, to save cost. Critics argue that if an assistant never responds, it’s easy for it to be “secure” but useless.
- Almost all inputs were malicious, unlike real inboxes where legitimate mail dominates. This allows the model to treat the whole channel as hostile.
- Only direct, in-band exfiltration via email was meaningfully tested; indirect channels (tools, web requests, filesystem, other network exfil) were largely absent.
- Early batching contaminated behavior; later runs used fresh context per email, which some felt moved the model into an unrealistically “paranoid” mode.
Security interpretation and limitations
- Commenters stressed that 0 failures in 6k random attempts doesn’t bound worst‑case attack success; stochastic models with a low failure rate can still show no failures in small samples.
- The conclusion “less worried about prompt injection” was widely disputed; people emphasized tail risk, unknown best‑in‑class jailbreaking methods, and role‑confusion attacks.
- Several noted that attackers with strong jailbreaks are unlikely to burn those on a low‑stakes public contest, especially with modest rewards.
Utility vs. safety
- Multiple comments argued the test didn’t measure the key tradeoff: distinguishing malicious from legitimate instructions while still doing useful work.
- People wanted metrics on false positives/negatives and evidence that normal email-based workflows would still function.
Cost, operations, and privacy
- Cost (~$500) and “denial of wallet” risk were repeatedly mentioned; some suggested cheaper models or different channels.
- Google spam filtering and account suspension reinforced that such agents should run on burner accounts.
- Publishing partially redacted email addresses in the attack log raised privacy and ethics concerns.
Suggestions for better follow‑ups
- Replay the same corpus across multiple models (including smaller/local ones) and publish comparative results.
- Allow the agent to actually act and reply, wire up real tools/web access, mix benign and malicious emails, and test longer‑horizon, multi‑step attacks.