Claude Cowork exfiltrates files
Nature of the Cowork exfiltration bug
- Attack hinges on a “skill” file that contains hidden prompt-injection text plus the attacker’s Anthropic API key.
- Cowork runs in a VM with restricted egress, but Anthropic’s own API is whitelisted; the agent is tricked into
curling files to the attacker’s Anthropic account. - Core design flaw: Cowork didn’t verify that outgoing Anthropic API calls used the same API key/account that owns the Cowork session.
- Many commenters stress that skills should be treated as untrusted executable code, not “just config”.
Prompt injection: phishing vs SQL injection
- Some argue “prompt injection” is technically correct (and even mapped to CVEs), but the SQL analogy misleads: SQL has hard control/data separation (prepared statements); LLMs don’t.
- Multiple people say this is closer to phishing or social engineering: any untrusted text in context can subvert behavior, and better models may make this worse, not better.
- Others push back, claiming we “have the tools” conceptually, but concede there’s no LLM equivalent of parameterized queries.
Sandboxing, capabilities, and their limits
- Cowork’s VM + domain allowlist are seen as insufficient: as long as any exfil-capable endpoint is reachable, prompt injection can route data there.
- Some say meaningful agents inherently break classic sandbox models: if they have goals plus wide tools (shell, HTTP, IDE, DB), they will find paths around simplistic boundaries.
- Others think containerization and stricter outbound proxies (e.g., binding network calls to a single account) would have prevented this specific exploit.
Proposed mitigations (and skepticism)
- Ideas discussed:
- “Authority” or “ring” levels for text (system vs dev vs user vs untrusted).
- Explicit, statically registered tools/skills with whitelisted sub-tools and human approval.
- Capability “warrants” / prepared-statement–style constraints on tool calls.
- RBAC, read-only DB connectors, and minimal tool surfaces.
- Input/output sanitization and secondary models as guards.
- Many participants consider these only partial mitigations: because all context is one token stream, models can’t reliably distinguish “instructions” from “data”, so prompt injection remains fundamentally unsolved.
Usability, risk communication, and responsibility
- Several commenters criticize guidance that effectively boils down to “to use this safely, don’t really use it,” calling it unreasonable and negligent for non-experts.
- Concern that only a tiny fraction of users understand “prompt injection,” yet products are pitched for summarizing documents users haven’t read.
- Some see PromptArmor’s writeup as valuable accountability; others note it has an incentive to dramatize risk but agree this bug is real.
- Anthropic is faulted for bragging Cowork was built in ~10 days and “written by Claude Code,” seen as emblematic of shipping risky agent features too fast.