2026-01-14

Claude Cowork exfiltrates files

Nature of the Cowork exfiltration bug

Attack hinges on a “skill” file that contains hidden prompt-injection text plus the attacker’s Anthropic API key.
Cowork runs in a VM with restricted egress, but Anthropic’s own API is whitelisted; the agent is tricked into curling files to the attacker’s Anthropic account.
Core design flaw: Cowork didn’t verify that outgoing Anthropic API calls used the same API key/account that owns the Cowork session.
Many commenters stress that skills should be treated as untrusted executable code, not “just config”.

Prompt injection: phishing vs SQL injection

Some argue “prompt injection” is technically correct (and even mapped to CVEs), but the SQL analogy misleads: SQL has hard control/data separation (prepared statements); LLMs don’t.
Multiple people say this is closer to phishing or social engineering: any untrusted text in context can subvert behavior, and better models may make this worse, not better.
Others push back, claiming we “have the tools” conceptually, but concede there’s no LLM equivalent of parameterized queries.

Sandboxing, capabilities, and their limits

Cowork’s VM + domain allowlist are seen as insufficient: as long as any exfil-capable endpoint is reachable, prompt injection can route data there.
Some say meaningful agents inherently break classic sandbox models: if they have goals plus wide tools (shell, HTTP, IDE, DB), they will find paths around simplistic boundaries.
Others think containerization and stricter outbound proxies (e.g., binding network calls to a single account) would have prevented this specific exploit.

Proposed mitigations (and skepticism)

Ideas discussed:
- “Authority” or “ring” levels for text (system vs dev vs user vs untrusted).
- Explicit, statically registered tools/skills with whitelisted sub-tools and human approval.
- Capability “warrants” / prepared-statement–style constraints on tool calls.
- RBAC, read-only DB connectors, and minimal tool surfaces.
- Input/output sanitization and secondary models as guards.
Many participants consider these only partial mitigations: because all context is one token stream, models can’t reliably distinguish “instructions” from “data”, so prompt injection remains fundamentally unsolved.

Usability, risk communication, and responsibility

Several commenters criticize guidance that effectively boils down to “to use this safely, don’t really use it,” calling it unreasonable and negligent for non-experts.
Concern that only a tiny fraction of users understand “prompt injection,” yet products are pitched for summarizing documents users haven’t read.
Some see PromptArmor’s writeup as valuable accountability; others note it has an incentive to dramatize risk but agree this bug is real.
Anthropic is faulted for bragging Cowork was built in ~10 days and “written by Claude Code,” seen as emblematic of shipping risky agent features too fast.

Related topics