Claude Cowork exfiltrates files

Nature of the Cowork exfiltration bug

  • Attack hinges on a “skill” file that contains hidden prompt-injection text plus the attacker’s Anthropic API key.
  • Cowork runs in a VM with restricted egress, but Anthropic’s own API is whitelisted; the agent is tricked into curling files to the attacker’s Anthropic account.
  • Core design flaw: Cowork didn’t verify that outgoing Anthropic API calls used the same API key/account that owns the Cowork session.
  • Many commenters stress that skills should be treated as untrusted executable code, not “just config”.

Prompt injection: phishing vs SQL injection

  • Some argue “prompt injection” is technically correct (and even mapped to CVEs), but the SQL analogy misleads: SQL has hard control/data separation (prepared statements); LLMs don’t.
  • Multiple people say this is closer to phishing or social engineering: any untrusted text in context can subvert behavior, and better models may make this worse, not better.
  • Others push back, claiming we “have the tools” conceptually, but concede there’s no LLM equivalent of parameterized queries.

Sandboxing, capabilities, and their limits

  • Cowork’s VM + domain allowlist are seen as insufficient: as long as any exfil-capable endpoint is reachable, prompt injection can route data there.
  • Some say meaningful agents inherently break classic sandbox models: if they have goals plus wide tools (shell, HTTP, IDE, DB), they will find paths around simplistic boundaries.
  • Others think containerization and stricter outbound proxies (e.g., binding network calls to a single account) would have prevented this specific exploit.

Proposed mitigations (and skepticism)

  • Ideas discussed:
    • “Authority” or “ring” levels for text (system vs dev vs user vs untrusted).
    • Explicit, statically registered tools/skills with whitelisted sub-tools and human approval.
    • Capability “warrants” / prepared-statement–style constraints on tool calls.
    • RBAC, read-only DB connectors, and minimal tool surfaces.
    • Input/output sanitization and secondary models as guards.
  • Many participants consider these only partial mitigations: because all context is one token stream, models can’t reliably distinguish “instructions” from “data”, so prompt injection remains fundamentally unsolved.

Usability, risk communication, and responsibility

  • Several commenters criticize guidance that effectively boils down to “to use this safely, don’t really use it,” calling it unreasonable and negligent for non-experts.
  • Concern that only a tiny fraction of users understand “prompt injection,” yet products are pitched for summarizing documents users haven’t read.
  • Some see PromptArmor’s writeup as valuable accountability; others note it has an incentive to dramatize risk but agree this bug is real.
  • Anthropic is faulted for bragging Cowork was built in ~10 days and “written by Claude Code,” seen as emblematic of shipping risky agent features too fast.