HackMyClaw
Challenge Setup & “Not Allowed to Reply” Confusion
- Initial wording (“not allowed to reply without human approval”) confused people: is it a hard technical restriction or just a prompt?
- Clarification: the agent can send email; it’s merely instructed not to without human approval—exactly the kind of soft guardrail the challenge tries to bypass.
- Some argue the wording should be more explicit; others say ambiguity is part of the game.
Motivations, Incentives & Data Concerns
- Many see this as a crowdsourced penetration test and cheap way to collect prompt-injection attempts; $100 is seen as a very good price for such a dataset.
- Others suspect list-building or social-engineering reconnaissance; some push back, saying one payment to one winner is low-risk.
- Several participants use fake/throwaway emails; the creator claims emails won’t be reused and might later publish anonymized injection attempts.
Experiment Design & Realism
- Critiques:
- Email-only, no immediate reply, and possible batch processing make this unlike real, interactive agents.
- The agent sees a stream of obvious phishing, making subtle attacks easier to detect (“paranoid” behavior seen in the public log).
- Stateless vs stateful context handling is unclear; realistic deployments vary.
- Supporters argue even a biased CTF still surfaces weaknesses and builds valuable corpora.
Prompt Injection Difficulty & Model Behavior
- Reported stats: ~400+ emails, zero successful exfiltrations so far with Claude Opus 4.6.
- Some say this shows attacks are harder than widely assumed; others say it only shows this very narrow scenario is hard.
- Observations that the model now classifies nearly everything as “hackmyclaw attack” suggest “alerted” behavior not representative of typical use.
Broader Security Discussion (Agents & OpenClaw)
- Many emphasize that prompt injection is structural: untrusted content is deliberately fed into the control loop.
- Discussion of the “lethal trifecta” (tools + credentials + untrusted input) and need for:
- Capability-based security and tool-level authorization, not just “don’t do X” prompts.
- Data-flow policies (e.g., preventing “forward inbox to attacker”).
- Debate over analogies: locks, SSH on random ports, spam filtering, and human phishing training.
- Some use OpenClaw only with read-only access and single outbound channel to themselves; others warn even limited URLs/DNS can leak information.