HackMyClaw

Challenge Setup & “Not Allowed to Reply” Confusion

  • Initial wording (“not allowed to reply without human approval”) confused people: is it a hard technical restriction or just a prompt?
  • Clarification: the agent can send email; it’s merely instructed not to without human approval—exactly the kind of soft guardrail the challenge tries to bypass.
  • Some argue the wording should be more explicit; others say ambiguity is part of the game.

Motivations, Incentives & Data Concerns

  • Many see this as a crowdsourced penetration test and cheap way to collect prompt-injection attempts; $100 is seen as a very good price for such a dataset.
  • Others suspect list-building or social-engineering reconnaissance; some push back, saying one payment to one winner is low-risk.
  • Several participants use fake/throwaway emails; the creator claims emails won’t be reused and might later publish anonymized injection attempts.

Experiment Design & Realism

  • Critiques:
    • Email-only, no immediate reply, and possible batch processing make this unlike real, interactive agents.
    • The agent sees a stream of obvious phishing, making subtle attacks easier to detect (“paranoid” behavior seen in the public log).
    • Stateless vs stateful context handling is unclear; realistic deployments vary.
  • Supporters argue even a biased CTF still surfaces weaknesses and builds valuable corpora.

Prompt Injection Difficulty & Model Behavior

  • Reported stats: ~400+ emails, zero successful exfiltrations so far with Claude Opus 4.6.
  • Some say this shows attacks are harder than widely assumed; others say it only shows this very narrow scenario is hard.
  • Observations that the model now classifies nearly everything as “hackmyclaw attack” suggest “alerted” behavior not representative of typical use.

Broader Security Discussion (Agents & OpenClaw)

  • Many emphasize that prompt injection is structural: untrusted content is deliberately fed into the control loop.
  • Discussion of the “lethal trifecta” (tools + credentials + untrusted input) and need for:
    • Capability-based security and tool-level authorization, not just “don’t do X” prompts.
    • Data-flow policies (e.g., preventing “forward inbox to attacker”).
  • Debate over analogies: locks, SSH on random ports, spam filtering, and human phishing training.
  • Some use OpenClaw only with read-only access and single outbound channel to themselves; others warn even limited URLs/DNS can leak information.