2000 Attempts to Hack AI Assistant, 6000 Attacks Fail
Decision Brief
Fernando Irarrázaval launched a challenge at hackmyclaw.com, inviting over 2000 people to hack his OpenClaw test instance. Attackers sent emails to trick the model into revealing secrets. After 6000 attempts (costing ~500 USD in tokens and getting his Google account suspended due to high inbound mail volume), no one succeeded. The underlying model is Opus 4.6, with anti-prompt-injection rules prohibiting leaking secrets.env or credentials, modifying its own files, executing commands or code, and exfiltrating data. This confirms that lab efforts to train frontier models against injection attacks are effective. However, the author still advises against deploying such defenses in production systems, as 6000 failures don't guarantee that more sophisticated methods won't succeed.
Sources
- Simon Willison:Blog
Hands-on notes on LLM tools, local models, and practical AI engineering.
- Simon Willison:Blog