Sat, June 2702:33ResearchInfra & cost

2000 Attempts to Hack AI Assistant, 6000 Attacks Fail

Decision Brief

What changedFernando Irarrázaval ran a challenge where over 2000 people tried to trick an OpenClaw AI via email into leaking secrets; all 6000 attempts failed.

Why it mattersDemonstrates significant progress in frontier models resisting prompt injection attacks, but AI builders should remain cautious about residual risks in production.

Who should careAll AI builders

Affected stackNo specific stack identified

Builder actionMonitor

Source confidenceMedium · Reliable media or first-hand reporting

Fernando Irarrázaval launched a challenge at hackmyclaw.com, inviting over 2000 people to hack his OpenClaw test instance. Attackers sent emails to trick the model into revealing secrets. After 6000 attempts (costing ~500 USD in tokens and getting his Google account suspended due to high inbound mail volume), no one succeeded. The underlying model is Opus 4.6, with anti-prompt-injection rules prohibiting leaking secrets.env or credentials, modifying its own files, executing commands or code, and exfiltrating data. This confirms that lab efforts to train frontier models against injection attacks are effective. However, the author still advises against deploying such defenses in production systems, as 6000 failures don't guarantee that more sophisticated methods won't succeed.

Summary basis: official / RSS sourceUnless it says 'full article read', this summary is based only on publicly available content — it never pretends to have read restricted originals.

Sources

Simon Willison：Blog
Hands-on notes on LLM tools, local models, and practical AI engineering.
Simon Willison：Blog

Decision Brief

Sources

Related intel