
GPT-5.3-Codex's 72% success rate in hacking contracts is a reminder that real-time defenses aren't optional
OpenAI just published EVMbench, a new benchmark measuring how well AI agents can detect, patch, and exploit smart contract vulnerabilities. The headline result: AI is dramatically better at attacking than defending. These are, in fact, the droids we feared.
OpenAI’s latest model, GPT-5.3-Codex, achieved a 72.2% success rate in “exploit mode,” meaning it was able to successfully carry out the attack it was attempting in roughly seven out of 10 scenarios. What matters even more than the exact number is the direction of travel. OpenAI reports that this result is more than double what comparable models could do just six months ago. In plain terms, the cost of producing a workable attack is dropping fast, and the speed at which attackers can iterate is climbing.
The defensive side of the ledger tells a very different story. The same models that cracked 72% of exploits could only “detect” vulnerabilities in about 40% of cases, and “patch” them in even fewer. Those steps remain harder for today’s AI systems to do reliably end to end, especially when the codebase is unfamiliar, the bug is subtle, or the fix requires judgment calls.
Attackers have always had an asymmetric advantage: they need to succeed once, defenders need to win every time. EVMbench suggests AI is widening that gap. But smart contract audits are only the first line of defense. An end-to-end security posture also includes broad risk coverage beyond known bugs, cross-chain visibility, runtime monitoring, and automated response. The latest finding doesn’t change this. It reinforces it.
One of the most valuable outputs of a comprehensive audit process, whether conducted by humans or AI, is the set of invariants it surfaces. These are the fundamental conditions that must always hold true for a protocol to behave as intended.
These invariants shouldn't live only in a static audit report. They translate naturally into continuous monitoring rules that watch for violations the moment a contract goes live. But in an era where AI can iterate on exploits until funds are drained, simply knowing your invariants isn't enough. You need the ability to enforce them in real-time.
EVMbench evaluates agents in a controlled, sandboxed Anvil environment. While this is essential for benchmarking, it doesn't represent the full difficulty of real-world smart contract security.
A protocol's threat landscape evolves constantly: governance parameters change, liquidity shifts, and new integrations are added. Pre-deployment audits, even those powered by the latest LLMs, cannot fully anticipate how a contract will behave once it becomes composable with the broader, volatile DeFi ecosystem.
New attack vectors emerge from protocol interactions that didn't exist at audit time. This is where the gap between a "benchmark" and "production security" becomes visible.
Continuous monitoring and automated response are not just "extensions" of the audit lifecycle; they are the runtime requirement for any protocol securing significant TVL.
At Hypernative, we provide the proactive, "always-on" layer that protects $100B+ in assets where audits stop. Our platform is designed to catch the evolving risks that static benchmarks miss:
Relying on audits alone has always been a recipe for disaster. EVMbench is a useful reminder of why, and a sign of what's coming. As AI makes exploitation cheaper, faster, and more accessible, the gap between point-in-time security and real-time resilience stops being a best practice gap and becomes an existential one. The industry is already moving in the right direction. AI will accelerate that shift whether teams are ready or not.
Reach out for a demo of Hypernative’s solutions, tune into Hypernative’s blog and our social channels to keep up with the latest on cybersecurity in Web3.
Secure everything you build, run and own in Web3 with Hypernative.
Website | X (Twitter) | LinkedIn