February 19, 2026
Insights

When AI Exploits Better Than Defends: What OpenAI's EVMbench Means for Web3 Security

GPT-5.3-Codex's 72% success rate in hacking contracts is a reminder that real-time defenses aren't optional

Hypernative

OpenAI just published EVMbench, a new benchmark measuring how well AI agents can detect, patch, and exploit smart contract vulnerabilities. The headline result: AI is dramatically better at attacking than defending. These are, in fact, the droids we feared.

OpenAI’s latest model, GPT-5.3-Codex, achieved a 72.2% success rate in “exploit mode,” meaning it was able to successfully carry out the attack it was attempting in roughly seven out of 10 scenarios. What matters even more than the exact number is the direction of travel. OpenAI reports that this result is more than double what comparable models could do just six months ago. In plain terms, the cost of producing a workable attack is dropping fast, and the speed at which attackers can iterate is climbing. 


The defensive side of the ledger tells a very different story. The same models that cracked 72% of exploits could only “detect” vulnerabilities in about 40% of cases, and “patch” them in even fewer. Those steps remain harder for today’s AI systems to do reliably end to end, especially when the codebase is unfamiliar, the bug is subtle, or the fix requires judgment calls.

Attackers have always had an asymmetric advantage: they need to succeed once, defenders need to win every time. EVMbench suggests AI is widening that gap. But smart contract audits are only the first line of defense. An end-to-end security posture also includes broad risk coverage beyond known bugs, cross-chain visibility, runtime monitoring, and automated response. The latest finding doesn’t change this. It reinforces it.

From Static Reports to Living Invariants

One of the most valuable outputs of a comprehensive audit process, whether conducted by humans or AI, is the set of invariants it surfaces. These are the fundamental conditions that must always hold true for a protocol to behave as intended.

These invariants shouldn't live only in a static audit report. They translate naturally into continuous monitoring rules that watch for violations the moment a contract goes live. But in an era where AI can iterate on exploits until funds are drained, simply knowing your invariants isn't enough. You need the ability to enforce them in real-time.

The Mainnet Reality: Beyond the Sandbox

EVMbench evaluates agents in a controlled, sandboxed Anvil environment. While this is essential for benchmarking, it doesn't represent the full difficulty of real-world smart contract security.

A protocol's threat landscape evolves constantly: governance parameters change, liquidity shifts, and new integrations are added. Pre-deployment audits, even those powered by the latest LLMs, cannot fully anticipate how a contract will behave once it becomes composable with the broader, volatile DeFi ecosystem.

New attack vectors emerge from protocol interactions that didn't exist at audit time. This is where the gap between a "benchmark" and "production security" becomes visible.

Security as a Runtime Requirement

Continuous monitoring and automated response are not just "extensions" of the audit lifecycle; they are the runtime requirement for any protocol securing significant TVL.

At Hypernative, we provide the proactive, "always-on" layer that protects $100B+ in assets where audits stop. Our platform is designed to catch the evolving risks that static benchmarks miss:

  • Beyond the 120-Vector Limit: EVMbench tests against a curated set of 120 vulnerabilities. In the real world, threats are infinite. Hypernative monitors over 300+ risk vectors across security, financial, operational, and governance categories. We don't just look for "bugs"; we detect intent-based anomalies and zero-day exploits in real-time.
  • True Cross-Chain Coverage: The OpenAI benchmark is currently limited to single-chain environments. However, today’s most devastating exploits often leverage cross-chain bridges and multi-chain dependencies. Hypernative provides comprehensive visibility across 70+ chains, ensuring that a vulnerability on one network doesn't become a backdoor to your entire ecosystem.
  • Runtime Intelligence vs. Static Snapshots: An audit is a picture of your code on the day it was written. Hypernative is a live feed of your code in the wild. Our Hypernative Platform and Guardian solutions use sophisticated ML models and mempool simulations to detect threats before they land on-chain, giving teams the "pre-crime" minutes needed to respond.
  • Automated Response at Block Speed: When an AI attacker can iterate on an exploit in seconds, manual intervention is too slow. We enable protocols to connect our high-fidelity alerts to automated playbooks, allowing them to pause contracts, move funds to cold storage, or rotate keys the moment an invariant is violated.

The Bottom Line

Relying on audits alone has always been a recipe for disaster. EVMbench is a useful reminder of why, and a sign of what's coming. As AI makes exploitation cheaper, faster, and more accessible, the gap between point-in-time security and real-time resilience stops being a best practice gap and becomes an existential one. The industry is already moving in the right direction. AI will accelerate that shift whether teams are ready or not.

Reach out for a demo of Hypernative’s solutions, tune into Hypernative’s blog and our social channels to keep up with the latest on cybersecurity in Web3.

Secure everything you build, run and own in Web3 with Hypernative.

Website | X (Twitter) | LinkedIn

Secure everything you build, run, and, own onchain

Book a demo