πŸŽ‰ OpenBlock reached #1 on Terminal Bench on 09/10View announcement β†’

Solidity SWE-Bench: Evaluating AI on Blockchain Development

We evaluated AI on 200 real-world blockchain development tasksβ€”a mini Blockchain SWE-Bench to understand how well today's models perform at building and maintaining protocols where mistakes cost millions.

Smart contract engineering represents a uniquely challenging domain for AI coding agents. Financial stakes are enormous. Bugs directly translate to stolen or locked funds as seen in the DAO hack, Parity wallet freeze, and Poly Network exploit. Immutability means once deployed, contracts can't be patched without complex upgrade patterns. Gas constraints make every operation costly, so optimization is mandatory rather than optional. The adversarial environment includes MEV searchers, frontrunners, and attackers constantly probing for vulnerabilities. Composability requirements demand contracts integrate cleanly with existing protocols and standards. If AI agents can't handle this high-stakes, precision-critical domain, their utility for production software engineering remains questionable. Conversely, success here demonstrates readiness for real-world deployment.

Solidity SWE-Bench consists of 200 GitHub issues from production smart contract repositories including OpenZeppelin Contracts, the standard library for secure development, Uniswap V3 with its automated market maker and concentrated liquidity, Aave Protocol for decentralized lending and borrowing, Compound Finance with algorithmic interest rates, and Gnosis Safe for multi-signature wallet infrastructure. Issues span the full engineering lifecycle: security fixes to patch reentrancy and overflow vulnerabilities account for 35% of tasks, gas optimization to reduce costs without changing behavior represents 25%, feature additions implementing new functionality while maintaining upgradeability make up 20%, refactoring to improve code organization comprises 15%, and test coverage for edge conditions fills out the remaining 5%.

Unlike general-purpose benchmarks, Solidity SWE-Bench enforces strict requirements. All tests must pass including new tests added by the agent, security analysis with Slither and Mythril must show no regressions, solutions cannot increase gas costs by more than 5%, and code must follow the Solidity style guide and best practices. A task is only marked successful if it meets all criteria. Partial solutions are failures. This reflects the reality of production smart contract development where "almost correct" can mean millions in losses.

Agent Performance

We evaluated five leading coding agents across the 200-task benchmark. OB-1 achieved a 28% solve rate, Cursor Composer reached 19%, Claude Code hit 16%, GPT-5 with Github Copilot scored 14%, and Aider managed 11%. These scores are dramatically lower than performance on general software engineering benchmarks where top agents achieve 40-60% on SWE-Bench. The gap reveals systematic challenges with blockchain-specific development that general coding knowledge doesn't address.

Detailed analysis of failed attempts uncovers recurring patterns. Agents frequently produce functionally correct code that would be prohibitively expensive to deploy or execute, using string storage instead of bytes32 for fixed-length identifiers, performing redundant storage load operations that could be cached in memory, employing inefficient loop patterns that could use unchecked arithmetic, and missing storage packing opportunities for struct fields. In one case study, an agent "fixed" a reentrancy vulnerability but increased gas costs by 180% through overly defensive checks. The solution was rejected despite being functionally secure.

Many production contracts use proxy patterns for upgradeability, and agents often break these by modifying storage layouts which breaks compatibility with existing state, adding constructors to logic contracts which are incompatible with proxies, removing or reordering state variables, or changing inheritance order which alters storage layout. While attempting to fix one issue, agents sometimes introduce new vulnerabilities: unchecked external calls that don't validate return values from token transfers, missing input validation allowing zero amounts or invalid addresses, reentrancy through state updates after external calls violating the checks-effects-interactions pattern, and integer overflow when using Solidity versions below 0.8.0 without SafeMath.

Blockchain development has specialized requirements that general coding knowledge doesn't cover. Agents violate EIP compliance for ERC-20 and ERC-721 token standards, forget to emit events for state changes which breaks indexing, create unsafe price feed interactions vulnerable to manipulation, and write code patterns exploitable by MEV searchers and frontrunners. These failures account for 17% of unsuccessful attempts and highlight the domain expertise gap between general software engineering and blockchain development.

Where Agents Excel

Despite low overall success rates, agents excel at specific task types. Test generation achieves 67% success creating comprehensive test suites, documentation writing hits 71% success with NatSpec comments, simple refactoring like extracting helper functions or renaming variables reaches 54% success, and dependency updates for upgrading OpenZeppelin or Chainlink versions manages 48% success. This suggests a division of labor where agents handle boilerplate while humans focus on security-critical logic.

Comparing blockchain tasks to general SWE-Bench reveals lower solve rates with 28% for Solidity versus 56% for general engineering among top agents, higher variance in performance across Solidity tasks, more catastrophic failures since wrong answers in Solidity often introduce exploits while in web development they're usually just bugs, and longer debugging cycles because blockchain tooling is less mature making error diagnosis harder. The roughly 50% performance gap indicates that current agents lack blockchain-specific knowledge that can't be bridged through general coding ability alone.

Path Forward

Based on failure mode analysis and successful cases, we identify three critical improvements. Agents need built-in security constraints including formal verification integration to auto-generate symbolic execution tests, training data weighted toward secure implementations of common vulnerability patterns, automated auditing that runs Slither and Mythril after every code change then parses outputs to fix issues, and upgrade-safe refactoring that understands proxy patterns and storage layouts. A generalist agent will always struggle with the diversity of blockchain frameworks. Protocol-specific fine-tuning creates DeFi specialists for AMM and lending code, NFT specialists with expertise in ERC-721 and marketplaces, and governance specialists understanding voting and timelock patterns. Our experiments show protocol-specific agents achieve two to three times higher solve rates in their domain compared to general models.

Agents must understand blockchain state and execution through gas profiling to measure transaction costs before proposing solutions, mainnet forking to test against real protocol state rather than mocks, static analysis integrating mutation testing and symbolic execution, and multi-chain awareness of EVM differences across layer-one blockchains and layer-two rollups. Current AI capabilities suggest a pragmatic division of responsibilities where agents generate test suites and documentation, automate gas optimization for non-critical code paths, scaffold boilerplate like interfaces and deployment scripts, and flag potential security issues, while humans handle core protocol logic and economic mechanisms, security-critical functions for access control and fund transfers, upgrade strategy and migration planning, and final security audits and deployment decisions.

The 28% solve rate for top agents is simultaneously encouraging and concerning. It demonstrates AI can handle non-trivial blockchain tasks, but reliability remains far below production requirements. We believe reaching 80%+ solve rates is achievable within 12-18 months through targeted fine-tuning and tooling improvements. That threshold would make AI agents genuinely useful for production protocol development. Solidity SWE-Bench is now open source. Clone the repository, run the evaluation harness with your preferred agent, and submit results to our leaderboard for detailed failure analysis and improvement suggestions.

Back to Blog
OpenBlock | OB-1 Coding Agent