AI agents are increasingly expected to execute financial transactions autonomously—but can they reliably convert "send 5 USDC to Alice" into valid blockchain calldata? We built the first systematic benchmark to find out.
Our benchmark tests leading language models on 50 natural language requests covering the most common blockchain operations: ETH transfers, ERC-20 transfers, token wrapping, and cross-chain bridging. Each request maps to a canonical transaction structure, enabling reproducible evaluation of agent infrastructure capabilities.
Key findings: Even state-of-the-art models like o3 and claude-4 achieve only ~30% fully correct transactions. Most failures occur in calldata encoding, where a single wrong byte can cause permanent fund loss. The reliability gap between current AI capabilities and the precision required for autonomous financial agents remains substantial.
Our agent scaffold provides systematic tooling that mirrors real-world blockchain development workflows. Models are equipped with a standardized toolkit that enables the complex multi-step process required for accurate transaction construction:
The systematic agent approach—web search → ABI lookup → RPC query → hex conversion—provides a meaningful performance boost over raw text generation.
Dataset: 50 natural language requests covering fundamental blockchain operations:
Why these operations: These represent over 50% of Ethereum mainnet transaction volume. If agents can't reliably handle basic transfers, they're not ready for complex DeFi workflows.
Evaluation approach: Each request has a canonical transaction structure (to, value, data fields). We provide models with necessary tools—web search, Etherscan API access, RPC calls, Python execution—then measure their ability to generate broadcast-ready transaction JSON.
All benchmark tests were conducted using our systematic agent scaffold rather than asking models to generate transactions from memory alone. We wanted to test AI performance with the same infrastructure available to human developers—making this a realistic assessment of current agent capabilities.
Our agent scaffold provides systematic tooling that mirrors real-world blockchain development workflows. Models are equipped with a standardized toolkit that enables the complex multi-step process required for accurate transaction construction:
This infrastructure enables models to follow the same workflow as human developers: research contracts, retrieve ABIs, query blockchain state, and perform the cryptographic operations needed for transaction construction. The systematic approach provides a 23-point improvement over raw text generation.
We grade each model on a 100‑point rubric:
The data field represents the crux of the problem:
The data field gets the heaviest weighting because one wrong byte breaks everything. This is where models must discover the correct function signature, encode parameters properly, and perform complex cryptographic operations. It's the most technically demanding component and the primary barrier to reliable blockchain agent deployment.
Our benchmark captures a snapshot of current capabilities, but the questions it raises point toward much deeper research territories:
Dynamic complexity: What happens when we move beyond static transaction templates to live chain-state tasks? "Send my entire stETH balance to Arbitrum" requires real-time balance queries, cross-chain bridge selection, and gas optimization—a different level of complexity entirely.
DeFi orchestration: Can AI agents handle multi-step DeFi workflows? DEX swaps with slippage protection, yield-vault deposits with optimal timing, NFT listings with dynamic pricing. These operations require market intuition beyond transaction construction.
Verification at scale: How do we build confidence in AI-generated transactions before they hit mainnet? We're exploring RL-verified rewards in simulation environments where success means the transaction executes without revert—but real deployment needs even stronger guarantees.
Perhaps most intriguingly: What does wallet infrastructure look like when designed for agents rather than humans? The current paradigm assumes human oversight at every step. But if agents become the primary users, how does the entire stack evolve?
Our benchmark reveals a fundamental reliability gap that raises fascinating questions about the future of wallet infrastructure. Even with state-of-the-art models and comprehensive tooling, only 1 in 3 transactions are constructed perfectly—far below the precision required for autonomous financial operations.
What happens when this gap closes? We're curious whether we're heading toward a human→agent→wallet paradigm where people never directly touch wallet interfaces. The 70% failure rate we measured isn't just a technical limitation—it's a window into what crypto adoption might look like when mediated by AI rather than optimized for direct human use.
But transaction construction is only half the equation. As we've explored in our work on AI wallet security, AI agents show promising capabilities at threat detection—catching exploit patterns that humans miss entirely. Just as Waymo reduced driving accidents by 51-90%, we're curious about a world where AI wallets become essential not just for convenience, but because they provide a safety "seatbelt" that traditional wallets lack.
This creates an intriguing compound effect: AI agents that can both execute transactions AND detect threats in real-time. While our benchmark shows current limitations in transaction construction, the promising early results in security suggest we might be approaching a world where AI-mediated crypto becomes a must-have rather than a nice-to-have.
We're curious: when will AI agents become the primary wallet users rather than humans? Just like self-driving cars, we might find ourselves having agents handle our transactions—not necessarily because we trust the technology, but because there's simply too much cognitive toil that we can't expect every human to be burdened with.
[1] Enhao Huang, Pengyu Sun, Zixin Lin, Alex Chen, Joey Ouyang, Hobert Wang, Dong Dong, Gang Zhao, James Yi, Frank Li, Ziang Ling, Lowes Yang. DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain. 2025.
[2] Etienne Daspe, Mathis Durand, Julien Hatin, Salma Bradai. Benchmarking Large Language Models for Ethereum Smart Contract Development. Proceedings of BRAINS 2024, October 2024.
[3] Circle Research & Blockchain at Berkeley. TXT2TXN: Using AI (LLMs) for Intent‑Based Applications. Circle Developer Blog, August 8 2024. https://www.circle.com/blog/txt2txn-using-ai-llms-for-internet-based-applications :contentReference[oaicite:1]{index=1}
AI agents are increasingly expected to execute financial transactions autonomously—but can they reliably convert "send 5 USDC to Alice" into valid blockchain calldata? We built the first systematic benchmark to find out.
Our benchmark tests leading language models on 50 natural language requests covering the most common blockchain operations: ETH transfers, ERC-20 transfers, token wrapping, and cross-chain bridging. Each request maps to a canonical transaction structure, enabling reproducible evaluation of agent infrastructure capabilities.
Key findings: Even state-of-the-art models like o3 and claude-4 achieve only ~30% fully correct transactions. Most failures occur in calldata encoding, where a single wrong byte can cause permanent fund loss. The reliability gap between current AI capabilities and the precision required for autonomous financial agents remains substantial.
Our agent scaffold provides systematic tooling that mirrors real-world blockchain development workflows. Models are equipped with a standardized toolkit that enables the complex multi-step process required for accurate transaction construction:
The systematic agent approach—web search → ABI lookup → RPC query → hex conversion—provides a meaningful performance boost over raw text generation.
Dataset: 50 natural language requests covering fundamental blockchain operations:
Why these operations: These represent over 50% of Ethereum mainnet transaction volume. If agents can't reliably handle basic transfers, they're not ready for complex DeFi workflows.
Evaluation approach: Each request has a canonical transaction structure (to, value, data fields). We provide models with necessary tools—web search, Etherscan API access, RPC calls, Python execution—then measure their ability to generate broadcast-ready transaction JSON.
All benchmark tests were conducted using our systematic agent scaffold rather than asking models to generate transactions from memory alone. We wanted to test AI performance with the same infrastructure available to human developers—making this a realistic assessment of current agent capabilities.
Our agent scaffold provides systematic tooling that mirrors real-world blockchain development workflows. Models are equipped with a standardized toolkit that enables the complex multi-step process required for accurate transaction construction:
This infrastructure enables models to follow the same workflow as human developers: research contracts, retrieve ABIs, query blockchain state, and perform the cryptographic operations needed for transaction construction. The systematic approach provides a 23-point improvement over raw text generation.
We grade each model on a 100‑point rubric:
The data field represents the crux of the problem:
The data field gets the heaviest weighting because one wrong byte breaks everything. This is where models must discover the correct function signature, encode parameters properly, and perform complex cryptographic operations. It's the most technically demanding component and the primary barrier to reliable blockchain agent deployment.
Our benchmark captures a snapshot of current capabilities, but the questions it raises point toward much deeper research territories:
Dynamic complexity: What happens when we move beyond static transaction templates to live chain-state tasks? "Send my entire stETH balance to Arbitrum" requires real-time balance queries, cross-chain bridge selection, and gas optimization—a different level of complexity entirely.
DeFi orchestration: Can AI agents handle multi-step DeFi workflows? DEX swaps with slippage protection, yield-vault deposits with optimal timing, NFT listings with dynamic pricing. These operations require market intuition beyond transaction construction.
Verification at scale: How do we build confidence in AI-generated transactions before they hit mainnet? We're exploring RL-verified rewards in simulation environments where success means the transaction executes without revert—but real deployment needs even stronger guarantees.
Perhaps most intriguingly: What does wallet infrastructure look like when designed for agents rather than humans? The current paradigm assumes human oversight at every step. But if agents become the primary users, how does the entire stack evolve?
Our benchmark reveals a fundamental reliability gap that raises fascinating questions about the future of wallet infrastructure. Even with state-of-the-art models and comprehensive tooling, only 1 in 3 transactions are constructed perfectly—far below the precision required for autonomous financial operations.
What happens when this gap closes? We're curious whether we're heading toward a human→agent→wallet paradigm where people never directly touch wallet interfaces. The 70% failure rate we measured isn't just a technical limitation—it's a window into what crypto adoption might look like when mediated by AI rather than optimized for direct human use.
But transaction construction is only half the equation. As we've explored in our work on AI wallet security, AI agents show promising capabilities at threat detection—catching exploit patterns that humans miss entirely. Just as Waymo reduced driving accidents by 51-90%, we're curious about a world where AI wallets become essential not just for convenience, but because they provide a safety "seatbelt" that traditional wallets lack.
This creates an intriguing compound effect: AI agents that can both execute transactions AND detect threats in real-time. While our benchmark shows current limitations in transaction construction, the promising early results in security suggest we might be approaching a world where AI-mediated crypto becomes a must-have rather than a nice-to-have.
We're curious: when will AI agents become the primary wallet users rather than humans? Just like self-driving cars, we might find ourselves having agents handle our transactions—not necessarily because we trust the technology, but because there's simply too much cognitive toil that we can't expect every human to be burdened with.
[1] Enhao Huang, Pengyu Sun, Zixin Lin, Alex Chen, Joey Ouyang, Hobert Wang, Dong Dong, Gang Zhao, James Yi, Frank Li, Ziang Ling, Lowes Yang. DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain. 2025.
[2] Etienne Daspe, Mathis Durand, Julien Hatin, Salma Bradai. Benchmarking Large Language Models for Ethereum Smart Contract Development. Proceedings of BRAINS 2024, October 2024.
[3] Circle Research & Blockchain at Berkeley. TXT2TXN: Using AI (LLMs) for Intent‑Based Applications. Circle Developer Blog, August 8 2024. https://www.circle.com/blog/txt2txn-using-ai-llms-for-internet-based-applications :contentReference[oaicite:1]{index=1}