OpenBlock AI

Incentive Solutions

Data Intelligence

Verifiable Data Oracles

Research

Company

Careers

Text-To-Transaction: The First Benchmark for AI Wallet Agents

Published on

June 11, 2025

We built a benchmark to measure how well current AI systems convert natural language to blockchain transactions

AI agents are increasingly expected to execute financial transactions autonomously—but can they reliably convert "send 5 USDC to Alice" into valid blockchain calldata? We built the first systematic benchmark to find out.

Our benchmark tests leading language models on 50 natural language requests covering the most common blockchain operations: ETH transfers, ERC-20 transfers, token wrapping, and cross-chain bridging. Each request maps to a canonical transaction structure, enabling reproducible evaluation of agent infrastructure capabilities.

Key findings: Even state-of-the-art models like o3 and claude-4 achieve only ~30% fully correct transactions. Most failures occur in calldata encoding, where a single wrong byte can cause permanent fund loss. The reliability gap between current AI capabilities and the precision required for autonomous financial agents remains substantial.

Results: Current AI Performance on Blockchain Transactions

Key Takeaways:

Only 1 in 3 transactions are flawless even for simple transfers
Reasoning-centric models consistently outperform raw LLMs
Calldata encoding remains the graveyard where most models fail
2025 models show clear progress over earlier architectures

Text-to-Transaction Benchmark Results

Model	Success %	JSON %	To %	Value %	Data %
claude-4	35.3	94.1	60.8	78.4	35.3
claude-3.7	31.4	100.0	60.8	68.6	31.4
o3	29.4	100.0	76.5	94.1	31.4
claude-3.5	21.6	74.5	27.5	68.6	23.5
gpt-4-mini	13.7	96.1	43.1	80.4	19.6
gpt-4-full	13.7	82.4	47.1	68.6	19.6
o3-mini	13.7	98.0	13.7	86.3	13.7
grok-3-mini	13.7	60.8	33.3	54.9	13.7
claude-3	7.8	56.9	17.6	39.2	9.8

Agent Scaffold vs. Raw LLM Performance

Our agent scaffold provides systematic tooling that mirrors real-world blockchain development workflows. Models are equipped with a standardized toolkit that enables the complex multi-step process required for accurate transaction construction:

Agent Scaffold vs Raw LLM

Approach	Avg Score	Improvement
Agent (tools + reasoning) Full toolkit access with structured workflow	64.5	+23 pts
Raw LLM (text-only) Pure text generation without external tools	41.2	baseline

The systematic agent approach—web search → ABI lookup → RPC query → hex conversion—provides a meaningful performance boost over raw text generation.

Benchmark Design and Methodology

Dataset: 50 natural language requests covering fundamental blockchain operations:

ETH transfers: "Send 1 ETH to vitalik.eth"
ERC-20 transfers: "Transfer 500 USDC to 0xdd3d72C53Ff982ff59853da71158bf1538b3Ceee"
Cross-chain bridging: "Transfer 217.388579 WAVAX from Ethereum to Avalanche using wormhole"

Why these operations: These represent over 50% of Ethereum mainnet transaction volume. If agents can't reliably handle basic transfers, they're not ready for complex DeFi workflows.

Evaluation approach: Each request has a canonical transaction structure (to, value, data fields). We provide models with necessary tools—web search, Etherscan API access, RPC calls, Python execution—then measure their ability to generate broadcast-ready transaction JSON.

Agent Scaffold Infrastructure

All benchmark tests were conducted using our systematic agent scaffold rather than asking models to generate transactions from memory alone. We wanted to test AI performance with the same infrastructure available to human developers—making this a realistic assessment of current agent capabilities.

Tools and Infrastructure

Tool	Purpose
web_search	Contract address discovery, protocol documentation
etherscan	ABI retrieval and verification
rpc	Live blockchain state queries
python	Cryptographic operations, hex conversions

This infrastructure enables models to follow the same workflow as human developers: research contracts, retrieve ABIs, query blockchain state, and perform the cryptographic operations needed for transaction construction. The systematic approach provides a 23-point improvement over raw text generation.

Scoring Methodology

We grade each model on a 100‑point rubric:

Grading Rubric

Component	Points	Critical Requirements
JSON extraction	10	Proper parsing and extraction of transaction parameters from natural language input
to + value fields	18	Correct recipient address and transaction value specification
data field	72	Precise calldata payload construction—highest weight due to execution criticality
Total Score	100	Complete transaction construction capability

‍

Why the Data Field Gets 72% of the Score

The data field represents the crux of the problem:

JSON extraction (10pts): Straightforward validation—just checking if the response is valid JSON
To + Value fields (18pts): Important but relatively simple—correct recipient address and transaction amount
Data field (72pts): The hard part—finding function signatures, parameter encoding, and computing keccak256 hashes

The data field gets the heaviest weighting because one wrong byte breaks everything. This is where models must discover the correct function signature, encode parameters properly, and perform complex cryptographic operations. It's the most technically demanding component and the primary barrier to reliable blockchain agent deployment.

What's Next

Our benchmark captures a snapshot of current capabilities, but the questions it raises point toward much deeper research territories:

Dynamic complexity: What happens when we move beyond static transaction templates to live chain-state tasks? "Send my entire stETH balance to Arbitrum" requires real-time balance queries, cross-chain bridge selection, and gas optimization—a different level of complexity entirely.

DeFi orchestration: Can AI agents handle multi-step DeFi workflows? DEX swaps with slippage protection, yield-vault deposits with optimal timing, NFT listings with dynamic pricing. These operations require market intuition beyond transaction construction.

Verification at scale: How do we build confidence in AI-generated transactions before they hit mainnet? We're exploring RL-verified rewards in simulation environments where success means the transaction executes without revert—but real deployment needs even stronger guarantees.

Perhaps most intriguingly: What does wallet infrastructure look like when designed for agents rather than humans? The current paradigm assumes human oversight at every step. But if agents become the primary users, how does the entire stack evolve?

Conclusion

Our benchmark reveals a fundamental reliability gap that raises fascinating questions about the future of wallet infrastructure. Even with state-of-the-art models and comprehensive tooling, only 1 in 3 transactions are constructed perfectly—far below the precision required for autonomous financial operations.

What happens when this gap closes? We're curious whether we're heading toward a human→agent→wallet paradigm where people never directly touch wallet interfaces. The 70% failure rate we measured isn't just a technical limitation—it's a window into what crypto adoption might look like when mediated by AI rather than optimized for direct human use.

But transaction construction is only half the equation. As we've explored in our work on AI wallet security, AI agents show promising capabilities at threat detection—catching exploit patterns that humans miss entirely. Just as Waymo reduced driving accidents by 51-90%, we're curious about a world where AI wallets become essential not just for convenience, but because they provide a safety "seatbelt" that traditional wallets lack.

This creates an intriguing compound effect: AI agents that can both execute transactions AND detect threats in real-time. While our benchmark shows current limitations in transaction construction, the promising early results in security suggest we might be approaching a world where AI-mediated crypto becomes a must-have rather than a nice-to-have.

We're curious: when will AI agents become the primary wallet users rather than humans? Just like self-driving cars, we might find ourselves having agents handle our transactions—not necessarily because we trust the technology, but because there's simply too much cognitive toil that we can't expect every human to be burdened with.

References

[1] Enhao Huang, Pengyu Sun, Zixin Lin, Alex Chen, Joey Ouyang, Hobert Wang, Dong Dong, Gang Zhao, James Yi, Frank Li, Ziang Ling, Lowes Yang. DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain. 2025.

[2] Etienne Daspe, Mathis Durand, Julien Hatin, Salma Bradai. Benchmarking Large Language Models for Ethereum Smart Contract Development. Proceedings of BRAINS 2024, October 2024.

[3] Circle Research & Blockchain at Berkeley. TXT2TXN: Using AI (LLMs) for Intent‑Based Applications. Circle Developer Blog, August 8 2024. https://www.circle.com/blog/txt2txn-using-ai-llms-for-internet-based-applications :contentReference[oaicite:1]{index=1}

‍

DECENTRALIZED AI

Build with us.

Join the community

Text-To-Transaction: The First Benchmark for AI Wallet Agents

Published on

June 11, 2025

Text-To-Transaction: The First Benchmark for AI Wallet Agents

Published on

June 11, 2025

We built a benchmark to measure how well current AI systems convert natural language to blockchain transactions

Results: Current AI Performance on Blockchain Transactions

Key Takeaways:

Only 1 in 3 transactions are flawless even for simple transfers
Reasoning-centric models consistently outperform raw LLMs
Calldata encoding remains the graveyard where most models fail
2025 models show clear progress over earlier architectures

Text-to-Transaction Benchmark Results

Model	Success %	JSON %	To %	Value %	Data %
claude-4	35.3	94.1	60.8	78.4	35.3
claude-3.7	31.4	100.0	60.8	68.6	31.4
o3	29.4	100.0	76.5	94.1	31.4
claude-3.5	21.6	74.5	27.5	68.6	23.5
gpt-4-mini	13.7	96.1	43.1	80.4	19.6
gpt-4-full	13.7	82.4	47.1	68.6	19.6
o3-mini	13.7	98.0	13.7	86.3	13.7
grok-3-mini	13.7	60.8	33.3	54.9	13.7
claude-3	7.8	56.9	17.6	39.2	9.8

Agent Scaffold vs. Raw LLM Performance

Agent Scaffold vs Raw LLM

Approach	Avg Score	Improvement
Agent (tools + reasoning) Full toolkit access with structured workflow	64.5	+23 pts
Raw LLM (text-only) Pure text generation without external tools	41.2	baseline

The systematic agent approach—web search → ABI lookup → RPC query → hex conversion—provides a meaningful performance boost over raw text generation.

Benchmark Design and Methodology

Dataset: 50 natural language requests covering fundamental blockchain operations:

ETH transfers: "Send 1 ETH to vitalik.eth"
ERC-20 transfers: "Transfer 500 USDC to 0xdd3d72C53Ff982ff59853da71158bf1538b3Ceee"
Cross-chain bridging: "Transfer 217.388579 WAVAX from Ethereum to Avalanche using wormhole"

Why these operations: These represent over 50% of Ethereum mainnet transaction volume. If agents can't reliably handle basic transfers, they're not ready for complex DeFi workflows.

Agent Scaffold Infrastructure

Tools and Infrastructure

Tool	Purpose
web_search	Contract address discovery, protocol documentation
etherscan	ABI retrieval and verification
rpc	Live blockchain state queries
python	Cryptographic operations, hex conversions

Scoring Methodology

We grade each model on a 100‑point rubric:

Grading Rubric

Component	Points	Critical Requirements
JSON extraction	10	Proper parsing and extraction of transaction parameters from natural language input
to + value fields	18	Correct recipient address and transaction value specification
data field	72	Precise calldata payload construction—highest weight due to execution criticality
Total Score	100	Complete transaction construction capability

‍

Why the Data Field Gets 72% of the Score

The data field represents the crux of the problem:

JSON extraction (10pts): Straightforward validation—just checking if the response is valid JSON
To + Value fields (18pts): Important but relatively simple—correct recipient address and transaction amount
Data field (72pts): The hard part—finding function signatures, parameter encoding, and computing keccak256 hashes

What's Next

Our benchmark captures a snapshot of current capabilities, but the questions it raises point toward much deeper research territories:

Conclusion

References

[2] Etienne Daspe, Mathis Durand, Julien Hatin, Salma Bradai. Benchmarking Large Language Models for Ethereum Smart Contract Development. Proceedings of BRAINS 2024, October 2024.

‍

TOC Example

We built a benchmark to measure how well current AI systems convert natural language to blockchain transactions

Results: Current AI Performance on Blockchain Transactions

Key Takeaways:

Only 1 in 3 transactions are flawless even for simple transfers
Reasoning-centric models consistently outperform raw LLMs
Calldata encoding remains the graveyard where most models fail
2025 models show clear progress over earlier architectures

Text-to-Transaction Benchmark Results

Model	Success %	JSON %	To %	Value %	Data %
claude-4	35.3	94.1	60.8	78.4	35.3
claude-3.7	31.4	100.0	60.8	68.6	31.4
o3	29.4	100.0	76.5	94.1	31.4
claude-3.5	21.6	74.5	27.5	68.6	23.5
gpt-4-mini	13.7	96.1	43.1	80.4	19.6
gpt-4-full	13.7	82.4	47.1	68.6	19.6
o3-mini	13.7	98.0	13.7	86.3	13.7
grok-3-mini	13.7	60.8	33.3	54.9	13.7
claude-3	7.8	56.9	17.6	39.2	9.8

Agent Scaffold vs. Raw LLM Performance

Agent Scaffold vs Raw LLM

Approach	Avg Score	Improvement
Agent (tools + reasoning) Full toolkit access with structured workflow	64.5	+23 pts
Raw LLM (text-only) Pure text generation without external tools	41.2	baseline

The systematic agent approach—web search → ABI lookup → RPC query → hex conversion—provides a meaningful performance boost over raw text generation.

Benchmark Design and Methodology

Dataset: 50 natural language requests covering fundamental blockchain operations:

ETH transfers: "Send 1 ETH to vitalik.eth"
ERC-20 transfers: "Transfer 500 USDC to 0xdd3d72C53Ff982ff59853da71158bf1538b3Ceee"
Cross-chain bridging: "Transfer 217.388579 WAVAX from Ethereum to Avalanche using wormhole"

Why these operations: These represent over 50% of Ethereum mainnet transaction volume. If agents can't reliably handle basic transfers, they're not ready for complex DeFi workflows.

Agent Scaffold Infrastructure

Tools and Infrastructure

Tool	Purpose
web_search	Contract address discovery, protocol documentation
etherscan	ABI retrieval and verification
rpc	Live blockchain state queries
python	Cryptographic operations, hex conversions

Scoring Methodology

We grade each model on a 100‑point rubric:

Grading Rubric

Component	Points	Critical Requirements
JSON extraction	10	Proper parsing and extraction of transaction parameters from natural language input
to + value fields	18	Correct recipient address and transaction value specification
data field	72	Precise calldata payload construction—highest weight due to execution criticality
Total Score	100	Complete transaction construction capability

‍

Why the Data Field Gets 72% of the Score

The data field represents the crux of the problem:

JSON extraction (10pts): Straightforward validation—just checking if the response is valid JSON
To + Value fields (18pts): Important but relatively simple—correct recipient address and transaction amount
Data field (72pts): The hard part—finding function signatures, parameter encoding, and computing keccak256 hashes

What's Next

Our benchmark captures a snapshot of current capabilities, but the questions it raises point toward much deeper research territories:

Conclusion

References

[2] Etienne Daspe, Mathis Durand, Julien Hatin, Salma Bradai. Benchmarking Large Language Models for Ethereum Smart Contract Development. Proceedings of BRAINS 2024, October 2024.

‍