๐ŸŽ‰ OpenBlock reached #1 on Terminal Bench on 09/10View announcement โ†’

TxnBench: Evaluating LLMs on Text-to-Transaction

LLMs that can generate transactions from natural language intent could drastically change crypto UXโ€”making it safer and easier for both humans and AI agents to interact with protocols. We built TxnBench to measure how close we are to that future.

Interacting with blockchain protocols requires understanding complex Application Binary Interfaces, transaction structures, gas estimation, nonce management, and protocol-specific quirks. A simple action like "swap 1 ETH for USDC on Uniswap" demands precise encoding of token addresses, fee tiers, slippage parameters, and deadline timestamps. This complexity creates a high barrier to entry for new users who must learn intricate technical details before basic interactions, error-prone workflows where manual transaction construction leads to costly mistakes like wrong decimals or incorrect addresses, and limits AI agent accessibility since agents can't reliably interact with DeFi protocols without robust transaction generation.

TxnBench is a comprehensive benchmark for evaluating how well LLMs translate natural language instructions into correct, safe blockchain transactions. It consists of 500 real-world test cases spanning DEX operations like swaps and liquidity provision on Uniswap, Curve, and Balancer, lending protocol interactions including deposits and borrows on Aave and Compound, yield farming with staking and reward claims, NFT interactions for minting and marketplace operations, DAO governance including proposal creation and voting, and cross-chain bridges for asset transfers. Each test case is scored on four dimensions: correctness to ensure the transaction executes the intended action, safety to verify slippage and approvals are set appropriately, efficiency to confirm gas usage is optimized, and robustness to handle edge cases like low liquidity or price volatility.

We evaluate models using a zero-shot prompting approach. Given only a natural language instruction, relevant contract ABIs, and current blockchain state including token balances and allowances, the model must generate complete transaction calldata with contract addresses, properly encoded parameters, gas estimates, and safety parameters like slippage tolerance and deadlines. Transactions are then validated against a forked mainnet environment to ensure they would execute correctly without reverting. A transaction must score at least 80% on all four criteria to pass. Partial credit is not awarded because the bar is whether you would trust this transaction with real money.

Current Performance

We evaluated seven frontier models across TxnBench's 500 test cases. GPT-5 achieved a 42% pass rate, Claude Opus 4.1 reached 38%, Gemini Ultra scored 31%, and Llama 4 405B managed 27%. Analysis of failures reveals systematic issues: decimal precision errors account for 34% of failures as models confuse 6 versus 18 decimal tokens leading to drastically incorrect amounts, missing safety checks represent 28% with no slippage protection or infinite deadlines, incorrect encoding causes 22% of failures through malformed calldata or wrong function signatures, and protocol misunderstanding contributes 16% when models use Uniswap V2 patterns for V3 or ignore pool fee tiers.

Models perform well on simple swaps with basic token exchanges achieving roughly 75% success, ERC-20 approvals to grant spending permissions reaching 82% success, and balance queries for read-only operations hitting 91% success. However, they struggle with multi-step workflows like "borrow USDC using ETH collateral then swap half to DAI" at only 18% success, position management such as adjusting Uniswap V3 concentrated liquidity ranges at 12% success, and cross-protocol composition involving flashloan-swap-repay sequences at just 8% success.

Why This Matters

Reliable text-to-transaction is the foundation for three transformative use cases. First, autonomous agents managing portfolios, executing arbitrage, or providing liquidity need robust transaction generation. Current 42% success rates mean agents fail more than half the time, which is unacceptable when real capital is at stake. Second, natural language wallets where users can say "send $50 to alice.eth" or "stake my ETH for the best yield" require near-perfect transaction generation since a single decimal error could mean sending $50,000 instead of $50. Third, intent-based architecture used by protocols like UniswapX and CoW Protocol allows users to specify what they want rather than how to get it, and text-to-transaction models can democratize intent expression.

Based on our analysis, we identify three high-leverage improvements. Current models are trained primarily on natural language and code, not blockchain-specific transaction patterns. A dedicated dataset of intent-transaction pairs would teach models decimal precision rules for different tokens, safe default values for slippage and deadlines, common protocol interaction patterns, and error recovery strategies. Models currently generate transactions in isolation with no feedback on success or failure, but adding a simulate-then-adjust workflow where the model generates an initial transaction, simulates it on forked mainnet, receives the revert reason if it fails, retries with corrected parameters, and repeats until success dramatically improves outcomes. In our experiments, this boosted pass rates from 42% to 67% for GPT-5, demonstrating that models can self-correct when given execution feedback.

A single generalist model will always struggle with the diversity of DeFi protocols. Instead, we advocate for specialized agents: a DEX agent expert at swap routing and liquidity math, a lending agent that understands collateralization ratios and liquidation thresholds, and a governance agent that navigates proposal lifecycles and voting mechanics. Each specialist achieves 70-85% success in its domain, far exceeding generalist performance.

Open Dataset

TxnBench is publicly available at github.com/openblocklabs/txnbench. The release includes 500 test cases with reference transactions and scoring rubrics, an evaluation harness for automated testing against forked networks, baseline model results from GPT-5, Claude, Gemini, and Llama, and error analysis categorizing common failure patterns. Crypto adoption hinges on accessibility. Today, using DeFi requires technical sophistication that excludes 99% of potential users. Text-to-transaction models won't solve every UX problem, but they remove a critical blocker. When users can express intent in natural language and trust that transactions will be constructed correctly and safely, blockchain applications become as accessible as web2 services.

Back to Blog
OpenBlock | OB-1 Coding Agent