OpenBlock | OB-1 Coding Agent

Terminal Bench is the industry standard for evaluating autonomous coding agents on real-world software engineering tasks. Unlike synthetic benchmarks or cherry-picked demos, Terminal Bench requires agents to complete end-to-end workflows: understanding requirements, configuring development environments, editing code across multiple files, running tests, debugging failures, and validating results. The benchmark consists of 100 realistic coding challenges drawn from actual GitHub issues, spanning web development, data processing, DevOps automation, and infrastructure tasks. Success demands not just code generation, but persistent execution, contextual understanding, error recovery, and the ability to navigate complex codebases without human intervention.

OB-1's achievement of 59.0% on this benchmark represents something fundamentally different from existing coding agents. Rather than relying on a single model, we built a mixture-of-models system that harnesses the complementary strengths of multiple frontier LLMs working in concert. Each iteration of problem-solving cycles through three distinct models: GPT-5 excels at rapid prototyping and creative problem decomposition, Claude Sonnet 4 brings careful code analysis and refactoring capabilities, and Claude Opus 4.1 acts as the final arbiter for completion decisions.

Think of it like a team of engineers reviewing a pull request. Multiple perspectives contribute their strengths, with a senior engineer making the final call. This mirrors how we actually use AI assistants in practice, trying different models until one makes meaningful progress. The approach boosts success by aggregating intelligence across frontier models with a strong guardrail at the end, though it introduces variance and higher cost. A task can only be marked complete if Claude Opus 4.1 gives explicit approval, preventing premature exits and ensuring quality.

The Memory Problem

To maintain consistency across model switches and iterations, OB-1 employs a trace memory system that records terminal commands executed and their outputs, error messages and stack traces encountered, progress notes and debugging observations, todo items for next steps, and reflection blocks analyzing what worked or failed. This memory persists across retries and long-horizon tasks, allowing the agent to learn from mistakes and build on successful patterns. This is critical for the iterative, error-prone nature of real software engineering.

The mixture-of-models approach improves peak solve rates and reduces local optima. When one model gets stuck, another brings fresh perspective. The ensemble aggregates intelligence across frontier models while maintaining quality control through a strict reviewer at the end. However, this comes at a cost: higher variance in outcomes and increased API expenses from running multiple models. We offset both challenges through trace memory to avoid repeating failed approaches, selective retries based on error analysis, and early stopping when Opus detects unrecoverable failures.

In practice, we alternate models at the iteration level, require a single arbiter for completion, keep a trace memory of terminal commands and outcomes, and use todos and reflection blocks for long-horizon control. This architecture allowed OB-1 to consistently make progress on long-horizon, error-prone tasks where most agents stall, validating our hypothesis that composition and memory beat raw model scale.

Beyond General Agents

AI will write most software. If your next developer can't rely on a coding agent in your stack, you stall. The default path is waiting for a general agent to eventually fit your needs, but that path is slow and uncertain. OB-1 takes a different route: a multi-model agent tuned to your repos and services, with memory of your traces and an evaluation harness that rewards what works in your environment. Reliability comes from specialization, not slogans.

Beyond the base Terminal Bench performance, OB-1's architecture enables repository-specific fine-tuning to learn your codebase's patterns and conventions, custom tool integration to connect to your internal APIs and services, memory persistence to recall previous solutions to similar problems in your stack, and team collaboration to share agent knowledge across your engineering organization. This isn't a distant vision. Teams are using these capabilities in production today.

What Comes Next

We're evolving OB-1 toward agents that reflect, learn, and improve across runs. An infrastructure where experience compounds rather than resets with each session. Our roadmap includes reinforcement learning from execution traces so agents improve from their mistakes, multi-agent collaboration where specialist agents work together on complex tasks, interactive debugging for real-time collaboration between human and agent, and custom evaluation suites to benchmark performance on your specific use cases.

Teams can apply for early access via our waitlist. We're working with select partners to customize OB-1 for their specific technology stacks and workflows. We're also hiring across product and research to push the agent architecture and evaluation stack forward. If building the future of software development excites you, explore our open roles.

Utility first: we use OB-1 internally to accelerate our own engineering. Mission alignment: ecosystems need a reliable coding agent they can shape to their needs, not a one-size-fits-all tool. OB-1 shows a path to practical reliability today by letting you extend, tweak, and reconfigure the agent for your stack.

Back to Blog