#1 on Terminal BenchSeptember 1, 2025

OpenBlock secures #1 on Terminal Bench with frontier agent OB-1

OB-1 achieved first place on Terminal Bench with a mixture-of-models workflow and agentic memory.

What is Terminal Bench?

Terminal Bench is an open evaluation where agents must complete real coding tasks end-to-end: configuring environments, editing code, running tests, and validating results. Success requires more than prompt engineering; it demands persistence, planning, and adaptability inside a real terminal.

OB-1's architecture

OB‑1 is a multi‑model system rather than a single monolith. Different base models take turns proposing next steps; a final arbiter (Opus 4.1) approves completion. A memory layer records terminal traces, notes, and todos so progress persists across retries and long horizons.

Our key innovation was using a mixture of models in tandem. Each iteration cycled across three models (GPT-5, Sonnet 4, and Opus 4.1), but a task could only be marked complete if Opus 4.1 gave final approval. Think of it as three students tackling a math problem together: they take turns working through steps, and the sharpest student reviews and signs off before the answer is finalized. This mirrors how we use coding agents in practice—trying different models until one makes meaningful progress. The approach boosted success by aggregating intelligence across frontier models with a strong guardrail at the end, but it also introduced variance and higher cost. To reinforce success and reduce variance, we stored terminal commands in a memory layer and built on our previous agent architecture of maintaining a todo list and memory blocks.

This composition improves peak solve rates and reduces local optima. It raises variance and cost; we offset both with trace memory, selective retries, and a strict reviewer at the end.

In practice, we alternate models at the iteration level, require a single arbiter for completion, keep a trace memory of terminal commands and outcomes, and use todos and reflection blocks for long‑horizon control.

This architecture allowed OB-1 to consistently make progress on long-horizon, error-prone tasks where most agents stall.

Making coding agents work for your ecosystem

AI will write most software. If your next developer can’t rely on a coding agent in your stack, you stall. The default path is waiting for a general agent to eventually fit your needs.

OB‑1 takes a different route: a multi‑model agent tuned to your repos and services, with memory of your traces and an evaluation harness that rewards what works in your environment. Reliability comes from specialization, not slogans.

Access & Hiring

Teams can apply for early access via the waitlist. We’re hiring across product and research to push the agent architecture and evaluation stack forward.

Why we built it

Utility first: we use OB‑1 internally to speed up engineering. Mission alignment: ecosystems need a reliable coding agent they can shape to their needs, not a one‑size‑fits‑all tool. OB‑1 shows a path to practical reliability today by letting you extend, tweak, and reconfigure the agent for your stack.

What's Next

We're evolving OB‑1 toward agents that reflect, learn, and improve across runs—an infrastructure where experience compounds. If this resonates, explore our research and dive into OB‑1.

OB-1 Coding Agent - Early Access Waitlist