GPT‑5.6 Sol sets a new state of the art on Terminal‑Bench 2.1, which tests complex command-line workflows requiring planning, iteration, and tool coordination.
Can agents build complete projects that deliver real value? We’re launching Terminal Bench Challenges: 3 unsolved tasks which could make a real impact on the open source community if solved.
These tasks provide a testing ground for optimizations both on the model and harness level on our continuous leaderboard for each task.
Terminal-Bench Challenges is inspired by previous projects exploring long-running agents including Carlini's C compiler and Cursor's browser.
Join the effort! If you have ideas for further challenges, come hang out in the tb-challenges discord channel
discord.com/invite/2Pe5uWG…
Introducing Terminal-Bench Challenges!
A new capability has emerged at the frontier: agents completing large-scale projects autonomously. To test this capability, we felt another flavor of benchmark was needed.
Terminal-Bench Challenges are long-horizon, token-intensive, single-task benchmarks. Today we are releasing our first 3 challenges.
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇
tbench.ai/news/tb-scienc…@AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to
Thank you to @ekellbuch for leading TB2.1, @Zai_org for Terminal-Bench 2.0 Verified, which informed 11 of the 28 tasks we patched, and @SnorkelAI and @togethercompute for support
We're releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0
TB2.1 includes
• recalibrated limits
• fixed solutions
• realigned verifiers
Per-task breakdowns in 🧵
We'll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜)
The Terminal-Bench community discovered multiple instances of cheating and reward hacking on the Terminal-Bench 2.0 leaderboard.
We're adding some new policies to keep it reliable:
• ATIF trajectories required for all passing trials
• Reward hacking results in reward 0 for the trial
• Cheating results in immediate leaderboard removal
Thanks to @davisbrownr, @adamlsteinl, and @NoCommas for flagging the recent occurrences!
Detailed blog post in comments ⬇️
We independently verified these claims and removed OpenBlocks from the Terminal-Bench 2.0 leaderboard.
Thank you @NoCommas for helping us keep leaderboard entries honest!
Recent leaderboard submissions are in huggingface.co/datasets/harbo… which makes it easy for the community to work together to detect cheating.
34K Followers 4K Followingtweets about AI and other fun stuff. currently @foundationcap; wrote the context graph paper.
previously McKinsey, @georgiatech, @stackfolio (acquired),
4K Followers 4K FollowingFounder/CEO of Graphlit (@graphlit): Operational Context Layer for AI Agents 🚀 @zine_ai @dossium 👋 ex-MSFT, PA born, Seattle bred. Dad to dogs/humans
325 Followers 5K Followingorganic general intelligence | jack of all trades, master's from @NYUDataScience
prev: Research @NYTimesRD @precog_iiitd; Manipal grad | he/him