terminalbench @terminalbench

https://t.co/3jNN3bYpO5 Joined May 2025

Tweets

15
Followers

340
Following

4
Likes

8

OpenAI @OpenAI

7 days ago

GPT‑5.6 Sol sets a new state of the art on Terminal‑Bench 2.1, which tests complex command-line workflows requiring planning, iteration, and tool coordination.

108 235 4K 1.9M 311

View Details

Can agents build complete projects that deliver real value? We’re launching Terminal Bench Challenges: 3 unsolved tasks which could make a real impact on the open source community if solved. These tasks provide a testing ground for optimizations both on the model and harness level on our continuous leaderboard for each task.

5 11 38 5K 12

View Details

terminalbench @terminalbench

2 weeks ago

Terminal-Bench Challenges is inspired by previous projects exploring long-running agents including Carlini's C compiler and Cursor's browser. Join the effort! If you have ideas for further challenges, come hang out in the tb-challenges discord channel discord.com/invite/2Pe5uWG…

0 0 2 209 0

View Details

terminalbench @terminalbench

2 weeks ago

Check out the full release blog for more details! tbench.ai/news/terminal-…

1 0 4 251 0

View Details

terminalbench @terminalbench

2 weeks ago

Introducing Terminal-Bench Challenges! A new capability has emerged at the frontier: agents completing large-scale projects autonomously. To test this capability, we felt another flavor of benchmark was needed. Terminal-Bench Challenges are long-horizon, token-intensive, single-task benchmarks. Today we are releasing our first 3 challenges.

3 12 44 5K 16

View Details

terminalbench @terminalbench

a month ago

Contribute to Terminal-Bench Science!

Steven Dillmann ✈️ ICML 2026 @StevenDillmann

a month ago

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to

16 109 496 908K 266

1 0 6 442 2

View Details

terminalbench @terminalbench

2 months ago

Thank you to @ekellbuch for leading TB2.1, @Zai_org for Terminal-Bench 2.0 Verified, which informed 11 of the 28 tasks we patched, and @SnorkelAI and @togethercompute for support

0 2 17 904 0

View Details

terminalbench @terminalbench

2 months ago

tbench.ai/leaderboard/te…

2 0 6 672 0

View Details

terminalbench @terminalbench

2 months ago

We're releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0 TB2.1 includes • recalibrated limits • fixed solutions • realigned verifiers Per-task breakdowns in 🧵 We'll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜)

2 12 53 15K 10

View Details

Alex Shaw @alexgshaw

2 months ago

The Terminal-Bench community discovered multiple instances of cheating and reward hacking on the Terminal-Bench 2.0 leaderboard. We're adding some new policies to keep it reliable: • ATIF trajectories required for all passing trials • Reward hacking results in reward 0 for the trial • Cheating results in immediate leaderboard removal Thanks to @davisbrownr, @adamlsteinl, and @NoCommas for flagging the recent occurrences! Detailed blog post in comments ⬇️

4 11 121 12K 29

View Details

Alex Shaw @alexgshaw

4 months ago

We independently verified these claims and removed OpenBlocks from the Terminal-Bench 2.0 leaderboard. Thank you @NoCommas for helping us keep leaderboard entries honest! Recent leaderboard submissions are in huggingface.co/datasets/harbo… which makes it easy for the community to work together to detect cheating.