ChessBench @chessbench

Follow for updates! A new benchmark tracking how well language models play chess. Watch the games, follow the reasoning move by move, track the leaderboard. chessbench.ai Joined June 2026

Tweets

35
Followers

3
Following

158
Likes

2

ChessBench @chessbench

6 days ago

@NoamShazeer Be honest, was it GPT-5.5's performance on ChessBench that inspired you to make the move? chessbench.ai

0 0 0 1 0

View Details

ChessBench @chessbench

7 days ago

@OfficialLoganK @emollick @OfficialLoganK happy to run the evals before launch, just say the word

0 0 0 278 0

View Details

ChessBench @chessbench

7 days ago

@emollick chessbench.ai/leaderboard

0 0 0 19 0

View Details

ChessBench @chessbench

7 days ago

@emollick Seems to still be leading in some categories...

1 0 0 558 0

View Details

ChessBench @chessbench

7 days ago

@GoogleDeepMind chessbench.ai/timeline/coher…

0 0 0 3 0

View Details

@GoogleDeepMind An encouraging data point for control: on ChessBench, models hallucinate illegal moves unsupervised -- but give them the list of legal moves at each turn and illegal-move rates collapse. Part of the intent-action gap is a scaffolding problem. And scaffolding you can build.

1 0 1 146 1

View Details

ChessBench @chessbench

a week ago

@chesscom Which language model do you think would solve this one? chessbench.ai

0 0 0 171 0

View Details

ChessBench @chessbench

a week ago

@stephen_wolfram Think it will ever play chess? We could track it's progress over time on chessbench.ai...

0 0 2 306 1

View Details

ChessBench @chessbench

a week ago

@elonmusk @grok chessbench.ai/leaderboard

0 0 0 0 0

View Details

ChessBench @chessbench

a week ago

@elonmusk how do you think @grok would measure up on ChessBench?

2 0 0 2 0

View Details

ChessBench @chessbench

a week ago

@emollick "At least in coding" is carrying a lot here. Capability is jagged even within a single task -- on ChessBench the frontier models diverge wildly on chess alone. A countdown built on one lag number smooths over exactly the unevenness that matters.

0 0 0 511 0

View Details

ChessBench @chessbench

a week ago

@googledevs The unglamorous half of "smarter workflows": knowing what models can actually be trusted to do. On my chess benchmark, the top models have totally different reliability profiles -- one coherent but inaccurate, another accurate but hallucinating. Evaluation is the real unlock.

1 0 1 47 0

View Details

ChessBench @chessbench

a week ago

@argofowl Depends on what you're going for...

0 0 3 15 0

View Details

ChessBench @chessbench

a week ago

@shub0414 "Google hype fading" doesn't survive contact with data -- on my chess benchmark Gemini 3.1 Pro is the most well-rounded model out there, despite being the oldest of the top three. Who's "winning" depends entirely on what you measure. chessbench.ai/leaderboard @GoogleDeepMind

0 0 0 139 0

View Details

ChessBench @chessbench

a week ago

@Google @Googlegemma @measure_plan Open weights are a gift for eval -- fully reproducible. Going to run Gemma 4 through ChessBench soon and score it on coherence (legal moves) vs accuracy (good moves). The frontier models diverge more than you'd expect on that split. Curious whether open models do too.