Jeff Da @_jeffda

Research Scientist @scale_ai. Research on Reinforcement Learning, Agents, Reasoning. Ex: @allen_ai jeffda.com Joined July 2017

Tweets

166
Followers

445
Following

861
Likes

913

Jeff Da @_jeffda

2 months ago

@yannis__he Let’s go 🔥

0 0 1 36 0

View Details

Jeff Da @_jeffda

3 months ago

@alexfabbri4 Congrats on the launch!

0 0 1 37 0

View Details

We are sharing an early preview of our ongoing SWE-1.6 training run. It significantly improves upon SWE-1.5 while being post-trained on the same pre-trained model - and it runs equally as fast at 950 tok/s. On SWE-Bench Pro it exceeds top open-source models. The preview model still exhibits some undesirable behaviors like overthinking and excessive self-verification, which we aim to improve. We are rolling out early access to a small subset of users in Windsurf.

64 111 1K 513K 304

View Details

Bing Liu @vbingliu

4 months ago

OpenAI is moving away from SWE-Bench Verified, citing challenges on underspecified tasks, misaligned tests, and contamination. We agree. These were exactly the motivations behind SWE-Bench Pro (arxiv.org/pdf/2509.16941). What we changed: → Underspecified tasks: structured, executable problem definitions → Contamination: strict curation + private / commercial codebases But this is just step one. Where we’re pushing frontier coding evals next: → Beyond unit tests: rubric-based evaluation (arxiv.org/pdf/2601.04171) → From static tasks to real-world agentic environments Modern coding systems are not solving isolated problems. They operate as agents over repos, tools, and long-horizon workflows. Our evals need to reflect that. SWE-Bench Pro is one step toward more realistic and reliable evaluation for coding agents. We’ll keep pushing the frontier.

2 3 39 3K 18

View Details

OpenAI Developers @OpenAIDevs

4 months ago

The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories. openai.com/index/why-we-n…

95 129 1K 243K 394

View Details

Logan Kilpatrick @OfficialLoganK

4 months ago

Introducing Gemini 3.1 Pro, our new SOTA model across most reasoning, coding, and stem use cases!

553 574 7K 646K 706

View Details

MiniMax (official) @MiniMax_AI

4 months ago

Introducing M2.5, an open-source frontier model designed for real-world productivity. - SOTA performance at coding (SWE-Bench Verified 80.2%), search (BrowseComp 76.3%), agentic tool-calling (BFCL 76.8%) & office work. - Optimized for efficient execution, 37% faster at complex tasks. - At $1 per hour with 100 tps, infinite scaling of long-horizon agents now economically possible MiniMax Agent: agent.minimax.io API: platform.minimax.io CodingPlan: platform.minimax.io/subscribe/codi…

453 1K 9K 5.4M 8K

View Details

Noam Brown @polynoamial

5 months ago

GPT-5.3-Codex's much better token efficiency *AND* faster inference is the biggest story of this release. Folks at @OpenAI worked hard to improve this and it will only get better from here.

Sam Altman @sama

5 months ago

GPT-5.3-Codex is here! *Best coding performance (57% SWE-Bench Pro, 76% TerminalBench 2.0, 64% OSWorld). *Mid-task steerability and live updates during tasks. *Faster! Less than half the tokens of 5.2-Codex for same tasks, and >25% faster per token! *Good computer use.

2K 2K 19K 2.4M 2K

36 86 1K 156K 94

View Details

Sam Altman @sama

5 months ago

2K 2K 19K 2.4M 2K

View Details

Wenting Zhao @wzhao_nlp

5 months ago

This release is an emtional one for me because I had stayed up so much for it 🥹 It has been truly amazing to see this model becomes better bit by bit through every change we make, and we have come a long way. Since I did mid-training for this model, I wanted to share a little anecdote about this part. We really made this model with user experience as first-class consideration. We want people to actually use it, period. We took it so serious that we redid midtraining because we saw cases where models failed to follow instructions on out-of-distribution scaffolds. We decided straight-up that we would fix this in a fundamental way instead of surface-level patching. The resulting base model, which we also release, is thus a healthy base. We find that, compared to other base models, this one better learns new tasks. Try fine-tuning our base and lmk what you think 🥳 huggingface.co/Qwen/Qwen3-Cod…

Qwen @Alibaba_Qwen

5 months ago

🚀 Introducing Qwen3-Coder-Next, an open-weight LM built for coding agents & local development. What’s new: 🤖 Scaling agentic training: 800K verifiable tasks + executable envs 📈 Efficiency–Performance Tradeoff: achieves strong results on SWE-Bench Pro with 80B total params and

211 788 6K 1.5M 2K

56 83 1K 109K 310

View Details

Jeff Da @_jeffda

5 months ago

A strong and fast open-source coding model, and a tech report 😍

Qwen @Alibaba_Qwen

5 months ago

211 788 6K 1.5M 2K

0 0 3 132 0

View Details

MiniMax (official) @MiniMax_AI

5 months ago

#1 open source on SWE-Bench Pro. Ahead of Gemini 3 Flash. Level with Haiku 4.5. Thanks @scale_AI for the solid benchmark. Let's keep pushing forward 💪

Scale AI @scale_AI

5 months ago

JUST ADDED: @MiniMax_AI 2.1 just joined our SWE-Bench Pro leaderboard. Check out the updated rankings: scale.com/leaderboard/sw…

2 3 33 24K 4

8 11 231 21K 25

View Details

Jeff Da @_jeffda

6 months ago

Rubrics are effective verifiers for SWE-Agents!

Mohit Raghavendra @mohit_r9a

6 months ago

1 10 31 4K 6

0 0 1 176 0

View Details

Mohit Raghavendra @mohit_r9a

6 months ago

🚀New @scale_AI research: Verifiers for SWE Agents have traditionally used unit tests or simple, execution-free classifiers. But can we get verifiers that are more expressive, repository-grounded, and still execution-free at scoring time? We explore Agentic Rubrics to fill this gap 💡 Agentic Rubrics are repo-grounded, execution-free verifiers for SWE agents. We generate a checklist of concrete, codebase-specific criteria using an Agentic Harness, and then score patches against it. 🧑‍💻

1 10 31 4K 6

View Details

Yuxiang Wei @YuxiangWei9

6 months ago

Results: - self-improvement on SWE-bench Verified (+10.4) and Pro (+7.8) - better than the baseline RL using human issue data over the course of training

2 3 54 7K 8

View Details

Scale AI @scale_AI

6 months ago

New Scale research: Do AI models actually reason in ways humans can trust for real-world decisions? Introducing MoReBench, the first benchmark for procedural moral reasoning in LLMs, measuring not just what models decide, but how they reason through moral ambiguity.

7 14 48 12K 12

View Details

Jeff Da @_jeffda

6 months ago

@scale_AI Check out the paper and dataset: Paper: static.scale.com/uploads/674f4c… Github: github.com/scaleapi/mcp-a… Dataset: huggingface.co/datasets/Scale… Leaderboard: scale.com/leaderboard/mc…

0 0 0 77 0

View Details

Jeff Da @_jeffda

6 months ago

New open-source benchmark from @scale_AI: MCP-Atlas MCP-Atlas is a large-scale benchmark for evaluating tool-use competency using 36 real MCP servers and 220 tools. The benchmark was featured in recent model cards (GPT, Claude, Gemini), and now it's open-source!

Bing Liu @vbingliu

6 months ago

18 22 220 22K 154

1 0 3 246 1

View Details

Scale AI @scale_AI

6 months ago

We recently introduced MCP-Atlas, a benchmark for evaluating how well LLMs handle tool use via the Model Context Protocol. Even top models failed nearly half of realistic multi-tool tasks. Today, we’re open-sourcing the benchmark so you can measure performance yourself.

1 6 33 5K 10

View Details

Bing Liu @vbingliu

6 months ago

🚀 Today we’re open-sourcing MCP Atlas — a large-scale, real-server benchmark for agentic tool use, which has been used in the recent GPT-5.2, Claude Opus 4.5, and Gemini 3 Flash model releases! 🧠 Key insight: realistic agentic tool use is not a function-calling problem. It requires tool discovery, orchestration, and recovery in real environments. 🔧 MCP Atlas evaluates agents on real MCP servers (36 servers, 220 tools, 1K human-written tasks). Models must find the right tools, call them correctly, chain them together, and handle failures. 📉 What we found: • Agents fail more often at tool interaction than at reasoning • Performance drops sharply with real-world tool friction • Scaling models helps unevenly, robustness remains hard • Claims-based eval reveals how agents fail, not just if they finish Check it out! 📄 Paper: static.scale.com/uploads/674f4c… 🌍 Environment: github.com/scaleapi/mcp-a… 📂 Dataset: huggingface.co/datasets/Scale… 📊 Leaderboard: scale.com/leaderboard/mc… #AgenticAI #ToolUse #LLMEval #Benchmarks #MCP