WebAgentLab is building an open-source community focused on Web Agent and the broader GUI Agent field.webagentlab.feishu.cn/wiki/space/744… join to contribute 👉Joined November 2024
This benchmark costs over $120k in API spend and 16k expert hours.
DecodingTrust-Agent Platform (DTap) is by far the most realistic agent red-teaming setup with 50+ simulated environments (Gmail, PayPal, Slack, Salesforce, Robinhood, Windows, macOS, etc.), full GUI/backend, and MCP tools mirroring the real ones.
DTap benchmarks in simulated environments with separated tools, skills, and prompts.
Each simulated environment is a full-stack replica with real frontend, backend, and database. Take Robinhood, for example. DTap rebuilds the trading dashboard, the order APIs, and the portfolio state all 1:1 with the real product. Plus you can reload any environment state on demand, and run thousands of evaluations in parallel. Most agent benchmarks fake this layer with hardcoded tool outputs.
DTap does not just benchmark what to inject, but also where to inject.
Most prior agent benchmarks (AgentDojo, AgentHarm) only attack the user prompt with hardcoded injections. They're clean to measure, but tell you nothing about whether your real Gmail agent is exploitable.
DTap treats location as a choice.
For example, to get an agent to leak your private inbox to an attacker, the attack might plant a fake email thread that makes the agent think you approved forwarding messages to an outside address. It might poison the description of an MCP tool the agent picks up at runtime. Or hide instructions inside an image attachment that the agent parses and executes. This is better because real attackers don't pick one surface and stop — they search for whichever path is least defended. A benchmark that only tests prompt injection might call your agent safe, but a poisoned tool description may still breach the system.
DTap uses a real risk taxonomy.
300+ risk categories are pulled from 60+ real policies (Salesforce AUP, EU AI Act, GDPR, NIST). So Attack Success Rate (ASR) measures whether the agent actually broke a real rule — like leaking data covered by GDPR or making an unauthorized PayPal transaction — not just whether someone got the model to say something bad. That's much closer to a real security claim than a typical jailbreak leaderboard.
DTap ditched LLM-as-judge.
Each task comes with a small piece of code, written by hand by the researchers, that inspects the environment after the attack runs. For example, on a PayPal task where the goal is "make an unauthorized $500 transfer to the attacker's account," the rule queries the sandbox transaction database after the attack and checks if a new transaction to that account for $500+ appeared. Every task uses the same deterministic state checks approach, which honestly makes a lot more sense.
The findings are more interesting (and concerning) than you'd expect:
1. Even Claude Code — the most robust one tested — falls to 25%+ of attacks. Google ADK loses to more than half.
2. Combining different injection points works much better than attacking just one. And Skill+Tool and Environment+Tool combinations consistently beat any single-channel attack.
3. The most exploitable environments are the ones with rich communication flows like Gmail, WhatsApp, and Calendar, where there's a lot of external content for an attacker to slip into.
4. The risks that hit hardest are the ones requiring multi-step reasoning, while content-level risks like generating harmful text are mostly already handled by model alignment.
Another finding that's largely been overlooked: harness design matters as much as model alignment, if not more.
As a comparison, OpenAI Agents SDK and Google ADK let the agent fire several tool calls at the same time, then only check afterward whether any of them should have been refused. By that point the harmful action — deleted file, sent email, executed transaction — has already happened. On the other hand, Claude Code and OpenClaw call tools one at a time, so the agent can spot the problem and stop before any damage is done.
Worth a real read:
arxiv.org/pdf/2605.04808
🦞 Claw-Eval-Live is out, a live extension of the Claw-Eval Family!
This live release includes:
105 tasks | 17 workflow families | 13 frontier models tested | quarterly refresh from real ClawHub marketplace signals.
Instead of relying on a static task set, Claw-Eval-Live keeps agent evaluation aligned with evolving real-world enterprise workflows.
Check it out:
🤗 HF Paper: huggingface.co/papers/2604.28…
Leaderboard: claw-eval-live.github.io
Code: github.com/Claw-Eval-Live…
Congrats to all students at @osunlp and collaborators for their papers getting accepted to #ICML2026 and #ACL2026. I particularly want to highlight our efforts on improving the safety of computer-use agents.
“When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents” -- AutoElicit (ICML'26), led by @Jaylen_JonesNLP@Zhehao_Zhang123
“When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents” -- DeAction (ICML'26), led by @yuting_ning
To our knowledge, AutoElicit is the first project that systematically studies and proactively surface harmful unintended behaviors of computer-use agents from benign inputs (e.g., an agent accidentally deletes files on your system or makes unauthorized changes). We propose a conceptual framework to define their key characteristics, automatically elicit them and analyze how they arise from benign inputs. Datasets with benign task instructions and frontier agents’ trajectories that exhibit unintended behaviors are released.
Now how do we detect and correct misaligned actions on the fly at runtime, before these actions are taken? In the second project, we make the first effort to define and study runtime misaligned action detection in CUAs, and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. We develop DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback.
The era of large language models has moved past its first act—the chat era—and entered its second act: the age of Agents.
On this show, we’ll dive deep into the core technical principles of Agents and break down the technology for you, offering a clear overview of its evolutionary trajectory.
If you enjoy our show, we’d appreciate it if you could leave us a 5‑star rating on Apple Podcasts🤓🤓
podcasts.apple.com/cn/podcast/%E5…
We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild.
In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check.
Data, paper, & findings in the 🧵👇
🚀 Excited to share our new work OpenMobile—a data synthesis framework that enables the open-source community to train SOTA mobile agents. All data, models, and code have been open-sourced!
Paper: huggingface.co/papers/2604.15…
Data: huggingface.co/datasets/cckev…
🧵[1/4]
I will talk about 'continual learning as adaptive compression of experience' at the recursive self-improvement workshop at #ICLR2026.
Happening in ~20 mins.
Unfortunately I didn't make it to Rio, so it will be online.
recursive-workshop.github.io
Yes, our latest special guest is Fuli Luo @_LuoFuli .
The second battle in the global large model arms race has begun: shifting from the Chat era dominated by pre-training to the Agent era driven by post-training.
This marks Fuli Luo’s first-ever interview, as well as her first in-depth technical conversation. We talked systematically about the massive AI upheaval triggered by technological breakthroughs including Claude Opus 4.6 and OpenClaw in 2026, along with its subsequent structural impacts across the industry.
Amid the fierce large-model arms race, the world around us is undergoing brutally rapid changes—even for researchers who train models firsthand.
“I used to believe our work was highly creative, and could never be simplified into fixed skills or standardized workflows. But now I realize it can be automated after all. If that’s possible, can models train stronger models on their own? Can they achieve iterative improvement through self-evolution? This is exactly what will unfold in the next couple of years,” Fuli Luo says.
As human knowledge and wisdom are internalized into model capabilities, what will humanity pursue in the future? Is our society truly ready for this tsunami-scale technological revolution?
All in all, this is an information-dense dialogue. It reveals how an AI lab makes strategic technical bets, allocates resources, and adjusts organizational structure and team planning amid a major paradigm shift. At the core of its response to drastic change lies its established culture and core values.
Though lengthy and technically intensive, we hope this conversation brings great insights to every viewer.
Our podcast, video episode and article are released simultaneously across platforms, with English subtitles provided to assist non-Chinese-speaking audiences.
Luo Fuli: OpenClaw, Agent Frameworks — The AI Paradigm Has Already Chang... youtu.be/V9eI-t3TApE?si… 来自 @YouTube
🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length.
🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models.
🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice.
Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today!
📄 Tech Report: huggingface.co/deepseek-ai/De…
🤗 Open Weights: huggingface.co/collections/de…
1/n
We're open-sourcing Cua Driver - our new macOS driver that lets any agent (Claude Code, Codex, your own loop) drive any app in the background, with true multi-player and multi-cursor built-in.
1/8
🇧🇷ICLR 2026 paper🇧🇷
Your agent's skills don't transfer.
On a new site, only 18% skills get reused — so there's no continual learning, just relearning every time.
How do agents learn skills that actually generalize?
Introducing PolySkill to make agents smooth across sites 🧵
Heading to ICLR’26! We’ll be presenting our work on computer-using agents and code intelligence. Stop by our presentations or catch us in the hall / oral sessions if you'd like to discuss! #iclr2026
See you in Rio 🇧🇷
#iclr
pass@k measures if it can work - that is capability.
pass^k measures if it will work - this is reliability.
2025, we proved capability. achieved human-level in OSWorld as the first time.
2026, we're solving reliability. the last problem before computer use agents stop being a toy.
Computer-use agents are getting very capable.
But capability is not the bottleneck anymore. 𝐑𝐞𝐥𝐢𝐚𝐛𝐢𝐥𝐢𝐭𝐲 is.
Benchmarks reward “works once.”
Real-world systems require “works every time.”
In On the Reliability of Computer Use Agents, we study WHY this gap exists and
Today we're finally out!
Something I keep coming back to: continual learning and world modeling are two sides of the same coin. Specialization starts where training ends. It's the agent continuously building its model of the world it actually lives in. That's when clever demos turn into real expertise.
We're hiring! @NeoCognition
Introducing @NeoCognition, the agent lab for specialized intelligence.
Everyone needs experts, but human expertise does not scale.
Backed by $40M seed funding, we build self-learning agents that specialize across domains to make expertise abundant.
🔥 Finding the "ChatGPT Moment" for CUA!
On April 19, WebAgentLab x @qingke
Community presents the "ICLR 2026 CUA Workshop" livestream.
We've gathered top pioneers from UWaterloo, HKU, Fudan, Alibaba, and Minimax to deep-dive into:
💻 Real-world deployment & multi-platform unification of GUI Agents
🚀 Autonomous continual learning in dynamic environments
🛠️ Breaking data dependency in agent infrastructure
Great research belongs beyond PDFs and repos. Join us to witness the new era of AI taking over the keyboard and mouse! 🖱️
#CUA#GUIAgent#LLM#AI#ICLR2026
9 Followers 69 FollowingForProduction | Data Scientist & MLOps. I read cutting-edge AI research so you don't have to | distilling papers into production-ready insights.
30 Followers 435 FollowingPh.D. Candidate at Zhejiang University; Guest Ph.D. at University of Copenhagen | Reinforcement Learning, Differential Privacy
41 Followers 268 FollowingEx-Intern @MSFTResearch, DPhil student @AVLOxford @UniOfOxford @St_Catz | CUA Agents, 3D Vision, Generative AI, Music and Arts
41 Followers 268 FollowingEx-Intern @MSFTResearch, DPhil student @AVLOxford @UniOfOxford @St_Catz | CUA Agents, 3D Vision, Generative AI, Music and Arts
13K Followers 873 FollowingAssociate Professor @WisconsinCS. Making AI reliable for the open world. Program Chairing #ICML2026. Prev: @Stanford @Cornell
2K Followers 3K FollowingAssistant Professor @NTUsg | Ph.D. @WisconsinCS, 30under30 @Forbes asia | reliable machine learning 🤖️ ⛑️ | Opinions are my own