WebAgentlab @webagentlab

WebAgentLab is building an open-source community focused on Web Agent and the broader GUI Agent field. webagentlab.feishu.cn/wiki/space/744… join to contribute 👉 Joined November 2024

Tweets

1K
Followers

649
Following

1K
Likes

2K

Satya Nadella @satyanadella

2 weeks ago

x.com/i/article/2065…

3K 8K 41K 66.2M 57K

View Details

现在顶级 AI 实验室的入场券，早就不只是有学术光环了！最近看到一篇很硬核的 ML 面试复盘文章，作者拿到了 DeepMind 等多家顶级 AI 公司的 offer，文章里面有个很现实的观察：哪怕你手里有多篇 AI 顶会的一作，简历也只是把你送进面试间。在真正面试时，很多考官并不会围着你的论文细节聊太久，他们更关心的是：你能不能在有限时间里写出 Transformer 的 backward pass，能不能把基础数学讲清楚，能不能现场手撕算法题。这背后作者讲出了很残酷的行业逻辑：顶级 AI 研究员面试，很多时候筛的不是你的科研上限，而是你的工程、数学和 coding 下限。所以顶尖博士面试前也会焦虑，也要刷题、模拟、补基础。学术成果证明你有潜力，但面试流程要确认你能稳定交付。这也挺反直觉的：做研究像艺术，找工作却像工程。论文、idea、创造力当然重要，但真正进门时，还是要过一套非常标准化、非常具体、甚至有点像高考的筛选流程。另外，文章里对初创公司期权的提醒也很现实：别只听估值故事，税收、流动性、行权成本和退出不确定性，都会让纸面财富和真实收益差很远。在今天的 AI 行业，别指望靠过去的学术功劳簿一路通关。想进顶级实验室，最好提前把面试当成一个工程项目来准备：刷题、推公式、复盘论文、模拟面试，一项项补齐。 silviasapora.github.io/blog/ml-interv…

34 166 1K 132K 2K

View Details

Zhuokai Zhao @zhuokaiz

2 months ago

This benchmark costs over $120k in API spend and 16k expert hours. DecodingTrust-Agent Platform (DTap) is by far the most realistic agent red-teaming setup with 50+ simulated environments (Gmail, PayPal, Slack, Salesforce, Robinhood, Windows, macOS, etc.), full GUI/backend, and MCP tools mirroring the real ones. DTap benchmarks in simulated environments with separated tools, skills, and prompts. Each simulated environment is a full-stack replica with real frontend, backend, and database. Take Robinhood, for example. DTap rebuilds the trading dashboard, the order APIs, and the portfolio state all 1:1 with the real product. Plus you can reload any environment state on demand, and run thousands of evaluations in parallel. Most agent benchmarks fake this layer with hardcoded tool outputs. DTap does not just benchmark what to inject, but also where to inject. Most prior agent benchmarks (AgentDojo, AgentHarm) only attack the user prompt with hardcoded injections. They're clean to measure, but tell you nothing about whether your real Gmail agent is exploitable. DTap treats location as a choice. For example, to get an agent to leak your private inbox to an attacker, the attack might plant a fake email thread that makes the agent think you approved forwarding messages to an outside address. It might poison the description of an MCP tool the agent picks up at runtime. Or hide instructions inside an image attachment that the agent parses and executes. This is better because real attackers don't pick one surface and stop — they search for whichever path is least defended. A benchmark that only tests prompt injection might call your agent safe, but a poisoned tool description may still breach the system. DTap uses a real risk taxonomy. 300+ risk categories are pulled from 60+ real policies (Salesforce AUP, EU AI Act, GDPR, NIST). So Attack Success Rate (ASR) measures whether the agent actually broke a real rule — like leaking data covered by GDPR or making an unauthorized PayPal transaction — not just whether someone got the model to say something bad. That's much closer to a real security claim than a typical jailbreak leaderboard. DTap ditched LLM-as-judge. Each task comes with a small piece of code, written by hand by the researchers, that inspects the environment after the attack runs. For example, on a PayPal task where the goal is "make an unauthorized $500 transfer to the attacker's account," the rule queries the sandbox transaction database after the attack and checks if a new transaction to that account for $500+ appeared. Every task uses the same deterministic state checks approach, which honestly makes a lot more sense. The findings are more interesting (and concerning) than you'd expect: 1. Even Claude Code — the most robust one tested — falls to 25%+ of attacks. Google ADK loses to more than half. 2. Combining different injection points works much better than attacking just one. And Skill+Tool and Environment+Tool combinations consistently beat any single-channel attack. 3. The most exploitable environments are the ones with rich communication flows like Gmail, WhatsApp, and Calendar, where there's a lot of external content for an attacker to slip into. 4. The risks that hit hardest are the ones requiring multi-step reasoning, while content-level risks like generating harmful text are mostly already handled by model alignment. Another finding that's largely been overlooked: harness design matters as much as model alignment, if not more. As a comparison, OpenAI Agents SDK and Google ADK let the agent fire several tool calls at the same time, then only check afterward whether any of them should have been refused. By that point the harmful action — deleted file, sent email, executed transaction — has already happened. On the other hand, Claude Code and OpenClaw call tools one at a time, so the agent can spot the problem and stop before any damage is done. Worth a real read: arxiv.org/pdf/2605.04808

2 7 30 4K 19

View Details

Lei Li @_TobiasLee

2 months ago

🦞 Claw-Eval-Live is out, a live extension of the Claw-Eval Family! This live release includes: 105 tasks | 17 workflow families | 13 frontier models tested | quarterly refresh from real ClawHub marketplace signals. Instead of relying on a static task set, Claw-Eval-Live keeps agent evaluation aligned with evolving real-world enterprise workflows. Check it out: 🤗 HF Paper: huggingface.co/papers/2604.28… Leaderboard: claw-eval-live.github.io Code: github.com/Claw-Eval-Live…

2 5 25 2K 5

View Details

Huan Sun @hhsun1

2 months ago

Congrats to all students at @osunlp and collaborators for their papers getting accepted to #ICML2026 and #ACL2026. I particularly want to highlight our efforts on improving the safety of computer-use agents. “When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents” -- AutoElicit (ICML'26), led by @Jaylen_JonesNLP @Zhehao_Zhang123 “When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents” -- DeAction (ICML'26), led by @yuting_ning To our knowledge, AutoElicit is the first project that systematically studies and proactively surface harmful unintended behaviors of computer-use agents from benign inputs (e.g., an agent accidentally deletes files on your system or makes unauthorized changes). We propose a conceptual framework to define their key characteristics, automatically elicit them and analyze how they arise from benign inputs. Datasets with benign task instructions and frontier agents’ trajectories that exhibit unintended behaviors are released. Now how do we detect and correct misaligned actions on the fly at runtime, before these actions are taken? In the second project, we make the first effort to define and study runtime misaligned action detection in CUAs, and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. We develop DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback.

OSU NLP Group @osunlp

2 months ago

5 papers at #ICML2026 and 4 papers at #ACL2026. Congrats to students at @osunlp and our collaborators!

1 9 41 17K 3

1 12 78 13K 20

View Details

Ungrounded不着边际 @UngroundedPod

2 months ago

Ungrounded 不着边际 EP02 对话赵晨阳：硅谷退学潮、SGLang、AI Coding与开源社区的新边界嘉宾：赵晨阳@GenAI_is_real 主持：孔德涵@DehanKong285793，谷雨@yugu_nlp b站地址，感谢一键三连！@webagentlab 出品 bilibili.com/video/BV1oRRyB…

0 4 16 3K 4

View Details

张小珺 Xiaojun Zhang @zhang_benita

2 months ago

The era of large language models has moved past its first act—the chat era—and entered its second act: the age of Agents. On this show, we’ll dive deep into the core technical principles of Agents and break down the technology for you, offering a clear overview of its evolutionary trajectory. If you enjoy our show, we’d appreciate it if you could leave us a 5‑star rating on Apple Podcasts🤓🤓 podcasts.apple.com/cn/podcast/%E5…

14 49 271 77K 166

View Details

AK @_akhaliq

2 months ago

Agentic World Modeling Foundations, Capabilities, Laws, and Beyond paper: huggingface.co/papers/2604.22…

8 39 188 29K 127

View Details

Joachim Baumann @joabaum

2 months ago

We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇

14 78 477 70K 293

View Details

Lei Li @_TobiasLee

2 months ago

Beyond the weights, we have sth special for all the builders! Check it out: 100t.xiaomimimo.com

4 13 103 15K 24

View Details

Kanzhi Cheng @njucckevin

2 months ago

🚀 Excited to share our new work OpenMobile—a data synthesis framework that enables the open-source community to train SOTA mobile agents. All data, models, and code have been open-sourced! Paper: huggingface.co/papers/2604.15… Data: huggingface.co/datasets/cckev… 🧵[1/4]

7 4 13 2K 5

View Details

Yu Su @ysu_nlp

2 months ago

I will talk about 'continual learning as adaptive compression of experience' at the recursive self-improvement workshop at #ICLR2026. Happening in ~20 mins. Unfortunately I didn't make it to Rio, so it will be online. recursive-workshop.github.io

12 43 468 69K 342

View Details

张小珺 Xiaojun Zhang @zhang_benita

2 months ago

Yes, our latest special guest is Fuli Luo @_LuoFuli . The second battle in the global large model arms race has begun: shifting from the Chat era dominated by pre-training to the Agent era driven by post-training. This marks Fuli Luo’s first-ever interview, as well as her first in-depth technical conversation. We talked systematically about the massive AI upheaval triggered by technological breakthroughs including Claude Opus 4.6 and OpenClaw in 2026, along with its subsequent structural impacts across the industry. Amid the fierce large-model arms race, the world around us is undergoing brutally rapid changes—even for researchers who train models firsthand. “I used to believe our work was highly creative, and could never be simplified into fixed skills or standardized workflows. But now I realize it can be automated after all. If that’s possible, can models train stronger models on their own? Can they achieve iterative improvement through self-evolution? This is exactly what will unfold in the next couple of years,” Fuli Luo says. As human knowledge and wisdom are internalized into model capabilities, what will humanity pursue in the future? Is our society truly ready for this tsunami-scale technological revolution? All in all, this is an information-dense dialogue. It reveals how an AI lab makes strategic technical bets, allocates resources, and adjusts organizational structure and team planning amid a major paradigm shift. At the core of its response to drastic change lies its established culture and core values. Though lengthy and technically intensive, we hope this conversation brings great insights to every viewer. Our podcast, video episode and article are released simultaneously across platforms, with English subtitles provided to assist non-Chinese-speaking audiences. Luo Fuli: OpenClaw, Agent Frameworks — The AI Paradigm Has Already Chang... youtu.be/V9eI-t3TApE?si… 来自 @YouTube

32 95 629 317K 521

View Details

DeepSeek @deepseek_ai

2 months ago

🚀 DeepSeek-V4 Preview is officially live & open-sourced! Welcome to the era of cost-effective 1M context length. 🔹 DeepSeek-V4-Pro: 1.6T total / 49B active params. Performance rivaling the world's top closed-source models. 🔹 DeepSeek-V4-Flash: 284B total / 13B active params. Your fast, efficient, and economical choice. Try it now at chat.deepseek.com via Expert Mode / Instant Mode. API is updated & available today! 📄 Tech Report: huggingface.co/deepseek-ai/De… 🤗 Open Weights: huggingface.co/collections/de… 1/n

2K 8K 46K 9.9M 10K

View Details

Cua @trycua

2 months ago

We're open-sourcing Cua Driver - our new macOS driver that lets any agent (Claude Code, Codex, your own loop) drive any app in the background, with true multi-player and multi-cursor built-in. 1/8

64 173 2K 241K 2K

View Details

Simon Yu @simon_ycl

2 months ago

🇧🇷ICLR 2026 paper🇧🇷 Your agent's skills don't transfer. On a new site, only 18% skills get reused — so there's no continual learning, just relearning every time. How do agents learn skills that actually generalize? Introducing PolySkill to make agents smooth across sites 🧵

3 17 107 13K 77

View Details

Qiushi Sun @qiushi_sun

2 months ago

Heading to ICLR’26! We’ll be presenting our work on computer-using agents and code intelligence. Stop by our presentations or catch us in the hall / oral sessions if you'd like to discuss! #iclr2026 See you in Rio 🇧🇷 #iclr

1 2 25 702 1

View Details

Ang Li @angli_ai

2 months ago

pass@k measures if it can work - that is capability. pass^k measures if it will work - this is reliability. 2025, we proved capability. achieved human-level in OSWorld as the first time. 2026, we're solving reliability. the last problem before computer use agents stop being a toy.

Xin Eric Wang @xwang_lk

2 months ago

Computer-use agents are getting very capable. But capability is not the bottleneck anymore. 𝐑𝐞𝐥𝐢𝐚𝐛𝐢𝐥𝐢𝐭𝐲 is. Benchmarks reward “works once.” Real-world systems require “works every time.” In On the Reliability of Computer Use Agents, we study WHY this gap exists and

4 21 62 9K 55

0 3 18 2K 9

View Details

Yu Gu @yugu_nlp

2 months ago

Today we're finally out! Something I keep coming back to: continual learning and world modeling are two sides of the same coin. Specialization starts where training ends. It's the agent continuously building its model of the world it actually lives in. That's when clever demos turn into real expertise. We're hiring! @NeoCognition

Yu Su @ysu_nlp

2 months ago

Introducing @NeoCognition, the agent lab for specialized intelligence. Everyone needs experts, but human expertise does not scale. Backed by $40M seed funding, we build self-learning agents that specialize across domains to make expertise abundant.

91 132 889 191K 364

7 15 91 12K 25

View Details

WebAgentlab @webagentlab

2 months ago

🔥 Finding the "ChatGPT Moment" for CUA! On April 19, WebAgentLab x @qingke Community presents the "ICLR 2026 CUA Workshop" livestream. We've gathered top pioneers from UWaterloo, HKU, Fudan, Alibaba, and Minimax to deep-dive into: 💻 Real-world deployment & multi-platform unification of GUI Agents 🚀 Autonomous continual learning in dynamic environments 🛠️ Breaking data dependency in agent infrastructure Great research belongs beyond PDFs and repos. Join us to witness the new era of AI taking over the keyboard and mouse! 🖱️ #CUA #GUIAgent #LLM #AI #ICLR2026