Karl Pertsch @KarlPertsch

Robot Foundation Models @physical_int kpertsch.github.io Joined July 2015

Tweets

458
Followers

4K
Following

279
Likes

657

Karl Pertsch @KarlPertsch

a week ago

Over the last few weeks we have seen some evidence of benchmark manipulation on RoboArena, in part flagged by members of the robotics community. We take this very seriously and have taken steps to protect the integrity of the benchmark and ensure fair, unbiased evaluations. Most importantly, going forward only organizations without an active policy on the leaderboard will be permitted to submit evaluations. We have applied these changes retroactively across the period of suspicious benchmark activity, to make sure all evals on the leaderboard can be trusted. A big thank you to all the RoboArena evaluators who volunteer their time to provide unbiased robot evaluations, and to the community members who helped audit the evaluation results and flagged potential wrong-doing.

Pranav Atreya @pranav_atreya

2 weeks ago

We observed evidence of benchmark hacking on RoboArena since April. We have taken steps to prevent this in the future, and we have rolled back evals in accordance with these steps to retain the integrity of the benchmark. Read more about our changes here: robo-arena.github.io.

3 1 48 27K 20

1 1 31 8K 11

View Details

Karl Pertsch @KarlPertsch

2 months ago

These choices are actually a bit orthogonal. Long vs short episodes mostly depends on your teleop success rate on long episodes --> if you frequently make mistakes in teleop, it's better to chunk a long task into shorter episodes. Re one task for everything vs subtask conditioning: the latter can potentially give you more generalization (see eg pi07 results), but requires training a high-level policy that picks subtasks at inference time, which can introduce one more source of errors. So I would suggest starting with simple one-task-for-all as baseline, and only after try the HL/LL design!

0 0 12 928 8

View Details

Yajat Yadav @YajatYadav314

2 months ago

Excited to be in Rio this week to present RETAIN (w/ @zhiyuan_zhou_ , @ajwagenmaker, @KarlPertsch, and @svlevine) at #ICLR2026! 🇧🇷 Saturday 10:30 AM – 1:00 PM at P3-#1208. Project Page: retain.yajatyadav.com x.com/zhiyuan_zhou_/….

Paul Zhou @zhiyuan_zhou_

6 months ago

Do you ever find finetuning VLA overfits to the target task, to the point where generalist ability is lost and even minor deviations beyond the SFT data break the policy? We found an extremely simple solution: directly merge the base and finetuned policy in weight space 🤯 👇🧵

7 49 385 127K 238

2 12 61 19K 25

View Details

Karl Pertsch @KarlPertsch

2 months ago

Happy to share some new results! π0.7 comes with memory, and algorithmic advances to pull out more performance and generalization from diverse training data! Check it out!

Physical Intelligence @physical_int

2 months ago

Our newest model, π0.7, has some interesting emergent capabilities: it can control a new robot to fold shirts for which we had no shirt folding data, figure out how to use an appliance with language-based coaching, and perform a wide range of dexterous tasks all in one model!

62 317 3K 456K 793

1 1 57 5K 4

View Details

Physical Intelligence @physical_int

3 months ago

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

37 291 2K 431K 1K

View Details

Paul Zhou @zhiyuan_zhou_

4 months ago

late but: RETAIN will be presented at #ICLR2026 in Rio! The code is also out at github.com/yajatyadav/RET…, though all you really need is this one line

Paul Zhou @zhiyuan_zhou_

6 months ago

7 49 385 127K 238

1 3 26 4K 7

View Details

Karl Pertsch @KarlPertsch

4 months ago

Jup, tho off the shelf VLMs today are often not well suited as HL policies for more complex tasks (many papers have shown this, they struggle with finegrained interaction understanding, failures etc) and robot fine tuned models so far need to be taught to remember explicitly. Agree tho that in the future this will hopefully be bridged

0 0 7 551 3

View Details

Danny Driess @DannyDriess

4 months ago

Many real-world tasks require memory to be successful. Yet, most robots don’t have any form of memory. Today, we are going to change that. We developed a system called MEM that introduces memory into VLAs on multiple scales

Physical Intelligence @physical_int

4 months ago

We’ve developed a memory system for our models that provides both short-term visual memory and long-term semantic memory. Our approach allows us to train robots to perform long and complex tasks, like cleaning up a kitchen or preparing a grilled cheese sandwich from scratch 👇

48 262 2K 451K 1K

5 12 64 6K 5

View Details

Marcel Torné @marceltornev

4 months ago

We equipped PI policies with memory! And taught our robots to do long-horizon real world tasks such as preparing the items for a recipe, cooking a grilled cheese and cleaning the kitchen!

Physical Intelligence @physical_int

4 months ago

48 262 2K 451K 1K

8 15 89 10K 7

View Details

Physical Intelligence @physical_int

4 months ago

48 262 2K 451K 1K

View Details

Karl Pertsch @KarlPertsch

4 months ago

This was one of the longest-running research projects at pi — adding memory to your models stretches all parts of your infra and needs innovation on the whole stack. The project started as @HomerWalke's internship project with @DannyDriess, but had lots of help from countless people at pi to get over the finish line. Special shoutout to @marceltornev who worked tirelessly to teach our models the long-horizon behaviors you saw in the videos above! For more details, check out our blog & paper: pi.website/research/memory

0 0 6 410 0

View Details

Karl Pertsch @KarlPertsch

4 months ago

Finally, many prior works have reported that policies get *worse* on dexterous tasks when adding memory (because of spurious correlations, causal confusion etc). We find that by equipping pi06 with MEM and training it on our most diverse data mix, we can match pi06 performance on tasks that do not require memory (while clearly outperforming on memory tasks). This is IMO one of the biggest results here: we have a recipe for adding memory to VLAs without significant tradeoffs, both in terms of latency and performance!

1 0 4 397 2

View Details

Karl Pertsch @KarlPertsch

4 months ago

This one has been a long time coming: today we’re introducing MEM, an approach for giving VLAs short-term and long-term memory. Memory is such an obvious capability, but adding it isn’t easy (most VLAs today are memory-less). A short thread on challenges, solutions, and the new capabilities MEM unlocks for us.

8 11 112 10K 25

View Details

Karl Pertsch @KarlPertsch

4 months ago

Use DROID data and PolaRiS sim evals to test your ideas on strong generalist policies! Congrats to the CoVer-VLA team!

Jacky Kwok @jackyk02

4 months ago

🧵(6) DROID Eval CoVer-VLA achieves 14% gains in task progress and 9% in success rate on the challenging red-team PolaRiS benchmark. In the pan cleaning task, π₀.₅ shows incorrect intent, grasping the pan handle. In contrast, CoVer-VLA correctly uses sponge to scrub the pan.

2 3 9 6K 8

1 0 26 3K 13

View Details

Karl Pertsch @KarlPertsch

4 months ago

Very exciting to see first steps of our models doing useful things in the world! Thanks to Ultra and Weave for being great partners in these deployments!

Physical Intelligence @physical_int

4 months ago

General-purpose AI models are behind some of the most exciting applications we now can't live without. We envision that an analogous “physical intelligence layer” built with models like π0.6 will similarly spur a new wave of applications for the physical world. We’ve recently