Evaluating a robot foundation model is one of the most demanding closed-loop problems in robotics. Before you can trust a policy to move a real robot, you have to test it thousands of times across thousands of starting conditions each, pairing GPU-heavy inference with GPU-heavy physics simulation step by step, at a scale that quickly becomes a complex infrastructure problem.
In this session, Ian Jordan will walk through this. Register: na2.hubs.ly/H06cfBN0
A great example of the importance of disaggregation in RL. From the paper
⚪️ LLM generation alternates between prefill and decode
🔵 Prefill is compute bound
🔵 H800s are compute optimized
🔵 Doing prefill on H800s cuts rollout time by 47%
🔴 Decode is bandwidth bound
🔴 H20s are bandwidth optimized
🔴 Doing decode on H20s cuts rollout time by 21-51%.
On top of all that, the prefill to decode ratio depends on task characteristics (e.g., many-turn tasks that require lots of context compaction are prefill heavy).
And prefill / decode is just for inference. There are many other components with different hardware requirements.
Quoting the paper:
⚪️ Environments are stateful, CPU-bound processes whose latency is heavy-tailed due to host contention, large variance in interaction turns, and environment failures.
⚪️ Reward workers are stateless and exhibit persistently low utilization—dropping to as little as 7.4% on dedicated GPUs— yet require elastic scaling when trajectories complete.
⚪️ Training demands high-end GPUs with fast interconnects.
No single hardware type satisfies all stages.
Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in
Ray (@raydistributed) and @vllm_project are helping developers understand when prefill-decode disaggregation can improve throughput and reduce compute costs, and when aggregated serving remains the better choice.
Read more: anyscale.com/blog/ray-vllm-…
The @PyTorch Foundation is lucky to host these amazing open source projects @raydistributed and @vllm_project
Seeing the community come together to make something even more powerful is amazing!
Great work! Amazing to see Ray Serve LLM and @vllm_project are ever closer together! When done right, @raydistributed is ever flexible, extensible, and highly performant.
Built something with Ray worth sharing? Ray Summit 2026 puts it in front of the people working on the same problems: maintainers, infra leads, applied researchers.
Foundation models, physical AI, RL, distributed inference, platform engineering, continuous learning systems.
⏰ Two days left, CFP closes June 20: na2.hubs.ly/H069Zbp0
Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput on decode-heavy workloads!
🚀Three major optimizations:
- Direct streaming, bypassing an intermediate Ray Serve deployment on the response path with a new,
This is a major architectural change of Ray Serve LLM which improves performance significantly on the streaming path. Ray serve is known for its ergonomics and how easy it is to setup and scale compared to pure k8s. Now it is also as performant as the SOTA.
Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in
Super excited about the launch of these new performance optimizations built on Ray and vLLM. This is a major milestone for the next-generation open-source AI infrastructure stack.
Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in
Ray (@raydistributed) Serve LLM and @vllm_project enable high performance distributed inference at scale. Awesome to see Foundation-hosted projects working together to advance the open source AI stack.
Learn more: anyscale.com/blog/high-perf…
Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput on decode-heavy workloads!
🚀Three major optimizations:
- Direct streaming, bypassing an intermediate Ray Serve deployment on the response path with a new,
Great work! Amazing to see Ray Serve LLM and @vllm_project are ever closer together! When done right, @raydistributed is ever flexible, extensible, and highly performant.
Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in
Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput on decode-heavy workloads!
🚀Three major optimizations:
- Direct streaming, bypassing an intermediate Ray Serve deployment on the response path with a new, control plane-only endpoint picker
- A new, Ray V2 executor backend in vLLM, enabling optimizations such as async scheduling
- HAProxy ingress, for ingress request routing at the speed of C
All available in Ray 2.56. This is awesome work with @googlecloud and @vllm_project!
Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in
Huge milestone from the @anyscalecompute + @googlecloud GKE teams 🎊
Ray Serve LLM provides up to 4.4x higher throughput on prefill-heavy workloads and 24x on decode-heavy workloads than previous versions.
Three optimizations made this possible on the Ray Serve LLM + vLLM stack:
⭐️Direct streaming with a control-plane-only endpoint picker
⭐️ A new vLLM Ray V2 executor backend
⭐️HAProxy ingress for routing at the speed of C
Ray's primitives for fault tolerance, observability, and portability across K8s and VMs are a great foundation as inference deployments get more complex.
Congrats to the team! Try the new Ray V2 executor today in vLLM with --distributed-executor-backend ray.
Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in
Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in benchmarks across a variety of workloads and deployment patterns.
In Ray 2.56, we see up to 4x higher request throughput on prefill-heavy workloads, and 24x higher request throughput on decode-heavy workloads 🎉
Ray Serve LLM hits a major milestone: up to 4.4x higher throughput on prefill-heavy & 24.8x on decode-heavy workloads vs. baseline, now matching Rust-based vllm-router while keeping Ray's fault tolerance & portability.
How we did it in partnership with @Google: na2.hubs.ly/H069hh-0
Our Anyscale on Azure webinar is now available on demand.
Daniel Arrizza (Anyscale) and Paul Yu (@Microsoft) on running production AI inside your own Azure tenant, plus a live build-train-serve demo.
Watch now 👉 na2.hubs.ly/H0699F80
Some intuition about PD disaggregation from the blog
- PD doesn't speed up prefill and can actually hurt TTFT
- PD's real benefit is flat, stable TPOT under load
- TPOT savings compound over output sequence length
The optimal P:D ratio is dependent in particular on input lengths, output lengths, and cache hits. Meaningful optimizations are possible, but tuning can be sensitive.
Benchmarks performed with @raydistributed + @vllm_project on AMD MI325X.
anyscale.com/blog/ray-vllm-…
499K Followers 87 FollowingTensors and neural networks in Python with strong hardware acceleration. PyTorch is an open source project at the Linux Foundation. #PyTorchFoundation
469K Followers 3K FollowingNVIDIA Director of Robotics & Distinguished Scientist. Co-Lead of GEAR lab. Solving Physical AGI, one motor at a time. Stanford Ph.D. OpenAI's 1st intern.
256 Followers 3K FollowingI'm a researcher bridging VR, AR, AI and Web3.0. Co-founder of VR\AR studio Sensorama Lab and contributor at The Culture DAO virtual beings creators guild
835 Followers 3K Followingengineering leader passionate about AI, distributed systems, and open source; currently @MSL/Meta Superintelligence Labs, previously @Netflix, @Intel, et al.
944 Followers 3 FollowingJoin us the #Ray community in SF for keynotes, #Ray deep dives, #llm sessions and lightning talks exploring the future of machine learning and scalable #AI.