ray @raydistributed

A distributed compute framework for scaling AI workloads. Created and developed by @anyscalecompute. docs.ray.io Joined August 2019

Tweets

2K
Followers

11K
Following

2
Likes

2K

Anyscale @anyscalecompute

24 hours ago

Evaluating a robot foundation model is one of the most demanding closed-loop problems in robotics. Before you can trust a policy to move a real robot, you have to test it thousands of times across thousands of starting conditions each, pairing GPU-heavy inference with GPU-heavy physics simulation step by step, at a scale that quickly becomes a complex infrastructure problem. In this session, Ian Jordan will walk through this. Register: na2.hubs.ly/H06cfBN0

0 1 3 462 0

View Details

Robert Nishihara @robertnishihara

4 days ago

A great example of the importance of disaggregation in RL. From the paper ⚪️ LLM generation alternates between prefill and decode 🔵 Prefill is compute bound 🔵 H800s are compute optimized 🔵 Doing prefill on H800s cuts rollout time by 47% 🔴 Decode is bandwidth bound 🔴 H20s are bandwidth optimized 🔴 Doing decode on H20s cuts rollout time by 21-51%. On top of all that, the prefill to decode ratio depends on task characteristics (e.g., many-turn tasks that require lots of context compaction are prefill heavy). And prefill / decode is just for inference. There are many other components with different hardware requirements. Quoting the paper: ⚪️ Environments are stateful, CPU-bound processes whose latency is heavy-tailed due to host contention, large variance in interaction turns, and environment failures. ⚪️ Reward workers are stateless and exhibit persistently low utilization—dropping to as little as 7.4% on dedicated GPUs— yet require elastic scaling when trajectories complete. ⚪️ Training demands high-end GPUs with fast interconnects. No single hardware type satisfies all stages.

ray @raydistributed

4 days ago

RollArt is an impressive example of disaggregation in large-scale RL. cse.ust.hk/~weiwa/papers/…

0 4 23 10K 24

4 14 86 11K 71

View Details

ray @raydistributed

4 days ago

RollArt is an impressive example of disaggregation in large-scale RL. cse.ust.hk/~weiwa/papers/…

0 4 23 10K 24

View Details

Xinyu Zhang @xinyzng

6 days ago

@raydistributed ❤️ @googlecloud

Seiji Eicher @seiji_________

6 days ago

1 7 51 22K 21

0 3 7 584 1

View Details

PyTorch @PyTorch

6 days ago

Ray (@raydistributed) and @vllm_project are helping developers understand when prefill-decode disaggregation can improve throughput and reduce compute costs, and when aggregated serving remains the better choice. Read more: anyscale.com/blog/ray-vllm-…

Anyscale @anyscalecompute

a week ago

Save 67% with prefill-decode disaggregation using Ray + vLLM on AMD GPUs. anyscale.com/blog/ray-vllm-…

1 5 15 19K 5

2 7 47 13K 7

View Details

Mark Collier 柯理怀 @sparkycollier

6 days ago

The @PyTorch Foundation is lucky to host these amazing open source projects @raydistributed and @vllm_project Seeing the community come together to make something even more powerful is amazing!

Simon Mo @simon_mo_

6 days ago

Great work! Amazing to see Ray Serve LLM and @vllm_project are ever closer together! When done right, @raydistributed is ever flexible, extensible, and highly performant.

1 7 21 4K 1

3 3 16 2K 0

View Details

Anyscale @anyscalecompute

6 days ago

Built something with Ray worth sharing? Ray Summit 2026 puts it in front of the people working on the same problems: maintainers, infra leads, applied researchers. Foundation models, physical AI, RL, distributed inference, platform engineering, continuous learning systems. ⏰ Two days left, CFP closes June 20: na2.hubs.ly/H069Zbp0

0 1 5 286 0

View Details

Robert Nishihara @robertnishihara

6 days ago

Ray + vLLM is faster now

ray @raydistributed

6 days ago

0 4 29 18K 14

1 5 45 6K 9

View Details

kourosh hakhamaneshi @CyrusHakha

6 days ago

This is a major architectural change of Ray Serve LLM which improves performance significantly on the streaming path. Ray serve is known for its ergonomics and how easy it is to setup and scale compared to pure k8s. Now it is also as performant as the SOTA.

Seiji Eicher @seiji_________

6 days ago

1 7 51 22K 21

0 2 12 444 1

View Details

Ion Stoica @istoica05

6 days ago

Super excited about the launch of these new performance optimizations built on Ray and vLLM. This is a major milestone for the next-generation open-source AI infrastructure stack.

Seiji Eicher @seiji_________

6 days ago

1 7 51 22K 21

4 6 54 5K 13

View Details

PyTorch @PyTorch

6 days ago

Ray (@raydistributed) Serve LLM and @vllm_project enable high performance distributed inference at scale. Awesome to see Foundation-hosted projects working together to advance the open source AI stack. Learn more: anyscale.com/blog/high-perf…

ray @raydistributed

6 days ago

0 4 29 18K 14

1 10 52 11K 15

View Details

Simon Mo @simon_mo_

6 days ago

Great work! Amazing to see Ray Serve LLM and @vllm_project are ever closer together! When done right, @raydistributed is ever flexible, extensible, and highly performant.

Seiji Eicher @seiji_________

6 days ago

1 7 51 22K 21

1 7 21 4K 1

View Details

ray @raydistributed

6 days ago

Ray Serve LLM now offers 4.4x higher request throughput on prefill-heavy workloads, and 24.8x higher request throughput on decode-heavy workloads! 🚀Three major optimizations: - Direct streaming, bypassing an intermediate Ray Serve deployment on the response path with a new, control plane-only endpoint picker - A new, Ray V2 executor backend in vLLM, enabling optimizations such as async scheduling - HAProxy ingress, for ingress request routing at the speed of C All available in Ray 2.56. This is awesome work with @googlecloud and @vllm_project!

Seiji Eicher @seiji_________

6 days ago

1 7 51 22K 21

0 4 29 18K 14

View Details

vLLM @vllm_project

6 days ago

Huge milestone from the @anyscalecompute + @googlecloud GKE teams 🎊 Ray Serve LLM provides up to 4.4x higher throughput on prefill-heavy workloads and 24x on decode-heavy workloads than previous versions. Three optimizations made this possible on the Ray Serve LLM + vLLM stack: ⭐️Direct streaming with a control-plane-only endpoint picker ⭐️ A new vLLM Ray V2 executor backend ⭐️HAProxy ingress for routing at the speed of C Ray's primitives for fault tolerance, observability, and portability across K8s and VMs are a great foundation as inference deployments get more complex. Congrats to the team! Try the new Ray V2 executor today in vLLM with --distributed-executor-backend ray.

Seiji Eicher @seiji_________

6 days ago

1 7 51 22K 21

4 23 94 10K 29

View Details

Seiji Eicher @seiji_________

6 days ago

Today we are excited to announce, in partnership with the GKE team at Google Cloud (@googlecloud), a major milestone in Ray Serve LLM’s production serving capability. Ray Serve LLM now matches high performance, rust-based routing frameworks such as vllm-router (@vllm_project) in benchmarks across a variety of workloads and deployment patterns. In Ray 2.56, we see up to 4x higher request throughput on prefill-heavy workloads, and 24x higher request throughput on decode-heavy workloads 🎉

1 7 51 22K 21

View Details

Anyscale @anyscalecompute

6 days ago

Ray Serve LLM hits a major milestone: up to 4.4x higher throughput on prefill-heavy & 24.8x on decode-heavy workloads vs. baseline, now matching Rust-based vllm-router while keeping Ray's fault tolerance & portability. How we did it in partnership with @Google: na2.hubs.ly/H069hh-0

1 11 33 3K 8

View Details

Anyscale @anyscalecompute

7 days ago

Our Anyscale on Azure webinar is now available on demand. Daniel Arrizza (Anyscale) and Paul Yu (@Microsoft) on running production AI inside your own Azure tenant, plus a live build-train-serve demo. Watch now 👉 na2.hubs.ly/H0699F80

0 1 5 343 0

View Details

Robert Nishihara @robertnishihara

a week ago

Some intuition about PD disaggregation from the blog - PD doesn't speed up prefill and can actually hurt TTFT - PD's real benefit is flat, stable TPOT under load - TPOT savings compound over output sequence length The optimal P:D ratio is dependent in particular on input lengths, output lengths, and cache hits. Meaningful optimizations are possible, but tuning can be sensitive. Benchmarks performed with @raydistributed + @vllm_project on AMD MI325X. anyscale.com/blog/ray-vllm-…