David Wang @_dcw02

gpu perf @modal davidwa.ng Joined August 2024

Tweets

105
Followers

464
Following

299
Likes

1K

David Wang @_dcw02

9 hours ago

@finn_fergus yea we have some ideas for this while keeping chain :) mostly tree drafting also complicates the engine implementation especially with overlap scheduling etc

0 0 0 24 0

View Details

As always, it’s a lot more nuanced than this. In spec decoding, the predominant cost is target verification, the chain vs tree tradeoff is well known, and at production concurrencies you’re often better off training a better draft model than doing tree drafting.

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) @teortaxesTex

24 hours ago

I want to bring your attention to JetSpec because it looks strictly smarter and stronger than previous speculative decoding and block diffusion approaches (yes, again). Avg 1000 t/s single stream with Qwen-8B on B200. Basically, you can better utilize compute at any batch size.

4 7 140 26K 86

2 0 16 1K 5

View Details

David Wang @_dcw02

22 hours ago

vLLM has even recently removed tree attention support github.com/vllm-project/v…

0 0 4 97 0

View Details

Modal @modal

3 days ago

It is not too late to _actually_ own your inference. Introducing: Modal Auto Endpoints.

20 59 441 193K 130

View Details

David Wang @_dcw02

7 days ago

@_EldarKurtic @jvmncs @charles_irl we’ve been collaborating with the zlab for a while and this was written before dflash training was well supported in open source

0 0 3 49 0

View Details

David Wang @_dcw02

a week ago

@dysondunbar @charles_irl @Alibaba_Qwen I played too much pokemon champions and forgot. It’s training and will be released later.

0 0 4 237 0

View Details

David Wang @_dcw02

a week ago

@hsu_byron @charles_irl @Alibaba_Qwen I would do it for k2.6 and k2.7 in exchange for a kimi labubu

1 0 1 80 0

View Details

Charles 🎉 Frye @charles_irl

a week ago

Speculation Is All You Need. In this blog post, we announce the co-release (w/ Z Lab) of six more state-of-the-art DFlash speculators for @Alibaba_Qwen 3.x. Over 1k output tps for 3.5 122B-A10B on a B200. Read the blog for why we're all-in on spec dec. modal.com/blog/spec-is-a…

35 100 697 186K 595

View Details

Jian Chen @jianchen1799

a week ago

Try them out! 🚀 Really enjoyed the amazing collaboration with the Modal team. The retrained DFlash draft models now support longer contexts, making them a better fit for agentic workloads.

Charles 🎉 Frye @charles_irl

a week ago

35 100 697 186K 595

1 6 12 5K 6

View Details

David Wang @_dcw02

a week ago

@tenderizzation i miss culver's

1 0 3 78 0

View Details

LMSYS Org @lmsysorg

2 weeks ago

🚀 New blog: The next generation of speculative decoding: DFlash and Spec V2 DFlash + Spec V2 hit >4.3X baseline throughput for LLM inference, now the default speculative decoding engine in SGLang! Together with @modal and z-lab.ai, our jointly-released DFlash drafter for Qwen 3.5 397B-A17B beats both baseline and native MTP in every setting we benchmarked: 1️⃣ >4.3X baseline & 1.5X native MTP throughput (concurrency 1, HumanEval, 8xB200) 2️⃣ Block diffusion drafter: a full token block in one forward pass 3️⃣ KV injection: target-model features fed into every draft layer’s KV cache for higher acceptance 4️⃣ Spec V2 overlap scheduler: +33% end-to-end Read the code, deploy a DFlash server, and start experimenting!

14 77 444 124K 275

View Details

Zhijian Liu @zhijianliu_

2 weeks ago

🚀 DFlash now runs on SGLang's new default speculative-decoding engine, Spec V2. ⚡️ Hitting >4.3× baseline throughput (1.5× over native MTP) on Qwen 3.5 397B-A17B. Same quality, more speed! ⭐ github.com/z-lab/dflash

9 28 232 17K 73

View Details

David Wang @_dcw02

2 weeks ago

couldn't have asked for better people to do this with. @sgl_project @modal and the z-lab team @jianchen1799 @yesheng_liang @zhijianliu_ 💚

0 0 11 278 0

View Details

David Wang @_dcw02

2 weeks ago

9+ accept lengths on coding workloads generic drafter btw qwen 397b 4x faster repro btw dflash go brrr

Modal @modal

2 weeks ago

We worked with @lmsysorg and z-lab.ai to - integrate DFlash spec into @sgl_project - make it faster with overlap - train a DFlash drafter for @Alibaba_Qwen 397B-A17B The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP.

8 29 235 41K 73

2 6 46 10K 8

View Details

David Wang @_dcw02

2 weeks ago

@charles_irl @modal @_gongy @saatwiknagpal @DevenNavani @be4ncurd @emilyhanyf @hsubbaraj @racerfunction @luiscape @jonobelotti_IO @mma12261 @teenychairs machine god go brrr

1 0 2 181 0

View Details

Zhijian Liu @zhijianliu_

3 weeks ago

Great to see DFlash is being used in MiMo!

Xiaomi MiMo @XiaomiMiMo

3 weeks ago

🚀 1,000+ TOKENS/S ON A 1T MODEL! 🚀 We are thrilled to release Xiaomi MiMo-V2.5-Pro-UltraSpeed in collaboration with @TileRT_AI , breaking the 1,000 tokens/s output speed on a 1 Trillion parameter model for the FIRST TIME! Not wafer-scale integration like Cerebras. Not pure

153 298 2K 394K 841

5 13 86 20K 7

View Details

Jian Chen @jianchen1799

3 weeks ago

Amazing to see DFlash helping push Xiaomi MiMo to 1,000+ tokens/s on a 1T model. Happy to see our work supercharging one of the best open-source LLMs! 🚀