@finn_fergus yea we have some ideas for this while keeping chain :) mostly tree drafting also complicates the engine implementation especially with overlap scheduling etc
As always, it’s a lot more nuanced than this. In spec decoding, the predominant cost is target verification, the chain vs tree tradeoff is well known, and at production concurrencies you’re often better off training a better draft model than doing tree drafting.
I want to bring your attention to JetSpec because it looks strictly smarter and stronger than previous speculative decoding and block diffusion approaches (yes, again).
Avg 1000 t/s single stream with Qwen-8B on B200. Basically, you can better utilize compute at any batch size.
@_EldarKurtic@jvmncs@charles_irl we’ve been collaborating with the zlab for a while and this was written before dflash training was well supported in open source
Speculation Is All You Need.
In this blog post, we announce the co-release (w/ Z Lab) of six more state-of-the-art DFlash speculators for @Alibaba_Qwen 3.x.
Over 1k output tps for 3.5 122B-A10B on a B200.
Read the blog for why we're all-in on spec dec.
modal.com/blog/spec-is-a…
Try them out! 🚀 Really enjoyed the amazing collaboration with the Modal team. The retrained DFlash draft models now support longer contexts, making them a better fit for agentic workloads.
Speculation Is All You Need.
In this blog post, we announce the co-release (w/ Z Lab) of six more state-of-the-art DFlash speculators for @Alibaba_Qwen 3.x.
Over 1k output tps for 3.5 122B-A10B on a B200.
Read the blog for why we're all-in on spec dec.
modal.com/blog/spec-is-a…
🚀 New blog: The next generation of speculative decoding: DFlash and Spec V2
DFlash + Spec V2 hit >4.3X baseline throughput for LLM inference, now the default speculative decoding engine in SGLang! Together with @modal and z-lab.ai, our jointly-released DFlash drafter for Qwen 3.5 397B-A17B beats both baseline and native MTP in every setting we benchmarked:
1️⃣ >4.3X baseline & 1.5X native MTP throughput (concurrency 1, HumanEval, 8xB200)
2️⃣ Block diffusion drafter: a full token block in one forward pass
3️⃣ KV injection: target-model features fed into every draft layer’s KV cache for higher acceptance
4️⃣ Spec V2 overlap scheduler: +33% end-to-end
Read the code, deploy a DFlash server, and start experimenting!
🚀 DFlash now runs on SGLang's new default speculative-decoding engine, Spec V2.
⚡️ Hitting >4.3× baseline throughput (1.5× over native MTP) on Qwen 3.5 397B-A17B. Same quality, more speed!
⭐ github.com/z-lab/dflash
We worked with @lmsysorg and z-lab.ai to
- integrate DFlash spec into @sgl_project
- make it faster with overlap
- train a DFlash drafter for @Alibaba_Qwen 397B-A17B
The result: up to 4.3x greater throughput over baseline and 1.5x over native MTP.
🚀 1,000+ TOKENS/S ON A 1T MODEL! 🚀
We are thrilled to release Xiaomi MiMo-V2.5-Pro-UltraSpeed in collaboration with @TileRT_AI , breaking the 1,000 tokens/s output speed on a 1 Trillion parameter model for the FIRST TIME!
Not wafer-scale integration like Cerebras. Not pure
Amazing to see DFlash helping push Xiaomi MiMo to 1,000+ tokens/s on a 1T model.
Happy to see our work supercharging one of the best open-source LLMs! 🚀
🚀 1,000+ TOKENS/S ON A 1T MODEL! 🚀
We are thrilled to release Xiaomi MiMo-V2.5-Pro-UltraSpeed in collaboration with @TileRT_AI , breaking the 1,000 tokens/s output speed on a 1 Trillion parameter model for the FIRST TIME!
Not wafer-scale integration like Cerebras. Not pure
19K Followers 9K FollowingLapsed computational social scientist. Blasphemous orthodox jester with Discordian allegiances, nerdy habits, burrito affinities, and a big computer. 🇺🇸
651 Followers 5K FollowingAI Innovation Enablement at Dassault Systèmes; ex Creative Director at Razorfish / startup co-founder (acq. by Bosch), 25 yrs of experience
38K Followers 1K FollowingCEO & Head of Product @gremlinlabs
"Vive ut Gremlinus mischivus; morere ut Gremlinus magnificus"
Powered my mischief. 💚
#vibecoding #entrepreneur
327 Followers 134 FollowingHead of Product @radixark @lmsysorg @sgl_project | ex-AI researcher @Stanford, @TuSimpleAI, PM @tiktok_us | 🔮 Building AI infra that scales
327 Followers 134 FollowingHead of Product @radixark @lmsysorg @sgl_project | ex-AI researcher @Stanford, @TuSimpleAI, PM @tiktok_us | 🔮 Building AI infra that scales
1K Followers 249 FollowingStaff Research Scientist at https://t.co/WEMkSSRVeZ.
Formerly research scientist at Google, postdoc at Stanford, and PhD student at Columbia.
6K Followers 423 FollowingAssistant Professor of Computing Science @SFU. Ph.D. from @Berkeley_EECS and Bachelor's from @UofTCompSci. Formerly @GoogleAI and Member of @the_IAS.