All 23 lectures for my CMU Advanced NLP course are now on YouTube. The slides and 20 code examples are also publicly available.
- YouTube: youtube.com/playlist?list=…
- Course Page (Slides, Schedule): cmu-l3.github.io/anlp-spring202…
- Code: github.com/cmu-l3/anlp-sp…
The lectures are grouped into 7 themes: fundamentals, architectures, learning & inference, modeling, evaluation, RL & agents, and scaling & efficiency.
Check them out if you’re looking for an introduction or refresher on the fundamentals of LLMs, key ideas from recent NLP research, or are just curious to learn more.
@prateek_0041 I have been going through the same journey. Writing notes after reading has helped me a lot. One of the introductory articles I have written - on CUDA kernels
x.com/unmesh_padalka…
Fuck your paid courses, Master GPU engineering for AI systems.
From foundational books and CUDA/ROCm programming to low-level optimization, Nsight tools, multi-GPU orchestration, distributed training
and AI acceleration techniques.
Excellent reference for embedded GPU work or large-scale AI infrastructure, curated collection covering:
- CUDA & ROCm programming
- Kernel optimization & performance tools
- Multi-GPU systems & distributed training
- Architecture deep dives, Triton, CUTLASS, and more
A goldmine for anyone working on high-performance AI infrastructure, kernel development, or systems-level GPU work.
- github.com/goabiaryan/awe…
Claude Fable 5 [max] wrote the first genuine (and fastest) megakernel ever submitted to KernelBench-Mega.
It was tested on: Kimi-Linear W4A16 batch-1 decode for RTX PRO 6000 Blackwell. Every prior model "won" it with a multi-kernel Triton pipeline that fails our single-fused-kernel authenticity gate
> Opus 4.8 at 14.4x
> GLM-5.2 11.1x
> GPT-5.5 4.3x
> Sonnet 5 4.0x.
Fable shipped 18.7x over reference, and torch.profiler shows exactly ONE cooperative kernel launch per decoded token. Int4 dequant (nibbles unpacked in-register, never materialized), conv+SiLU, KDA gated-delta state, MLA absorbed-latent attention with online softmax, MoE router + top-8 experts, RMSNorms, even the KV cache append all inside one launch, staged by 14 grid barriers. We overwrote its input buffers mid-audit to prove it recomputes on live data. It does.
The advantage grows with context. 17.8x at 2k, 18.9x at 8k, 19.5x at 16k. Longer context means a bigger KV cache and more attention work per token which is usually where a decode kernel bleeds. Keeping everything in one launch amortizes the fixed barrier overhead and the int4 GEMV stays bandwidth-bound, so the gap over the reference widens instead of closing.
It spent 64% of the session in silence timing the baseline, microbenchmarking grid barriers, deriving a ~29x bytes/token roofline, then wrote the whole kernel once, hit 14.4x on the first benchmark, and spent the last hour deleting barriers and making int4 dequant free (one LOP3 + HSUB2/HMUL2). The one regression it tried (finer split-K) it measured and reverted instead of rationalizing.
kernelbench.com/mega
CMU PhD who built the kernels NVIDIA now ships in TensorRT-LLM explained fast attention in 68 minutes - better than $1200 GPU programming courses.
pick the attention pattern -> generate a fused CUDA kernel -> drop it into vLLM/SGLang -> same GPU, way more tokens per second.
That loop is why FlashInfer now powers inference at NVIDIA, vLLM, SGLang, and half the serving stacks you use.
FlashInfer + Triton + JIT-compiled kernels + paged-KV attention - that's the stack.
Here's part 1 (of 5) of my short course on efficient LLM inference that I taught at Columbia University. Slides are heavily updated from two weeks ago.
youtube.com/watch?v=3ggYI8…
This Fall at CMU we're teaching a new course on AI Agents!
The goal is that you learn how to create a scaffold, build evals, and train an agentic LLM using RL.
We'll try to balance theory and practice, and introduce modern frameworks and best practices.
Train your own DSpark more efficiently than DeepSpec with Speculators!
We’ve already scaled it up to GLM 5.2 and you don’t need TBs of storage. Basic online training example here github.com/vllm-project/s…
And it's not locked to DeepSeek's checkpoints. 🧩
The Speculators library (github.com/vllm-project/s…) lets you train and package DSpark draft models in a standard, HF-compatible format that vLLM loads directly. Already validated on Qwen3-8B and GLM-5.2. Run it on vLLM nightly now:
Six offline RL distillation losses. One base model. The exact same math rollouts.
Do they actually learn the same thing?
Turns out most "new" losses — RFT, DFT, offline GRPO — write nearly the same direction in weight space as plain SFT. Only DPO learns something genuinely different: near-orthogonal, its own loss basin, rewires what the network computes.
Reward-weighting changes the step size. DPO changes the direction.
Accepted at @icmlconf (MechInterp workshop)
paper + interactive companion 👇
huggingface.co/spaces/AlexWor…
This is very close to how the text-albumentations library works.
Inputs a passage source, and generates task-oriented data from it. The variance comes from the input docs. The quality comes from the local/remote LM.
The Synthetic Data Playbook HF article also talks about domain data distribution + task variance for good synthetic data.
This new Autodata paper adds more review mechanics with a weak/strong resolver and an external judge, which is actually a super cool idea to maintain good dataset quality.
New blog post on harness optimization. We hit Sonnet 4.6 performance with a 7x cost improvement.
Fable 5 was the first frontier model release that evaluated on legal tasks. It only scored 13%, the worst performance among all benchmarks evaluated.
@Harvey released this benchmark called Legal Agent Benchmark (LAB) just a month prior. It contains a set of realistic legal matters. Each task gives the agent a closed workspace of documents (contracts, emails, spreadsheets, slide decks) and asks for a concrete deliverable: a diligence memo, an issue list, a redline, a draft. An LLM judge grades the deliverable against a long rubric containing 61 distinct binary criteria each on average.
Many frontier models such as Gemini 3.1 Pro don't surpass 0% all-pass rate (all rubric criteria passed). With automatic harness optimization, we manage to push DeepSeek V4 Pro from 0% to 5% all-pass rate, achieving parity with Sonnet 4.6 for 1/7 of the price.
Read the blog post for the details: huggingface.co/spaces/joelnik…
Sorting which financial docs are worth an analyst's time is surprisingly hard for frontier LLMs. With an expert-labeled dataset and on-policy distillation, Bridgewater fine-tuned a model to do it reliably and cheaply.
thinkingmachines.ai/news/learning-…
Our MOPD from MiMo-V2-Flash has been widely adopted in modern post-training pipelines.
Now the paper is out with more details & comparison.
Check it out: arxiv.org/abs/2606.30406
Stanford dropped their latest course on Parallel Programming, GPU, and CUDA.
24 hours, 19 lessons.
this is one of the hottest skills that AI labs are looking for. it covers:
> GPU architecture and CUDA
> performance optimization
> multi-core processors and architectures
watch here: youtube.com/playlist?list=…
DSpark from @deepseek_ai ingeniously integrates many speculative decoding ideas to achieve 1.5x to 5x higher throughput in a real production system
Let's understand it with 10 ideas, starting from the very basics 🧵
39 Followers 289 FollowingAfter a fail, I give only 1 chance, then I mute/block
Surround yourself with competent, smart people and you will thrive.
Stupids will bring you down with them
29K Followers 10K FollowingData Maven with a Dash of Espresso ☕️ | Turning Numbers into Narratives | Senior Customer Insights Director | Tweets fueled by caffeine and curiosity
21K Followers 735 FollowingML Engineer @huggingface. Building https://t.co/z4nyO4pjVE. @KU_Leuven grad. General interest in machine & deep learning. Making AI more accessible for everyone!
109K Followers 425 Followingprofessor of computer science @Stanford @stanfordnlp, co-founder of @togethercompute, creator of https://t.co/7R5THVogW2, co-founder of @simile_ai, pianist
38K Followers 707 Followingex world model lead @xAI | ex @Nvidia @Meta | 30+ papers, 9k citations | talk about AI, LLM, video generation, multimodal, AGI
44K Followers 577 FollowingAssistant prof. @ Stanford; Chief AI Scientist @ MongoDB; Former Co-founder/CEO of Voyage AI
Working on ML, DL, RL, LLMs, and their theory.
12K Followers 339 FollowingNeural Breakdown on YT | Read research with AI: https://t.co/Ef6m4nUpcZ | Latest vid: RLMs, Post Training | Next: Reasoning SLM