thom✨ @gpuwaster

highly performative computing thom.gg 🇫🇷/🇨🇭 Joined December 2017

Tweets

1K
Followers

256
Following

314
Likes

590

thom✨ @gpuwaster

15 hours ago

54-55/100 of GPU Grind trying different optimizations on the fp16 gemm kernel: - switching from manually loading smem values into registers mma fragments to using ldmatrix, which works well for A, but i have to use ldmatrix.trans for B since it's row major in memory, and there must be something i'm doing wrong cause it's killing the performance. i find the docs to be very short for this part, it simply says .trans loads the matrix in column-major format, but not really which lanes are accessing which part of the matrix and how they communicate. i guess it's causing huge bank conflicts that would explain the decrease in performance - increasing the number of buffering stages, from 2 (simple double buffering) to 3,4, or 6 stages, but none of these is increasing performance, at best it stays (roughly) the same (with 4 stages). kinda hitting a wall with the blind optimizations here, i'm going to use another neocloud that allows me to profile the kernels so i can make educated choices instead. it wasn't wasted time though because i learnt / practiced how to program these, even if it didn't increase performance

thom✨ @gpuwaster

3 days ago

0 0 12 988 5

1 1 10 523 4

View Details

thom✨ @gpuwaster

15 hours ago

@aleks_sharik one day maybe

0 0 1 9 0

View Details

thom✨ @gpuwaster

16 hours ago

life if you could run NCU on modal

1 0 1 36 0

View Details

thom✨ @gpuwaster

2 days ago

@Norapom04 is it breaking the NDA to say that the company made you sign a NDA ?

2 0 4 2K 0

View Details

thom✨ @gpuwaster

3 days ago

53/100 of GPU Grind still on the fp16 gemm kernel, switched from the m16n8k8 mma instruction to m16n8k16, and tuned the tile sizes a bit, to get to 85TFLOPS, (was at 50TFLOPS yesterday). i'm pretty sure there's a big issue with memory layout that is holding me back that much, i wish i could profile but i'm performance-counters-less i can't run ncu... trying to fix my shared-mem layout and replace manual fragments loading with calls to ldmatrix tomorrow! focusing with the heat is not easy though i got to buy a fan 🥵

thom✨ @gpuwaster

4 days ago

1 4 31 3K 10

0 0 12 988 5

View Details

thom✨ @gpuwaster

4 days ago

@kathrynwu1 whats the building with French writing?

0 0 0 130 0

View Details

thom✨ @gpuwaster

4 days ago

@aleks_sharik @AMD the idea is beautiful so i believe

0 0 1 8 0

View Details

thom✨ @gpuwaster

4 days ago

52/100 of GPU Grind working on the fp16 gemm kernel today, switching from the m8n8k4 mma shape (that was a legacy one from volta architecture) to the m16n8k8 one, and fixing a few bugs. i looked into the ldmatrix instruction, usually i just manually load the fragments into register by computing the row/col with the formulas from the docs, but this makes it much easier to read. it requires the B matrix to be stored in column-major though so i can't use it for now, maybe i should transpose B as i load it from GMEM to SMEM. i got to 50 TFLOPS, and i thought i cooked like i was matching cublas perf but i realized i accidentally disabled tensor cores on the cublas call 💀 anyways i still have ideas to reach cublas performance, such as going from double buffering to 3 or 4 stages of buffering so that i can continuously feed the tensor cores, and probably swizzling or something for the bank conflicts the plot looks terrible but i'm actually getting closer 🫣

thom✨ @gpuwaster

5 days ago

1 7 73 7K 57

1 4 31 3K 10

View Details

thom✨ @gpuwaster

4 days ago

@Leik0w0 @_arohan_ tbh i think theyre too busy to take the time to get into blackwell-specific things… but you could give it a shot, im sure they love challenges though

0 0 1 61 0

View Details

thom✨ @gpuwaster

4 days ago

@Leik0w0 on avait le goat de la facto qr juste devant nous

1 0 1 23 0

View Details

thom✨ @gpuwaster

4 days ago

@Leik0w0 prdrrrr bah pareil que toi sah j’étais vraiment convaincu de plus jamais avoir à toucher à une facto QR

1 0 1 29 0

View Details

thom✨ @gpuwaster

4 days ago

@Leik0w0 @m_sirovatka he doesnt know what a warm chocolatine feels like

1 0 1 22 0

View Details

thom✨ @gpuwaster

5 days ago

51/100 of GPU Grind reading a bit about the different LLM inference optimizations strategies today, it’s not something i was particularly familiar with so its good to make more sense of all these topic you see everywhere such as prefill-decode disaggregation, kv caches, speculative decoding etc great resources i found: - youtube.com/watch?v=eMlx5f… - developer.nvidia.com/blog/mastering… - huggingface.co/blog/not-lain/… - youtube.com/watch?v=9tvJ_G…

thom✨ @gpuwaster

a week ago

0 0 11 6K 4

1 7 73 7K 57

View Details

thom✨ @gpuwaster

a week ago

50/100 of GPU Grind investigating what could be wrong in my hgemm kernel from yesterday, i realized at some point the mma instruction i'm using (m8n8k4 for fp16) is a specific edge case in which each warp computes 4 mma (instead of one). it's some legacy variant that was made for Volta, i guess this was optimized back then. the documentation is a bit light on that part i think especially since it's (kinda) the same instruction as other mma ones it's even more confusing that it doesn't work the same, i had to dig through the forums to get a better idea however the docs specifically says one shouldn't be using this variant on any other architecture than sm_70 so i'm gonna obey and pick another one

thom✨ @gpuwaster

a week ago

0 0 2 949 1

0 0 11 6K 4

View Details

thom✨ @gpuwaster

a week ago

49/100 of GPU Grind unlucky modal was down when i got home from work but i got time to make a little progress on the hgemm kernel, fixed errors and got it to work ; however it's bad lmao i'm getting poor accuracy AND poor performance compared to cuBLAS (like 20x slower 🫠). i only translated a DGEMM kernel i had into a HGEMM one though for now, so i can probably get it to be much faster just by looking into the different mma shapes and tiling parameters, it should be a starting point for the accuracy however i'm kinda clueless for now, i can decide to accumulate in fp32 but it'll be slower and i'm comparing to cuBLAS accumulating in fp16 so there's most probably another way

thom✨ @gpuwaster

a week ago

0 0 2 282 2

0 0 2 949 1

View Details

thom✨ @gpuwaster

a week ago

modal being down is like 9/11 for gpu poors

0 0 0 68 0

View Details

thom✨ @gpuwaster

a week ago

48/100 of GPU Grind started working on a ampere implementation for the fp16 gemm kernel, getting to play with all the __half and __half2 APIs, how to deal with those packed type and pass them to the mma instruction expecting f16x2 for example, i still need to do some debugging before i can get a proper measurement but i'm learning a lot about these apis it's not as straightforward as DGEMM though because you have to take into consideration the complexity of writing a good gemm in itself and the complexity of dealing with low precision dtypes

thom✨ @gpuwaster

2 weeks ago

47/100 of GPU Grind following stanford cs149 with lecture 3, covering cpu multithreading to hide stalls and maximise core utilization, the example of Intel Kaby-Lake cpu with superscalar core in which multiple instructions can run per clock cycle. Also covering heterogeneous