54-55/100 of GPU Grind
trying different optimizations on the fp16 gemm kernel:
- switching from manually loading smem values into registers mma fragments to using ldmatrix, which works well for A, but i have to use ldmatrix.trans for B since it's row major in memory, and there must be something i'm doing wrong cause it's killing the performance. i find the docs to be very short for this part, it simply says .trans loads the matrix in column-major format, but not really which lanes are accessing which part of the matrix and how they communicate. i guess it's causing huge bank conflicts that would explain the decrease in performance
- increasing the number of buffering stages, from 2 (simple double buffering) to 3,4, or 6 stages, but none of these is increasing performance, at best it stays (roughly) the same (with 4 stages).
kinda hitting a wall with the blind optimizations here, i'm going to use another neocloud that allows me to profile the kernels so i can make educated choices instead.
it wasn't wasted time though because i learnt / practiced how to program these, even if it didn't increase performance
53/100 of GPU Grind
still on the fp16 gemm kernel, switched from the m16n8k8 mma instruction to m16n8k16, and tuned the tile sizes a bit, to get to 85TFLOPS, (was at 50TFLOPS yesterday). i'm pretty sure there's a big issue with memory layout that is holding me back that much, i
53/100 of GPU Grind
still on the fp16 gemm kernel, switched from the m16n8k8 mma instruction to m16n8k16, and tuned the tile sizes a bit, to get to 85TFLOPS, (was at 50TFLOPS yesterday). i'm pretty sure there's a big issue with memory layout that is holding me back that much, i wish i could profile but i'm performance-counters-less i can't run ncu... trying to fix my shared-mem layout and replace manual fragments loading with calls to ldmatrix tomorrow!
focusing with the heat is not easy though i got to buy a fan 🥵
52/100 of GPU Grind
working on the fp16 gemm kernel today, switching from the m8n8k4 mma shape (that was a legacy one from volta architecture) to the m16n8k8 one, and fixing a few bugs. i looked into the ldmatrix instruction, usually i just manually load the fragments into
52/100 of GPU Grind
working on the fp16 gemm kernel today, switching from the m8n8k4 mma shape (that was a legacy one from volta architecture) to the m16n8k8 one, and fixing a few bugs. i looked into the ldmatrix instruction, usually i just manually load the fragments into register by computing the row/col with the formulas from the docs, but this makes it much easier to read. it requires the B matrix to be stored in column-major though so i can't use it for now, maybe i should transpose B as i load it from GMEM to SMEM.
i got to 50 TFLOPS, and i thought i cooked like i was matching cublas perf but i realized i accidentally disabled tensor cores on the cublas call 💀 anyways i still have ideas to reach cublas performance, such as going from double buffering to 3 or 4 stages of buffering so that i can continuously feed the tensor cores, and probably swizzling or something for the bank conflicts
the plot looks terrible but i'm actually getting closer 🫣
51/100 of GPU Grind
reading a bit about the different LLM inference optimizations strategies today, it’s not something i was particularly familiar with so its good to make more sense of all these topic you see everywhere such as prefill-decode disaggregation, kv caches,
@Leik0w0@_arohan_ tbh i think theyre too busy to take the time to get into blackwell-specific things… but you could give it a shot, im sure they love challenges though
50/100 of GPU Grind
investigating what could be wrong in my hgemm kernel from yesterday, i realized at some point the mma instruction i'm using (m8n8k4 for fp16) is a specific edge case in which each warp computes 4 mma (instead of one). it's some legacy variant that was made
50/100 of GPU Grind
investigating what could be wrong in my hgemm kernel from yesterday, i realized at some point the mma instruction i'm using (m8n8k4 for fp16) is a specific edge case in which each warp computes 4 mma (instead of one). it's some legacy variant that was made for Volta, i guess this was optimized back then. the documentation is a bit light on that part i think especially since it's (kinda) the same instruction as other mma ones it's even more confusing that it doesn't work the same, i had to dig through the forums to get a better idea
however the docs specifically says one shouldn't be using this variant on any other architecture than sm_70 so i'm gonna obey and pick another one
49/100 of GPU Grind
unlucky modal was down when i got home from work but i got time to make a little progress on the hgemm kernel, fixed errors and got it to work ; however it's bad lmao i'm getting poor accuracy AND poor performance compared to cuBLAS (like 20x slower 🫠).
i
49/100 of GPU Grind
unlucky modal was down when i got home from work but i got time to make a little progress on the hgemm kernel, fixed errors and got it to work ; however it's bad lmao i'm getting poor accuracy AND poor performance compared to cuBLAS (like 20x slower 🫠).
i only translated a DGEMM kernel i had into a HGEMM one though for now, so i can probably get it to be much faster just by looking into the different mma shapes and tiling parameters, it should be a starting point
for the accuracy however i'm kinda clueless for now, i can decide to accumulate in fp32 but it'll be slower and i'm comparing to cuBLAS accumulating in fp16 so there's most probably another way
48/100 of GPU Grind
started working on a ampere implementation for the fp16 gemm kernel, getting to play with all the __half and __half2 APIs, how to deal with those packed type and pass them to the mma instruction expecting f16x2 for example, i still need to do some debugging
48/100 of GPU Grind
started working on a ampere implementation for the fp16 gemm kernel, getting to play with all the __half and __half2 APIs, how to deal with those packed type and pass them to the mma instruction expecting f16x2 for example, i still need to do some debugging before i can get a proper measurement but i'm learning a lot about these apis
it's not as straightforward as DGEMM though because you have to take into consideration the complexity of writing a good gemm in itself and the complexity of dealing with low precision dtypes
47/100 of GPU Grind
following stanford cs149 with lecture 3, covering cpu multithreading to hide stalls and maximise core utilization, the example of Intel Kaby-Lake cpu with superscalar core in which multiple instructions can run per clock cycle. Also covering heterogeneous
129 Followers 1K FollowingI made GPT read contracts at a Fortune 100. Escaped to tinker with ML plumbing, one failed compile at a time. Founder: https://t.co/j5kD3SfN42
3K Followers 1K FollowingA Monday warrier with a mean, mean stride.
I eat glass and look into the abyss.
I don't do TL;DR.
Foldb, WILD, other software.
874 Followers 181 Followingd-Matrix delivers high-performance AI inference with digital in-memory compute and ultra-high-bandwidth architecture for modern data centers.
5K Followers 4K FollowingDell + AMD Exclusive AI GPU Cloud
.
3 years early choosing AMD
.
Fully automated platform
.
Need capital to continue to innovate
.
[email protected]
3K Followers 162 FollowingLead Technical Talent Sourcer for @microsoftAI Team EMEA - building the teams that are building the next phase of humanist superintelligence
3K Followers 460 FollowingBuilding @unum_cloud since 2015 · Investing @aal_vc · author of USearch, StringZilla, NumKong - some of the world's most widely used open-source infra
20K Followers 2K FollowingGTM for @nebiustf @nebiusai // ex @Scaleway // from silicon to token, inference and anything in between. Views are my own - not financial advice
2K Followers 562 FollowingSenior Research Fellow @EPCCed, University of Edinburgh. Interested in novel architectures, HPC, FPGAs, RISC-V, programming language design and LLVM & MLIR.