Overall, the chapter shows that dynamic parallelism is mainly useful for workloads where the amount of work changes as the program runs. For more regular workloads, the normal host-launched execution model is usually the better option.
The chapter uses adaptive mesh refinement, Bezier curves and quadtrees as examples and then discusses things like synchronization between parent and child kernels, streams, memory visibility, launch overhead and when dynamic parallelism actually makes sense.
Summarizing chapter 21 of PMPP
This chapter introduces CUDA Dynamic Parallelism where GPU kernels can launch other kernels directly instead of relying on the CPU to launc every piece of work.
I feel like the idea discussed here isnt entirely new, since overlapping data movement with computation is already commonly used in single-GPU training to keep the GPU busy.
The chapter then introduces the CUDA features that make this possible:
1. CUDA streams for concurrent execution
2. Asynchronous memory copies
3. Pinned memory to enable DMA transfers
4. CUDAaware MPI
The common idea is to to reduce idle time by overlapping work wherever possible
Summarizing chapter 20 of PMPP
This chapter introduces CUDA streams using a distributed 3D stencil computation as the running example. Unlike the previous chapters that focus on optimizing kernels on a single GPU, this one looks at scaling across multiple nodes using MPI
from choosing the right algorithm to decomposing the problem and optimizing the implementation. The chapter also revists many examples from earlier chapters to show that performance isnt just about writing faster kernels but also about making better algorithmic and design choices
Summarizing chapter 19 of PMPP.
This chapter takes a step back from CUDA programming and focuses on computational thinking for parallel programming. Instead of introducing new optimization appraoches, it discusses how to approach parallel probelms -
From there, the implementation is gradually optimized using techniques like constant memory, memory coalescing, thread coarsening, and finally cutoff-based methods (binning) to reduce the amount of unnecessary computation.
The gather approach instead assigns each thread to a grid point and computes contributions from all charges. This increases the amount of work per thread, but avoids atomics entirely, which makes it more suitable for parallel execution.
Summarizing chapter 18 of PMPP.
This chapter focuses on electrostatic potential map computation as an example of optimizing a real GPU application, and it brings together many of the optimization techniques introduced earlier.
I liked how the chapter showed optimization as an iterative process. They sped up the expensive MRI reconstruction kernel, only to find that a completely different part of the pipeline had become the bottleneck.
The chapter starts with a straightforward implementation and progressively improves it through better parallelization strategies, memory optimizations, and GPU-specific hardware features.
Summarizing chapter 17 of PMPP.
This chapter is a case study on optimizing an iterative MRI reconstruction algorithm for GPUs. Unlike the previous chapters that introduced individual parallel patterns, this one brings many of those ideas together in a real application.
The chapter then ends by showing how convolutions can be reformulated as matrix multiplications (GEMM), allowing them to leverage highly optimized GPU libraries. This idea forms the foundation of libraries such as cuDNN.
The chapter expands on explaining why convolutions are such a good fit for GPUs (they expose massive amounts of parallelism across output pixels, feature maps, minibatches). It then shows how convolution kernels can be mapped onto CUDA and how performance can be improved.
Summarizing chapter 16 of PMPP.
This chapter focuses on deep learning, specifically CNNs, as a case study for massively parallel computing. Since I already have a background in machine learning, the neural network concepts were mostly familiar.
4K Followers 7K FollowingA geek from the 80's, a consultant in the 90's. Consulting in Business Processes (CRM/ERP, BPO) and Data Science (Business Intelligence, Data Smithing)
142 Followers 475 Following:D | Mashed a keyboard for @LinuxFoundation @OpenMFProject @chalmersuniv @eth @NASA | My views are personal | Building existential crisis | crisises? | crises?
10K Followers 165 Following🚀Bringing China's AI & tech trends, voices and perspectives to the global stage.
⚡️Powered by 知乎/https://t.co/OkIemRZdcj, China's leading knowledge community.
2K Followers 825 FollowingDistinguished Engineer @nvidia; working on Tile IR
Prev. Co-founder @octoml. PhD @uwcse.
Attempting to write about AI @ https://t.co/toFSukgrzM
1K Followers 2K Following✨ sustaining flow with gamified productivity @forgewaredev • cs, linguistics, learning math for ML • 🌾 interested in offworld industries, agtech, biotech
85 Followers 2K FollowingInterested in several things but focusing on intelligent systems/machines | aspiring philomath | Prov 21:31, Phil 4:13
working on @acxlabs and other stuff
1K Followers 1K Followinge/acc AI dev | Physics → ML | Apparently AI researcher nowadays. shitposts, cognition and consciousness | Building odd futures