light @reprompting

learning from first principles Joined May 2025

Tweets

183
Followers

167
Following

75
Likes

119

light @reprompting

10 hours ago

my favorite thing about backpropagation is just how simple it is. it is literally just applying the chain rule for derivatives.

sharpeye @sharpeye_wnl

18 hours ago

simple backpropagation by hand on 2 layers had fun revising it

22 11 379 12K 149

0 0 2 115 2

View Details

Overall, the chapter shows that dynamic parallelism is mainly useful for workloads where the amount of work changes as the program runs. For more regular workloads, the normal host-launched execution model is usually the better option.

0 0 0 17 0

View Details

light @reprompting

18 hours ago

The chapter uses adaptive mesh refinement, Bezier curves and quadtrees as examples and then discusses things like synchronization between parent and child kernels, streams, memory visibility, launch overhead and when dynamic parallelism actually makes sense.

1 0 0 26 0

View Details

light @reprompting

18 hours ago

Summarizing chapter 21 of PMPP This chapter introduces CUDA Dynamic Parallelism where GPU kernels can launch other kernels directly instead of relying on the CPU to launc every piece of work.

1 1 19 466 7

View Details

light @reprompting

2 days ago

I feel like the idea discussed here isnt entirely new, since overlapping data movement with computation is already commonly used in single-GPU training to keep the GPU busy.

0 1 1 45 0

View Details

light @reprompting

2 days ago

The chapter then introduces the CUDA features that make this possible: 1. CUDA streams for concurrent execution 2. Asynchronous memory copies 3. Pinned memory to enable DMA transfers 4. CUDAaware MPI The common idea is to to reduce idle time by overlapping work wherever possible

1 1 2 58 1

View Details

light @reprompting

2 days ago

Summarizing chapter 20 of PMPP This chapter introduces CUDA streams using a distributed 3D stencil computation as the running example. Unlike the previous chapters that focus on optimizing kernels on a single GPU, this one looks at scaling across multiple nodes using MPI

1 2 23 1K 13

View Details

light @reprompting

3 days ago

from choosing the right algorithm to decomposing the problem and optimizing the implementation. The chapter also revists many examples from earlier chapters to show that performance isnt just about writing faster kernels but also about making better algorithmic and design choices

0 0 1 76 1

View Details

light @reprompting

3 days ago

Summarizing chapter 19 of PMPP. This chapter takes a step back from CUDA programming and focuses on computational thinking for parallel programming. Instead of introducing new optimization appraoches, it discusses how to approach parallel probelms -

1 1 27 1K 19

View Details

light @reprompting

4 days ago

From there, the implementation is gradually optimized using techniques like constant memory, memory coalescing, thread coarsening, and finally cutoff-based methods (binning) to reduce the amount of unnecessary computation.

0 0 0 40 0

View Details

light @reprompting

4 days ago

The gather approach instead assigns each thread to a grid point and computes contributions from all charges. This increases the amount of work per thread, but avoids atomics entirely, which makes it more suitable for parallel execution.

1 0 0 53 0

View Details

light @reprompting

4 days ago

Summarizing chapter 18 of PMPP. This chapter focuses on electrostatic potential map computation as an example of optimizing a real GPU application, and it brings together many of the optimization techniques introduced earlier.

1 1 16 559 10

View Details

light @reprompting

4 days ago

CUDA SpMV using CSR format. Storing only non-zeros feels simple but irregular memory access is where things got messy.

0 2 19 510 7

View Details

light @reprompting

4 days ago

I liked how the chapter showed optimization as an iterative process. They sped up the expensive MRI reconstruction kernel, only to find that a completely different part of the pipeline had become the bottleneck.

0 0 1 42 1

View Details

light @reprompting

4 days ago

The chapter starts with a straightforward implementation and progressively improves it through better parallelization strategies, memory optimizations, and GPU-specific hardware features.

1 0 2 55 1

View Details

light @reprompting

4 days ago

Summarizing chapter 17 of PMPP. This chapter is a case study on optimizing an iterative MRI reconstruction algorithm for GPUs. Unlike the previous chapters that introduced individual parallel patterns, this one brings many of those ideas together in a real application.

1 3 16 376 3

View Details

light @reprompting

5 days ago

The chapter then ends by showing how convolutions can be reformulated as matrix multiplications (GEMM), allowing them to leverage highly optimized GPU libraries. This idea forms the foundation of libraries such as cuDNN.

0 0 2 48 1

View Details

light @reprompting

5 days ago

The chapter expands on explaining why convolutions are such a good fit for GPUs (they expose massive amounts of parallelism across output pixels, feature maps, minibatches). It then shows how convolution kernels can be mapped onto CUDA and how performance can be improved.

1 0 4 64 1

View Details

light @reprompting

5 days ago

Summarizing chapter 16 of PMPP. This chapter focuses on deep learning, specifically CNNs, as a case study for massively parallel computing. Since I already have a background in machine learning, the neural network concepts were mostly familiar.