Introduction to GPU programming
GPU is a processor with a dedicated memory area

To use it, you have to

To use it, you have to

To use it, you have to

To use it, you have to

But why?
Why move data back and forth from CPU to GPU to CPU?
What’s the benefit?

Element-wise add for arrays a and b
resulting in array c

4 cycles, 4 elements: throughput = 1
Element-wise add for arrays a and b
resulting in array c

1 cycle, 4 elements: throughput = 4
It takes time to
So is it faster to use the CPU or the GPU?



32 operations
1024 lanes
Utilization: \[ 32 / 1024 = 1 / 32 \approx 3\% \]


32 vector units

A vector unit, 16 lanes wide

32 vector units, executing different instructions

A vector unit, 16 lanes wide
Hand written SIMD for CPUs
// Multiply 8 floats by another 8 floats
// on the CPU, using SIMD.
template<int offsetRegs>
inline __m256 mul8(const float* p1, const float* p2)
{
constexpr int lanes = offsetRegs * 8;
const __m256 a = _mm256_loadu_ps(p1 + lanes);
const __m256 b = _mm256_loadu_ps(p2 + lanes);
return _mm256_mul_ps(a, b);
}
Is this realistic?

No
Tens or hundreds of simple processors (this model has 8)

Image: LUMI consortium
Image: LUMI consortium
GPU
Our responsibility: how many threads, in what configuration
Device’s responsibility: launch enough threads and call the supplied code for each
Thread hierarchy, previous is the building block of the next level
A single thread is the smallest unit

A block of threads can be 1D, 2D or 3D
A grid of blocks can be 1D, 2D or 3D







SM/CU breaks blocks of threads to
A 1D block of 256x1 threads gets partitioned to
| warp/wavefront ID | thread ID (Nvidia) | thread ID (AMD) |
|---|---|---|
| w0 | 0-31 | 0-63 |
| w1 | 32-63 | 64-127 |
| w2 | 64-95 | 128-191 |
| w3 | 96-127 | 192-255 |
| w4 | 128-159 | - |
| w5 | 160-191 | - |
| w6 | 192-223 | - |
| w7 | 224-255 | - |



What is the smallest number of blocks in a grid enough to utilize
all compute units?
\((a)~ 1\quad (b)~
32\quad (c)~ 8\)
Assume there are 32 blocks. What is the minimum number of threads
per block enough to utilize all of the hardware?
\((a)~ 64\quad (b)~ 256\quad (c)~
32\)
What brand of GPU is it based on terminology?