# **GPU programming using CUDA**

#### \\

Data analytics in the era of large-scale machine learning

#### //

Giannis Koutsou,

Computation-based Science and Technology Research Center,

The Cyprus Institute

#### Lecture part

- Review of GPU architecture
- Review of GPU programming and CUDA
- Some details of our training system, "Cyclone"

#### Lecture part

- Review of GPU architecture
- Review of GPU programming and CUDA
- Some details of our training system, "Cyclone"

### Hands-on - Practical examples on GPUs

Covering:

- GPU performance vs CPU performance
- Memory coalescing on GPUs
- Shared memory
- What is a warp and should you care

#### Lecture part

- Review of GPU architecture
- Review of GPU programming and CUDA
- Some details of our training system, "Cyclone"

### Hands-on - Practical examples on GPUs

Covering:

- GPU performance vs CPU performance
- Memory coalescing on GPUs
- Shared memory
- What is a warp and should you care

#### Assumptions about this hands-on session

- Some familiarity with programming in C or C++
- Familiarity with some common tasks in C, e.g. array allocation, pointers, etc.
- Can edit text files on a remote server, e.g. text-based (emacs or vim) or <u>VS Code Remote</u>
- Hopefully this will be useful to capture high-level GPU features that affect performance

#### Lecture part

- Review of GPU architecture
- Review of GPU programming and CUDA
- Some details of our training system, "Cyclone"

#### Hands-on - Practical examples on GPUs

Covering:

- GPU performance vs CPU performance
- Memory coalescing on GPUs
- Shared memory
- What is a warp and should you care

#### Assumptions about this hands-on session

- Some familiarity with programming in C or C++
- Familiarity with some common tasks in C, e.g. array allocation, pointers, etc.
- Can edit text files on a remote server, e.g. text-based (emacs or vim) or <u>VS Code Remote</u>
- Hopefully this will be useful to capture high-level GPU features that affect performance

Repo with slides (PDF) and exercises and URL for direct access to slides:

github.com/g-koutsou/NCC-training-202305

https://sds402.online/training

### **GPU architecture**

#### At a very high level:



#### CPU

- Few heavy cores
- Large memory
- Moderate BW to memory
- Optimized for serial execution

#### GPU

- Many light "cores"
- Smaller memory
- High BW to memory
- Optimized for parallel execution



Some numbers from the GPU partition of our Cyclone cluster

NVIDIA V100 Volta GPUs

- 80 Streaming Multiprocessors (SM) per GPU
- 64 "cores" per SM
- GPU memory: 32 GBytes
- Memory bandwidth: 900~GB/s
- Peak performance: 7.8 Tflop/s (double precision)



Some numbers from the GPU partition of our Cyclone cluster

NVIDIA V100 Volta GPUs

- 80 Streaming Multiprocessors (SM) per GPU
- 64 "cores" per SM
- GPU memory: 32 GBytes
- Memory bandwidth: 900~GB/s
- Peak performance: 7.8 Tflop/s (double precision)

We will come back to these numbers during the hands-on



#### "Offload" model of programming

- CPU starts program (runs main())
- CPU copies data to GPU memory (over e.g. PCIe, ~32 GB/s)
- CPU dispatches "kernels" for execution on GPU
  - Kernels read/write to GPU memory (~900 GB/s)
  - Kernels run on GPU threads (thousands) which share *fast* memory [O(10) times faster compared to GPU memory]
- Kernel completes; CPU copies data back from GPU (over e.g. PCIe, ~32 GB/s)



#### GPU memory model (NVIDIA model)

- GPU threads: *slow* access to global, constant, and texture memory
- Each thread has registers (fast) and local memory (slow)
- Threads are grouped into *blocks*; Threads within the same block: *shared memory* (fast)
- Shared memory is limited. E.g. 96 KB per block for V100



GPU memory model (NVIDIA model); some numbers for context

- Threads per block: 1024 (max)
- Register memory (per block): 64 KB
- Shared memory (per block): 96 KB
- Also, max. 255 registers per thread



#### GPU memory model (NVIDIA model)

- Assumptions about execution order
  - Threads within the same block can be assumed to run concurrently
  - No assumption about the order by which blocks are executed

# **CUDA programming model**

#### NVIDIA programming framework for NVIDIA GPUs

- Compute Unified Device Architecture
- C-like programming language for writing CUDA Kernels
  - Includes C/C++ and Fortran variants
  - Compiler for C/C++: nvcc
- Functions for transferring data to/from GPUs, starting kernels, etc.
- Some higher-level functionality also available (linear algebra, random number generations, etc.)
- Concepts generalizable to other accelerator programming frameworks (OpenCL, OpenACC, HiP, etc.)

#### Nomenclature

- "Host" is the CPU
- "Device" is the GPU

#### Allocate memory on GPU

#### err = cudaMalloc(&d\_ptr, size);

- Call from host (CPU)
- Allocate size bytes of memory on GPU and store the starting address in d\_ptr
- d\_ptr is a variable that holds an address to GPU memory i.e. a "device pointer"
- If err != cudaSuccess then something went wrong

#### Free GPU memory

cudaFree(d\_ptr);

#### Nomenclature

- "Host" is the CPU
- "Device" is the GPU

#### Copy data to GPU

cudaMemcpy(d\_ptr, ptr, size, cudaMemcpyHostToDevice);

- Call from *host* (CPU)
- Copy data on host pointed to by ptr to device at address pointed to by d\_ptr
- Device memory should have been allocated using cudaMalloc() to obtain d\_ptr

#### Copy data from GPU

cudaMemcpy(ptr, d\_ptr, size, cudaMemcpyDeviceToHost);

- Call from host (CPU)
- Copy data on device pointed to by d\_ptr to host at address pointed to by ptr
- Host memory should have been allocated using e.g. malloc() to obtain ptr

#### Declare a CUDA kernel

Example:



#### Call a CUDA kernel

• Call from host. Example:

func<<<nblck, nthr>>>(n, a, x);

- nthr: number of threads per block; can be scalar or a dim3 type
- nblck: number of blocks; can be scalar or a dim3 type
- Example of dim3 type:

dim3 nthr(1024, 8, 8); /\* No. of threads in (x, y, z) \*/

#### Call a CUDA kernel

• Call from host. Example:

func<<<nblck, nthr>>>(n, a, x);

- nthr: number of threads per block; can be scalar or a dim3 type
- nblck: number of blocks; can be scalar or a dim3 type
- Example of dim3 type:

dim3 nthr(1024, 8, 8); /\* No. of threads in (x, y, z) \*/

#### Thread coordinates within kernel

Example:



```
dim3 blcks( 4, 3, bz);
dim3 thrds(16, 8, tz);
func<<<blcks, thrds>>>(...);
```

|               | <br>             |              |
|---------------|------------------|--------------|
| ************* | <br>************ | ************ |
| ************* | <br>************ | ************ |
| ************* | <br>************ | *******      |
|               | <br>             | ***********  |
| ***********   | <br>             | *******      |
| ***********   | <br>**********   | ******       |
|               | <br>             |              |
|               | <br>             | ******       |
|               |                  |              |
|               |                  |              |
|               |                  |              |
|               |                  |              |
|               | <br>             |              |
|               | <br>             |              |
|               | <br>             |              |
|               |                  |              |
|               |                  |              |
|               |                  |              |
|               |                  |              |
|               |                  |              |
|               |                  |              |
|               | <br>             |              |
|               |                  |              |
|               | <br>             |              |

```
dim3 blcks( 4, 3, bz);
dim3 thrds(16, 8, tz);
func<<<blcks, thrds>>>(...);
```



```
dim3 blcks( 4, 3, bz);
dim3 thrds(16, 8, tz);
func<<<blcks, thrds>>>(...);
```



```
dim3 blcks( 4, 3, bz);
dim3 thrds(16, 8, tz);
func<<<blcks, thrds>>>(...);
```



```
dim3 blcks( 4, 3, bz);
dim3 thrds(16, 8, tz);
func<<<blcks, thrds>>>(...);
```



```
dim3 blcks( 4, 3, bz);
dim3 thrds(16, 8, tz);
func<<<blcks, thrds>>>(...);
```



```
dim3 blcks( 4, 3, bz);
dim3 thrds(16, 8, tz);
func<<<blcks, thrds>>>(...);
```



Threads, blocks, grids

```
dim3 blcks( 4, 3, bz);
dim3 thrds(16, 8, tz);
func<<<blcks, thrds>>>(...);
```



Variables available within kernel

- threadIdx.{x,y,z}
- blockIdx.{x,y,z}

- blockDim.{x,y,z}
- gridDim.{x,y,z}

Warp-up: port a simple code to GPU and investigate performance

Sources: /nvme/scratch/da23/gpus/

• ex01/axpy.cu implements a so-called "axpy" operation (a-times-x-plus-y):

 $y_i \leftarrow a \cdot x_i + y_i, \ i = 0, \dots, n-1$ 

with  $\alpha$  scalar and y and x vectors of length n.

#### Warp-up: port a simple code to GPU and investigate performance

Sources:/nvme/scratch/da23/gpus/

• ex01/axpy.cu implements a so-called "axpy" operation (a-times-x-plus-y):

 $y_i \gets a \cdot x_i + y_i, \ i = 0, \dots, n-1$ 

with  $\alpha$  scalar and y and x vectors of length n.

• Currently Implements this on the **CPU**. Uses OpenMP to multi-thread over the 20 cores (per socket) of the Cyclone cluster nodes

#### Warp-up: port a simple code to GPU and investigate performance

Sources: /nvme/scratch/da23/gpus/

• ex01/axpy.cu implements a so-called "axpy" operation (a-times-x-plus-y):

 $y_i \leftarrow a \cdot x_i + y_i, i = 0, \dots, n-1$ 

with  $\alpha$  scalar and y and x vectors of length n.

- Currently Implements this on the **CPU**. Uses OpenMP to multi-thread over the 20 cores (per socket) of the Cyclone cluster nodes
- We will proceed step-by-step, to port this simple application to the GPU using CUDA

#### Warp-up: port a simple code to GPU and investigate performance

Sources:/nvme/scratch/da23/gpus/

• ex01/axpy.cu implements a so-called "axpy" operation (a-times-x-plus-y):

 $y_i \leftarrow a \cdot x_i + y_i, i = 0, \dots, n-1$ 

with  $\alpha$  scalar and y and x vectors of length n.

- Currently Implements this on the **CPU**. Uses OpenMP to multi-thread over the 20 cores (per socket) of the Cyclone cluster nodes
- We will proceed step-by-step, to port this simple application to the GPU using CUDA

This will cover:

• Allocation of memory on the GPU;

#### Warp-up: port a simple code to GPU and investigate performance

Sources:/nvme/scratch/da23/gpus/

• ex01/axpy.cu implements a so-called "axpy" operation (a-times-x-plus-y):

 $y_i \leftarrow a \cdot x_i + y_i, i = 0, \dots, n-1$ 

with  $\alpha$  scalar and y and x vectors of length n.

- Currently Implements this on the **CPU**. Uses OpenMP to multi-thread over the 20 cores (per socket) of the Cyclone cluster nodes
- We will proceed step-by-step, to port this simple application to the GPU using CUDA

This will cover:

- Allocation of memory on the GPU;
- Transferring memory to/from GPU;

#### Warp-up: port a simple code to GPU and investigate performance

Sources:/nvme/scratch/da23/gpus/

• ex01/axpy.cu implements a so-called "axpy" operation (a-times-x-plus-y):

 $y_i \leftarrow a \cdot x_i + y_i, i = 0, \dots, n-1$ 

with  $\alpha$  scalar and y and x vectors of length n.

- Currently Implements this on the **CPU**. Uses OpenMP to multi-thread over the 20 cores (per socket) of the Cyclone cluster nodes
- We will proceed step-by-step, to port this simple application to the GPU using CUDA

This will cover:

- Allocation of memory on the GPU;
- Transferring memory to/from GPU;
- Invoking kernels;

#### Warp-up: port a simple code to GPU and investigate performance

Sources:/nvme/scratch/da23/gpus/

• ex01/axpy.cu implements a so-called "axpy" operation (a-times-x-plus-y):

 $y_i \leftarrow a \cdot x_i + y_i, i = 0, \dots, n-1$ 

with  $\alpha$  scalar and y and x vectors of length n.

- Currently Implements this on the **CPU**. Uses OpenMP to multi-thread over the 20 cores (per socket) of the Cyclone cluster nodes
- We will proceed step-by-step, to port this simple application to the GPU using CUDA

This will cover:

- Allocation of memory on the GPU;
- Transferring memory to/from GPU;
- Invoking kernels;
- Placement of threads and memory access

#### File: ex01/axpy.cu

- Contains the C program we will begin with: axpy.cu
- Even though the file extension is .cu, the program contains no CUDA. Only OpenMP
- Allocates four arrays: x0[n], x1[n], y0[n], and y1[n], with n read from the command line
- x0 and y0 are initialized to random numbers
- x1 and y1 are initialized to x0 and y0 respectively
- The program:
  - performs y0[:] = a x 0[:] + y0[:] in the first part marked with A:
  - performs y1[:] = a\*x1[:] + y1[:] in the second part marked with B:
  - reports the timing for part A and for B
  - reports the difference between y0 and y1

#### File: ex01/axpy.cu

- Contains the C program we will begin with: axpy.cu
- Even though the file extension is .cu, the program contains no CUDA. Only OpenMP
- Allocates four arrays: x0[n], x1[n], y0[n], and y1[n], with n read from the command line
- x0 and y0 are initialized to random numbers
- x1 and y1 are initialized to x0 and y0 respectively
- The program:
  - performs y0[:] = a x 0[:] + y0[:] in the first part marked with A:
  - performs y1[:] = a\*x1[:] + y1[:] in the second part marked with B:
  - reports the timing for part A and for B
  - reports the difference between y0 and y1

Take some time to inspect axpy.cu before we compile and run

• Copy first exercise from this training's shared space:

```
[user@front02 ~]$ cp -r /nvme/scratch/da23/gpus/ex01 .
[user@front02 ~]$ cd ex01/
[user@front02 ex01]$ ls -1
axpy.cu
```

• Copy first exercise from this training's shared space:

```
[user@front02 ~]$ cp -r /nvme/scratch/da23/gpus/ex01 .
[user@front02 ~]$ cd ex01/
[user@front02 ex01]$ ls -1
axpy.cu
```

• Compile with nvcc including OpenMP:

[user@front02 ex01]\$ module load gompi
[user@front02 ex01]\$ module load CUDA
[user@front02 ex01]\$ nvcc -03 -arch sm\_70 -Xcompiler -fopenmp -o axpy axpy.cu

• -Xcompiler -fopenmp: tells nvcc to pass -fopenmp to the underlying C compiler (here gcc)

• Copy first exercise from this training's shared space:

```
[user@front02 ~]$ cp -r /nvme/scratch/da23/gpus/ex01 .
[user@front02 ~]$ cd ex01/
[user@front02 ex01]$ ls -1
axpy.cu
```

• Compile with nvcc including OpenMP:

```
[user@front02 ex01]$ module load gompi
[user@front02 ex01]$ module load CUDA
[user@front02 ex01]$ nvcc -03 -arch sm_70 -Xcompiler -fopenmp -o axpy axpy.cu
```

- -Xcompiler -fopenmp: tells nvcc to pass -fopenmp to the underlying C compiler (here gcc)
- Run on the CPUs of a GPU node
- Use srun to run interactively, e.g.:

```
[user@front02 ex01] export OMP_PROC_BIND="close"
[user@front02 ex01] export OMP_PLACES="cores"
[user@front02 ex01] export OMP_NUM_THREADS=20
[user@front02 ex01] srun -n 1 --cpus-per-task=20 -p gpu -A da23 --gres=gpu:1 --reservation=data-analytics ./a
CPU: nthr = 20 t0 = 0.0089 sec P = 15.024 Gflop/s B = 90.142 GB/s
CPU: nthr = 20 t0 = 0.0086 sec P = 15.667 Gflop/s B = 94.000 GB/s
Diff = 0.000000e+00
```

• Copy first exercise from this training's shared space:

```
[user@front02 ~]$ cp -r /nvme/scratch/da23/gpus/ex01 .
[user@front02 ~]$ cd ex01/
[user@front02 ex01]$ ls -1
axpy.cu
```

• Compile with nvcc including OpenMP:

```
[user@front02 ex01]$ module load gompi
[user@front02 ex01]$ module load CUDA
[user@front02 ex01]$ nvcc -03 -arch sm_70 -Xcompiler -fopenmp -o axpy axpy.cu
```

- -Xcompiler -fopenmp: tells nvcc to pass -fopenmp to the underlying C compiler (here gcc)
- Run on the CPUs of a GPU node
- Use srun to run interactively, e.g.:

```
[user@front02 ex01] export OMP_PROC_BIND="close"
[user@front02 ex01] export OMP_PLACES="cores"
[user@front02 ex01] export OMP_NUM_THREADS=20
[user@front02 ex01] srun -n 1 --cpus-per-task=20 -p gpu -A da23 --gres=gpu:1 --reservation=data-analytics ./a
CPU: nthr = 20 t0 = 0.0089 sec P = 15.024 Gflop/s B = 90.142 GB/s
CPU: nthr = 20 t0 = 0.0086 sec P = 15.667 Gflop/s B = 94.000 GB/s
Diff = 0.000000e+00
```

• Compare ~90 GB/s achieved vs ~130 GB/s peak memory bandwidth

Use a GPU to replace part **B** of the calculation

- Edits outside of main():
  - 1. Add the cuda\_runtime.h header file
  - 2. Add the GPU axpy kernel, naming it gpu\_axpy()
  - 3. Add a function similar to ualloc() that allocates memory on the GPU and checks whether an error occured
- Edits within main():
  - 1. Allocate arrays on GPU
  - 2. Copy x1[:] and y1[:] to GPU
  - 3. Call gpu\_axpy()
  - 4. Copy y1[:] from GPU

#### Edits outside of main() 1/3

• Add the cuda\_runtime.h header file on line 5:

#include <cuda\_runtime.h>

#### Edits outside of main() 2/3

• Add the GPU axpy kernel, naming it gpu\_axpy(), after the CPU axpy, around line 64:

```
/***
* Do y <- a*x + y on the GPU
***/
global__ void
gpu_axpy(int n, float a, float *x, float *y)
{
   for(int i=0; i<n; i++)
      y[i] = a*x[i] + y[i];
   return;
}</pre>
```

#### Edits outside of main() 3/3

• At around line 30 add a function similar to ualloc() that allocates memory on the GPU and checks whether an error occurred

```
/***
 * Allocate memory on GPU; print error if not successful
 ***/
void *
gpu_alloc(size_t size)
{
    void *ptr;
    cudaError_t err = cudaMalloc(&ptr, size);
    if(err != cudaSuccess) {
        fprintf(stderr, "cudaMalloc() returned %d; quitting...\n", err);
        exit(-2);
    }
    return ptr;
}
```

#### Edits within main() 1/4

• Allocate arrays on GPU, within B part. Free arrays before closing B part:

Edits within main() 2/4

Copy x1[:] and y1[:] to GPU

cudaMemcpy(d\_x, x1, sizeof(float)\*n, cudaMemcpyHostToDevice); cudaMemcpy(d\_y, y1, sizeof(float)\*n, cudaMemcpyHostToDevice);

#### Edits within main() 3/4

• Call gpu\_axpy(). For the moment use 1 thread and 1 block. Replace axpy(n, a, x, y) of part B with:

```
double t0 = stop_watch(0);
gpu_axpy<<<1, 1>>>(n, a, d_x, d_y);
t0 = stop_watch(t0);
```

Note we need to pass the *device pointers* since it is these pointers that point to the memory allocated on the GPU

#### Edits within main() 4/4

• Copy y1[:] from GPU:

/\* Copy y1 back from GPU \*/
cudaMemcpy(y1, d\_y, sizeof(float)\*n, cudaMemcpyDeviceToHost);

• Also change:

printf(" CPU: nthr = %4d ...)

to:

printf(" GPU: ...);

and remove OpenMP parallel region.

#### Compile and run

• Compile as before:

[user@front02 ex01]\$ nvcc -arch sm\_70 -03 -Xcompiler -fopenmp -o axpy axpy.cu

• Run as before (I'm assuming OMP\_BIND, OMP\_PLACES, and OMP\_NUM\_THREADS were set before):

This performance is infeasible. What's going on?

#### Edits within main() 3/4

• The problem is here:

```
double t0 = stop_watch(0);
gpu_axpy<<<1, 1>>>(n, a, d_x, d_y);
t0 = stop_watch(t0);
```

- CUDA kernels return immediately; the kernel is still being executed on the device when stop\_watch(t0) is called. We are not timing the kernel execution time, but the time it takes to dispatch the kernel to the GPU.
- Correct this by adding cudaDeviceSynchronize(); after the CUDA kernel, which blocks until all running CUDA kernels are complete:

```
double t0 = stop_watch(0);
gpu_axpy<<<1, 1>>>(n, a, d_x, d_y);
cudaDeviceSynchronize();
t0 = stop_watch(t0);
```

• Compile and run again:

• Compile and run again:

• This performance is of course extremely poor;

• Compile and run again:

- This performance is of course extremely poor;
- We're using only one GPU thread in the kernel

#### Use more threads

• In this step, we will use 512 GPU threads. First, change the call to the GPU kernel:

```
double t0 = stop_watch(0);
gpu_axpy<<<1, 512>>>(n, a, d_x, d_y);
cudaDeviceSynchronize();
t0 = stop_watch(t0);
```

#### Use more threads

• In this step, we will use 512 GPU threads. First, change the call to the GPU kernel:

```
double t0 = stop_watch(0);
gpu_axpy<<<1, 512>>>(n, a, d_x, d_y);
cudaDeviceSynchronize();
t0 = stop_watch(t0);
```

• Then we need to change the kernel. We need in each GPU thread to calculate which elements it will operate on:

```
/***
* Do y <- a*x + y on the GPU
***/
__global__ void
gpu_axpy(int n, float a, float *x, float *y)
{
    int ithr = threadIdx.x;
    int nthr = blockDim.x;
    int lt = n/nthr;
    for(int i=ithr*lt; i<(ithr+1)*lt; i++)
      y[i] = a*x[i] + y[i];
    return;
}</pre>
```

• With the above, each thread operated on n/nthr contiguous elements

• Compile and run again:

• Better than before, but still very poor performance. Can we do better?

#### **Optimized GPU memory access**

Always keep in mind that on GPUs, it is more optimal if contiguous threads access contiguous memory locations



This represents the order by which elements are accessed currently

- The same thread accesses continuous elements
- Very common approach on CPUs
- On GPUs, this results in so-called *bank conflicts*
- Suboptimal!

#### **Optimized GPU memory access**

Always keep in mind that on GPUs, it is more optimal if contiguous threads access contiguous memory locations



This represents an optimal data access pattern

- Different threads accesses continuous elements
- Each thread is served by a different memory bank

#### **Optimized GPU memory access**

Always keep in mind that on GPUs, it is more optimal if contiguous threads access contiguous memory locations

In our example:

```
/***
* Do y <- a*x + y on the GPU
***/
__global___void
gpu_axpy(int n, float a, float *x, float *y)
{
    int ithr = threadIdx.x;
    int nthr = blockDim.x;
    for(int i=0; i<n; i+=nthr)
       y[i+ithr] = a*x[i+ithr] + y[i+ithr];
    return;
}</pre>
```

• Compile and run:

#### **Blocks and threads**

Now let's use blocks. Let's use as many blocks and threads as we can

- Upper limit of 1024 threads
- Upper limit of  $2^{31} 1$  blocks

```
double t0 = stop_watch(0);
int nthr = 1024;
gpu_axpy<<<n/nthr, nthr>>>(n, a, d_x, d_y);
cudaDeviceSynchronize();
t0 = stop_watch(t0);
```

```
/***
* Do y <- a*x + y on the GPU
***/
__global__ void
gpu_axpy(int n, float a, float *x, float *y)
{
    int ithr = threadIdx.x;
    int nthr = blockDim.x;
    int iblk = blockIdx.x;
    int idx = ithr + iblk*nthr;
    y[idx] = a*x[idx] + y[idx];
    return;
}</pre>
```

#### **Blocks and threads**

• Compile and run:

```
[user@front02 ex01]$ srun -n 1 --cpus-per-task=20 -p gpu -A da23 --gres=gpu:1 --reservation=data-analytics .,
CPU: nthr = 20 t0 = 0.0088 sec P = 15.188 Gflop/s B = 91.129 GB/s
GPU: t0 = 0.0011 sec P = 119.930 Gflop/s B = 719.578 GB/s
Diff = 1.021564e-15
```

• ~720 GB/s is ~80% of peak bandwidth (which is 900 GB/s)

#### **Blocks and threads**

• Compile and run:

- ~720 GB/s is ~80% of peak bandwidth (which is 900 GB/s)
- Try varying the number of threads per block. E.g. with 512 threads I got ~730 GB/s.

### **CUDA, another example**

Exercise: rotate and shift an array of (x, y) coordinates

- ex02/rot.cu calls, as before, the same kernel twice
- Operation is  $\vec{\nu}_i = U\vec{r}_i + \vec{s}_i$
- Where:

$$\mathbf{U} = \begin{pmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{pmatrix}$$

### **CUDA, another example**

Exercise: rotate and shift an array of (x, y) coordinates

- ex02/rot.cu calls, as before, the same kernel twice
- Operation is  $\vec{\nu}_i = U\vec{r}_i + \vec{s}_i$
- Where:

$$\mathbf{U} = \begin{pmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \end{pmatrix}$$

• Equivalently:

$$\begin{split} \nu_{i,x} &= \cos(\theta)r_{i,x} - \sin(\theta)r_{i,y} + s_{i,x} \\ \nu_{i,y} &= \sin(\theta)r_{i,x} + \cos(\theta)r_{i,y} + s_{i,y} \end{split}$$

$$\begin{split} \nu_{i,x} &= \cos(\theta) r_{i,x} - \sin(\theta) r_{i,y} + s_{i,x} \\ \nu_{i,y} &= \sin(\theta) r_{i,x} + \cos(\theta) r_{i,y} + s_{i,y} \end{split}$$















#### TODO, for a first version

- Implement a CUDA version for the second call
- Each GPU thread operating on one point (i)
- Pass number of GPU threads to use as command line argument, i.e. using argv[]

#### TODO, for a first version

- Implement a CUDA version for the second call
- Each GPU thread operating on one point (i)
- Pass number of GPU threads to use as command line argument, i.e. using argv[]

#### Example:

```
[user@front02 ex02] export OMP_PROC_BIND="close"
[user@front02 ex02] export OMP_PLACES="cores"
[user@front02 ex02] export OMP_NUM_THREADS=20
[user@front02 ex02] srun -n 1 --cpus-per-task=20 -p gpu -A da23 --gres=gpu:1 --reservation=data-analytics ./r
CPU: nthr = 20 t0 = 0.0610 sec P = 17.608 Gflop/s B = 52.823 GB/s
GPU: nthr = 32 t0 = 0.0069 sec P = 155.683 Gflop/s B = 467.049 GB/s
Diff = 1.115821e-15
```

### TODO, for a first version

- Implement a CUDA version for the second call
- Each GPU thread operating on one point (i)
- Pass number of GPU threads to use as command line argument, i.e. using argv[]

#### **Example:**

```
[user@front02 ex02] export OMP_PROC_BIND="close"
[user@front02 ex02] export OMP PLACES="cores"
[user@front02 ex02] export OMP NUM THREADS=20
[user@front02 ex02] srun -n 1 --cpus-per-task=20 -p gpu -A da23 --gres=gpu:1 --reservation=data-analytics ./
CPU: nthr = 20 t0 = 0.0610 sec P = 17.608 Gflop/s B = 52.823 GB/s
GPU: nthr = 32 t0 = 0.0069 sec
                                     P = 155.683 Gflop/s B = 467.049 GB/s
Diff = 1.115821e-15
* Do r <- U*r on the GPU using CUDA
***/
global void
qpu rotate(int n, coords *r out, float theta, coords *r in, coorsd *s)
 float ct = cos(theta);
 float st = sin(theta);
 int ithr = threadIdx.x:
 int nthr = blockDim.x;
 int iblk = blockIdx.x;
 int idx = (ithr + iblk*nthr);
 r_out[idx].x = ct*r_in[idx].x - st*r_in[idx].y + s[idx].x;
 r out[idx].y = st*r in[idx].x + ct*r in[idx].y + s[idx].y;
 return:
```

The optimal number of threads is not necessarily the maximum

• If we allow the number of threads to be a command line argument, we can easily scan for it

```
[user@font01 ex02]$ for((th=4; th<=1024; th*=2))</pre>
> do
> srun -n 1 --cpus-per-task=20 -p gpu -A da23 --gres=gpu:1 --reservation=data-analytics ./rot $th $((1024*10))
> done 2>&1|grep GPU
 GPU: nthr = 4
                    t0 = 0.0514 sec
                                       P = 20.902 Gflop/s
                                                             B = 62.707 \text{ GB/s}
                                       P = 40.389 \text{ Gflop/s}
 GPU: nthr = 8
                    t0 = 0.0266 sec
                                                             B = 121.168 \text{ GB/s}
 GPU: nthr = 16
                    t0 = 0.0134 sec
                                       P = 80.393 Gflop/s
                                                             B = 241.178 \text{ GB/s}
 GPU: nthr = 32 t0 = 0.0067 sec
                                       P = 159.498 Gflop/s B = 478.495 GB/s
 <u>GPU: nthr = 64</u> t0 = 0.0049 sec
                                       P = 220.171 Gflop/s B = 660.513 GB/s
 GPU: nthr = 128 t0 = 0.0049 sec
                                       P = 217.839 Gflop/s B = 653.516 GB/s
 GPU: nthr = 256
                    t0 = 0.0049 sec
                                       P = 219.217 Gflop/s
                                                             B = 657.652 \text{ GB/s}
 GPU: nthr = 512
                                       P = 222.818 Gflop/s
                    t0 = 0.0048 sec
                                                             B = 668.454 \text{ GB/s}
                                       P = 221.852 Gflop/s
 GPU: nthr = 1024
                                                             B = 665.557 \text{ GB/s}
                    t0 = 0.0048 sec
```

The optimal number of threads is not necessarily the maximum

• If we allow the number of threads to be a command line argument, we can easily scan for it

```
[user@font01 ex02]$ for((th=4; th<=1024; th*=2))</pre>
> do
> srun -n 1 --cpus-per-task=20 -p gpu -A da23 --gres=gpu:1 --reservation=data-analytics ./rot $th $((1024*10))
> done 2>\&1|grep GPU
 GPU: nthr = 4
                    t0 = 0.0514 sec
                                      P = 20.902 Gflop/s
                                                            B = 62.707 \text{ GB/s}
                                      P = 40.389 \text{ Gflop/s}
 GPU: nthr = 8 	 t0 = 0.0266 sec
                                                            B = 121.168 \text{ GB/s}
 GPU: nthr = 16 t0 = 0.0134 sec
                                      P = 80.393 Gflop/s
                                                            B = 241.178 \text{ GB/s}
 GPU: nthr = 32 t0 = 0.0067 sec
                                      P = 159.498 Gflop/s B = 478.495 GB/s
 <u>GPU: nthr = 64</u> t0 = 0.0049 sec
                                      P = 220.171 Gflop/s B = 660.513 GB/s
 GPU: nthr = 128 t0 = 0.0049 sec
                                      P = 217.839 Gflop/s B = 653.516 GB/s
 GPU: nthr = 256 t0 = 0.0049 sec
                                      P = 219.217 Gflop/s B = 657.652 GB/s
 GPU: nthr = 512
                                      P = 222.818 Gflop/s
                    t0 = 0.0048 sec
                                                            B = 668.454 \text{ GB/s}
 GPU: nthr = 1024
                                      P = 221.852 Gflop/s
                                                            B = 665.557 \text{ GB/s}
                    t0 = 0.0048 sec
```

• Tops at ~670 GBytes/s or ~75%. Can we do better?

## Optimizations

- Note the loading of elements of r[] from global memory ⇒ two continuous component per thread
- Optimization opportunity: use one thread per component

## Optimizations

- Note the loading of elements of r[] from global memory  $\Rightarrow$  two continuous component per thread
- Optimization opportunity: use one thread per component

 $\Rightarrow$  Instead of each thread operating on one point, consider each thread operating on a single component of the coordinate

## Optimizations

- Note the loading of elements of r[] from global memory  $\Rightarrow$  two continuous component per thread
- Optimization opportunity: use one thread per component

 $\Rightarrow$  Instead of each thread operating on one point, consider each thread operating on a single component of the coordinate

- In other words, have:
  - even threads computing the x coordinate part of v[:]
  - odd threads computing the y coordinate of v[:]

## Optimizations

- Note the loading of elements of r[] from global memory  $\Rightarrow$  two continuous component per thread
- Optimization opportunity: use one thread per component

 $\Rightarrow$  Instead of each thread operating on one point, consider each thread operating on a single component of the coordinate

- In other words, have:
  - $\circ~$  even threads computing the x coordinate part of v[:]
  - odd threads computing the y coordinate of v[:]
- This will demonstrate the use of *shared memory*, i.e. fast memory which all threads in a single block can access

## Optimizations

- Note the loading of elements of r[] from global memory  $\Rightarrow$  two continuous component per thread
- Optimization opportunity: use one thread per component

 $\Rightarrow$  Instead of each thread operating on one point, consider each thread operating on a single component of the coordinate

- In other words, have:
  - $\circ~$  even threads computing the x coordinate part of v[:]
  - odd threads computing the y coordinate of v[:]
- This will demonstrate the use of *shared memory*, i.e. fast memory which all threads in a single block can access
- Shared memory is declared with the shared attribute, i.e.:

\_shared\_\_ float arr[SIZE];

## Optimizations

- Note the loading of elements of r[] from global memory  $\Rightarrow$  two continuous component per thread
- Optimization opportunity: use one thread per component

 $\Rightarrow$  Instead of each thread operating on one point, consider each thread operating on a single component of the coordinate

- In other words, have:
  - even threads computing the x coordinate part of v[:]
  - odd threads computing the y coordinate of v[:]
- This will demonstrate the use of *shared memory*, i.e. fast memory which all threads in a single block can access
- Shared memory is declared with the shared attribute, i.e.:

#### \_shared\_\_ float arr[SIZE];

• Note that here SIZE must be known at compile time

## Optimizations

- Note the loading of elements of r[] from global memory  $\Rightarrow$  two continuous component per thread
- Optimization opportunity: use one thread per component

 $\Rightarrow$  Instead of each thread operating on one point, consider each thread operating on a single component of the coordinate

- In other words, have:
  - even threads computing the x coordinate part of v[:]
  - odd threads computing the y coordinate of v[:]
- This will demonstrate the use of *shared memory*, i.e. fast memory which all threads in a single block can access
- Shared memory is declared with the shared attribute, i.e.:

#### \_shared\_\_ float arr[SIZE];

- Note that here SIZE must be known at compile time
- Alternatively, we can have dynamic allocation of shared memory (relatively recent CUDA feature)

## Optimizations

• Below is how we would like to organize this calculation:

| 12=2*1                                                                            |
|-----------------------------------------------------------------------------------|
| (x coord. of elem. i + 0) thread = 0; v[i2+0] = r[i2+0]*ct - r[i2+1]*st + s[i2+0] |
| (y coord. of elem. i + 0) thread = 1; v[i2+1] = r[i2+1]*ct + r[i2+0]*st + s[i2+1] |
| (x coord. of elem. i + 1) thread = 2; v[i2+2] = r[i2+2]*ct - r[i2+3]*st + s[i2+2] |
| (y coord. of elem. i + 1) thread = 3; v[i2+3] = r[i2+3]*ct + r[i2+2]*st + s[i2+3] |
| (x coord. of elem. i + 2) thread = 4; v[i2+4] = r[i2+4]*ct - r[i2+5]*st + s[i2+4] |
| (y coord. of elem. i + 2) thread = 5; v[i2+5] = r[i2+5]*ct + r[i2+4]*st + s[i2+5] |
| (x coord. of elem. i + 3) thread = 6; v[i2+6] = r[i2+6]*ct - r[i2+7]*st + s[i2+6] |
| (y coord. of elem. i + 3) thread = 7; v[i2+7] = r[i2+7]*ct + r[i2+6]*st + s[i2+7] |
|                                                                                   |

- Notice that odd threads and even threads carry out different operations
- But on a GPU, it is important for performance to have all threads in a kernel execute the **same** operations
- In other words, try to avoid as much as possible constructs like:

if(ithr % 2 == 0){ ... };

### Optimizations

• First define a macro at the beginning of the file:

#define MAX\_THR 1024

• Then, when invoking the kernel, change the call to use twice the number of blocks:

gpu\_rotate<<<2\*n/n\_gpu\_thr, n\_gpu\_thr>>>(n, d\_v, theta, d\_r, d\_s);

### Optimizations

• In the kernel, declare a shared array, to be used to store the elements of r[]:

\_\_\_shared\_\_\_**float** rr[MAX\_THR];

## Optimizations

• In the kernel, declare a shared array, to be used to store the elements of r[]:

#### \_shared\_\_ float rr[MAX\_THR];

- We need a shared array for r[], because different threads will need to access the same elements. In particular, whether odd or even, each thread needs to access both x and y components of x[]
- By reading r[] into rr[] once, we avoid each thread having to read elements of r[] twice from global memory, which is slow

## Optimizations

• In the kernel, declare a shared array, to be used to store the elements of r[]:

#### \_shared\_\_ float rr[MAX\_THR];

- We need a shared array for r[], because different threads will need to access the same elements. In particular, whether odd or even, each thread needs to access both x and y components of x[]
- By reading r[] into rr[] once, we avoid each thread having to read elements of r[] twice from global memory, which is slow
- Read the elements of r[] corresponding to this block into rr[]:

int idx = iblk\*nthr + ithr; rr[ithr] = r[idx];

## Optimizations

• In the kernel, declare a shared array, to be used to store the elements of r[]:

#### \_shared\_\_ float rr[MAX\_THR];

- We need a shared array for r[], because different threads will need to access the same elements. In particular, whether odd or even, each thread needs to access both x and y components of x[]
- By reading r[] into rr[] once, we avoid each thread having to read elements of r[] twice from global memory, which is slow
- Read the elements of r[] corresponding to this block into rr[]:

```
int idx = iblk*nthr + ithr;
rr[ithr] = r[idx];
```

This way, the loading is done parallel: each thread reads in one component of r[]

### Optimizations

• Now insert the following, which only achieves the operation partially:

float rs = s[idx] + ct\*rr[ithr];

## Optimizations

• Now insert the following, which only achieves the operation partially:

#### float rs = s[idx] + ct\*rr[ithr];

• The operation is still incomplete; what we have achieved with the above is:

$$u_x \leftarrow \cos(\theta) r_x + s_x$$
 $u_y \leftarrow \cos(\theta) r_y + s_y$ 

### Optimizations

• Now insert the following, which only achieves the operation partially:

#### float rs = s[idx] + ct\*rr[ithr];

• The operation is still incomplete; what we have achieved with the above is:

$$v_x \leftarrow \cos(\theta) r_x + s_x$$
  
 $v_y \leftarrow \cos(\theta) r_y + s_y$ 

we are missing:

 $v_x \leftarrow v_x - \sin(\theta)r_y$  $v_y \leftarrow v_y + \sin(\theta)r_x$ 

## Optimizations

• We are missing:

 $\begin{aligned} \nu_x \leftarrow \nu_x - \sin(\theta) r_y \\ \nu_y \leftarrow \nu_y + \sin(\theta) r_x \end{aligned}$ 

## Optimizations

• We are missing:

 $v_x \leftarrow v_x - \sin(\theta)r_y$  $v_y \leftarrow v_y + \sin(\theta)r_x$ 

• Consider the following:

int sw = 1 - 2\*(ithr & 1);

## Optimizations

• We are missing:

$$v_x \leftarrow v_x - \sin(\theta)r_y$$
  
 $v_y \leftarrow v_y + \sin(\theta)r_x$ 

• Consider the following:

#### int sw = 1 - 2\*(ithr & 1);

- & is a bitwise "and" operation, meaning ithr & 1 will evaluate to:
- 0 if ithr is even
- 1 if ithr is odd

## Optimizations

• We are missing:

 $v_x \leftarrow v_x - \sin(\theta)r_y$  $v_y \leftarrow v_y + \sin(\theta)r_x$ 

• Consider the following:

int sw = 1 - 2\*(ithr & 1);

- & is a bitwise "and" operation, meaning ithr & 1 will evaluate to:
- 0 if ithr is even
- 1 if ithr is odd

sw = 1 - 2\*(ithr & 1) therefore yields:

ithr = 0, 1, 2, 3, ... sw = 1, -1, 1, -1, ...

## Optimizations

• We are missing:

 $v_x \leftarrow v_x - \sin(\theta)r_y$  $v_y \leftarrow v_y + \sin(\theta)r_x$ 

### Optimizations

• We are missing:

$$v_x \leftarrow v_x - \sin(\theta) r_y$$
$$v_y \leftarrow v_y + \sin(\theta) r_x$$

• Consider:

rs = rs - sw\*st\*rr[ithr+sw];

## Optimizations

• We are missing:

$$v_x \leftarrow v_x - \sin(\theta)r_y$$
  
 $v_y \leftarrow v_y + \sin(\theta)r_x$ 

• Consider:

rs = rs - sw\*st\*rr[ithr+sw];

• Then read back into out []:

out[idx] = rs;