Parallel & Distributed Computing

What is CUDA and How to Write Code?

Danish AliAbdullah AzamAmeer Hamza BajwaMuhammad Qasim

Gift University, Gujranwala

The Basics

Definition

CUDA (Compute Unified Device Architecture) is a platform created by NVIDIA that allows software to use the GPU for general purpose processing.

The Concept

Unlocks the power of the graphics card for math, science, and AI, not just gaming.

The Analogy

Think of your CPU as a brilliant Math Professor (smart, but works alone).

Think of CUDA as hiring 1,000 Students (less experienced, but they work together to finish the job much faster).

Why Parallelism Matters

CPU (The Manager)

Few powerful cores. Optimized for serial processing (doing one thing at a time very quickly).

Like a Race Car

Extremely fast for one person, but can't move 50 people at once.

GPU (The Workforce)

Thousands of smaller cores. Optimized for parallel processing (doing many things at once).

Like a City Bus

Slower top speed than a race car, but transports 50 people simultaneously.

Threads, Blocks, and Grids

Understanding how CUDA organizes work using a Construction Site analogy.

Thread

"The Worker"

One worker laying a single brick. The smallest unit of execution.

Block

"The Team"

A group of workers building one wall together. Threads in a block can share memory.

Grid

"The Site"

The entire construction site. A collection of all blocks working on the full problem.

Host vs. Device

Host (CPU)

System RAM. Where your main program starts.

Device (GPU)

Video RAM (VRAM). Where the heavy lifting happens.

The Bottleneck: Data Travel

The GPU cannot access CPU memory directly. Moving data is the slowest part.

"Processing on the GPU is instant, but getting data there is like shipping a package. You want to ship a full truckload (large data), not just one envelope at a time."

The CUDA Workflow

Think of it like a Chef's Workflow.

1

Allocate

Get bowls ready (Reserve GPU Memory)

🥣
2

Copy

Pour ingredients into bowls (Send Data CPU → GPU)

🥛
3

Launch

Turn on the mixer (Execute Kernel on GPU)

⚙️
4

Copy Back

Pour cake batter back into pan (Send Results GPU → CPU)

🎂
5

Free

Wash the bowls (Free GPU Memory)

🧼

The GPU Function (__global__)

__global__ void addArrays(...) {

  // Calculate unique ID
  int i = blockIdx.x * blockDim.x + threadIdx.x;

  if (i < n) {
    c[i] = a[i] + b[i];
  }
}

Translation

__global__

"Hey Compiler, this function is special. It runs on the GPU and is called from the CPU."

blockIdx.x * blockDim.x

"Which team (Block) am I in, and how big is that team?"

+ threadIdx.x

"Which worker number am I inside my team?"

i = Global ID

Calculating 'i' gives every thread a unique ID badge so it knows exactly which number in the array to process.

Step 1: cudaMalloc

Just like malloc() in standard C, we use cudaMalloc for the GPU.

int *d_a, *d_b, *d_c; // 'd' stands for Device
int size = n * sizeof(int);

cudaMalloc(&d_a, size);
cudaMalloc(&d_b, size);
cudaMalloc(&d_c, size);

Step 2: cudaMemcpy

We move the numbers from the RAM to the Graphics Card.

// From Host (CPU) to Device (GPU)
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

Step 3: The <<< >>> Syntax

The triple angle brackets tell the GPU how many blocks and threads to use.

int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;

addArrays<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);

Complete Source Code (main.cu)

#include <stdio.h>

__global__ void add(int *a, int *b, int *c, int n) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  if (i < n) c[i] = a[i] + b[i];
}

int main() {
  int n = 10;
  int size = n * sizeof(int);
  int h_a[10] = {1,2,3,4,5,6,7,8,9,10}, h_b[10] = {1,1,1,1,1,1,1,1,1,1}, h_c[10];
  int *d_a, *d_b, *d_c;

  cudaMalloc(&d_a, size); cudaMalloc(&d_b, size); cudaMalloc(&d_c, size);
  cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);

  add<<<1, n>>>(d_a, d_b, d_c, n);

  cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
  for(int i=0; i<n; i++) printf("%d + %d = %d\n", h_a[i], h_b[i], h_c[i]);

  cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
  return 0;
}

Conclusion & Questions

Summary

CUDA unlocks the massive power of GPUs for everyday tasks.

Contact:

Danish Ali, Abdullah Azam

Ameer Hamza Bajwa, Muhammad Qasim

Thank You for your time!