Some resources for quickly getting up and running with CUDA programming


browse  log 



You can also use your local clone with git send-email.


CUDA C++ provides convenient abstractions for leveraging the GPU to perform massively parallel computations. For the most part, CUDA C++ should look pretty familiar to you if you know C or C++.

We'll go over a few of the language extensions that should help you to get started.

If you want a more in-depth introduction, I encourage you to take a look at:

#Embarassingly parallel problems

From wikipedia:

In parallel computing, an embarrassingly parallel workload or problem is one where little or no effort is needed to separate the problem into a number of parallel tasks.

Embarassingly parallel problems are a great place to get started with GPU programming as you can get absurd speed-ups without much thought.

#Unified memory programming

Programming guide link: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#unified-memory-programming

As of CUDA 6.0 (released in 2014), CUDA provides convenient abstractions for "sharing" memory between the host (CPU) and device (GPU).

Oversimplified explanations:

  • cudaMallocManaged: given a pointer to a pointer and a size, allocates shared memory of the specified size and stores the start of the memory region at the passed in pointer.
  • cudaDeviceSynchronize: wait for the GPU to write results into shared memory.
  • cudaFree: frees the previously allocated memory.


int *x;
cudaMallocManaged(&x, sizeof(int));

// Do some GPU stuff using x.

// x is now available to the CPU.

std::cout << *x << "\n";

#Function execution space specifiers: __global__ and __device__

Programming guide link: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#function-execution-space-specifiers

Marking a function as __global__ causes it to be executed on the device (GPU) and callable from the host (CPU) (or the device on modern GPUs).

Marking a function as __device__ causes it to be executed on the device (GPU) and makes it callable from the device.

Example code:

#include <iostream>

void really_add_1(int *x) {
  *x = *x + 1;

void add1(int *x) {

int main(int argc, char** argv) {
  int *x;
  cudaMallocManaged(&x, sizeof(int));
  *x = 2;

  // We'll get to this funky syntax in the next section!
  add1<<<1, 1>>>(x);

  std::cout << *x << "\n";
  return 0;

#Execution configuration

When executing a __global__ function, you provide an "execution configuration" to the function between <<<...>>>.

An execution configuration consists of two components:

  1. number of blocks: the number (or dimensions) of thread blocks to run;
  2. block size: the number (or dimensions) of threads to run per thread block.

The block size must be a multiple of the number of threads in a single warp which is 32 for Nvidia GPUs. A commonly chosen block size is 256.

You can also use two and three-dimensional block sizes. These are functionally equivalent to specifying the count, but provide convenience when dealing with arrays in multiple dimensions. You can use the dim3 type which is able to represent 1-D, 2-D and 3-D coordinates (any coordinates you don't set are treated as 1).

This looks something like this:

// A single block of N * N * 1 threads.
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
GpuCall<<<numBlocks, threadsPerBlock>>>(A, B, C);

Within your kernel, you can use the following built-in variables:

  • blockIdx - the coordinates/indices of the block of the current thread;
  • blockDim - dimensions of a thread block;
  • threadIdx - the coordinates/indices of the current thread.

Here is an example of what a matrix addition might look like:

// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;
    if (i < N && j < N)
        C[i][j] = A[i][j] + B[i][j];

int main()
    // Kernel invocation
    dim3 threadsPerBlock(16, 16);
    dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
    MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

#And more!

This is enough to get your feet wet solving embarassingly parallel problems but there is a ton more to explore. Check out the programming guide for more features to try out!