Back to Blog
CUDA
January 3, 2024
20 min read

CUDA Programming on Rented GPUs: Getting Started Guide

Learn CUDA programming fundamentals and develop GPU-accelerated applications on cloud GPU instances. Perfect for researchers and developers new to parallel computing.

What You'll Learn:

  • CUDA programming basics
  • Setting up development environment
  • Memory management techniques
  • Parallel algorithm design
  • Performance optimization
  • Debugging CUDA applications
  • Cloud development workflow
  • Real-world examples
  • Best practices and patterns
  • Scientific computing applications

What is CUDA Programming?

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform that allows developers to use GPU cores for general-purpose computing. Instead of just graphics, you can harness thousands of GPU cores for mathematical computations, scientific simulations, and data processing.

Why Use CUDA?
  • • Massive parallelism (thousands of cores)
  • • Significant speedup for suitable problems
  • • Mature ecosystem and libraries
  • • Integration with popular languages
  • • Extensive documentation and community
  • • Industry standard for GPU computing
Common Applications
  • • Scientific simulations
  • • Machine learning and AI
  • • Image and signal processing
  • • Financial modeling
  • • Cryptography and mining
  • • Computational fluid dynamics

Setting Up Your Development Environment

Before writing CUDA code, you need to set up the development environment on your rented GPU instance.

Step 1: Verify GPU and Driver Installation
# Check if NVIDIA driver is installed
nvidia-smi

# Check CUDA version
nvcc --version

# Verify GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv
Step 2: Install CUDA Toolkit
# Ubuntu/Debian installation
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

# Add CUDA to PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
Step 3: Install Development Tools
# Install essential development tools
sudo apt-get install build-essential

# Install CMake for project management
sudo apt-get install cmake

# Install Git for version control
sudo apt-get install git

# Optional: Install Visual Studio Code for remote development
wget -qO- https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor > packages.microsoft.gpg
sudo install -o root -g root -m 644 packages.microsoft.gpg /etc/apt/trusted.gpg.d/
sudo sh -c 'echo "deb [arch=amd64,arm64,armhf signed-by=/etc/apt/trusted.gpg.d/packages.microsoft.gpg] https://packages.microsoft.com/repos/code stable main" > /etc/apt/sources.list.d/vscode.list'
sudo apt-get update
sudo apt-get install code

Your First CUDA Program

Let's start with a simple "Hello World" program to verify everything is working correctly.

Hello World CUDA Program
// hello_cuda.cu
#include <stdio.h>
#include <cuda_runtime.h>

// CUDA kernel function (runs on GPU)
__global__ void hello_kernel() {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    printf("Hello from GPU thread %d!\n", idx);
}

int main() {
    // Print device information
    int device_count;
    cudaGetDeviceCount(&device_count);
    printf("Found %d CUDA devices\n", device_count);
    
    if (device_count > 0) {
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, 0);
        printf("Device 0: %s\n", prop.name);
        printf("Compute Capability: %d.%d\n", prop.major, prop.minor);
        printf("Multiprocessors: %d\n", prop.multiProcessorCount);
        printf("Global Memory: %.2f GB\n", prop.totalGlobalMem / (1024.0 * 1024.0 * 1024.0));
    }
    
    // Launch kernel with 1 block of 8 threads
    hello_kernel<<<1, 8>>>();
    
    // Wait for GPU to finish
    cudaDeviceSynchronize();
    
    printf("Hello from CPU!\n");
    return 0;
}
Compile and Run
# Compile the CUDA program
nvcc -o hello_cuda hello_cuda.cu

# Run the program
./hello_cuda

CUDA Programming Concepts

Understanding these fundamental concepts is crucial for effective CUDA programming.

Thread Hierarchy

Threads

Individual execution units. Each thread executes the same kernel function but can work on different data.

Blocks

Groups of threads that can cooperate and share memory. Threads within a block can synchronize.

Grid

Collection of blocks. The entire grid executes the same kernel function.

Memory Hierarchy

Global Memory

Large, slow memory accessible by all threads. Main storage for data.

Shared Memory

Fast memory shared among threads in the same block. Limited size but very fast.

Registers

Fastest memory, private to each thread. Automatically managed by compiler.

Practical Example: Vector Addition

Let's implement a more practical example that demonstrates memory management and parallel computation.

Vector Addition Implementation
// vector_add.cu
#include <stdio.h>
#include <cuda_runtime.h>
#include <time.h>

// CUDA kernel for vector addition
__global__ void vector_add(float *a, float *b, float *c, int n) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    
    if (idx < n) {
        c[idx] = a[idx] + b[idx];
    }
}

// CPU version for comparison
void vector_add_cpu(float *a, float *b, float *c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    const int N = 1000000;  // Vector size
    const int bytes = N * sizeof(float);
    
    // Allocate host memory
    float *h_a = (float*)malloc(bytes);
    float *h_b = (float*)malloc(bytes);
    float *h_c = (float*)malloc(bytes);
    float *h_c_cpu = (float*)malloc(bytes);
    
    // Initialize vectors
    for (int i = 0; i < N; i++) {
        h_a[i] = (float)i;
        h_b[i] = (float)(i * 2);
    }
    
    // Allocate device memory
    float *d_a, *d_b, *d_c;
    cudaMalloc(&d_a, bytes);
    cudaMalloc(&d_b, bytes);
    cudaMalloc(&d_c, bytes);
    
    // Copy data to device
    cudaMemcpy(d_a, h_a, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, h_b, bytes, cudaMemcpyHostToDevice);
    
    // Launch kernel
    int threads_per_block = 256;
    int blocks = (N + threads_per_block - 1) / threads_per_block;
    
    // Time GPU execution
    clock_t start_gpu = clock();
    vector_add<<<blocks, threads_per_block>>>(d_a, d_b, d_c, N);
    cudaDeviceSynchronize();
    clock_t end_gpu = clock();
    
    // Copy result back to host
    cudaMemcpy(h_c, d_c, bytes, cudaMemcpyDeviceToHost);
    
    // Time CPU execution
    clock_t start_cpu = clock();
    vector_add_cpu(h_a, h_b, h_c_cpu, N);
    clock_t end_cpu = clock();
    
    // Verify results
    bool correct = true;
    for (int i = 0; i < N; i++) {
        if (abs(h_c[i] - h_c_cpu[i]) > 1e-5) {
            correct = false;
            break;
        }
    }
    
    printf("Vector addition of %d elements\n", N);
    printf("GPU time: %.2f ms\n", ((double)(end_gpu - start_gpu) / CLOCKS_PER_SEC) * 1000);
    printf("CPU time: %.2f ms\n", ((double)(end_cpu - start_cpu) / CLOCKS_PER_SEC) * 1000);
    printf("Results match: %s\n", correct ? "Yes" : "No");
    
    // Cleanup
    free(h_a); free(h_b); free(h_c); free(h_c_cpu);
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    
    return 0;
}

Performance Optimization Tips

Getting good performance from CUDA requires understanding and applying optimization techniques.

Memory Optimization
  • • Use coalesced memory access patterns
  • • Minimize data transfers between CPU and GPU
  • • Use shared memory for frequently accessed data
  • • Consider memory alignment
  • • Use pinned memory for faster transfers
  • • Overlap computation with memory transfers
Execution Optimization
  • • Choose optimal block and grid sizes
  • • Maximize occupancy
  • • Avoid thread divergence
  • • Use appropriate data types
  • • Minimize register usage when needed
  • • Profile and measure performance

Debugging CUDA Applications

Debugging GPU code can be challenging. Here are essential techniques and tools for CUDA development.

Error Checking
// Always check CUDA errors
#define CUDA_CHECK(call) \
    do { \
        cudaError_t error = call; \
        if (error != cudaSuccess) { \
            printf("CUDA error at %s:%d - %s\n", __FILE__, __LINE__, \
                   cudaGetErrorString(error)); \
            exit(1); \
        } \
    } while(0)

// Usage example
CUDA_CHECK(cudaMalloc(&d_data, size));
CUDA_CHECK(cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice));

// Check kernel launch
kernel<<<blocks, threads>>>(args);
CUDA_CHECK(cudaGetLastError());
CUDA_CHECK(cudaDeviceSynchronize());
Debugging Tools

CUDA-GDB

GPU debugger for stepping through kernel code

# Compile with debug info
nvcc -g -G -o program program.cu

# Debug with cuda-gdb
cuda-gdb ./program

Nsight Systems

System-wide performance profiler

# Profile application
nsys profile --output=report ./program

# View results
nsys-ui report.nsys-rep

Advanced Example: Matrix Multiplication

A more complex example demonstrating shared memory usage and optimization techniques.

Optimized Matrix Multiplication
// matrix_mult.cu
#include <stdio.h>
#include <cuda_runtime.h>

#define TILE_SIZE 16

// Optimized matrix multiplication using shared memory
__global__ void matrix_mult_shared(float *A, float *B, float *C, int N) {
    __shared__ float tile_A[TILE_SIZE][TILE_SIZE];
    __shared__ float tile_B[TILE_SIZE][TILE_SIZE];
    
    int row = blockIdx.y * TILE_SIZE + threadIdx.y;
    int col = blockIdx.x * TILE_SIZE + threadIdx.x;
    
    float sum = 0.0f;
    
    // Loop over tiles
    for (int t = 0; t < (N + TILE_SIZE - 1) / TILE_SIZE; t++) {
        // Load tiles into shared memory
        if (row < N && t * TILE_SIZE + threadIdx.x < N)
            tile_A[threadIdx.y][threadIdx.x] = A[row * N + t * TILE_SIZE + threadIdx.x];
        else
            tile_A[threadIdx.y][threadIdx.x] = 0.0f;
            
        if (col < N && t * TILE_SIZE + threadIdx.y < N)
            tile_B[threadIdx.y][threadIdx.x] = B[(t * TILE_SIZE + threadIdx.y) * N + col];
        else
            tile_B[threadIdx.y][threadIdx.x] = 0.0f;
        
        __syncthreads();
        
        // Compute partial sum
        for (int k = 0; k < TILE_SIZE; k++) {
            sum += tile_A[threadIdx.y][k] * tile_B[k][threadIdx.x];
        }
        
        __syncthreads();
    }
    
    // Write result
    if (row < N && col < N) {
        C[row * N + col] = sum;
    }
}

int main() {
    const int N = 1024;
    const int bytes = N * N * sizeof(float);
    
    // Allocate and initialize matrices
    float *h_A = (float*)malloc(bytes);
    float *h_B = (float*)malloc(bytes);
    float *h_C = (float*)malloc(bytes);
    
    // Initialize with random values
    for (int i = 0; i < N * N; i++) {
        h_A[i] = rand() / (float)RAND_MAX;
        h_B[i] = rand() / (float)RAND_MAX;
    }
    
    // Allocate device memory
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, bytes);
    cudaMalloc(&d_B, bytes);
    cudaMalloc(&d_C, bytes);
    
    // Copy to device
    cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, bytes, cudaMemcpyHostToDevice);
    
    // Launch kernel
    dim3 block(TILE_SIZE, TILE_SIZE);
    dim3 grid((N + TILE_SIZE - 1) / TILE_SIZE, (N + TILE_SIZE - 1) / TILE_SIZE);
    
    cudaEvent_t start, stop;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    
    cudaEventRecord(start);
    matrix_mult_shared<<<grid, block>>>(d_A, d_B, d_C, N);
    cudaEventRecord(stop);
    
    cudaEventSynchronize(stop);
    
    float milliseconds = 0;
    cudaEventElapsedTime(&milliseconds, start, stop);
    
    // Copy result back
    cudaMemcpy(h_C, d_C, bytes, cudaMemcpyDeviceToHost);
    
    printf("Matrix multiplication (%dx%d)\n", N, N);
    printf("GPU time: %.2f ms\n", milliseconds);
    printf("Performance: %.2f GFLOPS\n", 
           (2.0f * N * N * N) / (milliseconds * 1e6));
    
    // Cleanup
    free(h_A); free(h_B); free(h_C);
    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
    cudaEventDestroy(start);
    cudaEventDestroy(stop);
    
    return 0;
}

Best Practices for Cloud Development

Developing CUDA applications on rented GPU instances requires some additional considerations.

Development Workflow
  • • Use version control (Git) for code management
  • • Develop incrementally with small test cases
  • • Use remote development tools (VS Code Remote)
  • • Create automated build and test scripts
  • • Document your code and algorithms
  • • Use containerization for reproducibility
Cost Optimization
  • • Develop and debug on smaller datasets first
  • • Use CPU for algorithm development
  • • Profile before optimizing
  • • Batch multiple experiments
  • • Stop instances when not in use
  • • Monitor GPU utilization

Common Pitfalls and Solutions

Memory Management Issues

Problem:

Memory leaks and out-of-memory errors

Solution:

  • • Always pair cudaMalloc with cudaFree
  • • Use RAII patterns or smart pointers
  • • Check available memory before allocation
  • • Use memory profiling tools
Performance Issues

Problem:

Poor GPU utilization and slow performance

Solution:

  • • Profile with Nsight tools
  • • Optimize memory access patterns
  • • Choose appropriate block sizes
  • • Minimize CPU-GPU data transfers

Next Steps and Resources

Continue your CUDA learning journey with these advanced topics and resources.

Advanced Topics to Explore
  • • CUDA Streams and concurrency
  • • Multi-GPU programming
  • • CUDA libraries (cuBLAS, cuFFT, etc.)
  • • Unified Memory
  • • Dynamic parallelism
  • • Cooperative groups
  • • Tensor cores programming
  • • CUDA-aware MPI
  • • Performance optimization
  • • Real-world applications
Recommended Learning Resources
  • • NVIDIA CUDA Programming Guide (official documentation)
  • • CUDA by Example book by Sanders and Kandrot
  • • Professional CUDA C Programming by Cheng, Grossman, and McKercher
  • • NVIDIA Developer Blog and tutorials
  • • CUDA samples and SDK examples
  • • Online courses on parallel programming

Ready to Start CUDA Development?

Get access to high-performance GPUs and start developing your CUDA applications today.