GPU Computing

Guide to using GPUs on the Caltech HPC cluster.

Available GPUs

GPU Model

Count

Memory

Best For

NVIDIA H100

16+

80 GB

LLMs, large models

NVIDIA H200

Available

141 GB

Memory-intensive AI

NVIDIA V100

8

32 GB

Deep learning

NVIDIA P100

200

16 GB

General GPU computing

NVIDIA L40s

Available

48 GB

Inference, visualization

Requesting GPUs

Important

Every GPU job must include both:

  • #SBATCH --partition=gpu — GPU nodes only live on this partition.

  • #SBATCH --gres=gpu:<type>:<count> — a typed gres. Bare --gres=gpu:N (no type) is rejected by the scheduler.

Valid type tokens (the second field of the gres):

GPU model

gres type token

P100

p100

V100

v100

H100

h100

H200

nvidia_h200 (note the nvidia_ prefix)

L40s

nvidia_l40s (note the nvidia_ prefix)

Single GPU

#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:1

Multiple GPUs

#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:4

Specific GPU Type

#SBATCH --partition=gpu
#SBATCH --gres=gpu:h100:1

Basic GPU Job

#!/bin/bash
#SBATCH --job-name=gpu_test
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --gres=gpu:p100:1
#SBATCH --time=01:00:00

module load cuda/12.0

# Verify GPU is available
nvidia-smi

# Run your GPU program
./my_cuda_program

CUDA Programming

Load CUDA

module avail cuda
module load cuda/12.0

Compile CUDA Code

nvcc -o my_program my_program.cu

Check GPU in Code

Python:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")

Command line:

nvidia-smi

Deep Learning Frameworks

PyTorch

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00

source ~/miniconda3/etc/profile.d/conda.sh
conda activate pytorch

python train.py

Multi-GPU with PyTorch DDP:

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize distributed training
dist.init_process_group("nccl")
model = DDP(model)

TensorFlow

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00

module load cuda/11.8
source ~/miniconda3/etc/profile.d/conda.sh
conda activate tensorflow

python train.py

JAX

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:4
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --time=12:00:00

source ~/miniconda3/etc/profile.d/conda.sh
conda activate jax

# JAX automatically detects all GPUs
python train_jax.py

Using Containers for Deep Learning

Pull NGC Container

module load apptainer
apptainer pull docker://nvcr.io/nvidia/pytorch:24.01-py3

Multi-Node GPU Training

For training across multiple nodes:

#!/bin/bash
#SBATCH --job-name=distributed
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --gres=gpu:p100:4
#SBATCH --time=24:00:00

module load cuda/12.0

# Get master node address
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)

srun python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=$SLURM_NODEID \
    --master_addr=$MASTER_ADDR \
    --master_port=29500 \
    train_distributed.py

GPU Memory Management

Check Memory Usage

nvidia-smi
watch -n 1 nvidia-smi  # Update every second

Reducing Memory Usage

PyTorch:

# Use mixed precision
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

with autocast():
    output = model(input)

# Gradient checkpointing
model.gradient_checkpointing_enable()

# Clear cache
torch.cuda.empty_cache()

TensorFlow:

# Limit GPU memory growth
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)

Best Practices

Tip

Match CPUs to GPUs: Request 8-16 CPUs per GPU for data loading.

  1. Use appropriate GPU type:

    • H100/H200: Large language models, big batch training

    • V100/P100: Standard deep learning, molecular dynamics

    • L40s: Inference, smaller models

  2. Request appropriate memory:

    • 32 GB minimum for deep learning

    • 64-128 GB for large models

  3. Use containers: NGC containers are optimized and tested

  4. Enable mixed precision: FP16/BF16 training doubles effective memory

  5. Profile your code:

    nsys profile python train.py
    

Troubleshooting

“CUDA out of memory”

  • Reduce batch size

  • Enable gradient checkpointing

  • Use mixed precision training

  • Clear cache between batches

GPU not detected

# Check if GPU is allocated
nvidia-smi

# Check CUDA installation
nvcc --version

# Check PyTorch CUDA
python -c "import torch; print(torch.cuda.is_available())"

Slow GPU performance

  • Ensure data loading isn’t bottlenecked (increase num_workers)

  • Check if model is actually on GPU (.to('cuda'))

  • Profile to find bottlenecks

See Also