GPU Computing

Guide to using GPUs on the Caltech HPC cluster.

Available GPUs

GPU Model	Count	Memory	Best For
NVIDIA H100	16+	80 GB	LLMs, large models
NVIDIA H200	Available	141 GB	Memory-intensive AI
NVIDIA V100	8	32 GB	Deep learning
NVIDIA P100	200	16 GB	General GPU computing
NVIDIA L40s	Available	48 GB	Inference, visualization

Requesting GPUs

Important

Every GPU job must include both:

#SBATCH --partition=gpu — GPU nodes only live on this partition.
#SBATCH --gres=gpu:<type>:<count> — a typed gres. Bare --gres=gpu:N (no type) is rejected by the scheduler.

Valid type tokens (the second field of the gres):

GPU model	gres type token
P100	`p100`
V100	`v100`
H100	`h100`
H200	`nvidia_h200` (note the `nvidia_` prefix)
L40s	`nvidia_l40s` (note the `nvidia_` prefix)

Single GPU

#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:1

Multiple GPUs

#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:4

Specific GPU Type

#SBATCH --partition=gpu
#SBATCH --gres=gpu:h100:1

Basic GPU Job

#!/bin/bash
#SBATCH --job-name=gpu_test
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --gres=gpu:p100:1
#SBATCH --time=01:00:00

module load cuda/12.0

# Verify GPU is available
nvidia-smi

# Run your GPU program
./my_cuda_program

CUDA Programming

Load CUDA

module avail cuda
module load cuda/12.0

Compile CUDA Code

nvcc -o my_program my_program.cu

Check GPU in Code

Python:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")

Command line:

nvidia-smi

Deep Learning Frameworks

PyTorch

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00

source ~/miniconda3/etc/profile.d/conda.sh
conda activate pytorch

python train.py

Multi-GPU with PyTorch DDP:

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Initialize distributed training
dist.init_process_group("nccl")
model = DDP(model)

TensorFlow

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00

module load cuda/11.8
source ~/miniconda3/etc/profile.d/conda.sh
conda activate tensorflow

python train.py

JAX

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:4
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --time=12:00:00

source ~/miniconda3/etc/profile.d/conda.sh
conda activate jax

# JAX automatically detects all GPUs
python train_jax.py

Using Containers for Deep Learning

NGC Containers (Recommended)

NVIDIA NGC provides optimized containers:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00

module load apptainer

apptainer exec --nv \
    /path/to/pytorch_24.01.sif \
    python train.py

See NVIDIA NGC Guide for setup.

Pull NGC Container

module load apptainer
apptainer pull docker://nvcr.io/nvidia/pytorch:24.01-py3

Multi-Node GPU Training

For training across multiple nodes:

#!/bin/bash
#SBATCH --job-name=distributed
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --gres=gpu:p100:4
#SBATCH --time=24:00:00

module load cuda/12.0

# Get master node address
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)

srun python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=$SLURM_NODEID \
    --master_addr=$MASTER_ADDR \
    --master_port=29500 \
    train_distributed.py

GPU Memory Management

Check Memory Usage

nvidia-smi
watch -n 1 nvidia-smi  # Update every second

Reducing Memory Usage

PyTorch:

# Use mixed precision
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()

with autocast():
    output = model(input)

# Gradient checkpointing
model.gradient_checkpointing_enable()

# Clear cache
torch.cuda.empty_cache()

TensorFlow:

# Limit GPU memory growth
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)

Best Practices

Tip

Match CPUs to GPUs: Request 8-16 CPUs per GPU for data loading.

Use appropriate GPU type:
- H100/H200: Large language models, big batch training
- V100/P100: Standard deep learning, molecular dynamics
- L40s: Inference, smaller models
Request appropriate memory:
- 32 GB minimum for deep learning
- 64-128 GB for large models
Use containers: NGC containers are optimized and tested
Enable mixed precision: FP16/BF16 training doubles effective memory
Profile your code:
```
nsys profile python train.py
```

Troubleshooting

“CUDA out of memory”

Reduce batch size
Enable gradient checkpointing
Use mixed precision training
Clear cache between batches

GPU not detected

# Check if GPU is allocated
nvidia-smi

# Check CUDA installation
nvcc --version

# Check PyTorch CUDA
python -c "import torch; print(torch.cuda.is_available())"

Slow GPU performance

Ensure data loading isn’t bottlenecked (increase num_workers)
Check if model is actually on GPU (.to('cuda'))
Profile to find bottlenecks