GPU Computing
Guide to using GPUs on the Caltech HPC cluster.
Available GPUs
GPU Model |
Count |
Memory |
Best For |
|---|---|---|---|
NVIDIA H100 |
16+ |
80 GB |
LLMs, large models |
NVIDIA H200 |
Available |
141 GB |
Memory-intensive AI |
NVIDIA V100 |
8 |
32 GB |
Deep learning |
NVIDIA P100 |
200 |
16 GB |
General GPU computing |
NVIDIA L40s |
Available |
48 GB |
Inference, visualization |
Requesting GPUs
Important
Every GPU job must include both:
#SBATCH --partition=gpu— GPU nodes only live on this partition.#SBATCH --gres=gpu:<type>:<count>— a typed gres. Bare--gres=gpu:N(no type) is rejected by the scheduler.
Valid type tokens (the second field of the gres):
GPU model |
gres type token |
|---|---|
P100 |
|
V100 |
|
H100 |
|
H200 |
|
L40s |
|
Single GPU
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:1
Multiple GPUs
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:4
Specific GPU Type
#SBATCH --partition=gpu
#SBATCH --gres=gpu:h100:1
Basic GPU Job
#!/bin/bash
#SBATCH --job-name=gpu_test
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --gres=gpu:p100:1
#SBATCH --time=01:00:00
module load cuda/12.0
# Verify GPU is available
nvidia-smi
# Run your GPU program
./my_cuda_program
CUDA Programming
Load CUDA
module avail cuda
module load cuda/12.0
Compile CUDA Code
nvcc -o my_program my_program.cu
Check GPU in Code
Python:
import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
Command line:
nvidia-smi
Deep Learning Frameworks
PyTorch
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00
source ~/miniconda3/etc/profile.d/conda.sh
conda activate pytorch
python train.py
Multi-GPU with PyTorch DDP:
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
# Initialize distributed training
dist.init_process_group("nccl")
model = DDP(model)
TensorFlow
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00
module load cuda/11.8
source ~/miniconda3/etc/profile.d/conda.sh
conda activate tensorflow
python train.py
JAX
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:4
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --time=12:00:00
source ~/miniconda3/etc/profile.d/conda.sh
conda activate jax
# JAX automatically detects all GPUs
python train_jax.py
Using Containers for Deep Learning
NGC Containers (Recommended)
NVIDIA NGC provides optimized containers:
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00
module load apptainer
apptainer exec --nv \
/path/to/pytorch_24.01.sif \
python train.py
See NVIDIA NGC Guide for setup.
Pull NGC Container
module load apptainer
apptainer pull docker://nvcr.io/nvidia/pytorch:24.01-py3
Multi-Node GPU Training
For training across multiple nodes:
#!/bin/bash
#SBATCH --job-name=distributed
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --gres=gpu:p100:4
#SBATCH --time=24:00:00
module load cuda/12.0
# Get master node address
MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
srun python -m torch.distributed.launch \
--nproc_per_node=4 \
--nnodes=2 \
--node_rank=$SLURM_NODEID \
--master_addr=$MASTER_ADDR \
--master_port=29500 \
train_distributed.py
GPU Memory Management
Check Memory Usage
nvidia-smi
watch -n 1 nvidia-smi # Update every second
Reducing Memory Usage
PyTorch:
# Use mixed precision
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
output = model(input)
# Gradient checkpointing
model.gradient_checkpointing_enable()
# Clear cache
torch.cuda.empty_cache()
TensorFlow:
# Limit GPU memory growth
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)
Best Practices
Tip
Match CPUs to GPUs: Request 8-16 CPUs per GPU for data loading.
Use appropriate GPU type:
H100/H200: Large language models, big batch training
V100/P100: Standard deep learning, molecular dynamics
L40s: Inference, smaller models
Request appropriate memory:
32 GB minimum for deep learning
64-128 GB for large models
Use containers: NGC containers are optimized and tested
Enable mixed precision: FP16/BF16 training doubles effective memory
Profile your code:
nsys profile python train.py
Troubleshooting
“CUDA out of memory”
Reduce batch size
Enable gradient checkpointing
Use mixed precision training
Clear cache between batches
GPU not detected
# Check if GPU is allocated
nvidia-smi
# Check CUDA installation
nvcc --version
# Check PyTorch CUDA
python -c "import torch; print(torch.cuda.is_available())"
Slow GPU performance
Ensure data loading isn’t bottlenecked (increase
num_workers)Check if model is actually on GPU (
.to('cuda'))Profile to find bottlenecks