AI & Machine Learning
Guide to running AI/ML workloads on the Caltech HPC cluster.
Quick Start for ML Researchers
Tip
New to the cluster? Complete the Quick Start Guide first.
Recommended Setup
Use H100 GPUs for training large models
Install Conda in your group directory (not home)
Use containers for complex dependencies
Store data on
/resnick/scratchduring training
Environment Setup
Option 1: Conda (Recommended)
Install Miniconda in your group directory:
cd /resnick/groups/<your-group>/$USER
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p ./miniconda3
source miniconda3/etc/profile.d/conda.sh
Create a PyTorch environment:
conda create -n torch python=3.11
conda activate torch
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
Create a TensorFlow environment:
conda create -n tf python=3.11
conda activate tf
pip install tensorflow[and-cuda]
Option 2: NGC Containers
Pre-built, optimized containers from NVIDIA:
module load apptainer
# Pull PyTorch container
apptainer pull docker://nvcr.io/nvidia/pytorch:24.01-py3
# Run interactively
srun --partition=gpu --gres=gpu:p100:1 --time=1:00:00 --pty \
apptainer shell --nv pytorch_24.01-py3.sif
Common ML Tasks
Training a Model
#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --gres=gpu:h100:1
#SBATCH --time=24:00:00
#SBATCH [email protected]
#SBATCH --mail-type=END,FAIL
source /resnick/groups/mygroup/$USER/miniconda3/etc/profile.d/conda.sh
conda activate torch
# Enable mixed precision for faster training
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
python train.py \
--data /resnick/scratch/$USER/dataset \
--output /resnick/groups/mygroup/$USER/models \
--epochs 100
Hyperparameter Tuning
Use job arrays for parallel experiments:
#!/bin/bash
#SBATCH --job-name=hparam
#SBATCH --partition=gpu
#SBATCH --array=0-19
#SBATCH --nodes=1
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00
source ~/miniconda3/etc/profile.d/conda.sh
conda activate torch
# Define hyperparameter grid
LRS=(0.001 0.0001 0.00001 0.000001)
BATCH_SIZES=(16 32 64 128 256)
# Calculate indices
LR_IDX=$((SLURM_ARRAY_TASK_ID / 5))
BS_IDX=$((SLURM_ARRAY_TASK_ID % 5))
python train.py \
--lr ${LRS[$LR_IDX]} \
--batch-size ${BATCH_SIZES[$BS_IDX]} \
--output results/exp_${SLURM_ARRAY_TASK_ID}
Inference / Batch Prediction
#!/bin/bash
#SBATCH --job-name=inference
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=02:00:00
source ~/miniconda3/etc/profile.d/conda.sh
conda activate torch
python inference.py \
--model /resnick/groups/mygroup/models/best.pt \
--input /resnick/scratch/$USER/test_data \
--output /resnick/scratch/$USER/predictions
Large Language Models
Running Hugging Face Models
#!/bin/bash
#SBATCH --job-name=llm
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:h100:1
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=08:00:00
source ~/miniconda3/etc/profile.d/conda.sh
conda activate torch
# Cache models in group directory
export HF_HOME=/resnick/groups/mygroup/$USER/huggingface
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Llama-2-7b-hf',
torch_dtype=torch.float16,
device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')
# Your inference code here
"
Fine-tuning with LoRA
Memory-efficient fine-tuning:
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
load_in_8bit=True,
device_map="auto"
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1
)
model = get_peft_model(model, lora_config)
Distributed Training
Multi-GPU (Single Node)
PyTorch DataParallel (simple):
model = torch.nn.DataParallel(model)
PyTorch DDP (better performance):
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group("nccl")
model = DDP(model, device_ids=[local_rank])
Multi-Node Training
#!/bin/bash
#SBATCH --job-name=distributed
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --mem=256G
#SBATCH --gres=gpu:p100:4
#SBATCH --time=48:00:00
module load cuda/12.0
# Use torchrun for distributed training
srun torchrun \
--nnodes=$SLURM_NNODES \
--nproc_per_node=4 \
--rdzv_id=$SLURM_JOB_ID \
--rdzv_backend=c10d \
--rdzv_endpoint=$(hostname):29500 \
train_distributed.py
Data Management
Efficient Data Loading
from torch.utils.data import DataLoader
dataloader = DataLoader(
dataset,
batch_size=32,
shuffle=True,
num_workers=8, # Match cpus-per-task
pin_memory=True, # Faster GPU transfer
prefetch_factor=2,
persistent_workers=True
)
Storage Recommendations
Data Type |
Location |
Reason |
|---|---|---|
Raw datasets |
|
Persistent, shared |
Training data |
|
Fast I/O |
Checkpoints |
|
Persistent |
Logs/metrics |
|
Small files |
Model cache |
|
Shared, persistent |
Experiment Tracking
Weights & Biases
pip install wandb
wandb login # One-time setup
import wandb
wandb.init(project="my-project")
wandb.config.learning_rate = 0.001
# Log metrics
wandb.log({"loss": loss, "accuracy": acc})
TensorBoard
#!/bin/bash
#SBATCH --job-name=tensorboard
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --time=08:00:00
module load python3
tensorboard --logdir=/resnick/groups/mygroup/runs --port=6006
Access via Open OnDemand or SSH tunnel.
Best Practices
Start small: Test on a single GPU before scaling
Use mixed precision:
torch.cuda.amportf.keras.mixed_precisionProfile your code: Identify bottlenecks before scaling
Checkpoint frequently: Save model state every N epochs
Log everything: Track hyperparameters, metrics, and system stats
Version your code: Use git tags for reproducibility
Document dependencies: Export
conda env export > environment.yml
Common Issues
Out of Memory
Reduce batch size
Use gradient accumulation
Enable mixed precision (FP16/BF16)
Use gradient checkpointing
Try DeepSpeed ZeRO
Slow Training
Increase
num_workersin DataLoaderUse faster storage (
/resnick/scratch)Profile to find bottlenecks
Check GPU utilization with
nvidia-smi
NCCL Errors (Multi-GPU)
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0