AI & Machine Learning

Guide to running AI/ML workloads on the Caltech HPC cluster.

Quick Start for ML Researchers

Tip

New to the cluster? Complete the Quick Start Guide first.

Recommended Setup

Use H100 GPUs for training large models
Install Conda in your group directory (not home)
Use containers for complex dependencies
Store data on /resnick/scratch during training

Environment Setup

Option 1: Conda (Recommended)

Install Miniconda in your group directory:

cd /resnick/groups/<your-group>/$USER
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p ./miniconda3
source miniconda3/etc/profile.d/conda.sh

Create a PyTorch environment:

conda create -n torch python=3.11
conda activate torch
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia

Create a TensorFlow environment:

conda create -n tf python=3.11
conda activate tf
pip install tensorflow[and-cuda]

Option 2: NGC Containers

Pre-built, optimized containers from NVIDIA:

module load apptainer

# Pull PyTorch container
apptainer pull docker://nvcr.io/nvidia/pytorch:24.01-py3

# Run interactively
srun --partition=gpu --gres=gpu:p100:1 --time=1:00:00 --pty \
    apptainer shell --nv pytorch_24.01-py3.sif

Common ML Tasks

Training a Model

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --gres=gpu:h100:1
#SBATCH --time=24:00:00
#SBATCH [email protected]
#SBATCH --mail-type=END,FAIL

source /resnick/groups/mygroup/$USER/miniconda3/etc/profile.d/conda.sh
conda activate torch

# Enable mixed precision for faster training
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

python train.py \
    --data /resnick/scratch/$USER/dataset \
    --output /resnick/groups/mygroup/$USER/models \
    --epochs 100

Hyperparameter Tuning

Use job arrays for parallel experiments:

#!/bin/bash
#SBATCH --job-name=hparam
#SBATCH --partition=gpu
#SBATCH --array=0-19
#SBATCH --nodes=1
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00

source ~/miniconda3/etc/profile.d/conda.sh
conda activate torch

# Define hyperparameter grid
LRS=(0.001 0.0001 0.00001 0.000001)
BATCH_SIZES=(16 32 64 128 256)

# Calculate indices
LR_IDX=$((SLURM_ARRAY_TASK_ID / 5))
BS_IDX=$((SLURM_ARRAY_TASK_ID % 5))

python train.py \
    --lr ${LRS[$LR_IDX]} \
    --batch-size ${BATCH_SIZES[$BS_IDX]} \
    --output results/exp_${SLURM_ARRAY_TASK_ID}

Inference / Batch Prediction

#!/bin/bash
#SBATCH --job-name=inference
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=02:00:00

source ~/miniconda3/etc/profile.d/conda.sh
conda activate torch

python inference.py \
    --model /resnick/groups/mygroup/models/best.pt \
    --input /resnick/scratch/$USER/test_data \
    --output /resnick/scratch/$USER/predictions

Large Language Models

Running Hugging Face Models

#!/bin/bash
#SBATCH --job-name=llm
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:h100:1
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=08:00:00

source ~/miniconda3/etc/profile.d/conda.sh
conda activate torch

# Cache models in group directory
export HF_HOME=/resnick/groups/mygroup/$USER/huggingface

python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-2-7b-hf',
    torch_dtype=torch.float16,
    device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')

# Your inference code here
"

Fine-tuning with LoRA

Memory-efficient fine-tuning:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1
)

model = get_peft_model(model, lora_config)

Distributed Training

Multi-GPU (Single Node)

PyTorch DataParallel (simple):

model = torch.nn.DataParallel(model)

PyTorch DDP (better performance):

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group("nccl")
model = DDP(model, device_ids=[local_rank])

Multi-Node Training

#!/bin/bash
#SBATCH --job-name=distributed
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --mem=256G
#SBATCH --gres=gpu:p100:4
#SBATCH --time=48:00:00

module load cuda/12.0

# Use torchrun for distributed training
srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=4 \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$(hostname):29500 \
    train_distributed.py

Data Management

Efficient Data Loading

from torch.utils.data import DataLoader

dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=8,        # Match cpus-per-task
    pin_memory=True,      # Faster GPU transfer
    prefetch_factor=2,
    persistent_workers=True
)

Storage Recommendations

Data Type	Location	Reason
Raw datasets	`/resnick/groups/`	Persistent, shared
Training data	`/resnick/scratch/`	Fast I/O
Checkpoints	`/resnick/groups/`	Persistent
Logs/metrics	`/home/` or groups	Small files
Model cache	`/resnick/groups/`	Shared, persistent

Experiment Tracking

Weights & Biases

pip install wandb
wandb login  # One-time setup

import wandb

wandb.init(project="my-project")
wandb.config.learning_rate = 0.001

# Log metrics
wandb.log({"loss": loss, "accuracy": acc})

TensorBoard

#!/bin/bash
#SBATCH --job-name=tensorboard
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --time=08:00:00

module load python3
tensorboard --logdir=/resnick/groups/mygroup/runs --port=6006

Access via Open OnDemand or SSH tunnel.

Best Practices

Start small: Test on a single GPU before scaling
Use mixed precision: torch.cuda.amp or tf.keras.mixed_precision
Profile your code: Identify bottlenecks before scaling
Checkpoint frequently: Save model state every N epochs
Log everything: Track hyperparameters, metrics, and system stats
Version your code: Use git tags for reproducibility
Document dependencies: Export conda env export > environment.yml

Common Issues

Out of Memory

Reduce batch size
Use gradient accumulation
Enable mixed precision (FP16/BF16)
Use gradient checkpointing
Try DeepSpeed ZeRO

Slow Training

Increase num_workers in DataLoader
Use faster storage (/resnick/scratch)
Profile to find bottlenecks
Check GPU utilization with nvidia-smi

NCCL Errors (Multi-GPU)

export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0