AI & Machine Learning

Guide to running AI/ML workloads on the Caltech HPC cluster.

Quick Start for ML Researchers

Tip

New to the cluster? Complete the Quick Start Guide first.

Environment Setup

Option 2: NGC Containers

Pre-built, optimized containers from NVIDIA:

module load apptainer

# Pull PyTorch container
apptainer pull docker://nvcr.io/nvidia/pytorch:24.01-py3

# Run interactively
srun --partition=gpu --gres=gpu:p100:1 --time=1:00:00 --pty \
    apptainer shell --nv pytorch_24.01-py3.sif

Common ML Tasks

Training a Model

#!/bin/bash
#SBATCH --job-name=train
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --gres=gpu:h100:1
#SBATCH --time=24:00:00
#SBATCH [email protected]
#SBATCH --mail-type=END,FAIL

source /resnick/groups/mygroup/$USER/miniconda3/etc/profile.d/conda.sh
conda activate torch

# Enable mixed precision for faster training
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

python train.py \
    --data /resnick/scratch/$USER/dataset \
    --output /resnick/groups/mygroup/$USER/models \
    --epochs 100

Hyperparameter Tuning

Use job arrays for parallel experiments:

#!/bin/bash
#SBATCH --job-name=hparam
#SBATCH --partition=gpu
#SBATCH --array=0-19
#SBATCH --nodes=1
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=04:00:00

source ~/miniconda3/etc/profile.d/conda.sh
conda activate torch

# Define hyperparameter grid
LRS=(0.001 0.0001 0.00001 0.000001)
BATCH_SIZES=(16 32 64 128 256)

# Calculate indices
LR_IDX=$((SLURM_ARRAY_TASK_ID / 5))
BS_IDX=$((SLURM_ARRAY_TASK_ID % 5))

python train.py \
    --lr ${LRS[$LR_IDX]} \
    --batch-size ${BATCH_SIZES[$BS_IDX]} \
    --output results/exp_${SLURM_ARRAY_TASK_ID}

Inference / Batch Prediction

#!/bin/bash
#SBATCH --job-name=inference
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:p100:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --time=02:00:00

source ~/miniconda3/etc/profile.d/conda.sh
conda activate torch

python inference.py \
    --model /resnick/groups/mygroup/models/best.pt \
    --input /resnick/scratch/$USER/test_data \
    --output /resnick/scratch/$USER/predictions

Large Language Models

Running Hugging Face Models

#!/bin/bash
#SBATCH --job-name=llm
#SBATCH --partition=gpu
#SBATCH --nodes=1
#SBATCH --gres=gpu:h100:1
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=08:00:00

source ~/miniconda3/etc/profile.d/conda.sh
conda activate torch

# Cache models in group directory
export HF_HOME=/resnick/groups/mygroup/$USER/huggingface

python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-2-7b-hf',
    torch_dtype=torch.float16,
    device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf')

# Your inference code here
"

Fine-tuning with LoRA

Memory-efficient fine-tuning:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    load_in_8bit=True,
    device_map="auto"
)

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.1
)

model = get_peft_model(model, lora_config)

Distributed Training

Multi-GPU (Single Node)

PyTorch DataParallel (simple):

model = torch.nn.DataParallel(model)

PyTorch DDP (better performance):

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group("nccl")
model = DDP(model, device_ids=[local_rank])

Multi-Node Training

#!/bin/bash
#SBATCH --job-name=distributed
#SBATCH --partition=gpu
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --mem=256G
#SBATCH --gres=gpu:p100:4
#SBATCH --time=48:00:00

module load cuda/12.0

# Use torchrun for distributed training
srun torchrun \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=4 \
    --rdzv_id=$SLURM_JOB_ID \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$(hostname):29500 \
    train_distributed.py

Data Management

Efficient Data Loading

from torch.utils.data import DataLoader

dataloader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=8,        # Match cpus-per-task
    pin_memory=True,      # Faster GPU transfer
    prefetch_factor=2,
    persistent_workers=True
)

Storage Recommendations

Data Type

Location

Reason

Raw datasets

/resnick/groups/

Persistent, shared

Training data

/resnick/scratch/

Fast I/O

Checkpoints

/resnick/groups/

Persistent

Logs/metrics

/home/ or groups

Small files

Model cache

/resnick/groups/

Shared, persistent

Experiment Tracking

Weights & Biases

pip install wandb
wandb login  # One-time setup
import wandb

wandb.init(project="my-project")
wandb.config.learning_rate = 0.001

# Log metrics
wandb.log({"loss": loss, "accuracy": acc})

TensorBoard

#!/bin/bash
#SBATCH --job-name=tensorboard
#SBATCH --nodes=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --time=08:00:00

module load python3
tensorboard --logdir=/resnick/groups/mygroup/runs --port=6006

Access via Open OnDemand or SSH tunnel.

Best Practices

  1. Start small: Test on a single GPU before scaling

  2. Use mixed precision: torch.cuda.amp or tf.keras.mixed_precision

  3. Profile your code: Identify bottlenecks before scaling

  4. Checkpoint frequently: Save model state every N epochs

  5. Log everything: Track hyperparameters, metrics, and system stats

  6. Version your code: Use git tags for reproducibility

  7. Document dependencies: Export conda env export > environment.yml

Common Issues

Out of Memory

  • Reduce batch size

  • Use gradient accumulation

  • Enable mixed precision (FP16/BF16)

  • Use gradient checkpointing

  • Try DeepSpeed ZeRO

Slow Training

  • Increase num_workers in DataLoader

  • Use faster storage (/resnick/scratch)

  • Profile to find bottlenecks

  • Check GPU utilization with nvidia-smi

NCCL Errors (Multi-GPU)

export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=0

See Also