Caltech Resnick High Performance Computing Center
General
About
Mission
Research community
Getting help
Infrastructure
Sponsorship
Citing the Center
People
IMSS Leadership
IMSS Technical Leadership
IMSS Technical Staff
Contact
Services
Getting started
Performance & optimization
Software & environment support
Data transfer & storage
Interactive & visualization computing
Get in touch
Support
Getting Help
Self-service documentation
Consultations
System Status
Location
Rates
Cost Calculator
Compute Pricing
Compute Unit Definitions
Storage Pricing
Questions
Resources
Cluster Overview
Login Nodes
Standard Login
Visualization Login
Compute Nodes
GPU Nodes
Contact
Software
System Software
User Software
Pre-installed Software
Installation Methods
Module System
Spack
Anaconda/Conda
Singularity/Apptainer Containers
Software Guides
Abaqus
Known Issue
Solution
Example Job Script
Getting Help
MATLAB
Loading MATLAB
Running MATLAB on the Cluster
Parallel Computing
Tips
Jupyter Notebook
Launching Jupyter
Adding Conda Environments as Kernels
Accessing Group Directories
Julia Support
Configuration
Troubleshooting
cryoSPARC
Installation
Requesting Access
Connecting
Management Commands
Job Submission
RStudio
Launching RStudio
Package Installation
Example: Installing tidyverse
Troubleshooting
NVIDIA NGC
Account Setup
Configuration
Pulling Containers
Running Containers
NGC CLI Tool
Example SLURM Script
Popular NGC Containers
Relion using SBGrid
Setup
SBGrid Preferences
SLURM Submission Template
Verification
Troubleshooting
AlphaFold
Prerequisites
Setup
Job Submission
Output
Additional Options
Tips
VSCode
Prerequisites
Setup
First Connection
Advanced: Direct Compute Node Connection
Troubleshooting
Available Software Modules
Categories
Requesting New Software
See Also
Institutional Licenses
Request New Software
Open OnDemand
Features
Getting Started
Interactive Desktop
Troubleshooting
PATH Conflicts
Browser Issues
File Upload Failures
“Request Header Too Long” Error
Home Directory Quota
Support
System Status
Current Status
Active Announcements
Check the Cluster Yourself
Job queues
Node availability
Storage health
Job efficiency after a run
Notifications
Mailing List
Reporting an Issue
Training
Self-Study Resources
Internal
External (recommended)
Announcements
Citing the Resnick High Performance Computing Center
Recommended acknowledgement
BibTeX
CITATION.cff
Let us know when you publish
Questions
Policies
Acceptable use
Data handling
Export control
Security
Questions
Getting Started
Quick Start Guide
1. Get an account
2. Connect via SSH
Recommended SSH config
3. Move some data over
4. Find and load software
5. Submit your first job
6. Cheat sheet
7. Where to put your files
Next steps
Stuck?
Getting Started
Quick Links
First Steps
Logging In
Next Steps
Account Information
Getting an Account
For New Groups
For Existing Groups
Multi-Factor Authentication
Self-Registration
Supported Methods
Eligibility Certification
HPC End-User Agreement
Country Group D:5
Running Jobs
SLURM Commands
Job Submission
sbatch
salloc
srun
Resource Request Parameters
Environment Variables
Queue Management
squeue
scancel
scontrol
Usage Reporting
sreport
sacct
Account Management
Task Launching
MPI Jobs
Example Batch Script
Example Job Scripts
Basic Examples
Serial Job
Multi-threaded (OpenMP)
MPI (Multi-node)
GPU Jobs
Single GPU
Multi-GPU
Specific GPU Type
Python & Conda
Conda Environment
Jupyter Batch
Job Arrays
Parameter Sweep
Limit Concurrent Jobs
Applications
MATLAB
R
GROMACS
AlphaFold
Job Dependencies
Sequential Pipeline
Fan-out, Fan-in
Email Notifications
Generate a Custom Script
See Also
Best Practices
Resource Requests
Request What You Need
Match CPUs to Parallelism
Add Time Buffer
Job Arrays
Storage
Use the Right Location
Monitor Your Quota
Scratch Warning
I/O Performance
Code Efficiency
Profile First
Set Thread Counts
Checkpointing
Environment Management
Monitoring
Good Citizenship
Pre-Submit Checklist
Questions?
GPU Computing
Available GPUs
Requesting GPUs
Single GPU
Multiple GPUs
Specific GPU Type
Basic GPU Job
CUDA Programming
Load CUDA
Compile CUDA Code
Check GPU in Code
Deep Learning Frameworks
PyTorch
TensorFlow
JAX
Using Containers for Deep Learning
NGC Containers (Recommended)
Pull NGC Container
Multi-Node GPU Training
GPU Memory Management
Check Memory Usage
Reducing Memory Usage
Best Practices
Troubleshooting
“CUDA out of memory”
GPU not detected
Slow GPU performance
See Also
AI & Machine Learning
Quick Start for ML Researchers
Recommended Setup
Environment Setup
Option 1: Conda (Recommended)
Option 2: NGC Containers
Common ML Tasks
Training a Model
Hyperparameter Tuning
Inference / Batch Prediction
Large Language Models
Running Hugging Face Models
Fine-tuning with LoRA
Distributed Training
Multi-GPU (Single Node)
Multi-Node Training
Data Management
Efficient Data Loading
Storage Recommendations
Experiment Tracking
Weights & Biases
TensorBoard
Best Practices
Common Issues
Out of Memory
Slow Training
NCCL Errors (Multi-GPU)
See Also
Data & Software
Transferring Files
Quick Reference
Open OnDemand
Network Mounting
SCP/SFTP
Command Line
Graphical Clients
rsync
SSHFS (macOS)
Installation
Usage
Globus
HPC Endpoint Details
Setup
Cloud Storage
Amazon S3
Google Cloud Storage
Best Practices
Getting Help
Storage
Overview
User Home Directories
Group Storage
Scratch Space
Snapshots
Checking Your Quota
Data Protection
Storage Guides
Backups
Recommended Tool: Duplicity
Setup Guide
Verification
Alternative: Rclone
Rclone Backups to AWS
AWS IAM Setup
Rclone Configuration
Usage
Automated Backups
Configuration Location
Important Notes
Software and Modules
Module Commands
List Available Software
Load a Package
View Package Details
Remove All Loaded Modules
Software Installation Options
Community Installation
Local/Group Installation
Container Technology
Package Management
Spack
Anaconda/Conda
Python Virtual Environments (venv)
Software Guides
Abaqus
Known Issue
Solution
Example Job Script
Getting Help
MATLAB
Loading MATLAB
Running MATLAB on the Cluster
Parallel Computing
Tips
Jupyter Notebook
Launching Jupyter
Adding Conda Environments as Kernels
Accessing Group Directories
Julia Support
Configuration
Troubleshooting
cryoSPARC
Installation
Requesting Access
Connecting
Management Commands
Job Submission
RStudio
Launching RStudio
Package Installation
Example: Installing tidyverse
Troubleshooting
NVIDIA NGC
Account Setup
Configuration
Pulling Containers
Running Containers
NGC CLI Tool
Example SLURM Script
Popular NGC Containers
Relion using SBGrid
Setup
SBGrid Preferences
SLURM Submission Template
Verification
Troubleshooting
AlphaFold
Prerequisites
Setup
Job Submission
Output
Additional Options
Tips
VSCode
Prerequisites
Setup
First Connection
Advanced: Direct Compute Node Connection
Troubleshooting
Available Software Modules
Categories
Requesting New Software
See Also
Troubleshooting
Troubleshooting
By Symptom
Job Won’t Start
Job Fails Immediately
Out of Memory
SSH/Connection Issues
GPU Not Working
Software/Module Issues
Storage Issues
Quick Diagnostics
Error Messages
Still Stuck?
Common Problems
Authentication Issues
Password Not Working
Connection Refused
Network & Connectivity
Frequent SSH Disconnections During Idle Periods
Alternative: Mosh (Stateless SSH)
Computation Issues
Requested Cores Not Being Used
Nested SRUNs Hanging on GPU Nodes
Out of Memory on Login Nodes
Access & Display Issues
Home Directory Not Found in Open OnDemand
Python/Conda Segfaults from WSL
X11 Forwarding GLX Errors
Still Having Issues?
Frequently Asked Questions
Account & Access
Job Management
Software & Containers
Troubleshooting
Quick Answers
How do I login to the cluster?
How do external collaborators get an account?
How should I acknowledge research on the cluster?
Why won’t my job start?
How do I get job information via email?
How do I modify my bash environment?
How do I compress unused data?
How are priority and fairshare set up?
Using the debug QOS
I have a deadline and need my job to run now!
I need to run longer than 7 days
Dependencies and pipelines
How do I checkpoint before my job hits its walltime?
How do I check my group’s compute usage?
Detailed Guides
Reusing your SSH login
One-time setup
Connecting
Troubleshooting
See also
Cluster IP Space
Login Nodes
Head Nodes
Whitelisting from a remote firewall
Job Submission Limits
Maximum Concurrent Jobs
Why This Limit Exists
Workarounds
Cluster Reservations
Non-Investor Reservations
Investor Reservations
Requesting a Reservation
MATLAB Distributed Compute Engine
Access MATLAB
Configuration (MATLAB 2019a)
Configuration (MATLAB R2020+)
Using the Parallel Pool
Containers on the HPC Cluster
Why Containers?
Loading Apptainer
Basic Operations
Binding Directories
GPU Support
Example SLURM Script
Building Containers
Popular Container Registries
UCX Error: No Space Left on Device
Error Message
Solution
Why This Happens
Reference
Research Network
Purpose
Network Access Points
Mounting Instructions
Getting Access
Glossary
A
B
C
D
E
F
G
H
I
J
L
M
N
O
P
Q
S
T
V
W
Acronym Reference
Caltech Resnick High Performance Computing Center
Index
Index