Caltech Resnick High Performance Computing Center Logo

General

  • About
    • Mission
    • Research community
    • Getting help
    • Infrastructure
    • Sponsorship
    • Citing the Center
  • People
    • IMSS Leadership
    • IMSS Technical Leadership
    • IMSS Technical Staff
    • Contact
  • Services
    • Getting started
    • Performance & optimization
    • Software & environment support
    • Data transfer & storage
    • Interactive & visualization computing
    • Get in touch
  • Support
    • Getting Help
    • Self-service documentation
    • Consultations
    • System Status
    • Location
  • Rates
    • Cost Calculator
    • Compute Pricing
      • Compute Unit Definitions
    • Storage Pricing
    • Questions
  • Resources
    • Cluster Overview
    • Login Nodes
      • Standard Login
      • Visualization Login
    • Compute Nodes
    • GPU Nodes
    • Contact
  • Software
    • System Software
    • User Software
    • Pre-installed Software
    • Installation Methods
      • Module System
      • Spack
      • Anaconda/Conda
      • Singularity/Apptainer Containers
    • Software Guides
      • Abaqus
        • Known Issue
        • Solution
        • Example Job Script
        • Getting Help
      • MATLAB
        • Loading MATLAB
        • Running MATLAB on the Cluster
        • Parallel Computing
        • Tips
      • Jupyter Notebook
        • Launching Jupyter
        • Adding Conda Environments as Kernels
        • Accessing Group Directories
        • Julia Support
        • Configuration
        • Troubleshooting
      • cryoSPARC
        • Installation
        • Requesting Access
        • Connecting
        • Management Commands
        • Job Submission
      • RStudio
        • Launching RStudio
        • Package Installation
        • Example: Installing tidyverse
        • Troubleshooting
      • NVIDIA NGC
        • Account Setup
        • Configuration
        • Pulling Containers
        • Running Containers
        • NGC CLI Tool
        • Example SLURM Script
        • Popular NGC Containers
      • Relion using SBGrid
        • Setup
        • SBGrid Preferences
        • SLURM Submission Template
        • Verification
        • Troubleshooting
      • AlphaFold
        • Prerequisites
        • Setup
        • Job Submission
        • Output
        • Additional Options
        • Tips
      • VSCode
        • Prerequisites
        • Setup
        • First Connection
        • Advanced: Direct Compute Node Connection
        • Troubleshooting
      • Available Software Modules
        • Categories
        • Requesting New Software
        • See Also
    • Institutional Licenses
    • Request New Software
  • Open OnDemand
    • Features
    • Getting Started
    • Interactive Desktop
    • Troubleshooting
      • PATH Conflicts
      • Browser Issues
      • File Upload Failures
      • “Request Header Too Long” Error
      • Home Directory Quota
    • Support
  • System Status
    • Current Status
    • Active Announcements
    • Check the Cluster Yourself
      • Job queues
      • Node availability
      • Storage health
      • Job efficiency after a run
    • Notifications
      • Mailing List
    • Reporting an Issue
  • Training
    • Self-Study Resources
      • Internal
      • External (recommended)
    • Announcements
  • Citing the Resnick High Performance Computing Center
    • Recommended acknowledgement
    • BibTeX
    • CITATION.cff
    • Let us know when you publish
    • Questions
  • Policies
    • Acceptable use
    • Data handling
    • Export control
    • Security
    • Questions

Getting Started

  • Quick Start Guide
    • 1. Get an account
    • 2. Connect via SSH
      • Recommended SSH config
    • 3. Move some data over
    • 4. Find and load software
    • 5. Submit your first job
    • 6. Cheat sheet
    • 7. Where to put your files
    • Next steps
    • Stuck?
  • Getting Started
    • Quick Links
    • First Steps
    • Logging In
    • Next Steps
  • Account Information
    • Getting an Account
      • For New Groups
      • For Existing Groups
    • Multi-Factor Authentication
      • Self-Registration
      • Supported Methods
    • Eligibility Certification
    • HPC End-User Agreement
      • Country Group D:5

Running Jobs

  • SLURM Commands
    • Job Submission
      • sbatch
      • salloc
      • srun
    • Resource Request Parameters
    • Environment Variables
    • Queue Management
      • squeue
      • scancel
      • scontrol
    • Usage Reporting
      • sreport
      • sacct
    • Account Management
    • Task Launching
      • MPI Jobs
    • Example Batch Script
  • Example Job Scripts
    • Basic Examples
      • Serial Job
      • Multi-threaded (OpenMP)
      • MPI (Multi-node)
    • GPU Jobs
      • Single GPU
      • Multi-GPU
      • Specific GPU Type
    • Python & Conda
      • Conda Environment
      • Jupyter Batch
    • Job Arrays
      • Parameter Sweep
      • Limit Concurrent Jobs
    • Applications
      • MATLAB
      • R
      • GROMACS
      • AlphaFold
    • Job Dependencies
      • Sequential Pipeline
      • Fan-out, Fan-in
    • Email Notifications
    • Generate a Custom Script
    • See Also
  • Best Practices
    • Resource Requests
      • Request What You Need
      • Match CPUs to Parallelism
      • Add Time Buffer
    • Job Arrays
    • Storage
      • Use the Right Location
      • Monitor Your Quota
      • Scratch Warning
    • I/O Performance
    • Code Efficiency
      • Profile First
      • Set Thread Counts
    • Checkpointing
    • Environment Management
    • Monitoring
    • Good Citizenship
    • Pre-Submit Checklist
    • Questions?
  • GPU Computing
    • Available GPUs
    • Requesting GPUs
      • Single GPU
      • Multiple GPUs
      • Specific GPU Type
    • Basic GPU Job
    • CUDA Programming
      • Load CUDA
      • Compile CUDA Code
      • Check GPU in Code
    • Deep Learning Frameworks
      • PyTorch
      • TensorFlow
      • JAX
    • Using Containers for Deep Learning
      • NGC Containers (Recommended)
      • Pull NGC Container
    • Multi-Node GPU Training
    • GPU Memory Management
      • Check Memory Usage
      • Reducing Memory Usage
    • Best Practices
    • Troubleshooting
      • “CUDA out of memory”
      • GPU not detected
      • Slow GPU performance
    • See Also
  • AI & Machine Learning
    • Quick Start for ML Researchers
      • Recommended Setup
    • Environment Setup
      • Option 1: Conda (Recommended)
      • Option 2: NGC Containers
    • Common ML Tasks
      • Training a Model
      • Hyperparameter Tuning
      • Inference / Batch Prediction
    • Large Language Models
      • Running Hugging Face Models
      • Fine-tuning with LoRA
    • Distributed Training
      • Multi-GPU (Single Node)
      • Multi-Node Training
    • Data Management
      • Efficient Data Loading
      • Storage Recommendations
    • Experiment Tracking
      • Weights & Biases
      • TensorBoard
    • Best Practices
    • Common Issues
      • Out of Memory
      • Slow Training
      • NCCL Errors (Multi-GPU)
    • See Also

Data & Software

  • Transferring Files
    • Quick Reference
    • Open OnDemand
    • Network Mounting
    • SCP/SFTP
      • Command Line
      • Graphical Clients
    • rsync
    • SSHFS (macOS)
      • Installation
      • Usage
    • Globus
      • HPC Endpoint Details
      • Setup
    • Cloud Storage
      • Amazon S3
      • Google Cloud Storage
    • Best Practices
    • Getting Help
  • Storage
    • Overview
    • User Home Directories
    • Group Storage
    • Scratch Space
    • Snapshots
    • Checking Your Quota
    • Data Protection
    • Storage Guides
      • Backups
        • Recommended Tool: Duplicity
        • Setup Guide
        • Verification
        • Alternative: Rclone
      • Rclone Backups to AWS
        • AWS IAM Setup
        • Rclone Configuration
        • Usage
        • Automated Backups
        • Configuration Location
        • Important Notes
  • Software and Modules
    • Module Commands
      • List Available Software
      • Load a Package
      • View Package Details
      • Remove All Loaded Modules
    • Software Installation Options
      • Community Installation
      • Local/Group Installation
      • Container Technology
    • Package Management
      • Spack
      • Anaconda/Conda
      • Python Virtual Environments (venv)
    • Software Guides
      • Abaqus
        • Known Issue
        • Solution
        • Example Job Script
        • Getting Help
      • MATLAB
        • Loading MATLAB
        • Running MATLAB on the Cluster
        • Parallel Computing
        • Tips
      • Jupyter Notebook
        • Launching Jupyter
        • Adding Conda Environments as Kernels
        • Accessing Group Directories
        • Julia Support
        • Configuration
        • Troubleshooting
      • cryoSPARC
        • Installation
        • Requesting Access
        • Connecting
        • Management Commands
        • Job Submission
      • RStudio
        • Launching RStudio
        • Package Installation
        • Example: Installing tidyverse
        • Troubleshooting
      • NVIDIA NGC
        • Account Setup
        • Configuration
        • Pulling Containers
        • Running Containers
        • NGC CLI Tool
        • Example SLURM Script
        • Popular NGC Containers
      • Relion using SBGrid
        • Setup
        • SBGrid Preferences
        • SLURM Submission Template
        • Verification
        • Troubleshooting
      • AlphaFold
        • Prerequisites
        • Setup
        • Job Submission
        • Output
        • Additional Options
        • Tips
      • VSCode
        • Prerequisites
        • Setup
        • First Connection
        • Advanced: Direct Compute Node Connection
        • Troubleshooting
      • Available Software Modules
        • Categories
        • Requesting New Software
        • See Also

Troubleshooting

  • Troubleshooting
    • By Symptom
      • Job Won’t Start
      • Job Fails Immediately
      • Out of Memory
      • SSH/Connection Issues
      • GPU Not Working
      • Software/Module Issues
      • Storage Issues
    • Quick Diagnostics
    • Error Messages
    • Still Stuck?
  • Common Problems
    • Authentication Issues
      • Password Not Working
      • Connection Refused
    • Network & Connectivity
      • Frequent SSH Disconnections During Idle Periods
      • Alternative: Mosh (Stateless SSH)
    • Computation Issues
      • Requested Cores Not Being Used
      • Nested SRUNs Hanging on GPU Nodes
      • Out of Memory on Login Nodes
    • Access & Display Issues
      • Home Directory Not Found in Open OnDemand
      • Python/Conda Segfaults from WSL
      • X11 Forwarding GLX Errors
    • Still Having Issues?
  • Frequently Asked Questions
    • Account & Access
    • Job Management
    • Software & Containers
    • Troubleshooting
    • Quick Answers
      • How do I login to the cluster?
      • How do external collaborators get an account?
      • How should I acknowledge research on the cluster?
      • Why won’t my job start?
      • How do I get job information via email?
      • How do I modify my bash environment?
      • How do I compress unused data?
      • How are priority and fairshare set up?
      • Using the debug QOS
      • I have a deadline and need my job to run now!
      • I need to run longer than 7 days
      • Dependencies and pipelines
      • How do I checkpoint before my job hits its walltime?
      • How do I check my group’s compute usage?
    • Detailed Guides
      • Reusing your SSH login
        • One-time setup
        • Connecting
        • Troubleshooting
        • See also
      • Cluster IP Space
        • Login Nodes
        • Head Nodes
        • Whitelisting from a remote firewall
      • Job Submission Limits
        • Maximum Concurrent Jobs
        • Why This Limit Exists
        • Workarounds
      • Cluster Reservations
        • Non-Investor Reservations
        • Investor Reservations
        • Requesting a Reservation
      • MATLAB Distributed Compute Engine
        • Access MATLAB
        • Configuration (MATLAB 2019a)
        • Configuration (MATLAB R2020+)
        • Using the Parallel Pool
      • Containers on the HPC Cluster
        • Why Containers?
        • Loading Apptainer
        • Basic Operations
        • Binding Directories
        • GPU Support
        • Example SLURM Script
        • Building Containers
        • Popular Container Registries
      • UCX Error: No Space Left on Device
        • Error Message
        • Solution
        • Why This Happens

Reference

  • Research Network
    • Purpose
    • Network Access Points
    • Mounting Instructions
    • Getting Access
  • Glossary
    • A
    • B
    • C
    • D
    • E
    • F
    • G
    • H
    • I
    • J
    • L
    • M
    • N
    • O
    • P
    • Q
    • S
    • T
    • V
    • W
    • Acronym Reference
Caltech Resnick High Performance Computing Center
  • Search


© Copyright 2026, California Institute of Technology. Last updated on Jun 08, 2026.

Built with Sphinx using a theme provided by Read the Docs.