Troubleshooting
Quick links to common issues by category.
Tip
Include your username, job IDs, and error messages when contacting support.
By Symptom
Job Won’t Start
Check queue:
squeue -u $USERCheck job details:
scontrol show job JOBID
Job Fails Immediately
Check output file:
cat slurm-JOBID.outCommon causes:
Missing module load
Wrong file paths
Insufficient memory
Out of Memory
On login nodes: Cgroup limits to 8 GB. Use compute nodes.
In jobs: Request more with
--mem=or--mem-per-cpu=GPU memory: See GPU troubleshooting
SSH/Connection Issues
Connection refused — VPN required off-campus
GPU Not Working
Software/Module Issues
Module not found — check
module avail
Storage Issues
Check quota:
hpcquotaScratch purge policy — 14-day cleanup
Quick Diagnostics
# Check your jobs
squeue -u $USER
# Job details
scontrol show job JOBID
# Job efficiency (after completion)
seff JOBID
# Storage quota
hpcquota
# GPU status (on GPU node)
nvidia-smi
# Module status
module list
Error Messages
Error |
Likely Cause |
Solution |
|---|---|---|
|
Invalid request |
Check resource syntax |
|
OOM |
Increase |
|
GPU OOM |
Reduce batch size |
|
Off-campus |
Connect to VPN |
|
Wrong key/password |
Check credentials |
|
Quota full |
Check |
Still Stuck?
Caltech Help System — submit a ticket
help-hpc@caltech.edu — email support