Troubleshooting

Quick links to common issues by category.

Tip

Include your username, job IDs, and error messages when contacting support.

By Symptom

Job Won’t Start

Check queue: squeue -u $USER
Check job details: scontrol show job JOBID
Why won’t my job start?
Job Limits

Job Fails Immediately

Check output file: cat slurm-JOBID.out
Common causes:
- Missing module load
- Wrong file paths
- Insufficient memory
Common Problems

Out of Memory

On login nodes: Cgroup limits to 8 GB. Use compute nodes.
In jobs: Request more with --mem= or --mem-per-cpu=
GPU memory: See GPU troubleshooting

SSH/Connection Issues

SSH disconnecting
Connection refused — VPN required off-campus
SSH key setup

GPU Not Working

Software/Module Issues

Storage Issues

Check quota: hpcquota
Scratch purge policy — 14-day cleanup
Storage locations

Quick Diagnostics

# Check your jobs
squeue -u $USER

# Job details
scontrol show job JOBID

# Job efficiency (after completion)
seff JOBID

# Storage quota
hpcquota

# GPU status (on GPU node)
nvidia-smi

# Module status
module list

Error Messages

Error	Likely Cause	Solution
`sbatch: error: Batch job submission failed`	Invalid request	Check resource syntax
`slurmstepd: error: Exceeded job memory limit`	OOM	Increase `--mem`
`CUDA out of memory`	GPU OOM	Reduce batch size
`Connection refused`	Off-campus	Connect to VPN
`Permission denied`	Wrong key/password	Check credentials
`No space left on device`	Quota full	Check `hpcquota`

Troubleshooting

By Symptom

Job Won’t Start

Job Fails Immediately

Out of Memory

SSH/Connection Issues

GPU Not Working

Software/Module Issues

Storage Issues

Quick Diagnostics

Error Messages

Still Stuck?