Troubleshooting

Quick links to common issues by category.

Tip

Include your username, job IDs, and error messages when contacting support.

By Symptom

Job Won’t Start

Job Fails Immediately

  • Check output file: cat slurm-JOBID.out

  • Common causes:

    • Missing module load

    • Wrong file paths

    • Insufficient memory

  • Common Problems

Out of Memory

  • On login nodes: Cgroup limits to 8 GB. Use compute nodes.

  • In jobs: Request more with --mem= or --mem-per-cpu=

  • GPU memory: See GPU troubleshooting

SSH/Connection Issues

GPU Not Working

Software/Module Issues

Storage Issues

Quick Diagnostics

# Check your jobs
squeue -u $USER

# Job details
scontrol show job JOBID

# Job efficiency (after completion)
seff JOBID

# Storage quota
hpcquota

# GPU status (on GPU node)
nvidia-smi

# Module status
module list

Error Messages

Error

Likely Cause

Solution

sbatch: error: Batch job submission failed

Invalid request

Check resource syntax

slurmstepd: error: Exceeded job memory limit

OOM

Increase --mem

CUDA out of memory

GPU OOM

Reduce batch size

Connection refused

Off-campus

Connect to VPN

Permission denied

Wrong key/password

Check credentials

No space left on device

Quota full

Check hpcquota

Still Stuck?