Best Practices

Tips for effective cluster use.

Resource Requests

Over-requesting hurts you: longer queue times, higher charges, fewer resources for everyone.

Tip

Start small, check usage with seff JOBID, then adjust.

Requesting 32 CPUs for a single-threaded script wastes 31 CPUs.

Add ~20% to your estimated runtime. Jobs killed at the time limit lose all progress.

For parameter sweeps or batch processing:

#SBATCH --array=1-100
python process.py --input data_${SLURM_ARRAY_TASK_ID}.csv

Much cleaner than 100 separate submissions.

hpcquota

If your home directory is over quota, find the biggest files and directories — hidden caches included — with:

du -sh ~/.[!.]* ~/* 2>/dev/null | sort -rh | head -20

Files on scratch are deleted after 14 days without access. Copy important results to group storage.

python -m cProfile -s cumtime script.py

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

Save progress periodically for long jobs:

if epoch % 10 == 0:
    torch.save(checkpoint, f'checkpoint_{epoch}.pt')

Pin versions for reproducibility:

conda env export > environment.yml
pip freeze > requirements.txt

Load specific module versions:

module load python3/3.10.12  # Good
module load python3          # Risky

squeue -u $USER              # Job status
scontrol show job JOBID      # Details
seff JOBID                   # Efficiency report