Best Practices
Tips for effective cluster use.
Resource Requests
Request What You Need
Over-requesting hurts you: longer queue times, higher charges, fewer resources for everyone.
Tip
Start small, check usage with seff JOBID, then adjust.
Match CPUs to Parallelism
Requesting 32 CPUs for a single-threaded script wastes 31 CPUs.
Add Time Buffer
Add ~20% to your estimated runtime. Jobs killed at the time limit lose all progress.
Job Arrays
For parameter sweeps or batch processing:
#SBATCH --array=1-100
python process.py --input data_${SLURM_ARRAY_TASK_ID}.csv
Much cleaner than 100 separate submissions.
Storage
Use the Right Location
What |
Where |
Why |
|---|---|---|
Code, configs |
|
Small files |
Project data |
|
Persistent, large quota |
Temp files |
|
Fast, auto-cleaned |
Monitor Your Quota
hpcquota
Scratch Warning
Files on scratch are deleted after 14 days without access. Copy important results to group storage.
I/O Performance
Avoid many small files—use larger files or HDF5
For I/O-heavy jobs, copy data to scratch first, then copy results back
Code Efficiency
Profile First
python -m cProfile -s cumtime script.py
Set Thread Counts
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK
Checkpointing
Save progress periodically for long jobs:
if epoch % 10 == 0:
torch.save(checkpoint, f'checkpoint_{epoch}.pt')
Environment Management
Pin versions for reproducibility:
conda env export > environment.yml
pip freeze > requirements.txt
Load specific module versions:
module load python3/3.10.12 # Good
module load python3 # Risky
Monitoring
squeue -u $USER # Job status
scontrol show job JOBID # Details
seff JOBID # Efficiency report
Good Citizenship
Don’t run heavy work on login nodes
Clean up scratch and old outputs
Cancel unneeded jobs
Pre-Submit Checklist
Tested on small scale?
Appropriate resource requests?
Using scratch for temp files?
Checkpoints for long runs?
Correct paths and modules?