Best Practices

Tips for effective cluster use.

Resource Requests

Request What You Need

Over-requesting hurts you: longer queue times, higher charges, fewer resources for everyone.

Tip

Start small, check usage with seff JOBID, then adjust.

Match CPUs to Parallelism

Requesting 32 CPUs for a single-threaded script wastes 31 CPUs.

Add Time Buffer

Add ~20% to your estimated runtime. Jobs killed at the time limit lose all progress.

Job Arrays

For parameter sweeps or batch processing:

#SBATCH --array=1-100
python process.py --input data_${SLURM_ARRAY_TASK_ID}.csv

Much cleaner than 100 separate submissions.

Storage

Use the Right Location

What

Where

Why

Code, configs

/home

Small files

Project data

/resnick/groups

Persistent, large quota

Temp files

/resnick/scratch

Fast, auto-cleaned

Monitor Your Quota

hpcquota

Scratch Warning

Files on scratch are deleted after 14 days without access. Copy important results to group storage.

I/O Performance

  • Avoid many small files—use larger files or HDF5

  • For I/O-heavy jobs, copy data to scratch first, then copy results back

Code Efficiency

Profile First

python -m cProfile -s cumtime script.py

Set Thread Counts

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export MKL_NUM_THREADS=$SLURM_CPUS_PER_TASK

Checkpointing

Save progress periodically for long jobs:

if epoch % 10 == 0:
    torch.save(checkpoint, f'checkpoint_{epoch}.pt')

Environment Management

Pin versions for reproducibility:

conda env export > environment.yml
pip freeze > requirements.txt

Load specific module versions:

module load python3/3.10.12  # Good
module load python3          # Risky

Monitoring

squeue -u $USER              # Job status
scontrol show job JOBID      # Details
seff JOBID                   # Efficiency report

Good Citizenship

  • Don’t run heavy work on login nodes

  • Clean up scratch and old outputs

  • Cancel unneeded jobs

Pre-Submit Checklist

  • Tested on small scale?

  • Appropriate resource requests?

  • Using scratch for temp files?

  • Checkpoints for long runs?

  • Correct paths and modules?

Questions?

Common Problemshelp-hpc@caltech.edu