Frequently Asked Questions

Account & Access

Job Management

Software & Containers

Troubleshooting


Quick Answers

How do I login to the cluster?

ssh username@login.hpc.caltech.edu

How do external collaborators get an account?

Contact your PI to request access through help-hpc@caltech.edu.

How should I acknowledge research on the cluster?

Add the following to all resulting publications and presentations:

The computations presented here were conducted in the Resnick High Performance Computing Center, a facility supported by Resnick Sustainability Institute at the California Institute of Technology.

Why won’t my job start?

Common reasons:

  1. Resources unavailable - Requested resources exceed availability

  2. Queue depth - Many jobs ahead of yours

  3. Fairshare - Your group has used significant resources recently

Check with: squeue -u $USER and scontrol show job JOBID

How do I get job information via email?

Add to your SLURM script:

#SBATCH [email protected]
#SBATCH --mail-type=BEGIN,END,FAIL

How do I modify my bash environment?

Edit ~/.bashrc for interactive shells or ~/.bash_profile for login shells.

How do I compress unused data?

tar -czvf archive.tar.gz directory/

How are priority and fairshare set up?

Priority is calculated based on:

  • Fairshare - Historical group usage

  • Job age - Time in queue

  • Job size - Smaller jobs may start sooner

Using the debug QOS

For quick tests (up to 30 minutes):

#SBATCH --qos=debug
#SBATCH --time=00:30:00

I have a deadline and need my job to run now!

Contact help-hpc@caltech.edu to discuss options. See Reservations.

I need to run longer than 7 days

Contact help-hpc@caltech.edu to discuss extended walltime options.

Dependencies and pipelines

Use SLURM job dependencies:

# Submit job that waits for job 12345
sbatch --dependency=afterok:12345 next_job.sh

How do I checkpoint before my job hits its walltime?

Ask SLURM to send a signal a fixed number of seconds before the walltime, then trap it in your script to save state:

#SBATCH --signal=B:SIGTERM@120   # send SIGTERM 120s before the time limit

trap 'echo "saving checkpoint..."; ./save_state.sh; exit 0' SIGTERM
./long_running_program &
wait

The B: prefix sends the signal to the batch script itself rather than the job steps.

How do I check my group’s compute usage?

# Your own recent jobs
sacct -u $USER --starttime=2026-01-01 -o JobID,Elapsed,AllocCPUS,State

# Group-level usage over a period
sreport cluster AccountUtilizationByUser start=2026-01-01

Detailed Guides