Frequently Asked Questions

Account & Access

How do I set up SSH keys for passwordless login?
Cluster IP Space - IPs to whitelist on remote firewalls

Job Management

Software & Containers

Troubleshooting

UCX Error: No space left on device

Quick Answers

How do external collaborators get an account?

Contact your PI to request access through help-hpc@caltech.edu.

How should I acknowledge research on the cluster?

Add the following to all resulting publications and presentations:

The computations presented here were conducted in the Resnick High Performance Computing Center, a facility supported by Resnick Sustainability Institute at the California Institute of Technology.

Why won’t my job start?

Common reasons:

Resources unavailable - Requested resources exceed availability
Queue depth - Many jobs ahead of yours
Fairshare - Your group has used significant resources recently

Check with: squeue -u $USER and scontrol show job JOBID

How do I get job information via email?

Add to your SLURM script:

#SBATCH [email protected]
#SBATCH --mail-type=BEGIN,END,FAIL

How do I modify my bash environment?

Edit ~/.bashrc for interactive shells or ~/.bash_profile for login shells.

How do I compress unused data?

tar -czvf archive.tar.gz directory/

How are priority and fairshare set up?

Priority is calculated based on:

Fairshare - Historical group usage
Job age - Time in queue
Job size - Smaller jobs may start sooner

Using the debug QOS

For quick tests (up to 30 minutes):

#SBATCH --qos=debug
#SBATCH --time=00:30:00

I have a deadline and need my job to run now!

Contact help-hpc@caltech.edu to discuss options. See Reservations.

I need to run longer than 7 days

Contact help-hpc@caltech.edu to discuss extended walltime options.

Dependencies and pipelines

Use SLURM job dependencies:

# Submit job that waits for job 12345
sbatch --dependency=afterok:12345 next_job.sh

How do I checkpoint before my job hits its walltime?

Ask SLURM to send a signal a fixed number of seconds before the walltime, then trap it in your script to save state:

#SBATCH --signal=B:SIGTERM@120   # send SIGTERM 120s before the time limit

trap 'echo "saving checkpoint..."; ./save_state.sh; exit 0' SIGTERM
./long_running_program &
wait

The B: prefix sends the signal to the batch script itself rather than the job steps.

How do I check my group’s compute usage?

# Your own recent jobs
sacct -u $USER --starttime=2026-01-01 -o JobID,Elapsed,AllocCPUS,State

# Group-level usage over a period
sreport cluster AccountUtilizationByUser start=2026-01-01