Glossary

Common HPC terms and their definitions.


A

Allocation

A grant of computing resources (CPU hours, storage) to a research group.

Apptainer

Container runtime for HPC (formerly Singularity). Allows running containerized applications without root privileges.

B

Batch Job

A job submitted to the scheduler to run without user interaction. Contrast with interactive job.

C

Cluster

A collection of interconnected computers (nodes) that work together as a single system.

Compute Node

A server in the cluster dedicated to running user jobs. Users cannot log in directly; jobs are assigned by the scheduler.

Container

A lightweight, standalone package containing code and dependencies. See Apptainer, Docker.

Core

A single processing unit within a CPU. Modern CPUs have multiple cores (e.g., 64 cores per CPU).

CPU (Central Processing Unit)

The main processor in a computer. HPC nodes typically have 1-2 CPUs with many cores each.

D

Docker

Popular container platform. Docker images can be converted to Apptainer format for HPC use.

E

Environment Module

System for managing software packages. Users load modules to access specific software versions.

F

Fairshare

Scheduling policy that balances resource allocation based on historical usage. Groups that recently used many resources get lower priority.

FLOPS

Floating-point Operations Per Second. Measure of computing performance.

G

Globus

High-performance file transfer service for moving large datasets between systems.

GPU (Graphics Processing Unit)

Specialized processor for parallel computations. Essential for deep learning and certain scientific simulations.

H

Head Node

Server that manages the cluster, runs the scheduler, and routes network traffic.

HPC (High Performance Computing)

Using supercomputers and clusters to solve complex computational problems.

HTC (High Throughput Computing)

Running many independent jobs, optimizing for total work completed rather than individual job speed.

I

InfiniBand

High-speed, low-latency network interconnect used between cluster nodes. Much faster than Ethernet.

Interactive Job

A job where the user directly interacts with the compute node via terminal. Contrast with batch job.

J

Job

A unit of work submitted to the cluster scheduler.

Job Array

A collection of similar jobs submitted as a single entity, each with a unique index.

Job Script

A shell script containing SLURM directives and commands to execute.

L

Login Node

Server where users connect via SSH. Used for file editing, job submission, and light tasks. Not for heavy computation.

M

MFA (Multi-Factor Authentication)

Security requiring multiple verification methods (password + Duo app).

Module

See Environment Module.

MPI (Message Passing Interface)

Standard for parallel programming across multiple nodes. Programs use MPI to communicate between processes.

N

Node

A single computer/server in the cluster.

NUMA (Non-Uniform Memory Access)

Memory architecture where access time depends on memory location relative to the processor.

O

OOD (Open OnDemand)

Web-based interface for accessing HPC resources via browser.

OpenMP

API for shared-memory parallel programming. Programs use OpenMP for multi-threaded parallelism within a single node.

P

Parallel Computing

Running computations simultaneously across multiple cores, GPUs, or nodes.

Partition

A logical grouping of nodes in SLURM (e.g., gpu partition for GPU nodes).

PI (Principal Investigator)

Faculty member leading a research group. PIs authorize group membership and manage allocations.

PTA

Project-Task-Award. Caltech’s accounting code for charging compute usage.

Q

QOS (Quality of Service)

SLURM setting that modifies job priority or limits. Example: debug QOS for quick test jobs.

Queue

List of jobs waiting to run. Jobs are selected based on priority and resource availability.

Quota

Storage limit (e.g., 50 GB home directory quota).

S

SBATCH

SLURM command to submit a batch job.

Scheduler

Software that manages job queue and allocates resources. Caltech uses SLURM.

Scratch

Temporary high-performance storage for active computations. Files are automatically deleted after 14 days.

SIF (Singularity Image Format)

Container image format used by Apptainer/Singularity.

SLURM (Simple Linux Utility for Resource Management)

The job scheduler and resource manager used on the cluster.

SSH (Secure Shell)

Protocol for secure remote login and command execution.

SRUN

SLURM command to run parallel tasks within a job.

T

Task

A single process in a SLURM job. A job can have multiple tasks.

Thread

A lightweight execution unit within a process. Multi-threaded programs run multiple threads per core.

V

VAST

High-performance storage system. Current primary storage at Caltech HPC.

VPN (Virtual Private Network)

Secure connection to campus network from off-campus locations.

W

Walltime

Maximum allowed runtime for a job. Jobs exceeding walltime are terminated.

Worker Node

See Compute Node.


Acronym Reference

Acronym

Meaning

CPU

Central Processing Unit

GPU

Graphics Processing Unit

HPC

High Performance Computing

HTC

High Throughput Computing

MFA

Multi-Factor Authentication

MPI

Message Passing Interface

OOD

Open OnDemand

PI

Principal Investigator

QOS

Quality of Service

SIF

Singularity Image Format

SLURM

Simple Linux Utility for Resource Management

SSH

Secure Shell

VPN

Virtual Private Network