UCX Error: No Space Left on Device
Error Message
ib_verbs UCX ERROR No space left on device
Solution
Recompile your code with specific modules and an environment variable setting.
Required Modules
module load cuda/11.2
module load openmpi/4.1.0_cuda-11.2
Environment Variable
Set before running your application:
export OMPI_MCA_mpi_cuda_support=0
Complete Example
#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=gpu:p100:1
#SBATCH --time=4:00:00
module load cuda/11.2
module load openmpi/4.1.0_cuda-11.2
export OMPI_MCA_mpi_cuda_support=0
srun ./myprogram
Why This Happens
This error occurs due to compatibility issues between:
InfiniBand verbs layer
UCX communications library
CUDA
Disabling MPI CUDA support with the environment variable resolves the device memory errors during parallel computing operations.