UCX Error: No Space Left on Device

Error Message

ib_verbs UCX ERROR No space left on device

Solution

Recompile your code with specific modules and an environment variable setting.

Required Modules

module load cuda/11.2
module load openmpi/4.1.0_cuda-11.2

Environment Variable

Set before running your application:

export OMPI_MCA_mpi_cuda_support=0

Complete Example

#!/bin/bash
#SBATCH --job-name=myjob
#SBATCH --partition=gpu
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=gpu:p100:1
#SBATCH --time=4:00:00

module load cuda/11.2
module load openmpi/4.1.0_cuda-11.2

export OMPI_MCA_mpi_cuda_support=0

srun ./myprogram

Why This Happens

This error occurs due to compatibility issues between:

  • InfiniBand verbs layer

  • UCX communications library

  • CUDA

Disabling MPI CUDA support with the environment variable resolves the device memory errors during parallel computing operations.