NAMD3, GPU and slurm: FATAL ERROR: CUDA error cudaFree

From: Thérèse Malliavin (therese.malliavin_at_univ-lorraine.fr)
Date: Sat Nov 05 2022 - 09:46:16 CDT

Dear NAMD Netters,

I got the following errors and a crash when launching NAMD on GPUs using slurm:

The crash message was:
FATAL ERROR: CUDA error cudaFree((void *)(*pp)) in file src/CudaUtils.C, function reallocate_device_T, line 142
on Pe 16 (r12i2n6 device 1 pci 0:1c:0): invalid argument
FATAL ERROR: CUDA error cudaGetLastError() in file src/CudaTileListKernel.cu, function buildTileLists, line 1030
on Pe 16 (r12i2n6 device 1 pci 0:1c:0): invalid argument

The slurm script was:
#!/bin/bash
#SBATCH --nodes=1 # Number of Nodes
#SBATCH --ntasks-per-node=1 # Number of MPI tasks per node
#SBATCH --cpus-per-task=40 # Number of OpenMP threads
#SBATCH --hint=nomultithread # Disable hyperthreading
#SBATCH --gres=gpu:4 # Allocate 4 GPUs per node
#SBATCH --job-name=dyna # Job name
#SBATCH --output=%x-%j.out # Output file %x is the jobname, %j the jobid
#SBATCH --error=%x-%j.err # Error file
#SBATCH --time=20:00:00 # Expected runtime HH:MM:SS (max 20h)
#SBATCH --qos=qos_gpu-t3 # Uncomment for job requiring less than 20h (only one node)
module purge
module load namd/3.0-a3
simulation=./prod120.conf
log=./prod120.log
namd3 +p40 +idlepoll $simulation >& $log

It seems to me that the NAMD did not find the required numbers of GPUs when it started.
I am right?

The computer system is a supercomputer HPE SGI 8600 and the nodes are quadri-GPU with
processors Intel Cascade Lake 6248.

Thus, I have the following question: it is possible to use NAMD3 on GPUs with slurm?
Did somebody expereince similar problems?
How shoud slurm be configurated to avoid such crashes?

Best regards,
Therese Malliavin
Laboratoire de Physique et Chimi Theorique
Université de Lorraine, France

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST