Re: NAMD3, GPU and slurm: FATAL ERROR: CUDA error cudaFree

From: Josh Vermaas (vermaasj_at_msu.edu)
Date: Sat Nov 05 2022 - 16:09:43 CDT

Hi Therese,

Below is what we use as our "standard" when doing something simple,
using CUDASOAIntegrate.

#!/bin/bash
#SBATCH --gres=gpu:4
#SBATCH --gres-flags=enforce-binding #Useful if you are on hardware with
NVLINKs that aren't between all nodes. You want the CPUs to be directly
attached to the GPUs.
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G #This is actually important! We've had memory
allocation issues when only asking for a handful of nodes.
#SBATCH --time=4:0:0

cd $SLURM_SUBMIT_DIR
#Other modules are loaded automatically by the NAMD module.
module use /mnt/home/vermaasj/modules
module load NAMD/3.0a9-gpu
srun namd3 ++ppn 4 run.namd

So I know that NAMD can run with slurm just fine. What is the provenance
of your NAMD binary? 3.0a3 is at this point fairly old, and if memory
serves, CANNOT use more than 1 CPU per GPU in CUDASOAIntegrate mode.
Personally, I wouldn't trust anything older than a9 for production if I
were to start a simulation today, and a12 does have a (small)
performance improvement if you throw more CPUs at the problem.

-Josh

On 11/5/22 10:46, Thérèse Malliavin wrote:
> Dear NAMD Netters,
>
>  I got the following errors and a crash when launching NAMD on GPUs
> using slurm:
>
> The crash message was:
> FATAL ERROR: CUDA error cudaFree((void *)(*pp)) in file
> src/CudaUtils.C, function reallocate_device_T, line 142
>  on Pe 16 (r12i2n6 device 1 pci 0:1c:0): invalid argument
> FATAL ERROR: CUDA error cudaGetLastError() in file
> src/CudaTileListKernel.cu, function buildTileLists, line 1030
>  on Pe 16 (r12i2n6 device 1 pci 0:1c:0): invalid argument
>
> The slurm script was:
> #!/bin/bash
> #SBATCH --nodes=1               # Number of Nodes
> #SBATCH --ntasks-per-node=1     # Number of MPI tasks per node
> #SBATCH --cpus-per-task=40      # Number of OpenMP threads
> #SBATCH --hint=nomultithread    # Disable hyperthreading
> #SBATCH --gres=gpu:4            # Allocate 4 GPUs per node
> #SBATCH --job-name=dyna         # Job name
> #SBATCH --output=%x-%j.out      # Output file %x is the jobname, %j
> the jobid
> #SBATCH --error=%x-%j.err       # Error file
> #SBATCH --time=20:00:00         # Expected runtime HH:MM:SS (max 20h)
> #SBATCH --qos=qos_gpu-t3         # Uncomment for job requiring less
> than 20h (only one node)
> module purge
> module load namd/3.0-a3
> simulation=./prod120.conf
> log=./prod120.log
> namd3 +p40 +idlepoll $simulation >& $log
>
> It seems to me that the NAMD did not find the required numbers of GPUs
> when it started.
> I am right?
>
> The computer system is a supercomputer HPE SGI 8600 and the nodes
> are quadri-GPU with
> processors Intel Cascade Lake 6248.
>
> Thus, I have the following question: it is possible to use NAMD3 on
> GPUs with slurm?
> Did somebody expereince similar problems?
> How shoud slurm be configurated to avoid such crashes?
>
> Best regards,
> Therese Malliavin
> Laboratoire de Physique et Chimi Theorique
> Université de Lorraine, France
>

This archive was generated by hypermail 2.1.6 : Tue Dec 13 2022 - 14:32:44 CST