From: Lorenzo Casalino (lcasalino_at_ucsd.edu)
Date: Fri Feb 19 2021 - 21:24:19 CST
I am trying to use the multiGPU version of NAMD3 (NAMD_3.0alpha7_Linux-x86_64-multicore-CUDA-MultiGPU-SingleNode) to run plain MD on 2 GPUs on a single node on a local cluster using the following command:
namd3 +p 2 +setcpuaffinity +idlepoll +devices 0,1 input.conf > input.log
I added the following keywords to my configuration file:
- CUDASOAintegrate on
- margin 4
>From the log file, it looks like the 2 GPUs are seen and activated:
Info: Built with CUDA version 10010
Pe 1 physical rank 1 binding to CUDA device 1 on tscc-gpu-5-0.sdsc.edu: 'GeForce RTX 3090' Mem: 24268MB Rev: 8.6 PCI: 0:24:0
Pe 0 physical rank 0 binding to CUDA device 0 on tscc-gpu-5-0.sdsc.edu: 'GeForce RTX 3090' Mem: 24268MB Rev: 8.6 PCI: 0:1:0
The startup phase finishes smoothly, and then, when the actual MD simulation starts, the following error is generated:
Info: Finished startup at 34.7205 s, 0 MB of memory in use
TCL: Running for 100000 steps
FATAL ERROR: CUDA error cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, hgi, hgi, d_nHG, natoms, notZero(), stream) in file src/SequencerCUDAKernel.cu, function buildRattleLists, line 4461
on Pe 1 (tscc-gpu-5-0.sdsc.edu device 1 pci 0:24:0): invalid device function
A single node of the cluster has 32 cpus and 8 GPUs (GeForce RTX 3090).
I point out that the GPUs are NOT connected by NVlink.
Finally, this is the PBS argument I use to add the GPUs to the environment: #PBS -l nodes=1:ppn=8:gpus=2:gpu3090
I was not able to work this error out. Is it possible that without NVlink I cannot use the multiGPU version?
Any help or advises on this issue would be greatly appreciated.
This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:10 CST