Re: NAMD3 multiGPU: invalid device function error

From: David Hardy (dhardy_at_ks.uiuc.edu)
Date: Mon Feb 22 2021 - 17:56:32 CST

Hi Lorenzo,

Sorry, I have to follow up my previous email.

I had thought that multi-GPU alpha 7 required NVLink. It happens to be recommended, but not required. I am not sure why you are seeing an error. If you are willing to share your data set with us, we could try to reproduce the issue locally. Also, it would be good to have the full log file from your failed run.

Best regards,
Dave

> On Feb 22, 2021, at 4:26 PM, David Hardy <dhardy_at_ks.uiuc.edu> wrote:
>
> Hi Lorenzo,
>
> The version of NAMD that you are running does require NVLink for multi-GPU support (see http://www.ks.uiuc.edu/Research/namd/alpha/3.0alpha/ <http://www.ks.uiuc.edu/Research/namd/alpha/3.0alpha/>). Our most recent improvements to multi-GPU support no longer require NVLink, however, scaling performance suffers without it. The next release (alpha 9) will include this new multi-GPU support. Until we get new builds posted, you will need to build from the GitLab “devel" branch to try it out. You can get access to the NAMD GitLab repo by following the posted directions (https://gitlab.com/tcbgUIUC/namd <https://gitlab.com/tcbgUIUC/namd>).
>
> Best regards,
> Dave
>
> --
> David J. Hardy, Ph.D.
> Beckman Institute
> University of Illinois at Urbana-Champaign
> 405 N. Mathews Ave., Urbana, IL 61801
> dhardy_at_ks.uiuc.edu <mailto:dhardy_at_ks.uiuc.edu>, http://www.ks.uiuc.edu/~dhardy/ <http://www.ks.uiuc.edu/~dhardy/>
>
>> On Feb 19, 2021, at 9:24 PM, Lorenzo Casalino <lcasalino_at_ucsd.edu <mailto:lcasalino_at_ucsd.edu>> wrote:
>>
>> Hello,
>>
>> I am trying to use the multiGPU version of NAMD3 (NAMD_3.0alpha7_Linux-x86_64-multicore-CUDA-MultiGPU-SingleNode) to run plain MD on 2 GPUs on a single node on a local cluster using the following command:
>>
>> namd3 +p 2 +setcpuaffinity +idlepoll +devices 0,1 input.conf > input.log
>>
>> I added the following keywords to my configuration file:
>> - CUDASOAintegrate on
>> - margin 4
>>
>>> From the log file, it looks like the 2 GPUs are seen and activated:
>> Info: Built with CUDA version 10010
>> Pe 1 physical rank 1 binding to CUDA device 1 on tscc-gpu-5-0.sdsc.edu <http://tscc-gpu-5-0.sdsc.edu/>: 'GeForce RTX 3090' Mem: 24268MB Rev: 8.6 PCI: 0:24:0
>> Pe 0 physical rank 0 binding to CUDA device 0 on tscc-gpu-5-0.sdsc.edu <http://tscc-gpu-5-0.sdsc.edu/>: 'GeForce RTX 3090' Mem: 24268MB Rev: 8.6 PCI: 0:1:0
>>
>> The startup phase finishes smoothly, and then, when the actual MD simulation starts, the following error is generated:
>>
>> Info: Finished startup at 34.7205 s, 0 MB of memory in use
>>
>> TCL: Running for 100000 steps
>> FATAL ERROR: CUDA error cub::DeviceSelect::If(d_temp_storage, temp_storage_bytes, hgi, hgi, d_nHG, natoms, notZero(), stream) in file src/SequencerCUDAKernel.cu, function buildRattleLists, line 4461
>> on Pe 1 (tscc-gpu-5-0.sdsc.edu <http://tscc-gpu-5-0.sdsc.edu/> device 1 pci 0:24:0): invalid device function
>>
>> A single node of the cluster has 32 cpus and 8 GPUs (GeForce RTX 3090).
>> I point out that the GPUs are NOT connected by NVlink.
>> Finally, this is the PBS argument I use to add the GPUs to the environment: #PBS -l nodes=1:ppn=8:gpus=2:gpu3090
>>
>> I was not able to work this error out. Is it possible that without NVlink I cannot use the multiGPU version?
>> Any help or advises on this issue would be greatly appreciated.
>>
>> Thank you.
>>
>> Best regards,
>> Lorenzo
>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:10 CST