Re: How to run on multi-node environment

From: Luis Cebamanos (luiceur_at_gmail.com)
Date: Wed Dec 22 2021 - 09:33:16 CST

Ah! That make sense! If I create the nodelist as expected I get this for
each of the Pe

FATAL ERROR: CUDA error cudaMallocHost(&p, size) in file
src/ComputePmeCUDAMgr.C, function alloc_, line 54
  on Pe 8 (g424 device 0 pci 0:60:0): out of memory
FATAL ERROR: CUDA error cudaHostAlloc(pp, sizeofT*(*curlen), flag) in
file src/CudaUtils.C, function reallocate_host_T, line 164
  on Pe 9 (g424 device 1 pci 0:61:0): out of memory
FATAL ERROR: CUDA error cudaHostAlloc(pp, sizeofT*(*curlen), flag) in
file src/CudaUtils.C, function reallocate_host_T, line 164
  on Pe 9 (g424 device 1 pci 0:61:0): out of memory
FATAL ERROR: CUDA error cudaHostAlloc(pp, sizeofT*(*curlen), flag) in
file src/CudaUtils.C, function reallocate_host_T, line 164
  on Pe 54 (g425 device 2 pci 0:88:0): out of memory

Is there a reason for this?
L

On 22/12/2021 14:56, Vermaas, Josh wrote:
> That nodelist looks funky to me. I'm betting NAMD expects them to be one per line, only sees 1 line (as in the user guide https://www.ks.uiuc.edu/Research/namd/2.14/ug/node103.html), and assumes that you only have 1 node. It then nicely generates the configuration you generated on a single node, and goes about its merry way... Until it realizes that you've assigned 8 tasks to 4 GPUs, and warns you that it doesn't like sharing.
>
> -Josh
>
> On 12/22/21, 9:46 AM, "owner-namd-l_at_ks.uiuc.edu on behalf of Luis Cebamanos" <owner-namd-l_at_ks.uiuc.edu on behalf of luiceur_at_gmail.com> wrote:
>
> Hello all,
>
> Trying to run on a multinode/multi-GPU environment (namd built with
> Charm-verbs, cuda SMP and Intel). Each node with 4 GPUs, 40 CPUs:
>
> charmrun ++nodelist nodeListFiletxt ++p 72 ++ppn 9 namd2 +devices
> 0,1,2,3 +isomalloc_sync +setcpuaffinity +idlepoll +pemap
> 1-9,11-19,21-29,31-39 +comm
> ap 0,10,20,30 stmv.namd
>
> where my nodeListFile.txt looks like:
>
> group main
> host andraton11 host andraton12 ++cpus 40 ++shell ssh
>
> I am getting the following error:
>
> FATAL ERROR: Number of devices (4) is not a multiple of number of
> processes (8). Sharing devices between processe
> s is inefficient. Specify +ignoresharing (each process uses all visible
> devices) if not all devices are visible t
> o each process, otherwise adjust number of processes to evenly divide
> number of devices, specify subset of devices
> with +devices argument (e.g., +devices 0,2), or multiply list shared
> devices (e.g., +devices 0,1,2,0).
>
>
> If not using +ignoresharing, how should I run this correctly?
>
> Regards,
>
>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:12 CST