Re: How to run on multi-node environment

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Wed Dec 22 2021 - 11:03:32 CST

What GPU models are these, though? For a 4-GPU run of STMV I get between 1
and 1.5 GB of memory usage per GPU, more or less (depends on NAMD 2.x vs.
3.0 alpha).

It's been quite a while since Nvidia put out GPUs with less than 2 GB of
memory.

Giacomo

On Wed, Dec 22, 2021 at 11:39 AM Axel Kohlmeyer <akohlmey_at_gmail.com> wrote:

> As the error message states, you are obviously running out of memory on
> the corresponding GPUs.
>
> On Wed, Dec 22, 2021 at 10:34 AM Luis Cebamanos <luiceur_at_gmail.com> wrote:
>
>> Ah! That make sense! If I create the nodelist as expected I get this for
>> each of the Pe
>>
>> FATAL ERROR: CUDA error cudaMallocHost(&p, size) in file
>> src/ComputePmeCUDAMgr.C, function alloc_, line 54
>> on Pe 8 (g424 device 0 pci 0:60:0): out of memory
>> FATAL ERROR: CUDA error cudaHostAlloc(pp, sizeofT*(*curlen), flag) in
>> file src/CudaUtils.C, function reallocate_host_T, line 164
>> on Pe 9 (g424 device 1 pci 0:61:0): out of memory
>> FATAL ERROR: CUDA error cudaHostAlloc(pp, sizeofT*(*curlen), flag) in
>> file src/CudaUtils.C, function reallocate_host_T, line 164
>> on Pe 9 (g424 device 1 pci 0:61:0): out of memory
>> FATAL ERROR: CUDA error cudaHostAlloc(pp, sizeofT*(*curlen), flag) in
>> file src/CudaUtils.C, function reallocate_host_T, line 164
>> on Pe 54 (g425 device 2 pci 0:88:0): out of memory
>>
>> Is there a reason for this?
>> L
>>
>> On 22/12/2021 14:56, Vermaas, Josh wrote:
>> > That nodelist looks funky to me. I'm betting NAMD expects them to be
>> one per line, only sees 1 line (as in the user guide
>> https://www.ks.uiuc.edu/Research/namd/2.14/ug/node103.html), and assumes
>> that you only have 1 node. It then nicely generates the configuration you
>> generated on a single node, and goes about its merry way... Until it
>> realizes that you've assigned 8 tasks to 4 GPUs, and warns you that it
>> doesn't like sharing.
>> >
>> > -Josh
>> >
>> > On 12/22/21, 9:46 AM, "owner-namd-l_at_ks.uiuc.edu on behalf of Luis
>> Cebamanos" <owner-namd-l_at_ks.uiuc.edu on behalf of luiceur_at_gmail.com>
>> wrote:
>> >
>> > Hello all,
>> >
>> > Trying to run on a multinode/multi-GPU environment (namd built with
>> > Charm-verbs, cuda SMP and Intel). Each node with 4 GPUs, 40 CPUs:
>> >
>> > charmrun ++nodelist nodeListFiletxt ++p 72 ++ppn 9 namd2 +devices
>> > 0,1,2,3 +isomalloc_sync +setcpuaffinity +idlepoll +pemap
>> > 1-9,11-19,21-29,31-39 +comm
>> > ap 0,10,20,30 stmv.namd
>> >
>> > where my nodeListFile.txt looks like:
>> >
>> > group main
>> > host andraton11 host andraton12 ++cpus 40 ++shell ssh
>> >
>> > I am getting the following error:
>> >
>> > FATAL ERROR: Number of devices (4) is not a multiple of number of
>> > processes (8). Sharing devices between processe
>> > s is inefficient. Specify +ignoresharing (each process uses all
>> visible
>> > devices) if not all devices are visible t
>> > o each process, otherwise adjust number of processes to evenly
>> divide
>> > number of devices, specify subset of devices
>> > with +devices argument (e.g., +devices 0,2), or multiply list
>> shared
>> > devices (e.g., +devices 0,1,2,0).
>> >
>> >
>> > If not using +ignoresharing, how should I run this correctly?
>> >
>> > Regards,
>> >
>> >
>>
>>
>
> --
> Dr. Axel Kohlmeyer akohlmey_at_gmail.com https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!v2gNE_oWbQfPwMorCSUyU2Z8a8eXO8uQuaQxhWoVBP0YQCXZo6etUjz7VKPkx0yo6g$
> <https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!vH4dfU-_ILd6ZyEN1u_BrQvuvnXEVJBaAkveWUPG2DVzvDUNGHPoMJDmp_UBjccKUA$>
> College of Science & Technology, Temple University, Philadelphia PA, USA
> International Centre for Theoretical Physics, Trieste. Italy.
>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:12 CST