Re: How to run on multi-node environment

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Wed Dec 22 2021 - 11:28:05 CST

Next message: Yogesh Sharma: "Position restrain in ABF"
Previous message: Giacomo Fiorin: "Re: How to run on multi-node environment"
In reply to: Giacomo Fiorin: "Re: How to run on multi-node environment"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

On second thought, it could be on the memory/address space on the host that
is lacking (cudaHostAlloc()!).
There could be a number of reasons including other processes hogging
resources.

On Wed, Dec 22, 2021 at 12:04 PM Giacomo Fiorin <giacomo.fiorin_at_gmail.com>
wrote:

> What GPU models are these, though? For a 4-GPU run of STMV I get
> between 1 and 1.5 GB of memory usage per GPU, more or less (depends on NAMD
> 2.x vs. 3.0 alpha).
>
> It's been quite a while since Nvidia put out GPUs with less than 2 GB of
> memory.
>
> Giacomo
>
> On Wed, Dec 22, 2021 at 11:39 AM Axel Kohlmeyer <akohlmey_at_gmail.com>
> wrote:
>
>> As the error message states, you are obviously running out of memory on
>> the corresponding GPUs.
>>
>> On Wed, Dec 22, 2021 at 10:34 AM Luis Cebamanos <luiceur_at_gmail.com>
>> wrote:
>>
>>> Ah! That make sense! If I create the nodelist as expected I get this for
>>> each of the Pe
>>>
>>> FATAL ERROR: CUDA error cudaMallocHost(&p, size) in file
>>> src/ComputePmeCUDAMgr.C, function alloc_, line 54
>>> on Pe 8 (g424 device 0 pci 0:60:0): out of memory
>>> FATAL ERROR: CUDA error cudaHostAlloc(pp, sizeofT*(*curlen), flag) in
>>> file src/CudaUtils.C, function reallocate_host_T, line 164
>>> on Pe 9 (g424 device 1 pci 0:61:0): out of memory
>>> FATAL ERROR: CUDA error cudaHostAlloc(pp, sizeofT*(*curlen), flag) in
>>> file src/CudaUtils.C, function reallocate_host_T, line 164
>>> on Pe 9 (g424 device 1 pci 0:61:0): out of memory
>>> FATAL ERROR: CUDA error cudaHostAlloc(pp, sizeofT*(*curlen), flag) in
>>> file src/CudaUtils.C, function reallocate_host_T, line 164
>>> on Pe 54 (g425 device 2 pci 0:88:0): out of memory
>>>
>>> Is there a reason for this?
>>> L
>>>
>>> On 22/12/2021 14:56, Vermaas, Josh wrote:
>>> > That nodelist looks funky to me. I'm betting NAMD expects them to be
>>> one per line, only sees 1 line (as in the user guide
>>> https://www.ks.uiuc.edu/Research/namd/2.14/ug/node103.html), and
>>> assumes that you only have 1 node. It then nicely generates the
>>> configuration you generated on a single node, and goes about its merry
>>> way... Until it realizes that you've assigned 8 tasks to 4 GPUs, and warns
>>> you that it doesn't like sharing.
>>> >
>>> > -Josh
>>> >
>>> > On 12/22/21, 9:46 AM, "owner-namd-l_at_ks.uiuc.edu on behalf of Luis
>>> Cebamanos" <owner-namd-l_at_ks.uiuc.edu on behalf of luiceur_at_gmail.com>
>>> wrote:
>>> >
>>> > Hello all,
>>> >
>>> > Trying to run on a multinode/multi-GPU environment (namd built
>>> with
>>> > Charm-verbs, cuda SMP and Intel). Each node with 4 GPUs, 40 CPUs:
>>> >
>>> > charmrun ++nodelist nodeListFiletxt ++p 72 ++ppn 9 namd2 +devices
>>> > 0,1,2,3 +isomalloc_sync +setcpuaffinity +idlepoll +pemap
>>> > 1-9,11-19,21-29,31-39 +comm
>>> > ap 0,10,20,30 stmv.namd
>>> >
>>> > where my nodeListFile.txt looks like:
>>> >
>>> > group main
>>> > host andraton11 host andraton12 ++cpus 40 ++shell ssh
>>> >
>>> > I am getting the following error:
>>> >
>>> > FATAL ERROR: Number of devices (4) is not a multiple of number of
>>> > processes (8). Sharing devices between processe
>>> > s is inefficient. Specify +ignoresharing (each process uses all
>>> visible
>>> > devices) if not all devices are visible t
>>> > o each process, otherwise adjust number of processes to evenly
>>> divide
>>> > number of devices, specify subset of devices
>>> > with +devices argument (e.g., +devices 0,2), or multiply list
>>> shared
>>> > devices (e.g., +devices 0,1,2,0).
>>> >
>>> >
>>> > If not using +ignoresharing, how should I run this correctly?
>>> >
>>> > Regards,
>>> >
>>> >
>>>
>>>
>>
>> --
>> Dr. Axel Kohlmeyer akohlmey_at_gmail.com https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!v0OUIYvy7_OVLSXUeINGDA7TrMOoWh2PUQvHU6Pw34DnIPHKMGuqZchQ-1HLmgsAAg$
>> <https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!vH4dfU-_ILd6ZyEN1u_BrQvuvnXEVJBaAkveWUPG2DVzvDUNGHPoMJDmp_UBjccKUA$>
>> College of Science & Technology, Temple University, Philadelphia PA, USA
>> International Centre for Theoretical Physics, Trieste. Italy.
>>
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  https://urldefense.com/v3/__http://goo.gl/1wk0__;!!DZ3fjg!v0OUIYvy7_OVLSXUeINGDA7TrMOoWh2PUQvHU6Pw34DnIPHKMGuqZchQ-1HLmgsAAg$ 
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.

Next message: Yogesh Sharma: "Position restrain in ABF"
Previous message: Giacomo Fiorin: "Re: How to run on multi-node environment"
In reply to: Giacomo Fiorin: "Re: How to run on multi-node environment"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:12 CST