AW: Multinode NAMD CUDA GPU Selection

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Aug 19 2011 - 00:22:56 CDT

Hi Axel,

well I just wondered, because it's working till 3 Processes.

Example Hostfile:

host c35
host c33

If I now start simulation with +p 2 it will use gpu:1 at c35 and gpu:0 at
c33 when setting +device 1,0. Also its working when setting +p3 and +devices
1,0,0, cause to use gpu:0 and gpu:1 at c35 and gpu:0 at c33. But when wanted
to start more processes, namd restarts reading the +device on every target
node. So I thought I was just doing something wrong.

So setting +p4 and +devices 1,0,0,0 would cause c33 to crash because namd
tries using gpu:1 instead of gpu:0. Really strange isn't it? Also if I set
Hostfile like:

host C35
host C33
host C35
host C33

If I have more than 4 processes, namd restarts reading the +devices
parameter, but not with 3. What do you think?

We have 3 machines, 2 have 12 cores and 2 gpus and one is half of that, 6
cpus and 1 gpu. So it can happen that one sets a job with 1 gpu at the 12
core machine, allocated by sge. And then someone start a 12 core run and sge
put it to the 6 core and half 12 core machine. So for that case I would need
that. We have only 5 gpus and need to use them best shared by people,
instead of using the whole machine with possibly too small systems in some
cases. I just want to best manage our compute capacity.

Thank you for your opinion and suggestion.
Norman Geist.

-----Ursprüngliche Nachricht-----
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Axel Kohlmeyer
Gesendet: Donnerstag, 18. August 2011 17:39
An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: Multinode NAMD CUDA GPU Selection

norman,

On Thu, Aug 18, 2011 at 2:25 AM, Norman Geist
<norman.geist_at_uni-greifswald.de> wrote:
> Hi experts,
>
> yesterday I have observed some, let’s say unfavorable behavior of namd
cuda
> job spawning. I was testing multinode gpu runs when finding out that namd

the fact, that nodes on a cluster are identical is a common and
very valid assumption made by many parallel applications. support
for in-homogeneous machines would make things _much_ more
complicated for very little gain.

> reads the +devices parameter from the beginning at every node, not
process,
> just on every node, namd starts to read the devices string from the start
> and so make it impossible to work with different nodes. Even if I have
> nodelist like:

you can try using nvidia-smi to set the GPU that you don't want to use
for namd to "compute disable" mode. never tried it with namd myself
(or needed to do it).

axel.

>
>
>
> host c35
>
> host c35
>
> host c35
>
> host c35
>
> host c33
>
> host c33
>
> host c33
>
> host c33
>
>
>
> And type a device string like +devices 1,1,1,1,0,0,0,0 he will try to use
> the gpu:1 on all processes. Is there any way to influence this without
> hacking the namd source? There have to be a possibility for such  things.
> Maybe if I have two nodes, one with a quadro and tesla and the other node
> only with one tesla, and I only want to use the tesla. Or I just don’t
want
> all gpus, because I want to run multible jobs on one machine. I already
have
> a script that would give the right gpu id for every node and generate such
a
> device string. But for that to work, the PEs must determine which gpu to
> bind by their real PE-ID, which works fine for one node, but not with
> multible nodes. That would be better and no big change to the current
> function.
>
> Please tell me your view.
>
> Thanks
>
> Norman Geist

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:57:36 CST