Re: NAMD 2.7b2 with CUDA and infinband

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Mon Dec 21 2009 - 02:12:08 CST

On Mon, Dec 21, 2009 at 12:03 AM, Jyh-Shyong <c00jsh00_at_nchc.org.tw> wrote:
> Hi,
>
> I tried NAMD 2.7b2 on our GPU cluster, but so far I have not been
> successful, any hint and suggestion
> is appreciated.

> 1. I download the binary NAMD_2.7b2_Linux-x86_64-ibverbs-CUDA,  and ran
> a test case with command
>
> ./charmrun ++local  ++p 4 namd2 +idlepoll ./alanin.namd
> Charmrun> IBVERBS version of charmrun
> Charmrun: Bad initnode data length. Aborting

have you checked the permissions on the infiniband device?

also, it is totally useless to use infiniband communication
on just a single node.

> 2. I tried again with command
>
> ./charmrun ++nodelist  ./hostlist  ++p 4 namd2 +idlepoll ./alanin.namd
>
> Here file hostlist contains two lines:
>
> group main
>  host  gc16

> gc16 is the hostname of the computer I was using.  Here is the output of
> this command:
>
>
> ..
> Info:
> Info: Entering startup at 0.376303 s, 104.066 MB of memory in use
> Info: Startup phase 0 took 0.00472808 s, 104.066 MB of memory in use
> Info: Startup phase 1 took 0.00161982 s, 104.066 MB of memory in use
> Info: Startup phase 2 took 0.000169039 s, 104.066 MB of memory in use
> FATAL ERROR: CUDA-enabled NAMD requires more patches than processes.
> ------------- Processor 0 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: CUDA-enabled NAMD requires more patches than processes.

here is the hint that NAMD gives you. obviously you are using a tiny input
example that is too small for a reasonable domain decomposition.

due to the way how GPUs work, there is no speed gain for small
domains (patches in NAMD-speak). if you don't have of the order
of 10000 atoms per domain, the GPU will not be fully occupied.

[...]

> There are 4 Tesla C1070s on this node:
> chem_at_gc16:/work/chem/alanin> ls -l /dev/nvi*
> crw-rw-rw- 1 root video 195,   0 2009-10-13 10:56 /dev/nvidia0
> crw-rw-rw- 1 root video 195,   1 2009-10-13 10:56 /dev/nvidia1
> crw-rw-rw- 1 root video 195,   2 2009-10-13 10:56 /dev/nvidia2
> crw-rw-rw- 1 root video 195,   3 2009-10-13 10:56 /dev/nvidia3
> crw-rw-rw- 1 root video 195, 255 2009-10-13 10:56 /dev/nvidiactl
>
> I wonder something in my environment settings might be wrong, but I
> don't know what it is.
> I also downloaded the latest version of source code and built the binary
> with ibverbs option
> for charm, and I got the same result.

no surprise there.

cheers,
    axel.

> Regards
>
> Jyh-Shyong Ho, Ph.D.
> Research Scientist
> National Center for High Performance Computing
> Hsinchu, Taiwan, ROC
>
>
>
>
>
>

-- 
Dr. Axel Kohlmeyer    akohlmey_at_gmail.com
Institute for Computational Molecular Science
College of Science and Technology
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:37 CST