AW: problem with runiing namd through infiniband

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed May 29 2013 - 00:22:58 CDT

Hi Shubhra,

 

if you are sure that you ib fabric setup is fine (do other programs work, do
the tools like ib_ping work), you are maybe using an infiniband stack/driver
that is incompatible with the precompiled builds (not OFED?). You could try
to build namd yourself against an separate MPI (OpenMPI f.i.). Or, if you
have IPoIB installed (check /sbin/ifconfig for interfaces called ib0 or
similar) you can use that interfaces instead of the "eth" ones. Therefore
choose the corresponding ip addresses to the ib network interfaces. Also
when using IPoIB, set /sys/class/net/ib0/mode to "connected" and mtu to
"65520" simply will doing echo with ">" redirect as root. Additionally, also
if you are not using a CUDA version and as long as you use charm++, try to
add +idlepoll when calling namd to improve scaling.

 

Norman Geist.

 

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Shubhra Ghosh Dastidar
Gesendet: Dienstag, 28. Mai 2013 09:15
An: NAMD
Betreff: namd-l: problem with runiing namd through infiniband

 

I am trying to run namd through infiniband.

 

First I tried the multicore version, which runs smoothly on 32 cores being
restricted within a node.

 

Then I tried the TCP version (which uses ethernet), which runs across
multiple nodes, e.g. total 32 cores (16 cores from node-1 and 16 cores from
node-2).

 

Then I tried the infiniband version and also infiniband-smp version both.
If the job is restricted within the 32 cores on one node then they run
smoothly.

But if it is asked to run across multiple nodes (i.e. communicating through
infiniband) then I get the error..............the last few lines are the
following:

 

Charmrun> All clients connected.

Charmrun> IP tables sent.

Charmrun> node programs all connected

Charmrun> started all node programs in 3.995 seconds.

Charmrun: error on request socket--

Socket closed before recv.

 

 

Can anyone help?

 

The execution command which I am using is the following:

 

 ~/NAMD_2.9_Linux-x86_64-ibverbs/charmrun ++p 16 ++verbose ++remote-shell
ssh ++nodelist nodelist ~/NAMD_2.9_Linux-x86_64-ibverbs/namd2 namd-input

 

(inifiniband has been tested with other program ,e.g. CHARMM-37, which seems
to be working fine)

Regards

-- 
Dr. Shubhra Ghosh Dastidar
 

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:16 CST