Re: problem with runiing namd through infiniband

From: Shubhra Ghosh Dastidar (sgducd_at_gmail.com)
Date: Wed May 29 2013 - 07:41:47 CDT

Hi Norman,

I think we still have problem with IB configuration. Because although
ibstat, ibhosts, ibnetdiscover etc are showing OK but ibping is not showing
to ping LID of nodes, not even to self LID, and also ibv_rc_pingpong is
unable to ping localhost. This is a bit confusing to me as the other
commands are working. Since I am configuring IB for the first time I don't
have much clue about its way out.

I will appreciate if anyone can help in this matter.

Regards,
Shubhra

>
> On Wed, May 29, 2013 at 10:52 AM, Norman Geist <
> norman.geist_at_uni-greifswald.de> wrote:
>
>> Hi Shubhra,****
>>
>> ** **
>>
>> if you are sure that you ib fabric setup is fine (do other programs work,
>> do the tools like ib_ping work), you are maybe using an infiniband
>> stack/driver that is incompatible with the precompiled builds (not OFED?).
>> You could try to build namd yourself against an separate MPI (OpenMPI
>> f.i.). Or, if you have IPoIB installed (check /sbin/ifconfig for interfaces
>> called ib0 or similar) you can use that interfaces instead of the “eth”
>> ones. Therefore choose the corresponding ip addresses to the ib network
>> interfaces. Also when using IPoIB, set /sys/class/net/ib0/mode to
>> “connected” and mtu to “65520” simply will doing echo with “>” redirect as
>> root. Additionally, also if you are not using a CUDA version and as long as
>> you use charm++, try to add +idlepoll when calling namd to improve scaling.
>> ****
>>
>> ** **
>>
>> Norman Geist.****
>>
>> ** **
>>
>> *Von:* owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] *Im
>> Auftrag von *Shubhra Ghosh Dastidar
>> *Gesendet:* Dienstag, 28. Mai 2013 09:15
>> *An:* NAMD
>> *Betreff:* namd-l: problem with runiing namd through infiniband****
>>
>> ** **
>>
>> I am trying to run namd through infiniband.****
>>
>> ** **
>>
>> First I tried the multicore version, which runs smoothly on 32 cores
>> being restricted within a node.****
>>
>> ** **
>>
>> Then I tried the TCP version (which uses ethernet), which runs across
>> multiple nodes, e.g. total 32 cores (16 cores from node-1 and 16 cores from
>> node-2). ****
>>
>> ** **
>>
>> Then I tried the infiniband version and also infiniband-smp version
>> both. If the job is restricted within the 32 cores on one node then they
>> run smoothly.****
>>
>> But if it is asked to run across multiple nodes (i.e. communicating
>> through infiniband) then I get the error..............the last few lines
>> are the following:****
>>
>> ** **
>>
>> Charmrun> All clients connected.****
>>
>> Charmrun> IP tables sent.****
>>
>> Charmrun> node programs all connected****
>>
>> Charmrun> started all node programs in 3.995 seconds.****
>>
>> Charmrun: error on request socket--****
>>
>> Socket closed before recv.****
>>
>> ** **
>>
>> ** **
>>
>> Can anyone help?****
>>
>> ****
>>
>> The execution command which I am using is the following:****
>>
>> ** **
>>
>> ~/NAMD_2.9_Linux-x86_64-ibverbs/charmrun ++p 16 ++verbose
>> ++remote-shell ssh ++nodelist nodelist
>> ~/NAMD_2.9_Linux-x86_64-ibverbs/namd2 namd-input****
>>
>> ** **
>>
>> (inifiniband has been tested with other program ,e.g. CHARMM-37, which
>> seems to be working fine)****
>>
>>
>> ****
>>
>> Regards****
>>
>> -- ****
>>
>> Dr. Shubhra Ghosh Dastidar****
>>
>>
>> ****
>>
>
>
>
> --
> Dr. Shubhra Ghosh Dastidar
> Assistant Professor
> Centre of Excellence in Bioinformatics
> Bose Institute
> P-1/12 C.I.T. Scheme VII-M, Kolkata 700 054, India
> Phone: +91-33-23554766, Ext. 332, Fax: +91-33-2355 3886
> Web: http://www.boseinst.ernet.in/bic/fac/shubhra/
>
>
>
>

-- 
Dr. Shubhra Ghosh Dastidar
Assistant Professor
Centre of Excellence in Bioinformatics
Bose Institute
P-1/12 C.I.T. Scheme VII-M, Kolkata 700 054, India
Phone: +91-33-23554766, Ext. 332, Fax: +91-33-2355 3886
Web: http://www.boseinst.ernet.in/bic/fac/shubhra/

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:16 CST