AW: Problem running NAMD 2.8 with ibverbs

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Thu Jan 12 2012 - 07:34:31 CST

Hi again,

well I looked to the files and the only thing I saw is the huge number of
dropped packets on the nodes ib interface, maybe one could clear the counter
(ifconfig) and look when the dropped packets occur.

Have you tried running the job locally on a remote node? Not from another
node? Maybe charmrun tries to run the job on both your pc starting the job
from (and possibly not in the ib) AND the remote node, that would surely
come with a timeout. Try logging on to the remote node and start the job
there locally.

By the way, if the precompiled ibverbs binary works, why u want to compile
it yourself?

Another problem is often the permissions to the rdma devices, but if the
precompiled works, there seems to be everything right here.

You could check also if the shells u see spawning exit due to an error,
dmesg or cat /var/log/messages should show something then.

Let me know

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: Moritz Schlarb [mailto:schlarbm_at_uni-mainz.de]
> Gesendet: Mittwoch, 11. Januar 2012 11:21
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: namd-l: Problem running NAMD 2.8 with ibverbs
>
> Hello Norman,
>
> thank you for your help!
>
> I've attached the command outputs as textfiles for better readability.
> I
> hope you don't mind that I've tarred it and uploaded it here:
> https://fileshare.zdv.uni-mainz.de/846d3bf1-8b15-4ae5-92fd-
> 2eaf1ae9f849_36d1bd67-45b2-45d4-9c5b-b600d1d28126.file
>
> Yes, we're using OFED-1.4 with the following HCAs:
> 07:00.0 InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx
> HCA] (rev 20)
>
> As you can see from the lsmod output, IPoIB is installed and working.
>
> By using /etc/hosts, all nodes (and the "login" nodes, too) have names
> and addresses for their ethernet cards (eth1 on the internal network
> e.g. node102 = 192.168.142.102) and names and addresses for their
> infiniband cards (fnode102 = 192.168.138.102).
> Pinging works (I can't test it using the ibtools because of
> insufficient
> permissions and my boss isn't around atm, but normal ping works, and
> every other MPI application, too).
>
> Are you able to find some error in there?
> I would think that it is just a compilation problem, especially since
> the precompiled binary works fine... (If there isn't a simple solution
> I
> think I'll have to stick with that either way.)
>
> Thank you for your help!
>
> Greetings,
> Moritz
>
> Am 09.01.2012 07:32, schrieb Norman Geist:
> > Hi Moritz,
> >
> > I would answer you in german, as I'm here in the University of
> Greifswald,
> > but we should keep the mailing lists international usability.
> >
> > Pls send me the output of:
> >
> >> ifconfig -a
> >> cat /sys/class/net/ib0/mode
> >> cat /sys/class/net/ib0/mtu
> >
> > What Infiniband hardware do you have (HCA,Switch)?
> > How u installed it? (have u used OFED?)
> >
> > Have you thought of installing the IPoIB driver?<- needed for
> resolution of
> > ip-addresses to the right infiniband-addresses.
> > Have you tried the ib tools like ping etc, did they work??
> >
> > Let me know
> > Norman Geist.
> >

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:21:34 CST