Re: Error: transport retry exceeded error

From: Alexandre A. Vakhrouchev (makaveli.lcf_at_gmail.com)
Date: Sat May 17 2008 - 00:03:03 CDT

I've got targeted setup working at 32 processors due to reduced first
LDB period down to 20 and restart files saving at each 10ps (comparing
1ps I used before). I'm not shure that was the solution. Now I have to
increase processors number because simulation is too long, so I'll
contact our sysadmins following Axel's advice. We have queueing system
at our cluster, so I do not overload the whole system, the bad thing
is that I could not get the result)

2008/5/17, Axel Kohlmeyer <akohlmey_at_cmm.chem.upenn.edu>:
> On Sat, 17 May 2008, Alexandre A. Vakhrouchev wrote:
>
> AV> Thanks Axel!
> AV> I tried with pthread, got the same error. So will try with different
> AV> number of cores/nodes following your advice. PTW, what is the
>
> there are a number of things you can do. i was serious about
> getting into contact with the sysadmins, because if you or
> others are overloading the machine, then all kinds of problems
> can arise, so the problem _has_ to be addressed. i found that
> the low level infiniband communication cannot handle a large
> number of cores and nodes well. it seems related to the need
> of keeping DMA buffers around. NAMD is particularly bad in that
> respect, since charm++ does a lot of probes and non-blocking
> sends. with openmpi i found that one can reduce that problem
> by telling the infiniband layer to re-use existing buffers
> and also tell it to increase the number of tries until it
> gives up (which of course impedes performance, as it increases
> latencies...).
>
> AV> difference between NAMD with pthread and without?
>
> the thread issue is not in namd but in the underlying charm++.
> charm++ does asynchronous communication (it somewhat emulates
> the single sided communication from MPI-2 with MPI-1 calls).
> from the programming point of view it is very convenient to
> have a separate thread wait for MPI messages to arrive and
> queue them accordingly for processing, instead of polling or
> waiting in the compute task. charm++ can be compiled using native
> threads, e.g. pthreads, or the internal QuickThreads package.
>
> i don't think that there is threaded parallelization (yet)
> and if there is i would not expect NAMD to take advantage of
> that. NAMD uses a pretty much limited subset of what charm++
> has to offer.
>
> cheers,
> axel.
>
>
> AV>
> AV> 2008/5/16, Axel Kohlmeyer <akohlmey_at_cmm.chem.upenn.edu>:
> AV> >
> AV> > no. you are overloading your infiniband fabric.
> AV> > sometimes using less cores/node helps. you should
> AV> > contact the sysadmins of the machine and tell them
> AV> > that they are in for some "fun". ;-)
> AV> >
> AV> > i've seen this happen on several large infiniband
> AV> > based clusters and it is not easy to work around.
> AV> >
> AV> > cheers,
> AV> > axel.

-- 
Best regards,
Dr. Alexander Vakhrushev
Institute of Applied Mechanics
Dep. of Mech. and Phys.-Chem.
of heterogeneous mediums
UB of Russian Academy of Sciences
34 T. Baramzinoy St.
Izhevsk, Russia 426067

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:47:49 CST