From: Andrew Emerson (a.emerson_at_cineca.it)
Date: Fri Feb 13 2009 - 03:36:30 CST
Dear Vlad
I noticed exactly the same problem on our opteron/infiniband cluster
(openmpi) with an even smaller system of about 150k atoms. I was told it
was probably a problem of openmpi. I dont know if this is true or not
but after some system upgrades I certainly dont get the problem anymore,
although admittedly I havent tried with a larger system.
I think the idea of switching to openmpi seems a good one.
cheers
andy
Axel Kohlmeyer wrote:
> On Thu, 12 Feb 2009, Vlad Cojocaru wrote:
>
> VC> Hi Peter,
> VC>
> VC> my system has about 250K atoms. However, the simulation runs with no problem
> VC> on 512 cores. The segmentation fault only appears if I ask for 1024 cores
> VC> for the same job.
>
> vlad,
>
> this may be irritating, but it is still quite possible that you
> are running out of "available memory". the reason for that is that
> you are using infiniband and that you will most likely have to tune
> it to use a shared request queue rather than having
> explicite node-to-node RDMA buffers.
>
> most infiniband mpi packages have a lot of parameters to tune the
> behavior and they are usually tuned to work fast on a small cluster.
> infiniband does RDMA and for that it needs communication buffers
> that a "pinned", i.e. non-swappable. now by increasing the number
> of cores that you use, you also increasing the number of these
> buffers (in the worst case quadratically) and thus there would be
> less memory left for the application. NAMD by default needs a
> system size dependend amount of memory on the node with rank 0,
> so that all fits very well with your observations.
>
> in case you are using OpenMPI (which i highly recommend)
> i can give you details on what you you have to change to
> use the shared request queue (and actually run faster for
> larger jobs).
>
> cheers,
> axel.
>
>
> VC>
> VC> Vlad
> VC>
> VC> Peter Freddolino wrote:
> VC> > Hi Vlad,
> VC> > there is, in fact, not a 512 core limit in namd (we frequently run on
> VC> > more); both of the instances of the number 512 in the code that you
> VC> > mention are red herrings. PROCESSORMAX is never used, and numcpus is a
> VC> > character array used to *print* the number of processors being used (and
> VC> > thus the limit is a string 512 characters long). Segmentation faults on
> VC> > startup (or shortly thereafter) with large systems usually mean that
> VC> > node 0 is running out of memory. I'd recommend trying the steps at
> VC> > http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdMemoryReduction
> VC> > to reduce your memory usage. Out of curiosity, how large is your system?
> VC> >
> VC> > Best,
> VC> > Peter
> VC> >
> VC> > Vlad Cojocaru wrote:
> VC> >
> VC> > > Dear NAMDers,
> VC> > >
> VC> > > I compiled on August 6th last year, the namd cvs code on an
> VC> > > opteron-based linux cluster with infiniband (intel compiler, mvapich).
> VC> > > I was running it all the time on 512 cores and everything worked fine.
> VC> > > Now, I have a much bigger system and I wanted to run on 1024 cores.
> VC> > > However, I started getting "Segmentation fault" errors (nothing else
> VC> > > in the error message) on jobs that I could run on 512 cores. I was
> VC> > > puzzled by this as it didn't make sense and with some help I actually
> VC> > > discovered that the code was compiled to run on maximum 512 cores.
> VC> > >
> VC> > > (defined in LdbCoordinator.h (PROCESSORMAX = 512) and in main.C (char
> VC> > > numcpus[512] )
> VC> > >
> VC> > > Since I haven't changed anything in the code before compiling, I
> VC> > > assume this was built in the cvs code . What I found even more strange
> VC> > > is the error message. Instead of having something like "The maximum
> VC> > > number of cpus (512) was exceeded" I get something like "Segmentation
> VC> > > fault" (which doesn't tell much)
> VC> > >
> VC> > > Now, my questions are:
> VC> > > 1. Why does the code define a maximum number of cpus since namd is
> VC> > > meant to be run on large parrallel machines?
> VC> > > 2. Is that the case with he newest cvs code as well ?
> VC> > > 3. If I want to compile for running on more cpus, what do I have to
> VC> > > modify in the cvs code ?
> VC> > > 4. If this definition of max no of cpus is kept, is it possible to add
> VC> > > a relevant error message when trying to run on more cpus ?
> VC> > >
> VC> > > Thanks a lot
> VC> > >
> VC> > > Best wishes
> VC> > > vlad
> VC> > >
> VC> > >
> VC> >
> VC> >
> VC>
> VC>
>
-- Dr Andrew Emerson CINECA (High Performance Systems) via Magnanelli, 6/3 40033 Casalecchio di Reno (BO)-ITALY tel: +39-051-6171653, fax: +39-051-6137273 e-mail: a.emerson_at_cineca.it
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:22 CST