Re: namd cvs compilation with a maximum number of cores to run on

From: Axel Kohlmeyer (akohlmey_at_cmm.chem.upenn.edu)
Date: Thu Feb 12 2009 - 10:35:05 CST

On Thu, 12 Feb 2009, Vlad Cojocaru wrote:

VC> Hi Peter,
VC>
VC> my system has about 250K atoms. However, the simulation runs with no problem
VC> on 512 cores. The segmentation fault only appears if I ask for 1024 cores
VC> for the same job.

vlad,

this may be irritating, but it is still quite possible that you
are running out of "available memory". the reason for that is that
you are using infiniband and that you will most likely have to tune
it to use a shared request queue rather than having
explicite node-to-node RDMA buffers.

most infiniband mpi packages have a lot of parameters to tune the
behavior and they are usually tuned to work fast on a small cluster.
infiniband does RDMA and for that it needs communication buffers
that a "pinned", i.e. non-swappable. now by increasing the number
of cores that you use, you also increasing the number of these
buffers (in the worst case quadratically) and thus there would be
less memory left for the application. NAMD by default needs a
system size dependend amount of memory on the node with rank 0,
so that all fits very well with your observations.

in case you are using OpenMPI (which i highly recommend)
i can give you details on what you you have to change to
use the shared request queue (and actually run faster for
larger jobs).

cheers,
   axel.

VC>
VC> Vlad
VC>
VC> Peter Freddolino wrote:
VC> > Hi Vlad,
VC> > there is, in fact, not a 512 core limit in namd (we frequently run on
VC> > more); both of the instances of the number 512 in the code that you
VC> > mention are red herrings. PROCESSORMAX is never used, and numcpus is a
VC> > character array used to *print* the number of processors being used (and
VC> > thus the limit is a string 512 characters long). Segmentation faults on
VC> > startup (or shortly thereafter) with large systems usually mean that
VC> > node 0 is running out of memory. I'd recommend trying the steps at
VC> > http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdMemoryReduction
VC> > to reduce your memory usage. Out of curiosity, how large is your system?
VC> >
VC> > Best,
VC> > Peter
VC> >
VC> > Vlad Cojocaru wrote:
VC> >
VC> > > Dear NAMDers,
VC> > >
VC> > > I compiled on August 6th last year, the namd cvs code on an
VC> > > opteron-based linux cluster with infiniband (intel compiler, mvapich).
VC> > > I was running it all the time on 512 cores and everything worked fine.
VC> > > Now, I have a much bigger system and I wanted to run on 1024 cores.
VC> > > However, I started getting "Segmentation fault" errors (nothing else
VC> > > in the error message) on jobs that I could run on 512 cores. I was
VC> > > puzzled by this as it didn't make sense and with some help I actually
VC> > > discovered that the code was compiled to run on maximum 512 cores.
VC> > >
VC> > > (defined in LdbCoordinator.h (PROCESSORMAX = 512) and in main.C (char
VC> > > numcpus[512] )
VC> > >
VC> > > Since I haven't changed anything in the code before compiling, I
VC> > > assume this was built in the cvs code . What I found even more strange
VC> > > is the error message. Instead of having something like "The maximum
VC> > > number of cpus (512) was exceeded" I get something like "Segmentation
VC> > > fault" (which doesn't tell much)
VC> > >
VC> > > Now, my questions are:
VC> > > 1. Why does the code define a maximum number of cpus since namd is
VC> > > meant to be run on large parrallel machines?
VC> > > 2. Is that the case with he newest cvs code as well ?
VC> > > 3. If I want to compile for running on more cpus, what do I have to
VC> > > modify in the cvs code ?
VC> > > 4. If this definition of max no of cpus is kept, is it possible to add
VC> > > a relevant error message when trying to run on more cpus ?
VC> > >
VC> > > Thanks a lot
VC> > >
VC> > > Best wishes
VC> > > vlad
VC> > >
VC> > >
VC> >
VC> >
VC>
VC>

-- 
=======================================================================
Axel Kohlmeyer   akohlmey_at_cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:21 CST