From: Peter Freddolino (petefred_at_ks.uiuc.edu)
Date: Thu Feb 12 2009 - 09:41:27 CST
running on more processors also increases the amount of work that has to
be done on node 0, and can push it over the edge. If you want to test
this, try running a small system on 1024 cores, or your large system on
513, or change the 512s in the code to 511 and run the large system on 512.
Vlad Cojocaru wrote:
> Hi Peter,
> my system has about 250K atoms. However, the simulation runs with no
> problem on 512 cores. The segmentation fault only appears if I ask for
> 1024 cores for the same job.
> Peter Freddolino wrote:
>> Hi Vlad,
>> there is, in fact, not a 512 core limit in namd (we frequently run on
>> more); both of the instances of the number 512 in the code that you
>> mention are red herrings. PROCESSORMAX is never used, and numcpus is a
>> character array used to *print* the number of processors being used (and
>> thus the limit is a string 512 characters long). Segmentation faults on
>> startup (or shortly thereafter) with large systems usually mean that
>> node 0 is running out of memory. I'd recommend trying the steps at
>> to reduce your memory usage. Out of curiosity, how large is your system?
>> Vlad Cojocaru wrote:
>>> Dear NAMDers,
>>> I compiled on August 6th last year, the namd cvs code on an
>>> opteron-based linux cluster with infiniband (intel compiler, mvapich).
>>> I was running it all the time on 512 cores and everything worked fine.
>>> Now, I have a much bigger system and I wanted to run on 1024 cores.
>>> However, I started getting "Segmentation fault" errors (nothing else
>>> in the error message) on jobs that I could run on 512 cores. I was
>>> puzzled by this as it didn't make sense and with some help I actually
>>> discovered that the code was compiled to run on maximum 512 cores.
>>> (defined in LdbCoordinator.h (PROCESSORMAX = 512) and in main.C (char
>>> numcpus )
>>> Since I haven't changed anything in the code before compiling, I
>>> assume this was built in the cvs code . What I found even more strange
>>> is the error message. Instead of having something like "The maximum
>>> number of cpus (512) was exceeded" I get something like "Segmentation
>>> fault" (which doesn't tell much)
>>> Now, my questions are:
>>> 1. Why does the code define a maximum number of cpus since namd is
>>> meant to be run on large parrallel machines?
>>> 2. Is that the case with he newest cvs code as well ?
>>> 3. If I want to compile for running on more cpus, what do I have to
>>> modify in the cvs code ?
>>> 4. If this definition of max no of cpus is kept, is it possible to add
>>> a relevant error message when trying to run on more cpus ?
>>> Thanks a lot
>>> Best wishes
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:21 CST