Re: namd crash: Signal: segmentation violation

From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Thu May 24 2007 - 08:37:06 CDT

Hi Gengbin,
The error of the charm++ checkpoint cannot be that, unless it is
trying to access some odd "work directory". The cluster is
actually "working", unless for the fact that the simulations crash
after some hours. So all work directories are certainly mounted.

I have ran the test you suggested using the memory-paranoid
compilation, which crashes when running with two processors
in the master machine (++local +p2) before the first time-step.

Using a single processor (charmrun ++local +p1), and putting
two simulations together, the simulations start well, and I'm
running them now to see if they will crash sometime, but I
bet they won't.

Leandro.

Recalling the error of the memory-paranoid binary when running
on a single node with two processors:
Info: Finished startup with 104684 kB of memory in use.
------------- Processor 0 Exiting: Caught Signal ------------
Signal: segmentation violation
Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
Stack Traceback:
  [0] /lib/libc.so.6 [0x2b04c4fed5c0]
  [1] _ZN10Controller9threadRunEPS_+0 [0x5d8980]
Fatal error on PE 0> segmentation violation

On 5/23/07, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
>
> Leandro,
>
> Just a wild idea. What happen if you run two instances of NAMD on apoa1
> on a single node. That is, run two copies of NAMD at same time, each
> instance only uses one core (+p1).
>
> btw, the error in charm++ checkpoint test was due to "not be able to
> open a file on a compute node for writing". It may be that the work
> directory was not mounted on all compute nodes.
>
> Gengbin
>
> Leandro Martínez wrote:
>
> > Hi Brian,
> > Well, I have compiled with -g and -lefence, and the binaries indeed
> > crash with a lot of messages from which I cannot get anything. Here
> > are the logs:
> >
> > Using:
> > ./charmrun ++local +p2 ++verbose ./namd2 teste-gentoo.namd >&
> > namd-efence-local.log
> > Result:
> > http://limes.iqm.unicamp.br/~lmartinez/namd-efence-local.log
> >
> > Using:
> > ./charmrun +p4 ++nodelist ./nodelist ++remote-shell ssh ++verbose
> > ./namd2 teste-gentoo.namd >& namd-efence-2cpus.log
> > Result:
> > http://limes.iqm.unicamp.br/~lmartinez/namd-efence-2cpus.log
> >
> > Probably efence is not working properly there. I don't know.
> >
> > The specifications of the machines for which we are having
> > trouble are:
> >
> > Motherboard: ASUS M2NPV-MX
> > - NVIDIAR GeForce 6150 + nForce 430
> > - Dual-channel DDR2 800/667/533
> > - PCI Express architecture
> > - Integrated GeForce6 GPU
> > - NVIDIA Gb LAN controller on-board
> > Chipset: NVidia
> > Processor: AMD dual-core Athlon64 5200+ (socket AM2, 940 pins)
> >
> > Leandro.
> >
> >
> >
> >
> >
> >
> > On 5/22/07, Brian Bennion <bennion1_at_llnl.gov> wrote:
> >
> >> At 01:18 PM 5/22/2007, you wrote:
> >> >Hi Brian,
> >> >The charm++ is the CVS version, since the one provided with
> >> >namd cannot be compiled with GCC 4.0 (an error occurs,
> >> >something I have already tought with Gengbin before, see below).
> >> >
> >> >Just for claryfing things in order to me to test with more
> >> >knowledge of what I will be doing:
> >>
> >> The mpi version of namd should be built on a mpi aware charm++ layer.
> >> starting it with mpirun or charmrun will still
> >> cause problems if the namd binary is built on faulty mpi libraries
> >>
> >>
> >> >1. The fact that I get an error with mpirun for the same test
> >> >does not disminish the possibility of a charm++ problem? Is
> >> >there any relation of my "mpirun" of the test and the
> >> >compiled charm++ or its tests?
> >> >
> >> >2. The "-lefence" option I should add to the linking of both
> >> >charm++ and namd2? The same thing for -g?
> >> -g should go into the line that generates the
> >> object code. Find -O3 or -O2 and replace with -g
> >> for both charm and namd. It will make the binary
> >> bigger and cause it to run much slower. It may
> >> actually make the problem go away which is really a pain in the neck.
> >>
> >>
> >> >What do you think about that corrupted velocity? I think
> >> >that kind of corrupted data may be all the problem, returning
> >> >strange error messages whenever it occurred.
> >> What is corrupting the data? The message passing
> >> between cores and multiple nodes is touching
> >> memory that is corrupted either physically or the
> >> bits have been changed by another process. The
> >> efence link should rule out other code touching/rewriting used memory.
> >>
> >> Can you repeat the type of motherboard and
> >> chipset you have on the machines that can't seem to run namd?
> >>
> >> Brian
> >>
> >>
> >>
> >>
> >>
> >> >Thanks,
> >> >Leandro.
> >> >
> >> >Obs: the charm++ compilation errors which are solved in the
> >> >cvs version (I think this is not relevant at this point):
> >> >
> >> >
> >> >>Hi Leandro,
> >> >>I believe this compilation problem have been fixed in recent versions
> >> >>of charm. Please update your charm from cvs, or download a nightly
> >> build
> >> >>package from charm web site.
> >> >>Gengbin
> >> >
> >> >Leandro Martínez wrote:
> >> >
> >> >>Dear charm++ developers,
> >> >>I am trying to compile charm++ on Opteron and Amd64 machines with no
> >> >>success.
> >> >>I am doing this because I am trying to run NAMD on a Amd64 (dual-core)
> >> >>cluster with
> >> >>no success, because it seems that the load-balancer eventually hangs
> >> >>up (the simulation
> >> >>does not crash, but I get a single job running forever without any
> >> >>production just during
> >> >>a load balancing). This problem was reported by Jim Phillips and is,
> >> >>apparently, an issue related to the charm program in these machines:
> >> >>
> >> >>hhttp://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnAMD64
> >> >>
> >> >>The first question is: Is the problem really known? The solution above
> >> >>should really solve the issue? Because we are running namd in a
> >> >>opteron cluster for a while with no problems, the problems appeared in
> >> >>a new Amd64 dual core machines we bought.
> >> >>
> >> >>Anyway, I tryied to compile charm++ in my machines with the option
> >> >>above changed, with no success.
> >> >>
> >> >> From my opteron machine, running ./build charm++ net-linux-amd64 I
> >> get
> >> >>(running Fedora 5.0)
> >> >>
> >> >>if [ -d charmrun ] ; then ( cd charmrun ; make OPTS='' ) ; fi
> >> >>make[3]: Entering directory
> >> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/charmrun'
> >>
> >> >>make[3]: Nothing to be done for `all'.
> >> >>make[3]: Leaving directory
> >> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/charmrun'
> >>
> >> >>if [ -f charmrun ] ; then ( cp charmrun ../bin ) ; fi
> >> >>make[2]: Leaving directory `/home/lmartinez/temp/instalacoes/charm-
> >> >>5.9/net-linux-amd64/tmp'
> >> >>cd libs/ck-libs/multicast && make
> >> >>make[2]: Entering directory
> >> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/libs/ck-libs/multicast'
> >>
> >> >>make[2]: Nothing to be done for `all'.
> >> >>make[2]: Leaving directory
> >> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/libs/ck-libs/multicast'
> >>
> >> >>../bin/charmc -c -I. ComlibManager.C
> >> >>MsgPacker.h:86: error: extra qualification 'MsgPacker::' on member
> >> >>'MsgPacker'
> >> >>Fatal Error by charmc in directory
> >> >>/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp
> >> >> Command g++ -fPIC -m64 -I../bin/../include -D__CHARMC__=1 -I. -c
> >> >>ComlibManager.C -o ComlibManager.o returned error code 1
> >> >>charmc exiting...
> >> >>make[1]: *** [ComlibManager.o] Error 1
> >> >>make[1]: Leaving directory
> >> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp'
> >> >>make: *** [charm++] Error 2
> >> >>
> >> >> From my Amd64 (Fedora 6.0) machine, I get:
> >> >>
> >> >>make[2]: Entering directory
> >> >>`/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
> >> >>make[2]: `QuickThreads/libqt.a' is up to date.
> >> >>make[2]: Leaving directory `/home/lmartinez/charm-
> >> >>5.9/net-linux-amd64/tmp'
> >> >>make converse-target
> >> >>make[2]: Entering directory
> >> >>`/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
> >> >>../bin/charmc -c -I. traceCore.C
> >> >>traceCore.h:20: error: previous declaration of 'int Cpv__traceCoreOn_
> >> >>[2]' with 'C++' linkage
> >> >>traceCoreAPI.h:8: error: conflicts with new declaration with 'C'
> >> linkage
> >> >>Fatal Error by charmc in directory
> >> >>/home/lmartinez/charm-5.9/net-linux-amd64/tmp Command g++ -fPIC -m64
> >> >>-I../bin/../include -D__CHARMC__=1 -I. -c traceCore.C -o traceCore.o
> >> >>returned error code 1
> >> >>charmc exiting...
> >> >>make[2]: *** [traceCore.o] Error 1
> >> >>make[2]: Leaving directory
> >> `/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
> >> >>make[1]: *** [converse] Error 2
> >> >>make[1]: Leaving directory
> >> `/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
> >> >>make: *** [charm++] Error 2
> >>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:43 CST