Re: namd crash: Signal: segmentation violation

From: Leandro Martínez (leandromartinez98_at_gmail.com)
Date: Wed May 23 2007 - 13:00:28 CDT

Hi Brian,
Well, I have compiled with -g and -lefence, and the binaries indeed
crash with a lot of messages from which I cannot get anything. Here
are the logs:

Using:
./charmrun ++local +p2 ++verbose ./namd2 teste-gentoo.namd >&
namd-efence-local.log
Result:
http://limes.iqm.unicamp.br/~lmartinez/namd-efence-local.log

Using:
 ./charmrun +p4 ++nodelist ./nodelist ++remote-shell ssh ++verbose
./namd2 teste-gentoo.namd >& namd-efence-2cpus.log
Result:
http://limes.iqm.unicamp.br/~lmartinez/namd-efence-2cpus.log

Probably efence is not working properly there. I don't know.

The specifications of the machines for which we are having
trouble are:

Motherboard: ASUS M2NPV-MX
- NVIDIAR GeForce 6150 + nForce 430
- Dual-channel DDR2 800/667/533
- PCI Express architecture
- Integrated GeForce6 GPU
- NVIDIA Gb LAN controller on-board
Chipset: NVidia
Processor: AMD dual-core Athlon64 5200+ (socket AM2, 940 pins)

Leandro.

On 5/22/07, Brian Bennion <bennion1_at_llnl.gov> wrote:
> At 01:18 PM 5/22/2007, you wrote:
> >Hi Brian,
> >The charm++ is the CVS version, since the one provided with
> >namd cannot be compiled with GCC 4.0 (an error occurs,
> >something I have already tought with Gengbin before, see below).
> >
> >Just for claryfing things in order to me to test with more
> >knowledge of what I will be doing:
>
> The mpi version of namd should be built on a mpi aware charm++ layer.
> starting it with mpirun or charmrun will still
> cause problems if the namd binary is built on faulty mpi libraries
>
>
> >1. The fact that I get an error with mpirun for the same test
> >does not disminish the possibility of a charm++ problem? Is
> >there any relation of my "mpirun" of the test and the
> >compiled charm++ or its tests?
> >
> >2. The "-lefence" option I should add to the linking of both
> >charm++ and namd2? The same thing for -g?
> -g should go into the line that generates the
> object code. Find -O3 or -O2 and replace with -g
> for both charm and namd. It will make the binary
> bigger and cause it to run much slower. It may
> actually make the problem go away which is really a pain in the neck.
>
>
> >What do you think about that corrupted velocity? I think
> >that kind of corrupted data may be all the problem, returning
> >strange error messages whenever it occurred.
> What is corrupting the data? The message passing
> between cores and multiple nodes is touching
> memory that is corrupted either physically or the
> bits have been changed by another process. The
> efence link should rule out other code touching/rewriting used memory.
>
> Can you repeat the type of motherboard and
> chipset you have on the machines that can't seem to run namd?
>
> Brian
>
>
>
>
>
> >Thanks,
> >Leandro.
> >
> >Obs: the charm++ compilation errors which are solved in the
> >cvs version (I think this is not relevant at this point):
> >
> >
> >>Hi Leandro,
> >>I believe this compilation problem have been fixed in recent versions
> >>of charm. Please update your charm from cvs, or download a nightly build
> >>package from charm web site.
> >>Gengbin
> >
> >Leandro Martínez wrote:
> >
> >>Dear charm++ developers,
> >>I am trying to compile charm++ on Opteron and Amd64 machines with no
> >>success.
> >>I am doing this because I am trying to run NAMD on a Amd64 (dual-core)
> >>cluster with
> >>no success, because it seems that the load-balancer eventually hangs
> >>up (the simulation
> >>does not crash, but I get a single job running forever without any
> >>production just during
> >>a load balancing). This problem was reported by Jim Phillips and is,
> >>apparently, an issue related to the charm program in these machines:
> >>
> >>hhttp://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnAMD64
> >>
> >>The first question is: Is the problem really known? The solution above
> >>should really solve the issue? Because we are running namd in a
> >>opteron cluster for a while with no problems, the problems appeared in
> >>a new Amd64 dual core machines we bought.
> >>
> >>Anyway, I tryied to compile charm++ in my machines with the option
> >>above changed, with no success.
> >>
> >> From my opteron machine, running ./build charm++ net-linux-amd64 I get
> >>(running Fedora 5.0)
> >>
> >>if [ -d charmrun ] ; then ( cd charmrun ; make OPTS='' ) ; fi
> >>make[3]: Entering directory
> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/charmrun'
> >>make[3]: Nothing to be done for `all'.
> >>make[3]: Leaving directory
> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/charmrun'
> >>if [ -f charmrun ] ; then ( cp charmrun ../bin ) ; fi
> >>make[2]: Leaving directory `/home/lmartinez/temp/instalacoes/charm-
> >>5.9/net-linux-amd64/tmp'
> >>cd libs/ck-libs/multicast && make
> >>make[2]: Entering directory
> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/libs/ck-libs/multicast'
> >>make[2]: Nothing to be done for `all'.
> >>make[2]: Leaving directory
> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/libs/ck-libs/multicast'
> >>../bin/charmc -c -I. ComlibManager.C
> >>MsgPacker.h:86: error: extra qualification 'MsgPacker::' on member
> >>'MsgPacker'
> >>Fatal Error by charmc in directory
> >>/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp
> >> Command g++ -fPIC -m64 -I../bin/../include -D__CHARMC__=1 -I. -c
> >>ComlibManager.C -o ComlibManager.o returned error code 1
> >>charmc exiting...
> >>make[1]: *** [ComlibManager.o] Error 1
> >>make[1]: Leaving directory
> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp'
> >>make: *** [charm++] Error 2
> >>
> >> From my Amd64 (Fedora 6.0) machine, I get:
> >>
> >>make[2]: Entering directory
> >>`/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
> >>make[2]: `QuickThreads/libqt.a' is up to date.
> >>make[2]: Leaving directory `/home/lmartinez/charm-
> >>5.9/net-linux-amd64/tmp'
> >>make converse-target
> >>make[2]: Entering directory
> >>`/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
> >>../bin/charmc -c -I. traceCore.C
> >>traceCore.h:20: error: previous declaration of 'int Cpv__traceCoreOn_
> >>[2]' with 'C++' linkage
> >>traceCoreAPI.h:8: error: conflicts with new declaration with 'C' linkage
> >>Fatal Error by charmc in directory
> >>/home/lmartinez/charm-5.9/net-linux-amd64/tmp Command g++ -fPIC -m64
> >>-I../bin/../include -D__CHARMC__=1 -I. -c traceCore.C -o traceCore.o
> >>returned error code 1
> >>charmc exiting...
> >>make[2]: *** [traceCore.o] Error 1
> >>make[2]: Leaving directory `/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
> >>make[1]: *** [converse] Error 2
> >>make[1]: Leaving directory `/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
> >>make: *** [charm++] Error 2
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:43 CST