Re: namd crash: Signal: segmentation violation

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Wed May 23 2007 - 14:53:23 CDT

Leandro,

 Just a wild idea. What happen if you run two instances of NAMD on apoa1
on a single node. That is, run two copies of NAMD at same time, each
instance only uses one core (+p1).

 btw, the error in charm++ checkpoint test was due to "not be able to
open a file on a compute node for writing". It may be that the work
directory was not mounted on all compute nodes.

Gengbin

Leandro Martínez wrote:

> Hi Brian,
> Well, I have compiled with -g and -lefence, and the binaries indeed
> crash with a lot of messages from which I cannot get anything. Here
> are the logs:
>
> Using:
> ./charmrun ++local +p2 ++verbose ./namd2 teste-gentoo.namd >&
> namd-efence-local.log
> Result:
> http://limes.iqm.unicamp.br/~lmartinez/namd-efence-local.log
>
> Using:
> ./charmrun +p4 ++nodelist ./nodelist ++remote-shell ssh ++verbose
> ./namd2 teste-gentoo.namd >& namd-efence-2cpus.log
> Result:
> http://limes.iqm.unicamp.br/~lmartinez/namd-efence-2cpus.log
>
> Probably efence is not working properly there. I don't know.
>
> The specifications of the machines for which we are having
> trouble are:
>
> Motherboard: ASUS M2NPV-MX
> - NVIDIAR GeForce 6150 + nForce 430
> - Dual-channel DDR2 800/667/533
> - PCI Express architecture
> - Integrated GeForce6 GPU
> - NVIDIA Gb LAN controller on-board
> Chipset: NVidia
> Processor: AMD dual-core Athlon64 5200+ (socket AM2, 940 pins)
>
> Leandro.
>
>
>
>
>
>
> On 5/22/07, Brian Bennion <bennion1_at_llnl.gov> wrote:
>
>> At 01:18 PM 5/22/2007, you wrote:
>> >Hi Brian,
>> >The charm++ is the CVS version, since the one provided with
>> >namd cannot be compiled with GCC 4.0 (an error occurs,
>> >something I have already tought with Gengbin before, see below).
>> >
>> >Just for claryfing things in order to me to test with more
>> >knowledge of what I will be doing:
>>
>> The mpi version of namd should be built on a mpi aware charm++ layer.
>> starting it with mpirun or charmrun will still
>> cause problems if the namd binary is built on faulty mpi libraries
>>
>>
>> >1. The fact that I get an error with mpirun for the same test
>> >does not disminish the possibility of a charm++ problem? Is
>> >there any relation of my "mpirun" of the test and the
>> >compiled charm++ or its tests?
>> >
>> >2. The "-lefence" option I should add to the linking of both
>> >charm++ and namd2? The same thing for -g?
>> -g should go into the line that generates the
>> object code. Find -O3 or -O2 and replace with -g
>> for both charm and namd. It will make the binary
>> bigger and cause it to run much slower. It may
>> actually make the problem go away which is really a pain in the neck.
>>
>>
>> >What do you think about that corrupted velocity? I think
>> >that kind of corrupted data may be all the problem, returning
>> >strange error messages whenever it occurred.
>> What is corrupting the data? The message passing
>> between cores and multiple nodes is touching
>> memory that is corrupted either physically or the
>> bits have been changed by another process. The
>> efence link should rule out other code touching/rewriting used memory.
>>
>> Can you repeat the type of motherboard and
>> chipset you have on the machines that can't seem to run namd?
>>
>> Brian
>>
>>
>>
>>
>>
>> >Thanks,
>> >Leandro.
>> >
>> >Obs: the charm++ compilation errors which are solved in the
>> >cvs version (I think this is not relevant at this point):
>> >
>> >
>> >>Hi Leandro,
>> >>I believe this compilation problem have been fixed in recent versions
>> >>of charm. Please update your charm from cvs, or download a nightly
>> build
>> >>package from charm web site.
>> >>Gengbin
>> >
>> >Leandro Martínez wrote:
>> >
>> >>Dear charm++ developers,
>> >>I am trying to compile charm++ on Opteron and Amd64 machines with no
>> >>success.
>> >>I am doing this because I am trying to run NAMD on a Amd64 (dual-core)
>> >>cluster with
>> >>no success, because it seems that the load-balancer eventually hangs
>> >>up (the simulation
>> >>does not crash, but I get a single job running forever without any
>> >>production just during
>> >>a load balancing). This problem was reported by Jim Phillips and is,
>> >>apparently, an issue related to the charm program in these machines:
>> >>
>> >>hhttp://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnAMD64
>> >>
>> >>The first question is: Is the problem really known? The solution above
>> >>should really solve the issue? Because we are running namd in a
>> >>opteron cluster for a while with no problems, the problems appeared in
>> >>a new Amd64 dual core machines we bought.
>> >>
>> >>Anyway, I tryied to compile charm++ in my machines with the option
>> >>above changed, with no success.
>> >>
>> >> From my opteron machine, running ./build charm++ net-linux-amd64 I
>> get
>> >>(running Fedora 5.0)
>> >>
>> >>if [ -d charmrun ] ; then ( cd charmrun ; make OPTS='' ) ; fi
>> >>make[3]: Entering directory
>> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/charmrun'
>>
>> >>make[3]: Nothing to be done for `all'.
>> >>make[3]: Leaving directory
>> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/charmrun'
>>
>> >>if [ -f charmrun ] ; then ( cp charmrun ../bin ) ; fi
>> >>make[2]: Leaving directory `/home/lmartinez/temp/instalacoes/charm-
>> >>5.9/net-linux-amd64/tmp'
>> >>cd libs/ck-libs/multicast && make
>> >>make[2]: Entering directory
>> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/libs/ck-libs/multicast'
>>
>> >>make[2]: Nothing to be done for `all'.
>> >>make[2]: Leaving directory
>> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp/libs/ck-libs/multicast'
>>
>> >>../bin/charmc -c -I. ComlibManager.C
>> >>MsgPacker.h:86: error: extra qualification 'MsgPacker::' on member
>> >>'MsgPacker'
>> >>Fatal Error by charmc in directory
>> >>/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp
>> >> Command g++ -fPIC -m64 -I../bin/../include -D__CHARMC__=1 -I. -c
>> >>ComlibManager.C -o ComlibManager.o returned error code 1
>> >>charmc exiting...
>> >>make[1]: *** [ComlibManager.o] Error 1
>> >>make[1]: Leaving directory
>> >>`/home/lmartinez/temp/instalacoes/charm-5.9/net-linux-amd64/tmp'
>> >>make: *** [charm++] Error 2
>> >>
>> >> From my Amd64 (Fedora 6.0) machine, I get:
>> >>
>> >>make[2]: Entering directory
>> >>`/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
>> >>make[2]: `QuickThreads/libqt.a' is up to date.
>> >>make[2]: Leaving directory `/home/lmartinez/charm-
>> >>5.9/net-linux-amd64/tmp'
>> >>make converse-target
>> >>make[2]: Entering directory
>> >>`/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
>> >>../bin/charmc -c -I. traceCore.C
>> >>traceCore.h:20: error: previous declaration of 'int Cpv__traceCoreOn_
>> >>[2]' with 'C++' linkage
>> >>traceCoreAPI.h:8: error: conflicts with new declaration with 'C'
>> linkage
>> >>Fatal Error by charmc in directory
>> >>/home/lmartinez/charm-5.9/net-linux-amd64/tmp Command g++ -fPIC -m64
>> >>-I../bin/../include -D__CHARMC__=1 -I. -c traceCore.C -o traceCore.o
>> >>returned error code 1
>> >>charmc exiting...
>> >>make[2]: *** [traceCore.o] Error 1
>> >>make[2]: Leaving directory
>> `/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
>> >>make[1]: *** [converse] Error 2
>> >>make[1]: Leaving directory
>> `/home/lmartinez/charm-5.9/net-linux-amd64/tmp'
>> >>make: *** [charm++] Error 2
>>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:44:43 CST