Re: Illegal instruction signal at startup. (with net-rs6k smp)

From: Hansang Bae (baeh_at_ecn.purdue.edu)
Date: Tue Apr 13 2004 - 22:00:48 CDT

Hi.

I can't make the namd2 smp version (AIX..) run on a node with multiple
threads even with charmrun. Since I want it to run only with one process
and multiple threads, I tried

"charmrun +p2 ++ppn 2 ++local namd2 alanin.namd"

but it crashes generating

Info: *****************************
Info: Entering startup phase 0 with 3604 kB of memory in use.
Info: Entering startup phase 1 with 3604 kB of memory in use.
------------- Processor 1 Exiting: Caught Signal ------------
Signal: illegal instruction
Suggestion: Check for calls to uninitialized function pointers.
req_handle_abort called
Fatal error on PE 1> illegal instruction

I think this command is still doing the same thing as "namd2 +p2 ..".
(Just like standalone mode, it sometimes run successfully.)

Am I missing something important?

FYI: In the standalone mode with "namd2 +p2 ..", +p2 is not ignored. It
tries to use 2 processors.

Thanks,
Hansang Bae
1285 EE Building, Mail Box #58
West Lafayette, IN 47907-1285
(H) 765-496-4729
(L) 765-494-3550 (EE 347)

On Fri, 9 Apr 2004, Gengbin Zheng wrote:

>
> I am little confused. :-)
> In any case, does it run with charmrun for a parallel job?
> or does it run with simply: ./namd2 alanin?
>
> standalone mode works for IBM SP too if you compile the net version.
> The +p2 should be ignored in standalone mode since you are running
> sequentially.
> You can run:
> ./charmrun +p2 ./namd alanin ++local
> to start a parallel job of 2 processes on your local machine. (it make
> more sense if you are running on a SMP node)
> You can even do this since you compiled an SMP version:
> ./charmrun +p8 ./namd2 alanin ++ppn 4 ++local
> if you have a 4-way SMP, which fires two processes with 4 smp threads
> each.
>
> Gengbin
>
> On Fri, 9 Apr 2004, Hansang Bae wrote:
>
> > I see. I think I didn't notice standalone mode without charmrun only works
> > for Solaris and Windows because I have been running it on a Sun
> > workstation so far.
> >
> > Thank you very much.
> >
> > Thanks,
> > Hansang Bae
> >
> > On Fri, 9 Apr 2004, Brian Bennion wrote:
> >
> > > Hi,
> > >
> > > Sorry to butt in, but doesn't the +p2 argument require charmrun to be
> > > loading namd2?
> > >
> > > ie
> > >
> > > charmrun ++local namd2 +p2 alanin.namd
> > >
> > >
> > > Brian
> > >
> > >
> > > On Fri, 9 Apr 2004, Hansang Bae wrote:
> > >
> > > > I tried your fix but it didn't work.
> > > > Actually, I narrowed down the place where it crashes.
> > > >
> > > > I'm running namd with command line:
> > > > namd2 +p2 alanin.namd
> > > >
> > > > The error occurs at the second thread when it tries to execute
> > > > (h->hdlr)(msg,h->userPtr); (line 938 in convcore.c)
> > > >
> > > > ,where both h->hdlr and h->userPtr are null. (h->hdlr is crucial I think)
> > > >
> > > > Do you have any idea?
> > > >
> > > > Thanks,
> > > > Hansang Bae
> > > >
> > > > On Thu, 8 Apr 2004, Gengbin Zheng wrote:
> > > >
> > > > >
> > > > > Hi Hansang,
> > > > >
> > > > > It seems that there is some problem with the new buildin gnu malloc of
> > > > > Charm++. Please try if this could fix it:
> > > > >
> > > > > edit charm/net-rs6k-smp/tmp/conv-mach-smp.h, add this:
> > > > >
> > > > > #undef CMK_MALLOC_USE_GNU_MALLOC
> > > > > #undef CMK_MALLOC_USE_OS_BUILTIN
> > > > > #define CMK_MALLOC_USE_OS_BUILTIN 1
> > > > >
> > > > > Do a clean make (make clean, and make charm++ OPTS=-g)
> > > > > and re-link namd2.
> > > > >
> > > > > Please let me know if this works or not,
> > > > >
> > > > > Gengbin
> > > > >
> > > > > On Thu, 8 Apr 2004, Gengbin Zheng wrote:
> > > > >
> > > > > >
> > > > > > I see. Could you send me your command line options to get this crash?
> > > > > > I supposed this is alanin.
> > > > > >
> > > > > > Gengbin
> > > > > >
> > > > > >
> > > > > > On Thu, 8 Apr 2004, Hansang Bae wrote:
> > > > > >
> > > > > > > Of course, I compiled this version with -g option, and Other versions,
> > > > > > > net-rs6k and mpi-sp do not have any problem. I'm using tcl-8.4.4 and
> > > > > > > fftw-2.1.5.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Hansang Bae
> > > > > > > 1285 EE Building, Mail Box #58
> > > > > > > West Lafayette, IN 47907-1285
> > > > > > > (H) 765-496-4729
> > > > > > > (L) 765-494-3550 (EE 347)
> > > > > > >
> > > > > > > On Thu, 8 Apr 2004, Gengbin Zheng wrote:
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > It is a little hard to find out anything wrong here. I would suggest build
> > > > > > > > your own binary (there may be binary or library incompatibility problem).
> > > > > > > > For more options, you can try net-rs6k (without smp) or MPI version
> > > > > > > > like mpi-sp|IBM-SP.
> > > > > > > >
> > > > > > > > Gengbin
> > > > > > > >
> > > > > > > > On Tue, 6 Apr 2004, Hansang Bae wrote:
> > > > > > > >
> > > > > > > > > I have a problem running the AIX-RS6000-SMP version with multiple threads.
> > > > > > > > > It crashes generating illegal instruction exception at startup phase.
> > > > > > > > > Strange thing is sometimes this doesn't happen.
> > > > > > > > >
> > > > > > > > > Here is "some" information from dbx log.
> > > > > > > > >
> > > > > > > > > ...
> > > > > > > > > Info: ****************************
> > > > > > > > > Info: STRUCTURE SUMMARY:
> > > > > > > > > Info: 66 ATOMS
> > > > > > > > > Info: 65 BONDS
> > > > > > > > > Info: 96 ANGLES
> > > > > > > > > Info: 31 DIHEDRALS
> > > > > > > > > Info: 32 IMPROPERS
> > > > > > > > > Info: 0 EXCLUSIONS
> > > > > > > > > Info: 195 DEGREES OF FREEDOM
> > > > > > > > > Info: 55 HYDROGEN GROUPS
> > > > > > > > > Info: TOTAL MASS = 783.886 amu
> > > > > > > > > Info: TOTAL CHARGE = 8.19564e-08 e
> > > > > > > > > Info: *****************************
> > > > > > > > > [20] stopped in suspend() at line 153 in file "BackEnd.cc" ($t1)
> > > > > > > > > 153 CsdScheduler(-1);
> > > > > > > > > (dbx) s
> > > > > > > > > Info: Entering startup phase 0 with 3804 kB of memory in use.
> > > > > > > > > Info: Entering startup phase 1 with 3804 kB of memory in use.
> > > > > > > > >
> > > > > > > > > Illegal instruction in . at 0x0 ($t2)
> > > > > > > > > 0x00000000 00000000 Invalid opcode.
> > > > > > > > > (dbx) where
> > > > > > > > > warning: could not locate trace table from starting address 0x0
> > > > > > > > > CmiHandleMessage(0x305d0a08) at 0x10011b38
> > > > > > > > > CsdScheduleForever() at 0x10012be4
> > > > > > > > > CsdScheduler(0xffffffff) at 0x10012d0c
> > > > > > > > > slave_init(int,char**)(argc = 3, argv = 0x3027b6d8), line 94 in
> > > > > > > > > "BackEnd.cc"
> > > > > > > > > ConverseRunPE(0x0) at 0x1000c96c
> > > > > > > > > call_startfn(0x1) at 0x1000b810
> > > > > > > > > _pthread_body(??) at 0xd004b3fc
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Hansang Bae
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > > *****************************************************************
> > > **Brian Bennion, Ph.D. **
> > > **Computational and Systems Biology Division **
> > > **Biology and Biotechnology Research Program **
> > > **Lawrence Livermore National Laboratory **
> > > **P.O. Box 808, L-448 bennion1_at_llnl.gov **
> > > **7000 East Avenue phone: (925) 422-5722 **
> > > **Livermore, CA 94550 fax: (925) 424-6605 **
> > > *****************************************************************
> > >
> > >
> >
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:37:31 CST