Re: NAMD 2.5 crashes unpredictably on a 8 way SMP (Linux x86)

From: Niraj kumar (niraj17_at_gmail.com)
Date: Tue May 10 2005 - 04:48:59 CDT

On 5/10/05, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
>
> This may be the memory allocator issue that Charm++'s gnu malloc library
> conflicts with MPICH's.
> Try modify namd2/Makefile, the link command line for namd2, add "-memory
> os" and relink namd2..

Hi Gengbin,

Thanks for your help .
I passed "-memory os" option to charmc by using
this command to compile charm :
./build charm++ mpi-linux -O -DCMK_OPTIMIZE=1 -memory os

All compiled fine and I started doing repeated tests to see whether it
crashes or not.
I again got this crash in 7th (or 8th ) run :

#0 0x08210971 in chunk_free ()
(gdb) where
#0 0x08210971 in chunk_free ()
#1 0x0821089c in mm_free ()
#2 0x08211dae in free ()
#3 0x082851ad in MPID_SHMEM_Eagern_unxrecv_start ()
#4 0x082782c8 in MPID_IrecvContig ()
#5 0x0827a2a4 in MPID_IrecvDatatype ()
#6 0x0827a185 in MPID_RecvDatatype ()
#7 0x08260d3e in PMPI_Recv ()
#8 0x082509b8 in PumpMsgs ()
#9 0x08250bac in CmiGetNonLocal ()
#10 0x0825204b in CsdNextMessage ()
#11 0x08252108 in CsdScheduleForever ()
#12 0x082520b0 in CsdScheduler ()
#13 0x080c2372 in BackEnd::init ()
#14 0x080bf431 in main ()

This is the same location where it was crashing earlier . Although
probably I can say that the frequency of crash has reduced a little .

Any ideas ?

Regards
Niraj

>
> Gengbin
>
> Niraj kumar wrote:
>
> >Hi ,
> >
> >(I had sent this report to namd_at_ks.uiuc.edu earlier , but got no responce ,
> >so I am resending it ...hopefully somebody can help me this time......)
> >
> >I am seeing NAMD 2.5 crash on a 8 way SMP machine (Linux x86).
> > The crash doesn't happen every time but after repeated runs , it shows up
> >often . There are two stack traces (see below). Every crash
> >results in one of these . The program recieves SIGSEGV signal .
> >
> >>From the trace , it looks like some subtle issue related to memory
> >management . The crash location is in charm++ code .
> >
> >NAMD was compiled using MPICH using shared-memory device.
> >
> >Stack trace 1:
> >----------------------------------------------------------
> >(gdb) where
> >#0 0x0825986c in chunk_free (ar_ptr=0x8409e20, p=0x993c840) at
> >memory-gnu.c:3268
> >#1 0x082596d5 in mm_free (mem=0x993c848) at memory-gnu.c:3191
> >#2 0x0825b890 in free (mem=0x993c848) at memory.c:203
> >#3 0x082eb956 in MPID_SHMEM_Eagern_unxrecv_start ()
> >#4 0x082deae4 in MPID_IrecvContig ()
> >#5 0x082e0a98 in MPID_IrecvDatatype ()
> >#6 0x082e0979 in MPID_RecvDatatype ()
> >#7 0x082c76e2 in PMPI_Recv ()
> >#8 0x082b252d in PumpMsgs () at machine.c:418
> >#9 0x082b2794 in CmiNotifyIdle () at machine.c:628
> >#10 0x082b5c7e in call_cblist_keep (l=0x8c2f010) at conv-conds.c:142
> >#11 0x082b6696 in CcdRaiseCondition (condnum=2) at conv-conds.c:417
> >#12 0x082b4021 in CsdStillIdle () at convcore.c:918
> >#13 0x082b424e in CsdScheduleForever () at convcore.c:1029
> >#14 0x082b4194 in CsdScheduler (maxmsgs=-1) at convcore.c:990
> >#15 0x080f0fa8 in slave_init (argc=2, argv=0xbfffef44) at src/BackEnd.C:94
> >#16 0x080f1011 in BackEnd::init (argc=2, argv=0xbfffef44) at src/BackEnd.C:103
> >#17 0x080ed8f5 in main (argc=2, argv=0xbfffef44) at src/mainfunc.C:34
> >-------------------------------------------------------------------------------------------------------------------------
> >Crash 2 Location :
> >-------------------------------------------------------------------------------------------------------------------------
> >#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
> >3886 bck->fd = unsorted_chunks(av);
> >(gdb) where
> >#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
> >#1 0x08266db1 in mm_malloc (bytes=10444) at memory-gnu.c:3306
> >#2 0x08269a46 in malloc (size=10444) at memory.c:207
> >#3 0x08269c4e in malloc_nomigrate (size=10444) at memory.c:276
> >#4 0x082d5d25 in CmiAlloc (size=10436) at convcore.c:1625
> >#5 0x082d3383 in PumpMsgs () at machine.c:421
> >#6 0x082d35e6 in CmiGetNonLocal () at machine.c:624
> >#7 0x082d503b in CsdNextMessage (s=0xbffff600) at convcore.c:1016
> >#8 0x082d5118 in CsdScheduleForever () at convcore.c:1078
> >#9 0x082d50b6 in CsdScheduler (maxmsgs=-1) at convcore.c:1044
> >#10 0x080fd634 in slave_init (argc=2, argv=0xbffff844) at src/BackEnd.C:94
> >#11 0x080fd69d in BackEnd::init (argc=2, argv=0xbffff844) at src/BackEnd.C:103
> >#12 0x080f9f81 in main (argc=2, argv=0xbffff844) at src/mainfunc.C:34
> >----------------------------------------------------------------------------------------------------------------------------------
> >
> >Any hint on this ?
> >If you need any more info , please let me know .
> >Thanks in advance .
> >
> >
>

-- 
-----------------------------------------------------------------
Please visit my webpage at http://nirajkumar.net

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:25 CST