Re: NAMD 2.5 crashes unpredictably on a 8 way SMP (Linux x86)

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Tue May 10 2005 - 01:45:25 CDT

This may be the memory allocator issue that Charm++'s gnu malloc library
conflicts with MPICH's.
Try modify namd2/Makefile, the link command line for namd2, add "-memory
os" and relink namd2..

Gengbin

Niraj kumar wrote:

>Hi ,
>
>(I had sent this report to namd_at_ks.uiuc.edu earlier , but got no responce ,
>so I am resending it ...hopefully somebody can help me this time......)
>
>I am seeing NAMD 2.5 crash on a 8 way SMP machine (Linux x86).
> The crash doesn't happen every time but after repeated runs , it shows up
>often . There are two stack traces (see below). Every crash
>results in one of these . The program recieves SIGSEGV signal .
>
>>From the trace , it looks like some subtle issue related to memory
>management . The crash location is in charm++ code .
>
>NAMD was compiled using MPICH using shared-memory device.
>
>Stack trace 1:
>----------------------------------------------------------
>(gdb) where
>#0 0x0825986c in chunk_free (ar_ptr=0x8409e20, p=0x993c840) at
>memory-gnu.c:3268
>#1 0x082596d5 in mm_free (mem=0x993c848) at memory-gnu.c:3191
>#2 0x0825b890 in free (mem=0x993c848) at memory.c:203
>#3 0x082eb956 in MPID_SHMEM_Eagern_unxrecv_start ()
>#4 0x082deae4 in MPID_IrecvContig ()
>#5 0x082e0a98 in MPID_IrecvDatatype ()
>#6 0x082e0979 in MPID_RecvDatatype ()
>#7 0x082c76e2 in PMPI_Recv ()
>#8 0x082b252d in PumpMsgs () at machine.c:418
>#9 0x082b2794 in CmiNotifyIdle () at machine.c:628
>#10 0x082b5c7e in call_cblist_keep (l=0x8c2f010) at conv-conds.c:142
>#11 0x082b6696 in CcdRaiseCondition (condnum=2) at conv-conds.c:417
>#12 0x082b4021 in CsdStillIdle () at convcore.c:918
>#13 0x082b424e in CsdScheduleForever () at convcore.c:1029
>#14 0x082b4194 in CsdScheduler (maxmsgs=-1) at convcore.c:990
>#15 0x080f0fa8 in slave_init (argc=2, argv=0xbfffef44) at src/BackEnd.C:94
>#16 0x080f1011 in BackEnd::init (argc=2, argv=0xbfffef44) at src/BackEnd.C:103
>#17 0x080ed8f5 in main (argc=2, argv=0xbfffef44) at src/mainfunc.C:34
>-------------------------------------------------------------------------------------------------------------------------
>Crash 2 Location :
>-------------------------------------------------------------------------------------------------------------------------
>#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
>3886 bck->fd = unsorted_chunks(av);
>(gdb) where
>#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
>#1 0x08266db1 in mm_malloc (bytes=10444) at memory-gnu.c:3306
>#2 0x08269a46 in malloc (size=10444) at memory.c:207
>#3 0x08269c4e in malloc_nomigrate (size=10444) at memory.c:276
>#4 0x082d5d25 in CmiAlloc (size=10436) at convcore.c:1625
>#5 0x082d3383 in PumpMsgs () at machine.c:421
>#6 0x082d35e6 in CmiGetNonLocal () at machine.c:624
>#7 0x082d503b in CsdNextMessage (s=0xbffff600) at convcore.c:1016
>#8 0x082d5118 in CsdScheduleForever () at convcore.c:1078
>#9 0x082d50b6 in CsdScheduler (maxmsgs=-1) at convcore.c:1044
>#10 0x080fd634 in slave_init (argc=2, argv=0xbffff844) at src/BackEnd.C:94
>#11 0x080fd69d in BackEnd::init (argc=2, argv=0xbffff844) at src/BackEnd.C:103
>#12 0x080f9f81 in main (argc=2, argv=0xbffff844) at src/mainfunc.C:34
>----------------------------------------------------------------------------------------------------------------------------------
>
>Any hint on this ?
>If you need any more info , please let me know .
>Thanks in advance .
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:25 CST