NAMD 2.5 crashes unpredictably on a 8 way SMP (Linux x86)

From: Niraj kumar (niraj17_at_gmail.com)
Date: Mon May 09 2005 - 06:03:19 CDT

Hi ,

(I had sent this report to namd_at_ks.uiuc.edu earlier , but got no responce ,
so I am resending it ...hopefully somebody can help me this time......)

I am seeing NAMD 2.5 crash on a 8 way SMP machine (Linux x86).
 The crash doesn't happen every time but after repeated runs , it shows up
often . There are two stack traces (see below). Every crash
results in one of these . The program recieves SIGSEGV signal .

>From the trace , it looks like some subtle issue related to memory
management . The crash location is in charm++ code .

NAMD was compiled using MPICH using shared-memory device.

Stack trace 1:
----------------------------------------------------------
(gdb) where
#0 0x0825986c in chunk_free (ar_ptr=0x8409e20, p=0x993c840) at
memory-gnu.c:3268
#1 0x082596d5 in mm_free (mem=0x993c848) at memory-gnu.c:3191
#2 0x0825b890 in free (mem=0x993c848) at memory.c:203
#3 0x082eb956 in MPID_SHMEM_Eagern_unxrecv_start ()
#4 0x082deae4 in MPID_IrecvContig ()
#5 0x082e0a98 in MPID_IrecvDatatype ()
#6 0x082e0979 in MPID_RecvDatatype ()
#7 0x082c76e2 in PMPI_Recv ()
#8 0x082b252d in PumpMsgs () at machine.c:418
#9 0x082b2794 in CmiNotifyIdle () at machine.c:628
#10 0x082b5c7e in call_cblist_keep (l=0x8c2f010) at conv-conds.c:142
#11 0x082b6696 in CcdRaiseCondition (condnum=2) at conv-conds.c:417
#12 0x082b4021 in CsdStillIdle () at convcore.c:918
#13 0x082b424e in CsdScheduleForever () at convcore.c:1029
#14 0x082b4194 in CsdScheduler (maxmsgs=-1) at convcore.c:990
#15 0x080f0fa8 in slave_init (argc=2, argv=0xbfffef44) at src/BackEnd.C:94
#16 0x080f1011 in BackEnd::init (argc=2, argv=0xbfffef44) at src/BackEnd.C:103
#17 0x080ed8f5 in main (argc=2, argv=0xbfffef44) at src/mainfunc.C:34
-------------------------------------------------------------------------------------------------------------------------
Crash 2 Location :
-------------------------------------------------------------------------------------------------------------------------
#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
3886 bck->fd = unsorted_chunks(av);
(gdb) where
#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
#1 0x08266db1 in mm_malloc (bytes=10444) at memory-gnu.c:3306
#2 0x08269a46 in malloc (size=10444) at memory.c:207
#3 0x08269c4e in malloc_nomigrate (size=10444) at memory.c:276
#4 0x082d5d25 in CmiAlloc (size=10436) at convcore.c:1625
#5 0x082d3383 in PumpMsgs () at machine.c:421
#6 0x082d35e6 in CmiGetNonLocal () at machine.c:624
#7 0x082d503b in CsdNextMessage (s=0xbffff600) at convcore.c:1016
#8 0x082d5118 in CsdScheduleForever () at convcore.c:1078
#9 0x082d50b6 in CsdScheduler (maxmsgs=-1) at convcore.c:1044
#10 0x080fd634 in slave_init (argc=2, argv=0xbffff844) at src/BackEnd.C:94
#11 0x080fd69d in BackEnd::init (argc=2, argv=0xbffff844) at src/BackEnd.C:103
#12 0x080f9f81 in main (argc=2, argv=0xbffff844) at src/mainfunc.C:34
----------------------------------------------------------------------------------------------------------------------------------

Any hint on this ?
If you need any more info , please let me know .
Thanks in advance .

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:44 CST