Re: NAMD 2.5 crashes unpredictably on a 8 way SMP (Linux x86)

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Tue May 10 2005 - 09:57:04 CDT

Hi Niraj,

 
 It is not a charm build option, please pass it to the namd2 link command, in namd2/Makefile:

namd2: $(INCDIR) $(DSTDIR) $(OBJS) $(LIBS)
        $(MAKEBUILDINFO)
        $(CHARMC) -verbose -ld++-option \
        "$(COPTI)$(CHARMINC) $(COPTI)$(INCDIR) $(COPTI)$(SRCDIR) $(CXXOPTS)" \
        -module NeighborLB -module commlib -language charm++ \
        $(BUILDINFO).o \
        $(OBJS) \
        $(DPMTALIB) \
        $(DPMELIB) \
        $(TCLLIB) \
        $(FFTLIB) \
        $(PLUGINLIB) \
        $(CHARMOPTS) \
        -lm -o namd2 -memory os
                     ^^^^^^^^^^^^

Gengbin

Niraj kumar wrote:

>On 5/10/05, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
>
>
>>This may be the memory allocator issue that Charm++'s gnu malloc library
>>conflicts with MPICH's.
>>Try modify namd2/Makefile, the link command line for namd2, add "-memory
>>os" and relink namd2..
>>
>>
>
>Hi Gengbin,
>
>Thanks for your help .
>I passed "-memory os" option to charmc by using
>this command to compile charm :
>./build charm++ mpi-linux -O -DCMK_OPTIMIZE=1 -memory os
>
>All compiled fine and I started doing repeated tests to see whether it
>crashes or not.
>I again got this crash in 7th (or 8th ) run :
>
>#0 0x08210971 in chunk_free ()
>(gdb) where
>#0 0x08210971 in chunk_free ()
>#1 0x0821089c in mm_free ()
>#2 0x08211dae in free ()
>#3 0x082851ad in MPID_SHMEM_Eagern_unxrecv_start ()
>#4 0x082782c8 in MPID_IrecvContig ()
>#5 0x0827a2a4 in MPID_IrecvDatatype ()
>#6 0x0827a185 in MPID_RecvDatatype ()
>#7 0x08260d3e in PMPI_Recv ()
>#8 0x082509b8 in PumpMsgs ()
>#9 0x08250bac in CmiGetNonLocal ()
>#10 0x0825204b in CsdNextMessage ()
>#11 0x08252108 in CsdScheduleForever ()
>#12 0x082520b0 in CsdScheduler ()
>#13 0x080c2372 in BackEnd::init ()
>#14 0x080bf431 in main ()
>
>
>This is the same location where it was crashing earlier . Although
>probably I can say that the frequency of crash has reduced a little .
>
>Any ideas ?
>
>Regards
>Niraj
>
>
>
>
>>Gengbin
>>
>>Niraj kumar wrote:
>>
>>
>>
>>>Hi ,
>>>
>>>(I had sent this report to namd_at_ks.uiuc.edu earlier , but got no responce ,
>>>so I am resending it ...hopefully somebody can help me this time......)
>>>
>>>I am seeing NAMD 2.5 crash on a 8 way SMP machine (Linux x86).
>>>The crash doesn't happen every time but after repeated runs , it shows up
>>>often . There are two stack traces (see below). Every crash
>>>results in one of these . The program recieves SIGSEGV signal .
>>>
>>>>From the trace , it looks like some subtle issue related to memory
>>>management . The crash location is in charm++ code .
>>>
>>>NAMD was compiled using MPICH using shared-memory device.
>>>
>>>Stack trace 1:
>>>----------------------------------------------------------
>>>(gdb) where
>>>#0 0x0825986c in chunk_free (ar_ptr=0x8409e20, p=0x993c840) at
>>>memory-gnu.c:3268
>>>#1 0x082596d5 in mm_free (mem=0x993c848) at memory-gnu.c:3191
>>>#2 0x0825b890 in free (mem=0x993c848) at memory.c:203
>>>#3 0x082eb956 in MPID_SHMEM_Eagern_unxrecv_start ()
>>>#4 0x082deae4 in MPID_IrecvContig ()
>>>#5 0x082e0a98 in MPID_IrecvDatatype ()
>>>#6 0x082e0979 in MPID_RecvDatatype ()
>>>#7 0x082c76e2 in PMPI_Recv ()
>>>#8 0x082b252d in PumpMsgs () at machine.c:418
>>>#9 0x082b2794 in CmiNotifyIdle () at machine.c:628
>>>#10 0x082b5c7e in call_cblist_keep (l=0x8c2f010) at conv-conds.c:142
>>>#11 0x082b6696 in CcdRaiseCondition (condnum=2) at conv-conds.c:417
>>>#12 0x082b4021 in CsdStillIdle () at convcore.c:918
>>>#13 0x082b424e in CsdScheduleForever () at convcore.c:1029
>>>#14 0x082b4194 in CsdScheduler (maxmsgs=-1) at convcore.c:990
>>>#15 0x080f0fa8 in slave_init (argc=2, argv=0xbfffef44) at src/BackEnd.C:94
>>>#16 0x080f1011 in BackEnd::init (argc=2, argv=0xbfffef44) at src/BackEnd.C:103
>>>#17 0x080ed8f5 in main (argc=2, argv=0xbfffef44) at src/mainfunc.C:34
>>>-------------------------------------------------------------------------------------------------------------------------
>>>Crash 2 Location :
>>>-------------------------------------------------------------------------------------------------------------------------
>>>#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
>>>3886 bck->fd = unsorted_chunks(av);
>>>(gdb) where
>>>#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
>>>#1 0x08266db1 in mm_malloc (bytes=10444) at memory-gnu.c:3306
>>>#2 0x08269a46 in malloc (size=10444) at memory.c:207
>>>#3 0x08269c4e in malloc_nomigrate (size=10444) at memory.c:276
>>>#4 0x082d5d25 in CmiAlloc (size=10436) at convcore.c:1625
>>>#5 0x082d3383 in PumpMsgs () at machine.c:421
>>>#6 0x082d35e6 in CmiGetNonLocal () at machine.c:624
>>>#7 0x082d503b in CsdNextMessage (s=0xbffff600) at convcore.c:1016
>>>#8 0x082d5118 in CsdScheduleForever () at convcore.c:1078
>>>#9 0x082d50b6 in CsdScheduler (maxmsgs=-1) at convcore.c:1044
>>>#10 0x080fd634 in slave_init (argc=2, argv=0xbffff844) at src/BackEnd.C:94
>>>#11 0x080fd69d in BackEnd::init (argc=2, argv=0xbffff844) at src/BackEnd.C:103
>>>#12 0x080f9f81 in main (argc=2, argv=0xbffff844) at src/mainfunc.C:34
>>>----------------------------------------------------------------------------------------------------------------------------------
>>>
>>>Any hint on this ?
>>>If you need any more info , please let me know .
>>>Thanks in advance .
>>>
>>>
>>>
>>>
>
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:44 CST