Re: Re: NAMD 2.5 crashes unpredictably on a 8 way SMP (Linux x86)

From: Gengbin Zheng (gzheng_at_ks.uiuc.edu)
Date: Wed May 11 2005 - 22:07:13 CDT

As an experiment, could you also put additional "-thread pthreads" in
the same place.

Gengbin

Niraj kumar wrote:

>Hi Gengbin,
>
>I did as you suggested . I also verifed that the new namd2 executible
>is indeed using system's malloc :
>
>[niraj_at_x445 namd_tests]$ nm namd2.new | grep malloc@
> U malloc@@GLIBC_2.0
>[niraj_at_x445 namd_tests]$ nm namd2.old | grep "T malloc"
>08211d5f T malloc
>08211aed T malloc_get_state
>08211f56 T malloc_nomigrate
>08211eb6 T malloc_reentrant
>..............................
>
>But I am still getting the crash . I got a message like :
>*** glibc detected *** malloc(): memory corruption: 0x0af07310 ***
>
>and then the trace :
>
>#0 0xffffe410 in __kernel_vsyscall ()
>(gdb) where
>#0 0xffffe410 in __kernel_vsyscall ()
>#1 0x002ef955 in raise () from /lib/tls/libc.so.6
>#2 0x002f1319 in abort () from /lib/tls/libc.so.6
>#3 0x00322f9a in __libc_message () from /lib/tls/libc.so.6
>#4 0x0032a0c6 in _int_malloc () from /lib/tls/libc.so.6
>#5 0x0032bbd1 in malloc () from /lib/tls/libc.so.6
>#6 0x0820fb5b in malloc_nomigrate ()
>#7 0x08250417 in CmiAlloc ()
>#8 0x0824e2e1 in PumpMsgs ()
>#9 0x0824e4f0 in CmiGetNonLocal ()
>#10 0x0824f98f in CsdNextMessage ()
>#11 0x0824fa4c in CsdScheduleForever ()
>#12 0x0824f9f4 in CsdScheduler ()
>#13 0x080c2122 in BackEnd::init ()
>#14 0x080bf1e1 in main ()
>
>
>
>Regards
>Niraj
>
>On 5/10/05, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
>
>
>>Hi Niraj,
>>
>> It is not a charm build option, please pass it to the namd2 link command, in namd2/Makefile:
>>
>>namd2: $(INCDIR) $(DSTDIR) $(OBJS) $(LIBS)
>> $(MAKEBUILDINFO)
>> $(CHARMC) -verbose -ld++-option \
>> "$(COPTI)$(CHARMINC) $(COPTI)$(INCDIR) $(COPTI)$(SRCDIR) $(CXXOPTS)" \
>> -module NeighborLB -module commlib -language charm++ \
>> $(BUILDINFO).o \
>> $(OBJS) \
>> $(DPMTALIB) \
>> $(DPMELIB) \
>> $(TCLLIB) \
>> $(FFTLIB) \
>> $(PLUGINLIB) \
>> $(CHARMOPTS) \
>> -lm -o namd2 -memory os
>> ^^^^^^^^^^^^
>>
>>Gengbin
>>
>>
>>Niraj kumar wrote:
>>
>>
>>
>>>On 5/10/05, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
>>>
>>>
>>>
>>>
>>>>This may be the memory allocator issue that Charm++'s gnu malloc library
>>>>conflicts with MPICH's.
>>>>Try modify namd2/Makefile, the link command line for namd2, add "-memory
>>>>os" and relink namd2..
>>>>
>>>>
>>>>
>>>>
>>>Hi Gengbin,
>>>
>>>Thanks for your help .
>>>I passed "-memory os" option to charmc by using
>>>this command to compile charm :
>>>./build charm++ mpi-linux -O -DCMK_OPTIMIZE=1 -memory os
>>>
>>>All compiled fine and I started doing repeated tests to see whether it
>>>crashes or not.
>>>I again got this crash in 7th (or 8th ) run :
>>>
>>>#0 0x08210971 in chunk_free ()
>>>(gdb) where
>>>#0 0x08210971 in chunk_free ()
>>>#1 0x0821089c in mm_free ()
>>>#2 0x08211dae in free ()
>>>#3 0x082851ad in MPID_SHMEM_Eagern_unxrecv_start ()
>>>#4 0x082782c8 in MPID_IrecvContig ()
>>>#5 0x0827a2a4 in MPID_IrecvDatatype ()
>>>#6 0x0827a185 in MPID_RecvDatatype ()
>>>#7 0x08260d3e in PMPI_Recv ()
>>>#8 0x082509b8 in PumpMsgs ()
>>>#9 0x08250bac in CmiGetNonLocal ()
>>>#10 0x0825204b in CsdNextMessage ()
>>>#11 0x08252108 in CsdScheduleForever ()
>>>#12 0x082520b0 in CsdScheduler ()
>>>#13 0x080c2372 in BackEnd::init ()
>>>#14 0x080bf431 in main ()
>>>
>>>
>>>This is the same location where it was crashing earlier . Although
>>>probably I can say that the frequency of crash has reduced a little .
>>>
>>>Any ideas ?
>>>
>>>Regards
>>>Niraj
>>>
>>>
>>>
>>>
>>>
>>>
>>>>Gengbin
>>>>
>>>>Niraj kumar wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Hi ,
>>>>>
>>>>>(I had sent this report to namd_at_ks.uiuc.edu earlier , but got no responce ,
>>>>>so I am resending it ...hopefully somebody can help me this time......)
>>>>>
>>>>>I am seeing NAMD 2.5 crash on a 8 way SMP machine (Linux x86).
>>>>>The crash doesn't happen every time but after repeated runs , it shows up
>>>>>often . There are two stack traces (see below). Every crash
>>>>>results in one of these . The program recieves SIGSEGV signal .
>>>>>
>>>>>>From the trace , it looks like some subtle issue related to memory
>>>>>management . The crash location is in charm++ code .
>>>>>
>>>>>NAMD was compiled using MPICH using shared-memory device.
>>>>>
>>>>>Stack trace 1:
>>>>>----------------------------------------------------------
>>>>>(gdb) where
>>>>>#0 0x0825986c in chunk_free (ar_ptr=0x8409e20, p=0x993c840) at
>>>>>memory-gnu.c:3268
>>>>>#1 0x082596d5 in mm_free (mem=0x993c848) at memory-gnu.c:3191
>>>>>#2 0x0825b890 in free (mem=0x993c848) at memory.c:203
>>>>>#3 0x082eb956 in MPID_SHMEM_Eagern_unxrecv_start ()
>>>>>#4 0x082deae4 in MPID_IrecvContig ()
>>>>>#5 0x082e0a98 in MPID_IrecvDatatype ()
>>>>>#6 0x082e0979 in MPID_RecvDatatype ()
>>>>>#7 0x082c76e2 in PMPI_Recv ()
>>>>>#8 0x082b252d in PumpMsgs () at machine.c:418
>>>>>#9 0x082b2794 in CmiNotifyIdle () at machine.c:628
>>>>>#10 0x082b5c7e in call_cblist_keep (l=0x8c2f010) at conv-conds.c:142
>>>>>#11 0x082b6696 in CcdRaiseCondition (condnum=2) at conv-conds.c:417
>>>>>#12 0x082b4021 in CsdStillIdle () at convcore.c:918
>>>>>#13 0x082b424e in CsdScheduleForever () at convcore.c:1029
>>>>>#14 0x082b4194 in CsdScheduler (maxmsgs=-1) at convcore.c:990
>>>>>#15 0x080f0fa8 in slave_init (argc=2, argv=0xbfffef44) at src/BackEnd.C:94
>>>>>#16 0x080f1011 in BackEnd::init (argc=2, argv=0xbfffef44) at src/BackEnd.C:103
>>>>>#17 0x080ed8f5 in main (argc=2, argv=0xbfffef44) at src/mainfunc.C:34
>>>>>-------------------------------------------------------------------------------------------------------------------------
>>>>>Crash 2 Location :
>>>>>-------------------------------------------------------------------------------------------------------------------------
>>>>>#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
>>>>>3886 bck->fd = unsorted_chunks(av);
>>>>>(gdb) where
>>>>>#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
>>>>>#1 0x08266db1 in mm_malloc (bytes=10444) at memory-gnu.c:3306
>>>>>#2 0x08269a46 in malloc (size=10444) at memory.c:207
>>>>>#3 0x08269c4e in malloc_nomigrate (size=10444) at memory.c:276
>>>>>#4 0x082d5d25 in CmiAlloc (size=10436) at convcore.c:1625
>>>>>#5 0x082d3383 in PumpMsgs () at machine.c:421
>>>>>#6 0x082d35e6 in CmiGetNonLocal () at machine.c:624
>>>>>#7 0x082d503b in CsdNextMessage (s=0xbffff600) at convcore.c:1016
>>>>>#8 0x082d5118 in CsdScheduleForever () at convcore.c:1078
>>>>>#9 0x082d50b6 in CsdScheduler (maxmsgs=-1) at convcore.c:1044
>>>>>#10 0x080fd634 in slave_init (argc=2, argv=0xbffff844) at src/BackEnd.C:94
>>>>>#11 0x080fd69d in BackEnd::init (argc=2, argv=0xbffff844) at src/BackEnd.C:103
>>>>>#12 0x080f9f81 in main (argc=2, argv=0xbffff844) at src/mainfunc.C:34
>>>>>----------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>>Any hint on this ?
>>>>>If you need any more info , please let me know .
>>>>>Thanks in advance .
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>>
>
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:18:46 CST