Re: Re: NAMD 2.5 crashes unpredictably on a 8 way SMP (Linux x86)

From: Niraj kumar (niraj17_at_gmail.com)
Date: Mon May 16 2005 - 05:14:31 CDT

On 5/12/05, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
>
> As an experiment, could you also put additional "-thread pthreads" in
> the same place.

I tried this ...(I also had to add -lpthread to linker option ) but
it shows no improvement .

Any other idea ? BTW , have you ever heard of anything like
this ?

Regards
Niraj

>
> Gengbin
>
> Niraj kumar wrote:
>
> >Hi Gengbin,
> >
> >I did as you suggested . I also verifed that the new namd2 executible
> >is indeed using system's malloc :
> >
> >[niraj_at_x445 namd_tests]$ nm namd2.new | grep malloc@
> > U malloc@@GLIBC_2.0
> >[niraj_at_x445 namd_tests]$ nm namd2.old | grep "T malloc"
> >08211d5f T malloc
> >08211aed T malloc_get_state
> >08211f56 T malloc_nomigrate
> >08211eb6 T malloc_reentrant
> >..............................
> >
> >But I am still getting the crash . I got a message like :
> >*** glibc detected *** malloc(): memory corruption: 0x0af07310 ***
> >
> >and then the trace :
> >
> >#0 0xffffe410 in __kernel_vsyscall ()
> >(gdb) where
> >#0 0xffffe410 in __kernel_vsyscall ()
> >#1 0x002ef955 in raise () from /lib/tls/libc.so.6
> >#2 0x002f1319 in abort () from /lib/tls/libc.so.6
> >#3 0x00322f9a in __libc_message () from /lib/tls/libc.so.6
> >#4 0x0032a0c6 in _int_malloc () from /lib/tls/libc.so.6
> >#5 0x0032bbd1 in malloc () from /lib/tls/libc.so.6
> >#6 0x0820fb5b in malloc_nomigrate ()
> >#7 0x08250417 in CmiAlloc ()
> >#8 0x0824e2e1 in PumpMsgs ()
> >#9 0x0824e4f0 in CmiGetNonLocal ()
> >#10 0x0824f98f in CsdNextMessage ()
> >#11 0x0824fa4c in CsdScheduleForever ()
> >#12 0x0824f9f4 in CsdScheduler ()
> >#13 0x080c2122 in BackEnd::init ()
> >#14 0x080bf1e1 in main ()
> >
> >
> >
> >Regards
> >Niraj
> >
> >On 5/10/05, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
> >
> >
> >>Hi Niraj,
> >>
> >> It is not a charm build option, please pass it to the namd2 link command, in namd2/Makefile:
> >>
> >>namd2: $(INCDIR) $(DSTDIR) $(OBJS) $(LIBS)
> >> $(MAKEBUILDINFO)
> >> $(CHARMC) -verbose -ld++-option \
> >> "$(COPTI)$(CHARMINC) $(COPTI)$(INCDIR) $(COPTI)$(SRCDIR) $(CXXOPTS)" \
> >> -module NeighborLB -module commlib -language charm++ \
> >> $(BUILDINFO).o \
> >> $(OBJS) \
> >> $(DPMTALIB) \
> >> $(DPMELIB) \
> >> $(TCLLIB) \
> >> $(FFTLIB) \
> >> $(PLUGINLIB) \
> >> $(CHARMOPTS) \
> >> -lm -o namd2 -memory os
> >> ^^^^^^^^^^^^
> >>
> >>Gengbin
> >>
> >>
> >>Niraj kumar wrote:
> >>
> >>
> >>
> >>>On 5/10/05, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
> >>>
> >>>
> >>>
> >>>
> >>>>This may be the memory allocator issue that Charm++'s gnu malloc library
> >>>>conflicts with MPICH's.
> >>>>Try modify namd2/Makefile, the link command line for namd2, add "-memory
> >>>>os" and relink namd2..
> >>>>
> >>>>
> >>>>
> >>>>
> >>>Hi Gengbin,
> >>>
> >>>Thanks for your help .
> >>>I passed "-memory os" option to charmc by using
> >>>this command to compile charm :
> >>>./build charm++ mpi-linux -O -DCMK_OPTIMIZE=1 -memory os
> >>>
> >>>All compiled fine and I started doing repeated tests to see whether it
> >>>crashes or not.
> >>>I again got this crash in 7th (or 8th ) run :
> >>>
> >>>#0 0x08210971 in chunk_free ()
> >>>(gdb) where
> >>>#0 0x08210971 in chunk_free ()
> >>>#1 0x0821089c in mm_free ()
> >>>#2 0x08211dae in free ()
> >>>#3 0x082851ad in MPID_SHMEM_Eagern_unxrecv_start ()
> >>>#4 0x082782c8 in MPID_IrecvContig ()
> >>>#5 0x0827a2a4 in MPID_IrecvDatatype ()
> >>>#6 0x0827a185 in MPID_RecvDatatype ()
> >>>#7 0x08260d3e in PMPI_Recv ()
> >>>#8 0x082509b8 in PumpMsgs ()
> >>>#9 0x08250bac in CmiGetNonLocal ()
> >>>#10 0x0825204b in CsdNextMessage ()
> >>>#11 0x08252108 in CsdScheduleForever ()
> >>>#12 0x082520b0 in CsdScheduler ()
> >>>#13 0x080c2372 in BackEnd::init ()
> >>>#14 0x080bf431 in main ()
> >>>
> >>>
> >>>This is the same location where it was crashing earlier . Although
> >>>probably I can say that the frequency of crash has reduced a little .
> >>>
> >>>Any ideas ?
> >>>
> >>>Regards
> >>>Niraj
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>Gengbin
> >>>>
> >>>>Niraj kumar wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>Hi ,
> >>>>>
> >>>>>(I had sent this report to namd_at_ks.uiuc.edu earlier , but got no responce ,
> >>>>>so I am resending it ...hopefully somebody can help me this time......)
> >>>>>
> >>>>>I am seeing NAMD 2.5 crash on a 8 way SMP machine (Linux x86).
> >>>>>The crash doesn't happen every time but after repeated runs , it shows up
> >>>>>often . There are two stack traces (see below). Every crash
> >>>>>results in one of these . The program recieves SIGSEGV signal .
> >>>>>
> >>>>>>From the trace , it looks like some subtle issue related to memory
> >>>>>management . The crash location is in charm++ code .
> >>>>>
> >>>>>NAMD was compiled using MPICH using shared-memory device.
> >>>>>
> >>>>>Stack trace 1:
> >>>>>----------------------------------------------------------
> >>>>>(gdb) where
> >>>>>#0 0x0825986c in chunk_free (ar_ptr=0x8409e20, p=0x993c840) at
> >>>>>memory-gnu.c:3268
> >>>>>#1 0x082596d5 in mm_free (mem=0x993c848) at memory-gnu.c:3191
> >>>>>#2 0x0825b890 in free (mem=0x993c848) at memory.c:203
> >>>>>#3 0x082eb956 in MPID_SHMEM_Eagern_unxrecv_start ()
> >>>>>#4 0x082deae4 in MPID_IrecvContig ()
> >>>>>#5 0x082e0a98 in MPID_IrecvDatatype ()
> >>>>>#6 0x082e0979 in MPID_RecvDatatype ()
> >>>>>#7 0x082c76e2 in PMPI_Recv ()
> >>>>>#8 0x082b252d in PumpMsgs () at machine.c:418
> >>>>>#9 0x082b2794 in CmiNotifyIdle () at machine.c:628
> >>>>>#10 0x082b5c7e in call_cblist_keep (l=0x8c2f010) at conv-conds.c:142
> >>>>>#11 0x082b6696 in CcdRaiseCondition (condnum=2) at conv-conds.c:417
> >>>>>#12 0x082b4021 in CsdStillIdle () at convcore.c:918
> >>>>>#13 0x082b424e in CsdScheduleForever () at convcore.c:1029
> >>>>>#14 0x082b4194 in CsdScheduler (maxmsgs=-1) at convcore.c:990
> >>>>>#15 0x080f0fa8 in slave_init (argc=2, argv=0xbfffef44) at src/BackEnd.C:94
> >>>>>#16 0x080f1011 in BackEnd::init (argc=2, argv=0xbfffef44) at src/BackEnd.C:103
> >>>>>#17 0x080ed8f5 in main (argc=2, argv=0xbfffef44) at src/mainfunc.C:34
> >>>>>-------------------------------------------------------------------------------------------------------------------------
> >>>>>Crash 2 Location :
> >>>>>-------------------------------------------------------------------------------------------------------------------------
> >>>>>#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
> >>>>>3886 bck->fd = unsorted_chunks(av);
> >>>>>(gdb) where
> >>>>>#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
> >>>>>#1 0x08266db1 in mm_malloc (bytes=10444) at memory-gnu.c:3306
> >>>>>#2 0x08269a46 in malloc (size=10444) at memory.c:207
> >>>>>#3 0x08269c4e in malloc_nomigrate (size=10444) at memory.c:276
> >>>>>#4 0x082d5d25 in CmiAlloc (size=10436) at convcore.c:1625
> >>>>>#5 0x082d3383 in PumpMsgs () at machine.c:421
> >>>>>#6 0x082d35e6 in CmiGetNonLocal () at machine.c:624
> >>>>>#7 0x082d503b in CsdNextMessage (s=0xbffff600) at convcore.c:1016
> >>>>>#8 0x082d5118 in CsdScheduleForever () at convcore.c:1078
> >>>>>#9 0x082d50b6 in CsdScheduler (maxmsgs=-1) at convcore.c:1044
> >>>>>#10 0x080fd634 in slave_init (argc=2, argv=0xbffff844) at src/BackEnd.C:94
> >>>>>#11 0x080fd69d in BackEnd::init (argc=2, argv=0xbffff844) at src/BackEnd.C:103
> >>>>>#12 0x080f9f81 in main (argc=2, argv=0xbffff844) at src/mainfunc.C:34
> >>>>>----------------------------------------------------------------------------------------------------------------------------------
> >>>>>
> >>>>>Any hint on this ?
> >>>>>If you need any more info , please let me know .
> >>>>>Thanks in advance .
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
> >
>

-- 
-----------------------------------------------------------------
http://www.nirajkumar.net

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:39:28 CST