Re: Re: NAMD 2.5 crashes unpredictably on a 8 way SMP (Linux x86)

From: Niraj kumar (niraj17_at_gmail.com)
Date: Wed May 11 2005 - 01:40:56 CDT

Hi Gengbin,

I did as you suggested . I also verifed that the new namd2 executible
is indeed using system's malloc :

[niraj_at_x445 namd_tests]$ nm namd2.new | grep malloc@
         U malloc@@GLIBC_2.0
[niraj_at_x445 namd_tests]$ nm namd2.old | grep "T malloc"
08211d5f T malloc
08211aed T malloc_get_state
08211f56 T malloc_nomigrate
08211eb6 T malloc_reentrant
..............................

But I am still getting the crash . I got a message like :
*** glibc detected *** malloc(): memory corruption: 0x0af07310 ***

and then the trace :

#0 0xffffe410 in __kernel_vsyscall ()
(gdb) where
#0 0xffffe410 in __kernel_vsyscall ()
#1 0x002ef955 in raise () from /lib/tls/libc.so.6
#2 0x002f1319 in abort () from /lib/tls/libc.so.6
#3 0x00322f9a in __libc_message () from /lib/tls/libc.so.6
#4 0x0032a0c6 in _int_malloc () from /lib/tls/libc.so.6
#5 0x0032bbd1 in malloc () from /lib/tls/libc.so.6
#6 0x0820fb5b in malloc_nomigrate ()
#7 0x08250417 in CmiAlloc ()
#8 0x0824e2e1 in PumpMsgs ()
#9 0x0824e4f0 in CmiGetNonLocal ()
#10 0x0824f98f in CsdNextMessage ()
#11 0x0824fa4c in CsdScheduleForever ()
#12 0x0824f9f4 in CsdScheduler ()
#13 0x080c2122 in BackEnd::init ()
#14 0x080bf1e1 in main ()

Regards
Niraj

On 5/10/05, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
>
> Hi Niraj,
>
> It is not a charm build option, please pass it to the namd2 link command, in namd2/Makefile:
>
> namd2: $(INCDIR) $(DSTDIR) $(OBJS) $(LIBS)
> $(MAKEBUILDINFO)
> $(CHARMC) -verbose -ld++-option \
> "$(COPTI)$(CHARMINC) $(COPTI)$(INCDIR) $(COPTI)$(SRCDIR) $(CXXOPTS)" \
> -module NeighborLB -module commlib -language charm++ \
> $(BUILDINFO).o \
> $(OBJS) \
> $(DPMTALIB) \
> $(DPMELIB) \
> $(TCLLIB) \
> $(FFTLIB) \
> $(PLUGINLIB) \
> $(CHARMOPTS) \
> -lm -o namd2 -memory os
> ^^^^^^^^^^^^
>
> Gengbin
>
>
> Niraj kumar wrote:
>
> >On 5/10/05, Gengbin Zheng <gzheng_at_ks.uiuc.edu> wrote:
> >
> >
> >>This may be the memory allocator issue that Charm++'s gnu malloc library
> >>conflicts with MPICH's.
> >>Try modify namd2/Makefile, the link command line for namd2, add "-memory
> >>os" and relink namd2..
> >>
> >>
> >
> >Hi Gengbin,
> >
> >Thanks for your help .
> >I passed "-memory os" option to charmc by using
> >this command to compile charm :
> >./build charm++ mpi-linux -O -DCMK_OPTIMIZE=1 -memory os
> >
> >All compiled fine and I started doing repeated tests to see whether it
> >crashes or not.
> >I again got this crash in 7th (or 8th ) run :
> >
> >#0 0x08210971 in chunk_free ()
> >(gdb) where
> >#0 0x08210971 in chunk_free ()
> >#1 0x0821089c in mm_free ()
> >#2 0x08211dae in free ()
> >#3 0x082851ad in MPID_SHMEM_Eagern_unxrecv_start ()
> >#4 0x082782c8 in MPID_IrecvContig ()
> >#5 0x0827a2a4 in MPID_IrecvDatatype ()
> >#6 0x0827a185 in MPID_RecvDatatype ()
> >#7 0x08260d3e in PMPI_Recv ()
> >#8 0x082509b8 in PumpMsgs ()
> >#9 0x08250bac in CmiGetNonLocal ()
> >#10 0x0825204b in CsdNextMessage ()
> >#11 0x08252108 in CsdScheduleForever ()
> >#12 0x082520b0 in CsdScheduler ()
> >#13 0x080c2372 in BackEnd::init ()
> >#14 0x080bf431 in main ()
> >
> >
> >This is the same location where it was crashing earlier . Although
> >probably I can say that the frequency of crash has reduced a little .
> >
> >Any ideas ?
> >
> >Regards
> >Niraj
> >
> >
> >
> >
> >>Gengbin
> >>
> >>Niraj kumar wrote:
> >>
> >>
> >>
> >>>Hi ,
> >>>
> >>>(I had sent this report to namd_at_ks.uiuc.edu earlier , but got no responce ,
> >>>so I am resending it ...hopefully somebody can help me this time......)
> >>>
> >>>I am seeing NAMD 2.5 crash on a 8 way SMP machine (Linux x86).
> >>>The crash doesn't happen every time but after repeated runs , it shows up
> >>>often . There are two stack traces (see below). Every crash
> >>>results in one of these . The program recieves SIGSEGV signal .
> >>>
> >>>>From the trace , it looks like some subtle issue related to memory
> >>>management . The crash location is in charm++ code .
> >>>
> >>>NAMD was compiled using MPICH using shared-memory device.
> >>>
> >>>Stack trace 1:
> >>>----------------------------------------------------------
> >>>(gdb) where
> >>>#0 0x0825986c in chunk_free (ar_ptr=0x8409e20, p=0x993c840) at
> >>>memory-gnu.c:3268
> >>>#1 0x082596d5 in mm_free (mem=0x993c848) at memory-gnu.c:3191
> >>>#2 0x0825b890 in free (mem=0x993c848) at memory.c:203
> >>>#3 0x082eb956 in MPID_SHMEM_Eagern_unxrecv_start ()
> >>>#4 0x082deae4 in MPID_IrecvContig ()
> >>>#5 0x082e0a98 in MPID_IrecvDatatype ()
> >>>#6 0x082e0979 in MPID_RecvDatatype ()
> >>>#7 0x082c76e2 in PMPI_Recv ()
> >>>#8 0x082b252d in PumpMsgs () at machine.c:418
> >>>#9 0x082b2794 in CmiNotifyIdle () at machine.c:628
> >>>#10 0x082b5c7e in call_cblist_keep (l=0x8c2f010) at conv-conds.c:142
> >>>#11 0x082b6696 in CcdRaiseCondition (condnum=2) at conv-conds.c:417
> >>>#12 0x082b4021 in CsdStillIdle () at convcore.c:918
> >>>#13 0x082b424e in CsdScheduleForever () at convcore.c:1029
> >>>#14 0x082b4194 in CsdScheduler (maxmsgs=-1) at convcore.c:990
> >>>#15 0x080f0fa8 in slave_init (argc=2, argv=0xbfffef44) at src/BackEnd.C:94
> >>>#16 0x080f1011 in BackEnd::init (argc=2, argv=0xbfffef44) at src/BackEnd.C:103
> >>>#17 0x080ed8f5 in main (argc=2, argv=0xbfffef44) at src/mainfunc.C:34
> >>>-------------------------------------------------------------------------------------------------------------------------
> >>>Crash 2 Location :
> >>>-------------------------------------------------------------------------------------------------------------------------
> >>>#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
> >>>3886 bck->fd = unsorted_chunks(av);
> >>>(gdb) where
> >>>#0 0x08267d46 in _int_malloc (av=0x845e240, bytes=10444) at memory-gnu.c:3886
> >>>#1 0x08266db1 in mm_malloc (bytes=10444) at memory-gnu.c:3306
> >>>#2 0x08269a46 in malloc (size=10444) at memory.c:207
> >>>#3 0x08269c4e in malloc_nomigrate (size=10444) at memory.c:276
> >>>#4 0x082d5d25 in CmiAlloc (size=10436) at convcore.c:1625
> >>>#5 0x082d3383 in PumpMsgs () at machine.c:421
> >>>#6 0x082d35e6 in CmiGetNonLocal () at machine.c:624
> >>>#7 0x082d503b in CsdNextMessage (s=0xbffff600) at convcore.c:1016
> >>>#8 0x082d5118 in CsdScheduleForever () at convcore.c:1078
> >>>#9 0x082d50b6 in CsdScheduler (maxmsgs=-1) at convcore.c:1044
> >>>#10 0x080fd634 in slave_init (argc=2, argv=0xbffff844) at src/BackEnd.C:94
> >>>#11 0x080fd69d in BackEnd::init (argc=2, argv=0xbffff844) at src/BackEnd.C:103
> >>>#12 0x080f9f81 in main (argc=2, argv=0xbffff844) at src/mainfunc.C:34
> >>>----------------------------------------------------------------------------------------------------------------------------------
> >>>
> >>>Any hint on this ?
> >>>If you need any more info , please let me know .
> >>>Thanks in advance .
> >>>
> >>>
> >>>
> >>>
> >
> >
> >
> >
>

-- 
-----------------------------------------------------------------
Please visit my webpage at http://nirajkumar.net

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:40:45 CST