Re: NAMD2.6b2: Segmentation fault

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Tue Aug 29 2006 - 15:01:42 CDT

Do the 2.6b2 released binaries run your job in parallel successfully?

Can you run your binary in gdb to see where the crash happens?

I'm just amazed that there is no other output from the job.

-Jim

On Tue, 29 Aug 2006, Morad Alawneh wrote:

> Thanks for your suggestions.
>
> I have compiled from scratch NAMD 2.6b1 and it works without any problem.
>
> I have compiled from scratch NAMD 2.6b2 and it gives the Segmentation Fault.
>
> I have compiled from scratch NAMD 2.6b2 with charm++ from NAMD 2.6b1 and
> also gives the Segmentation Fault.
>
>
> By checking the NamdKnownBugs, I found the following:
>
>
> 2.6b2
>
> Parallel runs will often crash (segment fault) during startup phase 2
> when CMAP crossterms are present in the psf file. Fixed.
>
>
> According to that note it should have been fixed. So I downloaded the
> source file of NAMD 2.6b2 today, and I followed your suggestions but
> without any success yet.
>
> Here what I got in the error log file:
>
> bash: line 1: 24937 Segmentation fault /usr/bin/env MPIRUN_MPD=0
> MPIRUN_HOST=m4a-3-21.local MPIRUN_PORT=52039
> MPIRUN_PROCESSES='m4a-3-21i:m4a-3-21i:m4a-3-21i:m4a-3-21i:m4a-3-20i:m4a-3-20i:m4a-3-20i:m4a-3-20i:m4a-3-19i:m4a-3-19i:m4a-3-19i:m4a-3-19i:m4a-3-18i:m4a-3-18i:m4a-3-18i:m4a-3-18i:m4a-3-17i:m4a-3-17i:m4a-3-17i:m4a-3-17i:m4a-3-16i:m4a-3-16i:m4a-3-16i:m4a-3-16i:m4a-3-15i:m4a-3-15i:m4a-3-15i:m4a-3-15i:m4a-3-14i:m4a-3-14i:m4a-3-14i:m4a-3-14i:'
> MPIRUN_RANK=6 MPIRUN_NPROCS=32 MPIRUN_ID=21872
> /ibrix/home/mfm42/opt/namd-IB/Linux-amd64-MPI/namd2 +strategy USE_GRID
> prod_sys.namd
> Terminating processes.
>
>
> Do you have other suggestions?
>
>
> Thanks
>
>
>
>
>
>
>
>
> /*Morad Alawneh*/
>
> *Department of Chemistry and Biochemistry*
>
> *C100 BNSN, BYU*
>
> *Provo, UT 84602*
>
>
>
> Jim Phillips wrote:
>>
>> There are many changes between the two versions. The first test is to
>> see if the difference is in NAMD or Charm++. NAMD 2.6b2 should work
>> with the version of Charm++ included in NAMD 2.6b1, so you might try
>> building that first to see if the problem goes away. I would also
>> rebuild 2.6b1 from scratch to see if there has been a change in your
>> compilers, etc.
>>
>> -Jim
>>
>>
>> On Tue, 29 Aug 2006, Morad Alawneh wrote:
>>
>>> Dear NAMD Developers,
>>>
>>>
>>> After long time of debuging and testing the our hardware, NAMD2.6b1 runs
>>> in parallel without any problem whereas NAMD2.6b2 does not, even though,
>>> both were installed with the same instructions. Both versions can work
>>> in serial and parallel (using GegaEthernet conection) without any
>>> problem.
>>>
>>> I did what Jim suggested in the his previous email, but still I have the
>>> same problem.
>>>
>>> I have attached the instruction again with this email.
>>>
>>> I am wondering if there is any change between the two versions?
>>>
>>> Would you suggest any solution for this issue?
>>>
>>> Thanks
>>>
>>>
>>>
>>> /*Morad Alawneh*/
>>>
>>> *Department of Chemistry and Biochemistry*
>>>
>>> *C100 BNSN, BYU*
>>>
>>> *Provo, UT 84602*
>>>
>>>
>>>
>>> Jim Phillips wrote:
>>>>
>>>> I can't tell much from just a segfault. Does the charm++ megatest
>>>> work? Does NAMD run on one processor? Is there *any* output at all?
>>>>
>>>> My only comments looking at your build script are that on the charm
>>>> ./build line "-language charm++ -balance rand" shouldn't be needed and
>>>> may be harmful. Also, you shouldn't need "CHARMOPTS = -thread
>>>> pthreads -memory os" with the TopSpin MPI library. It looks like
>>>> you're following
>>>> http://www.ks.uiuc.edu/Research/namd/wiki/?NamdOnInfiniBand but using
>>>> the VMI build instructions. Also, please use the charm-5.9 source
>>>> distributed with the NAMD source code, since this is the stable tree.
>>>>
>>>> -Jim
>>>>
>>>>
>>>> On Mon, 21 Aug 2006, Morad Alawneh wrote:
>>>>
>>>>> Dear users,
>>>>>
>>>>> I have installed successfully NAMD2.6b1 onto my system, the
>>>>> installation
>>>>> instructions are attached with this email, and the program was working
>>>>> without any problem.
>>>>>
>>>>> I followed the same way for installing NAMD2.6b2, but after
>>>>> submitting a
>>>>> job I received the following message in the error log file:
>>>>>
>>>>> bash: line 1: 31904 Segmentation fault /usr/bin/env MPIRUN_MPD=0
>>>>> MPIRUN_HOST=m4a-7-11.local MPIRUN_PORT=40732
>>>>> MPIRUN_PROCESSES='m4a-7-11i:m4a-7-11i:m4a-7-11i:m4a-7-11i:m4a-7-10i:m4a-7-10i:m4a-7-10i:m4a-7-10i:m4a-7-9i:m4a-7-9i:m4a-7-9i:m4a-7-9i:m4a-7-8i:m4a-7-8i:m4a-7-8i:m4a-7-8i:m4a-7-7i:m4a-7-7i:m4a-7-7i:m4a-7-7i:m4a-7-6i:m4a-7-6i:m4a-7-6i:m4a-7-6i:m4a-7-5i:m4a-7-5i:m4a-7-5i:m4a-7-5i:m4a-7-4i:m4a-7-4i:m4a-7-4i:m4a-7-4i:m4a-6-24i:m4a-6-24i:m4a-6-24i:m4a-6-24i:m4a-6-23i:m4a-6-23i:m4a-6-23i:m4a-6-23i:m4a-6-22i:m4a-6-22i:m4a-6-22i:m4a-6-22i:m4a-6-21i:m4a-6-21i:m4a-6-21i:m4a-6-21i:m4a-6-20i:m4a-6-20i:m4a-6-20i:m4a-6-20i:m4a-6-19i:m4a-6-19i:m4a-6-19i:m4a-6-19i:m4a-6-18i:m4a-6-18i:m4a-6-18i:m4a-6-18i:m4a-6-17i:m4a-6-17i:m4a-6-17i:m4a-6-17i:m4a-6-16i:m4a-6-16i:m4a-6-16i:m4a-6-16i:m4a-6-15i:m4a-6-15i:m4a-6-15i:m4a-6-15i:m4a-6-14i:m4a-6-14i:m4a-6-14i:m4a-6-14i:m4a-6-13i:m4a-6-13i:m4a-6-13i:m4a-6-13i:m4a-6-12i:m4a-6-12i:m4a-6-12i:m4a-6-12i:m4a-6-11i:m4a-6-11i:m4a-6-11i:m4a-6-11i:m4a-6-10i:m4a-6-10i:m4a-6-10i:m4a-6-10i:m4a-6-9i:m4a-6-9i:m4a-6-9i:m4a-6-9i:m4a-6-8i:m4a-6-8i:m4a-6-8i:m4a-6-8i:m4a-6-7i:m4a-6-7i:m4a-6-7i:m4a-6-7i:m4a-6-6i:m4a-6-6i:m4a-6-6i:m4a-6-6i:m4a-6-5i:m4a-6-5i:m4a-6-5i:m4a-6-5i:m4a-6-4i:m4a-6-4i:m4a-6-4i:m4a-6-4i:m4a-6-3i:m4a-6-3i:m4a-6-3i:m4a-6-3i:m4a-6-2i:m4a-6-2i:m4a-6-2i:m4a-6-2i:m4a-6-1i:m4a-6-1i:m4a-6-1i:m4a-6-1i:'
>>>>>
>>>>>
>>>>> MPIRUN_RANK=16 MPIRUN_NPROCS=128 MPIRUN_ID=32469
>>>>> /ibrix/home/mfm42/opt/namd-IB/Linux-amd64-MPI/namd2 +strategy USE_GRID
>>>>> equil3_sys.namd
>>>>>
>>>>> Any suggestions for that kind of error will be appreciated.
>>>>>
>>>>>
>>>>> My system info:
>>>>>
>>>>> Dell 1855 Linux cluster consisting that is equipped with four Intel
>>>>> Xeon
>>>>> EM64T processors (3.6GHz) and 8 GB of memory. The nodes are connected
>>>>> with Infiniband, a high-speed, low-latency copper interconnect.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>> /*Morad Alawneh*/
>>>>>
>>>>> *Department of Chemistry and Biochemistry*
>>>>>
>>>>> *C100 BNSN, BYU*
>>>>>
>>>>> *Provo, UT 84602*
>>>>>
>>>>>
>>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:42:31 CST