Re: NAMD2.6b2: Segmentation fault

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Tue Aug 29 2006 - 16:13:45 CDT

On Tue, 29 Aug 2006, Morad Alawneh wrote:

> I have tested this binary Linux-amd64-TCP in parallel, and it did like a
> duplication instead of using that number of processors for one job.

Don't use mpirun, use the included charmrun binary.

> What do you mean by gdb?

It's a debugger. When it segfaults you can see where it was.

> Regarding the output file it is the same for all, it reach the last line
> bellow and after that it crash.
>
>
> Info: STRUCTURE SUMMARY:
> Info: 6376 ATOMS
> Info: 5030 BONDS
> Info: 5934 ANGLES
> Info: 6512 DIHEDRALS
> Info: 96 IMPROPERS
> Info: 30 CROSSTERMS
> Info: 0 EXCLUSIONS
> Info: 120 CONSTRAINTS
> Info: 5320 RIGID BONDS
> Info: 13808 DEGREES OF FREEDOM
> Info: 2352 HYDROGEN GROUPS
> Info: TOTAL MASS = 39362.3 amu
> Info: TOTAL CHARGE = 2.15322e-06 e
> Info: *****************************
> Info: Entering startup phase 0 with 193104 kB of memory in use.
> Info: Entering startup phase 1 with 193104 kB of memory in use.

OK, so relevant information would be "segfaults in startup phase 1".
This is almost certainly related to the known CMAP crash (2.6b1 doesn't
even recognize the CMAP terms in the psf file). It's fixed in CVS but not
in the 2.6b1 source code on the download site.

-Jim

> After adding this option -DSIMPLE_PAIRLIST to CXXOPTS and rebuild it
> again, I did a test for different number of cpus, and I have something
> weired:
>
> Number of nodes * number of cpu per node : result
> 1 * 4 : OK
> 2 * 4 : OK
> 4 * 4 : OK
> 8 * 4 : FAILED
> 16 * 4 : FAILED
>
>
> Does that give any clue about the problem?
>
>
> Thanks agian
>
>
>
>
>
> /*Morad Alawneh*/
>
> *Department of Chemistry and Biochemistry*
>
> *C100 BNSN, BYU*
>
> *Provo, UT 84602*
>
>
>
> Jim Phillips wrote:
>>
>> One other thing, and I doubt this is it, but try adding
>> -DSIMPLE_PAIRLIST to CXXOPTS and CXXNOALIASOPTS (if it's there) in the
>> .arch file. I've seen the Intel compilers choke on those loops before.
>>
>> -Jim
>>
>>
>> On Tue, 29 Aug 2006, Jim Phillips wrote:
>>
>>>
>>> Do the 2.6b2 released binaries run your job in parallel successfully?
>>>
>>> Can you run your binary in gdb to see where the crash happens?
>>>
>>> I'm just amazed that there is no other output from the job.
>>>
>>> -Jim
>>>
>>>
>>> On Tue, 29 Aug 2006, Morad Alawneh wrote:
>>>
>>>> Thanks for your suggestions.
>>>>
>>>> I have compiled from scratch NAMD 2.6b1 and it works without any
>>>> problem.
>>>>
>>>> I have compiled from scratch NAMD 2.6b2 and it gives the
>>>> Segmentation Fault.
>>>>
>>>> I have compiled from scratch NAMD 2.6b2 with charm++ from NAMD 2.6b1
>>>> and
>>>> also gives the Segmentation Fault.
>>>>
>>>>
>>>> By checking the NamdKnownBugs, I found the following:
>>>>
>>>>
>>>> 2.6b2
>>>>
>>>> Parallel runs will often crash (segment fault) during startup phase 2
>>>> when CMAP crossterms are present in the psf file. Fixed.
>>>>
>>>>
>>>> According to that note it should have been fixed. So I downloaded the
>>>> source file of NAMD 2.6b2 today, and I followed your suggestions but
>>>> without any success yet.
>>>>
>>>> Here what I got in the error log file:
>>>>
>>>> bash: line 1: 24937 Segmentation fault /usr/bin/env MPIRUN_MPD=0
>>>> MPIRUN_HOST=m4a-3-21.local MPIRUN_PORT=52039
>>>> MPIRUN_PROCESSES='m4a-3-21i:m4a-3-21i:m4a-3-21i:m4a-3-21i:m4a-3-20i:m4a-3-20i:m4a-3-20i:m4a-3-20i:m4a-3-19i:m4a-3-19i:m4a-3-19i:m4a-3-19i:m4a-3-18i:m4a-3-18i:m4a-3-18i:m4a-3-18i:m4a-3-17i:m4a-3-17i:m4a-3-17i:m4a-3-17i:m4a-3-16i:m4a-3-16i:m4a-3-16i:m4a-3-16i:m4a-3-15i:m4a-3-15i:m4a-3-15i:m4a-3-15i:m4a-3-14i:m4a-3-14i:m4a-3-14i:m4a-3-14i:'
>>>>
>>>> MPIRUN_RANK=6 MPIRUN_NPROCS=32 MPIRUN_ID=21872
>>>> /ibrix/home/mfm42/opt/namd-IB/Linux-amd64-MPI/namd2 +strategy USE_GRID
>>>> prod_sys.namd
>>>> Terminating processes.
>>>>
>>>>
>>>> Do you have other suggestions?
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> /*Morad Alawneh*/
>>>>
>>>> *Department of Chemistry and Biochemistry*
>>>>
>>>> *C100 BNSN, BYU*
>>>>
>>>> *Provo, UT 84602*
>>>>
>>>>
>>>>
>>>> Jim Phillips wrote:
>>>>>
>>>>> There are many changes between the two versions. The first test is to
>>>>> see if the difference is in NAMD or Charm++. NAMD 2.6b2 should work
>>>>> with the version of Charm++ included in NAMD 2.6b1, so you might try
>>>>> building that first to see if the problem goes away. I would also
>>>>> rebuild 2.6b1 from scratch to see if there has been a change in your
>>>>> compilers, etc.
>>>>>
>>>>> -Jim
>>>>>
>>>>>
>>>>> On Tue, 29 Aug 2006, Morad Alawneh wrote:
>>>>>
>>>>>> Dear NAMD Developers,
>>>>>>
>>>>>>
>>>>>> After long time of debuging and testing the our hardware,
>>>>>> NAMD2.6b1 runs
>>>>>> in parallel without any problem whereas NAMD2.6b2 does not, even
>>>>>> though,
>>>>>> both were installed with the same instructions. Both versions can
>>>>>> work
>>>>>> in serial and parallel (using GegaEthernet conection) without any
>>>>>> problem.
>>>>>>
>>>>>> I did what Jim suggested in the his previous email, but still I
>>>>>> have the
>>>>>> same problem.
>>>>>>
>>>>>> I have attached the instruction again with this email.
>>>>>>
>>>>>> I am wondering if there is any change between the two versions?
>>>>>>
>>>>>> Would you suggest any solution for this issue?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> /*Morad Alawneh*/
>>>>>>
>>>>>> *Department of Chemistry and Biochemistry*
>>>>>>
>>>>>> *C100 BNSN, BYU*
>>>>>>
>>>>>> *Provo, UT 84602*
>>>>>>
>>>>>>
>>>>>>
>>>>>> Jim Phillips wrote:
>>>>>>>
>>>>>>> I can't tell much from just a segfault. Does the charm++ megatest
>>>>>>> work? Does NAMD run on one processor? Is there *any* output at all?
>>>>>>>
>>>>>>> My only comments looking at your build script are that on the charm
>>>>>>> ./build line "-language charm++ -balance rand" shouldn't be
>>>>>>> needed and
>>>>>>> may be harmful. Also, you shouldn't need "CHARMOPTS = -thread
>>>>>>> pthreads -memory os" with the TopSpin MPI library. It looks like
>>>>>>> you're following
>>>>>>> http://www.ks.uiuc.edu/Research/namd/wiki/?NamdOnInfiniBand but
>>>>>>> using
>>>>>>> the VMI build instructions. Also, please use the charm-5.9 source
>>>>>>> distributed with the NAMD source code, since this is the stable
>>>>>>> tree.
>>>>>>>
>>>>>>> -Jim
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 21 Aug 2006, Morad Alawneh wrote:
>>>>>>>
>>>>>>>> Dear users,
>>>>>>>>
>>>>>>>> I have installed successfully NAMD2.6b1 onto my system, the
>>>>>>>> installation
>>>>>>>> instructions are attached with this email, and the program was
>>>>>>>> working
>>>>>>>> without any problem.
>>>>>>>>
>>>>>>>> I followed the same way for installing NAMD2.6b2, but after
>>>>>>>> submitting a
>>>>>>>> job I received the following message in the error log file:
>>>>>>>>
>>>>>>>> bash: line 1: 31904 Segmentation fault /usr/bin/env
>>>>>>>> MPIRUN_MPD=0
>>>>>>>> MPIRUN_HOST=m4a-7-11.local MPIRUN_PORT=40732
>>>>>>>> MPIRUN_PROCESSES='m4a-7-11i:m4a-7-11i:m4a-7-11i:m4a-7-11i:m4a-7-10i:m4a-7-10i:m4a-7-10i:m4a-7-10i:m4a-7-9i:m4a-7-9i:m4a-7-9i:m4a-7-9i:m4a-7-8i:m4a-7-8i:m4a-7-8i:m4a-7-8i:m4a-7-7i:m4a-7-7i:m4a-7-7i:m4a-7-7i:m4a-7-6i:m4a-7-6i:m4a-7-6i:m4a-7-6i:m4a-7-5i:m4a-7-5i:m4a-7-5i:m4a-7-5i:m4a-7-4i:m4a-7-4i:m4a-7-4i:m4a-7-4i:m4a-6-24i:m4a-6-24i:m4a-6-24i:m4a-6-24i:m4a-6-23i:m4a-6-23i:m4a-6-23i:m4a-6-23i:m4a-6-22i:m4a-6-22i:m4a-6-22i:m4a-6-22i:m4a-6-21i:m4a-6-21i:m4a-6-21i:m4a-6-21i:m4a-6-20i:m4a-6-20i:m4a-6-20i:m4a-6-20i:m4a-6-19i:m4a-6-19i:m4a-6-19i:m4a-6-19i:m4a-6-18i:m4a-6-18i:m4a-6-18i:m4a-6-18i:m4a-6-17i:m4a-6-17i:m4a-6-17i:m4a-6-17i:m4a-6-16i:m4a-6-16i:m4a-6-16i:m4a-6-16i:m4a-6-15i:m4a-6-15i:m4a-6-15i:m4a-6-15i:m4a-6-14i:m4a-6-14i:m4a-6-14i:m4a-6-14i:m4a-6-13i:m4a-6-13i:m4a-6-13i:m4a-6-13i:m4a-6-12i:m4a-6-12i:m4a-6-12i:m4a-6-12i:m4a-6-11i:m4a-6-11i:m4a-6-11i:m4a-6-11i:m4a-6-10i:m4a-6-10i:m4a-6-10i:m4a-6-10i:m4a-6-9i:m4a-6-9i:m4a-6-9i:m4a-6-9i:m4a-6-8i:m4a-6-8i:m4a-6-8i:m4a-6!
>>>>>>>>
>> -
>>>>>>>> 8i:m4a-6-7i:m4a-6-7i:m4a-6-7i:m4a-6-7i:m4a-6-6i:m4a-6-6i:m4a-6-6i:m4a-6-6i:m4a-6-5i:m4a-6-5i:m4a-6-5i:m4a-6-5i:m4a-6-4i:m4a-6-4i:m4a-6-4i:m4a-6-4i:m4a-6-3i:m4a-6-3i:m4a-6-3i:m4a-6-3i:m4a-6-2i:m4a-6-2i:m4a-6-2i:m4a-6-2i:m4a-6-1i:m4a-6-1i:m4a-6-1i:m4a-6-1i:'
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> MPIRUN_RANK=16 MPIRUN_NPROCS=128 MPIRUN_ID=32469
>>>>>>>> /ibrix/home/mfm42/opt/namd-IB/Linux-amd64-MPI/namd2 +strategy
>>>>>>>> USE_GRID
>>>>>>>> equil3_sys.namd
>>>>>>>>
>>>>>>>> Any suggestions for that kind of error will be appreciated.
>>>>>>>>
>>>>>>>>
>>>>>>>> My system info:
>>>>>>>>
>>>>>>>> Dell 1855 Linux cluster consisting that is equipped with four Intel
>>>>>>>> Xeon
>>>>>>>> EM64T processors (3.6GHz) and 8 GB of memory. The nodes are
>>>>>>>> connected
>>>>>>>> with Infiniband, a high-speed, low-latency copper interconnect.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> /*Morad Alawneh*/
>>>>>>>>
>>>>>>>> *Department of Chemistry and Biochemistry*
>>>>>>>>
>>>>>>>> *C100 BNSN, BYU*
>>>>>>>>
>>>>>>>> *Provo, UT 84602*
>>>>>>>>
>>>>>>>>
>>>>>>
>>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:43:57 CST