Re: Segmentation violation from APOA1 simulation on AMD64

From: Cesar Luis Avila (cavila_at_fbqf.unt.edu.ar)
Date: Thu Aug 31 2006 - 10:42:41 CDT

At last, I have finally solved the problem. As you said it was related
to memory and/or PCI BUS.
I ran memtest86+ (http://www.memtest.org/) and found a problem in test #5

*Test 5 [Block move, 64 moves]*

    This test stresses memory by using block move (movsl) instructions
    and is based on Robert Redelmeier's burnBX test. Memory is
    initialized with shifting patterns that are inverted every 8 bytes.
    Then 4mb blocks of memory are moved around using the movsl
    instruction. After the moves are completed the data patterns are
    checked. Because the data is checked only after the memory moves are
    completed it is not possible to know where the error occurred. The
    addresses reported are only for where the bad pattern was found.
    Since the moves are constrained to a 8mb segment of memory the
    failing address will always be less than 8mb away from the reported
    address. Errors from this test are not used to calculate BadRAM
    patterns.

After updating the BIOS the problem dissapeared on memtest and NAMD was
up and running all night long.

The problem is present on MSI K8NGM2-FID (chipset NForce 430) with BIOS
version 3.0. and was fixed upgrading to BIOS v3.5.

Jim, thank you very much for your help.

Cesar

Jim Phillips escribió:
>
> Try cpuburn (http://pages.sbcglobal.net/redelm/) and netperf
> (http://www.netperf.org/netperf/NetperfPage.html), although I'm not
> sure how much netperf checks the results of its transmissions.
>
> setenv CONV_RSH "ssh -x" should take care of the console issue.
>
> -Jim
>
>
> On Tue, 29 Aug 2006, Cesar Luis Avila wrote:
>
>> Dear Jim,
>> thanks for your reply. The problem appears along random points in
>> large simulations. I have run the simulation once again and this time
>> I get
>>
>> ENERGY: 33000 22367.3984 21797.7332 5673.4186
>> 195.9020 -350322.0171 28291.9734 0.0000
>> 0.0000 49528.9333 -222466.6582 180.1726 -222046.4247
>> -222038.8650 180.0486 -1644.6361 -1672.2297
>> 921491.4634 -1663.1019 -1662.5577
>>
>> WRITING COORDINATES TO DCD FILE AT STEP 33000
>> ------------- Processor 2 Exiting: Caught Signal ------------
>> Signal: segmentation violation
>> Suggestion: Try running with '++debug', or linking with '-memory
>> paranoid'.
>> Fatal error on PE 2> segmentation violation
>>
>> I have to specific questions:
>> 1- Do you know of any linux tool to test both memory or PCI bus for
>> data corruption?
>> 2- How do I tell charm++ to use "ssh -x" instead of rsh when login in
>> the nodes? That way I will be able to redirect all the xterm consoles
>> for each node to my desktop.
>>
>> Regards
>> Cesar
>>
>>
>> Jim Phillips escribió:
>>> Hi,
>>>
>>> Thanks for all of the detail, it's very useful.
>>>
>>> The "Warning: Not all atoms have unique coordinates." message that
>>> happens right before the crash means that you're getting extra
>>> exclusions. This used to happen when two atoms were very close, but
>>> in the current code it really should be impossible. Does this
>>> always happen?
>>>
>>> Try running with "++debug-no-pause" to open an xterm for each node
>>> and try to figure out where the crash is happening. It's possible
>>> that you're getting data corruption on your PCI bus or in memory.
>>>
>>> -Jim
>>>
>>>
>>> On Mon, 28 Aug 2006, Cesar Luis Avila wrote:
>>>
>>>> Dear all,
>>>> I am running APOA1 benchmark to test NAMD_2.6b2_Linux-amd64-TCP
>>>> binaries. The only modification I have made to the configuration
>>>> file is to take the number of steps from 100 to 200000 (I wanted to
>>>> test the stability over long simulations). I am using the following
>>>> command to launch the job:
>>>> nohup charmrun +p6 /usr/local/bin/namd2 apoa1.namd > apoa1.log &
>>>>
>>>> with the nodelist
>>>> group main
>>>> host node0
>>>> host node1
>>>> host node2
>>>> host node3
>>>> host node4
>>>> host node5
>>>>
>>>> Each node has an AMD Athlon64 Dual-Core processor with 1Gb of
>>>> RAM. They are connected through gigabit ethernet. Kernel is
>>>> 2.6.16.20 SMP.
>>>>
>>>> I receive the following error (extracted from log file). Sorry for
>>>> the large message but I want to give as much information as possible.
>>>>
>>>> Charm++: scheduler running in netpoll mode.
>>>> Info: NAMD 2.6b2 for Linux-amd64-TCP
>>>> Info:
>>>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>>>> Info: and send feedback or bug reports to namd_at_ks.uiuc.edu
>>>> Info:
>>>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>>>> Info: in all publications reporting results obtained with NAMD.
>>>> Info:
>>>> Info: Based on Charm++/Converse 50900 for
>>>> net-linux-amd64-tcp-iccstatic
>>>> Info: Built Thu Aug 17 16:19:22 CDT 2006 by jim on belfast.ks.uiuc.edu
>>>> Info: Sending usage information to NAMD developers via UDP. Sent
>>>> data is:
>>>> Info: 1 NAMD 2.6b2 Linux-amd64-TCP 6 cluster cesar
>>>> Info: Running on 6 processors.
>>>> Info: 7608 kB of memory in use.
>>>> Info: Memory usage based on mallinfo
>>>> Info: Configuration file is apoa1.namd
>>>> TCL: Suspending until startup complete.
>>>> Info: SIMULATION PARAMETERS:
>>>> Info: TIMESTEP 1
>>>> Info: NUMBER OF STEPS 200000
>>>> Info: STEPS PER CYCLE 20
>>>> Info: PERIODIC CELL BASIS 1 108.861 0 0
>>>> Info: PERIODIC CELL BASIS 2 0 108.861 0
>>>> Info: PERIODIC CELL BASIS 3 0 0 77.758
>>>> Info: PERIODIC CELL CENTER 0 0 0
>>>> Info: LOAD BALANCE STRATEGY Other
>>>> Info: LDB PERIOD 4000 steps
>>>> Info: FIRST LDB TIMESTEP 100
>>>> Info: LDB BACKGROUND SCALING 1
>>>> Info: HOM BACKGROUND SCALING 1
>>>> Info: PME BACKGROUND SCALING 1
>>>> Info: MAX SELF PARTITIONS 50
>>>> Info: MAX PAIR PARTITIONS 20
>>>> Info: SELF PARTITION ATOMS 125
>>>> Info: PAIR PARTITION ATOMS 200
>>>> Info: PAIR2 PARTITION ATOMS 400
>>>> Info: MIN ATOMS PER PATCH 100
>>>> Info: INITIAL TEMPERATURE 300
>>>> Info: CENTER OF MASS MOVING? NO
>>>> Info: DIELECTRIC 1
>>>> Info: EXCLUDE SCALED ONE-FOUR
>>>> Info: 1-4 SCALE FACTOR 1
>>>> Info: DCD FILENAME apoa1
>>>> Info: DCD FREQUENCY 1000
>>>> Info: DCD FIRST STEP 1000
>>>> Info: DCD FILE WILL CONTAIN UNIT CELL DATA
>>>> Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT
>>>> Info: NO VELOCITY DCD OUTPUT
>>>> Info: OUTPUT FILENAME apoa1-out
>>>> Info: BINARY OUTPUT FILES WILL BE USED
>>>> Info: NO RESTART FILE
>>>> Info: SWITCHING ACTIVE
>>>> Info: SWITCHING ON 10
>>>> Info: SWITCHING OFF 12
>>>> Info: PAIRLIST DISTANCE 13.5
>>>> Info: PAIRLIST SHRINK RATE 0.01
>>>> Info: PAIRLIST GROW RATE 0.01
>>>> Info: PAIRLIST TRIGGER 0.3
>>>> Info: PAIRLISTS PER CYCLE 2
>>>> Info: PAIRLISTS ENABLED
>>>> Info: MARGIN 0
>>>> Info: HYDROGEN GROUP CUTOFF 2.5
>>>> Info: PATCH DIMENSION 16
>>>> Info: ENERGY OUTPUT STEPS 500
>>>> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
>>>> Info: TIMING OUTPUT STEPS 500
>>>> Info: PARTICLE MESH EWALD (PME) ACTIVE
>>>> Info: PME TOLERANCE 1e-06
>>>> Info: PME EWALD COEFFICIENT 0.257952
>>>> Info: PME INTERPOLATION ORDER 4
>>>> Info: PME GRID DIMENSIONS 108 108 80
>>>> Info: Attempting to read FFTW data from
>>>> FFTW_NAMD_2.6b2_Linux-amd64-TCP.txt
>>>> Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
>>>> Info: Writing FFTW data to FFTW_NAMD_2.6b2_Linux-amd64-TCP.txt
>>>> Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 4
>>>> Info: USING VERLET I (r-RESPA) MTS SCHEME.
>>>> Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
>>>> Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
>>>> Info: RANDOM NUMBER SEED 74269
>>>> Info: USE HYDROGEN BONDS? NO
>>>> Info: COORDINATE PDB apoa1.pdb
>>>> Info: STRUCTURE FILE apoa1.psf
>>>> Info: PARAMETER file: XPLOR format! (default)
>>>> Info: PARAMETERS par_all22_prot_lipid.xplor
>>>> Info: PARAMETERS par_all22_popc.xplor
>>>> Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
>>>> Info: SUMMARY OF PARAMETERS:
>>>> Info: 177 BONDS
>>>> Info: 435 ANGLES
>>>> Info: 446 DIHEDRAL
>>>> Info: 45 IMPROPER
>>>> Info: 0 CROSSTERM
>>>> Info: 83 VDW
>>>> Info: 6 VDW_PAIRS
>>>> Command = zcat apoa1.psf.Z
>>>> Filename.Z = apoa1.psf.Z
>>>> Command = gzip -d -c apoa1.psf.gz
>>>> Filename.gz = apoa1.psf.gz
>>>> Command = zcat apoa1.pdb.Z
>>>> Filename.Z = apoa1.pdb.Z
>>>> Command = gzip -d -c apoa1.pdb.gz
>>>> Filename.gz = apoa1.pdb.gz
>>>> Info: ****************************
>>>> Info: STRUCTURE SUMMARY:
>>>> Info: 92224 ATOMS
>>>> Info: 70660 BONDS
>>>> Info: 74136 ANGLES
>>>> Info: 74130 DIHEDRALS
>>>> Info: 1402 IMPROPERS
>>>> Info: 0 CROSSTERMS
>>>> Info: 0 EXCLUSIONS
>>>> Info: 1568 DIHEDRALS WITH MULTIPLE PERIODICITY (BASED ON PSF FILE)
>>>> Info: 276669 DEGREES OF FREEDOM
>>>> Info: 32992 HYDROGEN GROUPS
>>>> Info: TOTAL MASS = 553785 amu
>>>> Info: TOTAL CHARGE = -14 e
>>>> Info: *****************************
>>>> Info: Entering startup phase 0 with 38804 kB of memory in use.
>>>> Info: Entering startup phase 1 with 38804 kB of memory in use.
>>>> Info: Entering startup phase 2 with 67284 kB of memory in use.
>>>> Info: Entering startup phase 3 with 68008 kB of memory in use.
>>>> Info: PATCH GRID IS 6 (PERIODIC) BY 6 (PERIODIC) BY 4 (PERIODIC)
>>>> Info: REMOVING COM VELOCITY 0.00117959 0.0289175 0.0202933
>>>> Info: LARGEST PATCH (56) HAS 718 ATOMS
>>>> Info: CREATING 10904 COMPUTE OBJECTS
>>>> Info: Entering startup phase 4 with 81640 kB of memory in use.
>>>> Info: PME using 6 and 6 processors for FFT and reciprocal sum.
>>>> Info: PME GRID LOCATIONS: 0 1 2 3 4 5
>>>> Info: PME TRANS LOCATIONS: 0 1 2 3 4 5
>>>> Info: Optimizing 4 FFT steps. 1... 2... 3... 4... Done.
>>>> Info: Entering startup phase 5 with 82888 kB of memory in use.
>>>> Info: Entering startup phase 6 with 82888 kB of memory in use.
>>>> Measuring processor speeds... Done.
>>>> Info: Entering startup phase 7 with 82888 kB of memory in use.
>>>> Info: CREATING 10904 COMPUTE OBJECTS
>>>> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
>>>> Info: NONBONDED TABLE SIZE: 769 POINTS
>>>> Info: Entering startup phase 8 with 82028 kB of memory in use.
>>>> Info: Finished startup with 82028 kB of memory in use.
>>>> ETITLE: TS BOND ANGLE DIHED IMPRP
>>>> ELECT VDW BOUNDARY MISC KINETIC
>>>> TOTAL TEMP TOTAL2 TOTAL3 TEMPAVG
>>>> PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
>>>>
>>>> ENERGY: 0 12352.4560 14603.1480 4549.0710
>>>> 48.1064 -362073.7646 24229.5343 0.0000 0.0000
>>>> 82752.2738 -223539.1752 301.0300 -223524.8477
>>>> -223524.8477 301.0300 -2568.6725 -3582.2555 921491.4634
>>>> -2568.6725 -3582.2555
>>>>
>>>> Info: Initial time: 6 CPUs 0.507152 s/step 5.86982 days/ns 124788
>>>> kB memory
>>>> LDB: LOAD: AVG 48.5736 MAX 51.2269 MSGS: TOTAL 180 MAXC 30 MAXP
>>>> 3 None
>>>> Info: Adjusted background load on 4 nodes.
>>>> LDB: LOAD: AVG 48.6532 MAX 48.7132 MSGS: TOTAL 180 MAXC 30 MAXP
>>>> 3 Alg7
>>>> LDB: LOAD: AVG 48.6532 MAX 48.7132 MSGS: TOTAL 180 MAXC 30 MAXP
>>>> 3 Alg7
>>>> Info: Initial time: 6 CPUs 0.512891 s/step 5.93624 days/ns 126852
>>>> kB memory
>>>> LDB: LOAD: AVG 49.6683 MAX 49.915 MSGS: TOTAL 180 MAXC 30 MAXP 3
>>>> None
>>>> LDB: LOAD: AVG 49.6683 MAX 49.915 MSGS: TOTAL 180 MAXC 30 MAXP 3
>>>> Refine
>>>> Info: Initial time: 6 CPUs 0.499921 s/step 5.78612 days/ns 126852
>>>> kB memory
>>>> LDB: LOAD: AVG 49.4578 MAX 49.6803 MSGS: TOTAL 180 MAXC 30 MAXP
>>>> 3 None
>>>> LDB: LOAD: AVG 49.4578 MAX 49.6803 MSGS: TOTAL 180 MAXC 30 MAXP 3
>>>> Refine
>>>> Info: Benchmark time: 6 CPUs 0.497711 s/step 5.76054 days/ns 126852
>>>> kB memory
>>>> Info: Benchmark time: 6 CPUs 0.507661 s/step 5.8757 days/ns 126852
>>>> kB memory
>>>> TIMING: 500 CPU: 248.252, 0.494231/step Wall: 255.61,
>>>> 0.508663/step, 28.1884 hours remaining, 126852 kB of memory in use.
>>>> ENERGY: 500 20974.8939 19756.6574 5724.4523
>>>> 179.8271 -337741.4164 23251.1002 0.0000 0.0000
>>>> 45359.0766 -222495.4089 165.0039 -222135.7455
>>>> -222061.0907 161.6475 -3197.5168 -2425.4141 921491.4634
>>>> -2273.6744 -2277.3151
>>>>
>>>> Info: Benchmark time: 6 CPUs 0.556213 s/step 6.43765 days/ns 126852
>>>> kB memory
>>>> TIMING: 1000 CPU: 527.001, 0.557499/step Wall: 547.336,
>>>> 0.583452/step, 32.2519 hours remaining, 128680 kB of memory in use.
>>>> ENERGY: 1000 20703.0380 20129.4440 5682.2714
>>>> 181.3573 -339456.8976 24454.7618 0.0000 0.0000
>>>> 45859.0820 -222446.9431 166.8227 -222070.6715
>>>> -222057.9902 164.8544 -2204.3503 -2191.2056 921491.4634
>>>> -2249.8572 -2250.0385
>>>>
>>>> OPENING COORDINATE DCD FILE
>>>> WRITING COORDINATES TO DCD FILE AT STEP 1000
>>>> TIMING: 1500 CPU: 813.843, 0.573684/step Wall: 849.107,
>>>> 0.603543/step, 33.2787 hours remaining, 132720 kB of memory in use.
>>>>
>>>> .........
>>>>
>>>>
>>>> ENERGY: 25000 22195.1935 21609.1224 5636.4686
>>>> 206.3381 -350226.9598 28694.7582 0.0000
>>>> 0.0000 49427.1863 -222457.8927 179.8025
>>>> -222043.6668 -222038.7477 179.2539 -1388
>>>> .6522 -1473.5057 921491.4634 -1610.7261 -1610.3108
>>>>
>>>> WRITING COORDINATES TO DCD FILE AT STEP 25000
>>>> TIMING: 25500 CPU: 12821.8, 0.555515/step Wall: 13217.1,
>>>> 0.569695/step, 27.6144 hours remaining, 139268 kB of memory in use
>>>> .
>>>> ENERGY: 25500 22413.6842 21687.7772 5669.0633
>>>> 204.7580 -350072.0865 28452.2791 0.0000
>>>> 0.0000 49195.1010 -222449.4237 178.9583
>>>> -222031.9681 -222038.4793 179.4686 -1751
>>>> .5360 -1722.6946 921491.4634 -1553.1121 -1552.6865
>>>>
>>>> Warning: Not all atoms have unique coordinates.
>>>> ------------- Processor 2 Exiting: Caught Signal ------------
>>>> Signal: segmentation violation
>>>> Suggestion: Try running with '++debug', or linking with '-memory
>>>> paranoid'.
>>>> Fatal error on PE 2> segmentation violation
>>>>
>>>>
>>>
>>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:19:44 CST