Re: Segmentation violation from APOA1 simulation on AMD64

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Tue Aug 29 2006 - 14:51:41 CDT

Hi,

Thanks for all of the detail, it's very useful.

The "Warning: Not all atoms have unique coordinates." message that happens
right before the crash means that you're getting extra exclusions. This
used to happen when two atoms were very close, but in the current code it
really should be impossible. Does this always happen?

Try running with "++debug-no-pause" to open an xterm for each node and try
to figure out where the crash is happening. It's possible that you're
getting data corruption on your PCI bus or in memory.

-Jim

On Mon, 28 Aug 2006, Cesar Luis Avila wrote:

> Dear all,
> I am running APOA1 benchmark to test NAMD_2.6b2_Linux-amd64-TCP binaries.
> The only modification I have made to the configuration file is to take the
> number of steps from 100 to 200000 (I wanted to test the stability over long
> simulations).
> I am using the following command to launch the job:
> nohup charmrun +p6 /usr/local/bin/namd2 apoa1.namd > apoa1.log &
>
> with the nodelist
> group main
> host node0
> host node1
> host node2
> host node3
> host node4
> host node5
>
> Each node has an AMD Athlon64 Dual-Core processor with 1Gb of RAM. They are
> connected through gigabit ethernet. Kernel is 2.6.16.20 SMP.
>
> I receive the following error (extracted from log file). Sorry for the large
> message but I want to give as much information as possible.
>
> Charm++: scheduler running in netpoll mode.
> Info: NAMD 2.6b2 for Linux-amd64-TCP
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: and send feedback or bug reports to namd_at_ks.uiuc.edu
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 50900 for net-linux-amd64-tcp-iccstatic
> Info: Built Thu Aug 17 16:19:22 CDT 2006 by jim on belfast.ks.uiuc.edu
> Info: Sending usage information to NAMD developers via UDP. Sent data is:
> Info: 1 NAMD 2.6b2 Linux-amd64-TCP 6 cluster cesar
> Info: Running on 6 processors.
> Info: 7608 kB of memory in use.
> Info: Memory usage based on mallinfo
> Info: Configuration file is apoa1.namd
> TCL: Suspending until startup complete.
> Info: SIMULATION PARAMETERS:
> Info: TIMESTEP 1
> Info: NUMBER OF STEPS 200000
> Info: STEPS PER CYCLE 20
> Info: PERIODIC CELL BASIS 1 108.861 0 0
> Info: PERIODIC CELL BASIS 2 0 108.861 0
> Info: PERIODIC CELL BASIS 3 0 0 77.758
> Info: PERIODIC CELL CENTER 0 0 0
> Info: LOAD BALANCE STRATEGY Other
> Info: LDB PERIOD 4000 steps
> Info: FIRST LDB TIMESTEP 100
> Info: LDB BACKGROUND SCALING 1
> Info: HOM BACKGROUND SCALING 1
> Info: PME BACKGROUND SCALING 1
> Info: MAX SELF PARTITIONS 50
> Info: MAX PAIR PARTITIONS 20
> Info: SELF PARTITION ATOMS 125
> Info: PAIR PARTITION ATOMS 200
> Info: PAIR2 PARTITION ATOMS 400
> Info: MIN ATOMS PER PATCH 100
> Info: INITIAL TEMPERATURE 300
> Info: CENTER OF MASS MOVING? NO
> Info: DIELECTRIC 1
> Info: EXCLUDE SCALED ONE-FOUR
> Info: 1-4 SCALE FACTOR 1
> Info: DCD FILENAME apoa1
> Info: DCD FREQUENCY 1000
> Info: DCD FIRST STEP 1000
> Info: DCD FILE WILL CONTAIN UNIT CELL DATA
> Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT
> Info: NO VELOCITY DCD OUTPUT
> Info: OUTPUT FILENAME apoa1-out
> Info: BINARY OUTPUT FILES WILL BE USED
> Info: NO RESTART FILE
> Info: SWITCHING ACTIVE
> Info: SWITCHING ON 10
> Info: SWITCHING OFF 12
> Info: PAIRLIST DISTANCE 13.5
> Info: PAIRLIST SHRINK RATE 0.01
> Info: PAIRLIST GROW RATE 0.01
> Info: PAIRLIST TRIGGER 0.3
> Info: PAIRLISTS PER CYCLE 2
> Info: PAIRLISTS ENABLED
> Info: MARGIN 0
> Info: HYDROGEN GROUP CUTOFF 2.5
> Info: PATCH DIMENSION 16
> Info: ENERGY OUTPUT STEPS 500
> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
> Info: TIMING OUTPUT STEPS 500
> Info: PARTICLE MESH EWALD (PME) ACTIVE
> Info: PME TOLERANCE 1e-06
> Info: PME EWALD COEFFICIENT 0.257952
> Info: PME INTERPOLATION ORDER 4
> Info: PME GRID DIMENSIONS 108 108 80
> Info: Attempting to read FFTW data from FFTW_NAMD_2.6b2_Linux-amd64-TCP.txt
> Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
> Info: Writing FFTW data to FFTW_NAMD_2.6b2_Linux-amd64-TCP.txt
> Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 4
> Info: USING VERLET I (r-RESPA) MTS SCHEME.
> Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
> Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
> Info: RANDOM NUMBER SEED 74269
> Info: USE HYDROGEN BONDS? NO
> Info: COORDINATE PDB apoa1.pdb
> Info: STRUCTURE FILE apoa1.psf
> Info: PARAMETER file: XPLOR format! (default)
> Info: PARAMETERS par_all22_prot_lipid.xplor
> Info: PARAMETERS par_all22_popc.xplor
> Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
> Info: SUMMARY OF PARAMETERS:
> Info: 177 BONDS
> Info: 435 ANGLES
> Info: 446 DIHEDRAL
> Info: 45 IMPROPER
> Info: 0 CROSSTERM
> Info: 83 VDW
> Info: 6 VDW_PAIRS
> Command = zcat apoa1.psf.Z
> Filename.Z = apoa1.psf.Z
> Command = gzip -d -c apoa1.psf.gz
> Filename.gz = apoa1.psf.gz
> Command = zcat apoa1.pdb.Z
> Filename.Z = apoa1.pdb.Z
> Command = gzip -d -c apoa1.pdb.gz
> Filename.gz = apoa1.pdb.gz
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 92224 ATOMS
> Info: 70660 BONDS
> Info: 74136 ANGLES
> Info: 74130 DIHEDRALS
> Info: 1402 IMPROPERS
> Info: 0 CROSSTERMS
> Info: 0 EXCLUSIONS
> Info: 1568 DIHEDRALS WITH MULTIPLE PERIODICITY (BASED ON PSF FILE)
> Info: 276669 DEGREES OF FREEDOM
> Info: 32992 HYDROGEN GROUPS
> Info: TOTAL MASS = 553785 amu
> Info: TOTAL CHARGE = -14 e
> Info: *****************************
> Info: Entering startup phase 0 with 38804 kB of memory in use.
> Info: Entering startup phase 1 with 38804 kB of memory in use.
> Info: Entering startup phase 2 with 67284 kB of memory in use.
> Info: Entering startup phase 3 with 68008 kB of memory in use.
> Info: PATCH GRID IS 6 (PERIODIC) BY 6 (PERIODIC) BY 4 (PERIODIC)
> Info: REMOVING COM VELOCITY 0.00117959 0.0289175 0.0202933
> Info: LARGEST PATCH (56) HAS 718 ATOMS
> Info: CREATING 10904 COMPUTE OBJECTS
> Info: Entering startup phase 4 with 81640 kB of memory in use.
> Info: PME using 6 and 6 processors for FFT and reciprocal sum.
> Info: PME GRID LOCATIONS: 0 1 2 3 4 5
> Info: PME TRANS LOCATIONS: 0 1 2 3 4 5
> Info: Optimizing 4 FFT steps. 1... 2... 3... 4... Done.
> Info: Entering startup phase 5 with 82888 kB of memory in use.
> Info: Entering startup phase 6 with 82888 kB of memory in use.
> Measuring processor speeds... Done.
> Info: Entering startup phase 7 with 82888 kB of memory in use.
> Info: CREATING 10904 COMPUTE OBJECTS
> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
> Info: NONBONDED TABLE SIZE: 769 POINTS
> Info: Entering startup phase 8 with 82028 kB of memory in use.
> Info: Finished startup with 82028 kB of memory in use.
> ETITLE: TS BOND ANGLE DIHED IMPRP
> ELECT VDW BOUNDARY MISC KINETIC
> TOTAL TEMP TOTAL2 TOTAL3 TEMPAVG
> PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
>
> ENERGY: 0 12352.4560 14603.1480 4549.0710 48.1064
> -362073.7646 24229.5343 0.0000 0.0000 82752.2738
> -223539.1752 301.0300 -223524.8477 -223524.8477 301.0300
> -2568.6725 -3582.2555 921491.4634 -2568.6725 -3582.2555
>
> Info: Initial time: 6 CPUs 0.507152 s/step 5.86982 days/ns 124788 kB memory
> LDB: LOAD: AVG 48.5736 MAX 51.2269 MSGS: TOTAL 180 MAXC 30 MAXP 3 None
> Info: Adjusted background load on 4 nodes.
> LDB: LOAD: AVG 48.6532 MAX 48.7132 MSGS: TOTAL 180 MAXC 30 MAXP 3 Alg7
> LDB: LOAD: AVG 48.6532 MAX 48.7132 MSGS: TOTAL 180 MAXC 30 MAXP 3 Alg7
> Info: Initial time: 6 CPUs 0.512891 s/step 5.93624 days/ns 126852 kB memory
> LDB: LOAD: AVG 49.6683 MAX 49.915 MSGS: TOTAL 180 MAXC 30 MAXP 3 None
> LDB: LOAD: AVG 49.6683 MAX 49.915 MSGS: TOTAL 180 MAXC 30 MAXP 3 Refine
> Info: Initial time: 6 CPUs 0.499921 s/step 5.78612 days/ns 126852 kB memory
> LDB: LOAD: AVG 49.4578 MAX 49.6803 MSGS: TOTAL 180 MAXC 30 MAXP 3 None
> LDB: LOAD: AVG 49.4578 MAX 49.6803 MSGS: TOTAL 180 MAXC 30 MAXP 3 Refine
> Info: Benchmark time: 6 CPUs 0.497711 s/step 5.76054 days/ns 126852 kB memory
> Info: Benchmark time: 6 CPUs 0.507661 s/step 5.8757 days/ns 126852 kB memory
> TIMING: 500 CPU: 248.252, 0.494231/step Wall: 255.61, 0.508663/step,
> 28.1884 hours remaining, 126852 kB of memory in use.
> ENERGY: 500 20974.8939 19756.6574 5724.4523 179.8271
> -337741.4164 23251.1002 0.0000 0.0000 45359.0766
> -222495.4089 165.0039 -222135.7455 -222061.0907 161.6475
> -3197.5168 -2425.4141 921491.4634 -2273.6744 -2277.3151
>
> Info: Benchmark time: 6 CPUs 0.556213 s/step 6.43765 days/ns 126852 kB memory
> TIMING: 1000 CPU: 527.001, 0.557499/step Wall: 547.336, 0.583452/step,
> 32.2519 hours remaining, 128680 kB of memory in use.
> ENERGY: 1000 20703.0380 20129.4440 5682.2714 181.3573
> -339456.8976 24454.7618 0.0000 0.0000 45859.0820
> -222446.9431 166.8227 -222070.6715 -222057.9902 164.8544
> -2204.3503 -2191.2056 921491.4634 -2249.8572 -2250.0385
>
> OPENING COORDINATE DCD FILE
> WRITING COORDINATES TO DCD FILE AT STEP 1000
> TIMING: 1500 CPU: 813.843, 0.573684/step Wall: 849.107, 0.603543/step,
> 33.2787 hours remaining, 132720 kB of memory in use.
>
> .........
>
>
> ENERGY: 25000 22195.1935 21609.1224 5636.4686 206.3381
> -350226.9598 28694.7582 0.0000
> 0.0000 49427.1863 -222457.8927 179.8025
> -222043.6668 -222038.7477 179.2539 -1388
> .6522 -1473.5057 921491.4634 -1610.7261 -1610.3108
>
> WRITING COORDINATES TO DCD FILE AT STEP 25000
> TIMING: 25500 CPU: 12821.8, 0.555515/step Wall: 13217.1, 0.569695/step,
> 27.6144 hours remaining, 139268 kB of memory in use
> .
> ENERGY: 25500 22413.6842 21687.7772 5669.0633 204.7580
> -350072.0865 28452.2791 0.0000
> 0.0000 49195.1010 -222449.4237 178.9583
> -222031.9681 -222038.4793 179.4686 -1751
> .5360 -1722.6946 921491.4634 -1553.1121 -1552.6865
>
> Warning: Not all atoms have unique coordinates.
> ------------- Processor 2 Exiting: Caught Signal ------------
> Signal: segmentation violation
> Suggestion: Try running with '++debug', or linking with '-memory paranoid'.
> Fatal error on PE 2> segmentation violation
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:43:57 CST