Re: 4 sequential jobs work on laptops, but only first one works on supercomputer

From: JT (jtibbitt_at_odu.edu)
Date: Mon Apr 13 2009 - 22:02:54 CDT

Ok, I found another thread (from 2 years ago) of people reporting
similar 'segmentation violation' problems when running on their
cluster. It seemed to be a major problem that they never quite
fixed. Does anyone know if NAMD 2.7b1 addressed any issues like this?

http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l/5770.html

On Apr 13, 2009, at 10:59 PM, JT wrote:

> Here is a little more information about the crash occurring during
> minimization. Running using only single processor mode creates some
> sort of memory leak. It writes an enormous amount of junk to the
> NAMD output that looks like:
>
> Info: Finished startup with 27120 kB of memory in use.
> TCL: Minimizing for 2000 steps
> ETITLE: TS BOND ANGLE
> DIHED IMPRP
> ELECT VDW BOUNDARY MISC
> KINETIC
> TOTAL TEMP TOTAL2 TOTAL3
> TEMPAVG
> PRESSURE GPRESSURE VOLUME PRESSAVG
> GPRESSAVG
>
> ENERGY: 0 491.9037 99999999.9999 76.3226
> 8.5673
> -428179.8422 1346.5543 0.0000 0.0000
> 0.0000
> 99999999.9999 0.0000 99999999.9999 99999999.9999
> 0.0000
> 99999999.9999 99999999.9999 272212.3858 99999999.9999
> 99999999.9999
>
> INITIAL STEP: 1e-06
> GRADIENT TOLERANCE: nan
> BRACKET: 0 nan nan nan nan
> RESTARTING CONJUGATE GRADIENT ALGORITHM
> INITIAL STEP: 5e-07
> GRADIENT TOLERANCE: nan
> BRACKET: 0 nan nan nan nan
> RESTARTING CONJUGATE GRADIENT ALGORITHM
> INITIAL STEP: 2.5e-07
> GRADIENT TOLERANCE: nan
> BRACKET: 0 nan nan nan nan
> NEW SEARCH DIRECTION
> .
> .
> .
> This continues on for more than 4 gigabytes (until I noticed what
> was happening and killed the job).
>
>
>
> Running the same job using using parallel execution causes the job
> to crash. The cluster log file states something about a
> segmentation violation:
>
> -catch_rsh /opt/gridengine/default/spool/zorka-0-34/active_jobs/
> 6711.1/pe_hostfile
> zorka-0-34.local:4
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> *** RUN MY JOB ***
> Charmrun> charmrun started...
> Charmrun> using /tmp/6711.1.q2x2/namd2.nodelist as nodesfile
> Charmrun> rsh (zorka-0-34:0) started
> Charmrun> rsh (zorka-0-34:1) started
> Charmrun> rsh (zorka-0-34:2) started
> Charmrun> rsh (zorka-0-34:3) started
> Charmrun> node programs all started
> Warning: Permanently added 'zorka-0-34' (RSA) to the list of known
> hosts.
> Warning: Permanently added 'zorka-0-34' (RSA) to the list of known
> hosts.
> Charmrun> node programs all connected
> ------------- Processor 2 Exiting: Caught Signal ------------
> Signal: segmentation violation
> Suggestion: Try running with '++debug', or linking with '-memory
> paranoid'.
> Fatal error on PE 2> segmentation violation
>
>
> Anyone know what is going on? Why does the first minimization work,
> but the second one does not?
> Jeff
>
>
>
>
>
>
>
>
>
>
>
> On Apr 13, 2009, at 7:00 PM, JT wrote:
>
>> Hey NAMD community,
>> A little trouble here. There are four sequential jobs to be run on a
>> solvated polymer:
>> 1. minimize surrounding solvent, polymer fixed
>> 2. minimize polymer, solvent fixed
>> 3. minimize both
>> 4. equilibrate
>>
>> The first job uses a solvated polymer built with VMD. Then each
>> successive job takes the output files from the previous job
>> (.coor, .xsc, .....) as its starting point. The thing is, these jobs
>> work fine under two different laptop architectures (OSX and XP). But
>> when trying the exact same input files on a supercomputer, only the
>> first minimization works. Two different supercomputers have been
>> tried. Serial and parallel runs were tried on both. The same thing
>> happens each time. Info about one of the clusters: forty compute
>> nodes, each with two 3-gigahertz dual-core Intel processors and 8
>> gigabytes of memory. When it crashes, it's like NAMD thinks the
>> system is exploding or something (high energies, nan written
>> everywhere, ...). Output from the crash is pasted in below. Also, I
>> placed links to the other input and output files. BTW, if it is
>> better to just attach files, or not to attach them (unless asked),
>> please let me know. I also tried running the minimizations on my
>> laptop, then just using the supercomputers for running equilibration.
>> Similar errors occur.
>>
>> Thank-you for looking,
>> Jeff Tibbitt
>>
>>
>> Input files:
>> www.drclawslaboratory.com/clawup/namd_min1.tcl
>> www.drclawslaboratory.com/clawup/namd_min2.tcl
>> www.drclawslaboratory.com/clawup/namd_min3.tcl
>> www.drclawslaboratory.com/clawup/namd_equi.tcl
>>
>>
>> Output files (run on OSX laptop):
>> www.drclawslaboratory.com/clawup/namd_min1.out
>> www.drclawslaboratory.com/clawup/namd_min2.out
>> www.drclawslaboratory.com/clawup/namd_min3.out
>> www.drclawslaboratory.com/clawup/namd_equi.out
>>
>>
>> Output files (run on supercomputer):
>> www.drclawslaboratory.com/clawup/namd_min1_WORKED.out
>> www.drclawslaboratory.com/clawup/namd_min2_CRASHED.out
>>
>>
>> Queue submission script files (for running on supercomputer):
>> www.drclawslaboratory.com/clawup/s2_sp
>> www.drclawslaboratory.com/clawup/s2_mp
>>
>>
>> Output from minimization that crashed on supercomputer
>> (namd_min2_CRASHED.out):
>> TCL: Minimizing for 2000 steps
>> ETITLE: TS BOND ANGLE DIHED
>> IMPRP ELECT VDW BOUNDARY
>> MISC KINETIC TOTAL T
>> EMP TOTAL2 TOTAL3 TEMPAVG
>> PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
>>
>> ENERGY: 0 491.9037 99999999.9999 76.3226
>> 8.5673 -428179.8422 1346.5543 0.0000
>> 0.0000 0.0000 99999999.9999 0.0
>> 000 99999999.9999 99999999.9999 0.0000 99999999.9999
>> 99999999.9999 272212.3858 99999999.9999 99999999.9999
>>
>> INITIAL STEP: 1e-06
>> GRADIENT TOLERANCE: nan
>> BRACKET: 0 nan nan nan nan
>> RESTARTING CONJUGATE GRADIENT ALGORITHM
>> INITIAL STEP: 5e-07
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:36 CST