Re: NAMD breaks with wrong timestep

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Fri Oct 21 2016 - 10:54:00 CDT

On Fri, Oct 21, 2016 at 11:41 AM, Brian Radak <bradak_at_anl.gov> wrote:
> I think you are hitting the limit of 32 bit signed integers somewhere
> (2147483647). There is not always good habit in the code of using unsigned
> integers where applicable, probably because the step isn't really used for
> anything other than checking the output frequency.

signed vs. unsigned only buys you a factor of 2. that is usually not
helping much and you lose the ability to detect overflows.
using unsigned integers has a lot of issues regardless. outside of
counts that have to be able to count bytes for the full address space
range (e.g. size_t). it is generally better to avoid unsigned integers
and switch to explicit 64-bit integers instead. generally, blindly
converting signed integers to unsigned ones is solving the wrong
problem (and creating new ones in the process).

> It might not be satisfying, but you can probably solve this by using the
> "firstTimestep" command to reset the count.

in the case of the (regular) dcd file format (which is derived from
fortran unformatted output with signed 32-bit integers) the latter is
the reasonable option to follow.

axel.

>
>
> HTH,
>
> Brian
>
>
> On 10/21/2016 08:23 AM, Götz, Alexander wrote:
>
> Hello everybody,
>
>
> I currently face some troubles with NAMD2.10 and my simulation systems. The
> systems (I have three nearly equal membrane systems) have all run for 2.1µs
> in 21 chunks of 100ns (all atom CHARMM36, 2fs integration timestep).
> Everything worked perfectly fine until step 22. Whenever I want to start
> step 22 for any of my systems I get the following error in the NAMD output:
>
>
> TCL: Running for 50000000 steps
> ETITLE: TS BOND ANGLE DIHED IMPRP
> ELECT VDW BOUNDARY MISC KINETIC
> TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG
> PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
>
> ENERGY: 2100000000 1866.2220 9802.6548 6896.3270
> 78.1218 -103131.8273 3410.1396 0.0000 0.0000
> 26331.7058 -54746.6563 301.5845 -81078.3620 -54571.7237
> 301.5845 -263.9492 -261.2014 396913.4892 -263.9492
> -261.2014
>
> OPENING EXTENDED SYSTEM TRAJECTORY FILE
> WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP -2144967296
> CLOSING EXTENDED SYSTEM TRAJECTORY FILE
> WRITING COORDINATES TO OUTPUT FILE AT STEP -2144967296
> COORDINATE DCD FILE <path removed by the author> WAS NOT CREATED
> The last position output (seq=-2) takes 0.006 seconds, 1399.262 MB of memory
> in use
> WRITING VELOCITIES TO OUTPUT FILE AT STEP -2144967296
> The last velocity output (seq=-2) takes 0.004 seconds, 1400.191 MB of memory
> in use
> ====================================================
>
> WallClock: 4.380243 CPUTime: 4.380243 Memory: 1400.191406 MB
> [Partition 0][Node 0] End of program
>
> I am quite confused about this, because I changed nothing in my NAMD
> configuration files except for the file numbering of the restart and output
> files and these are fine (checked by 3 different people). For me the
> problem seems to be related with generation of the DCD file. For the cluster
> part, the file system of the cluster (IBM GPFS) should be fine because other
> jobs with equal configurations are working and there has not been any
> maintenance that could be in relation to the observed problems. In addition
> step 21 of one of the system worked properly while step 22 of the other two
> systems failed at the same time. Looks a little bit like 22 is a magic
> number?
>
>
> Furthermore, the negative step number in the output, which is not in a line
> with the run steps, is also quite mysterious for me. I hope anybody has a
> tip or a solution for me because I have checked nearly everything that came
> into my mind until now.
>
>
> Best Regards
>
>
> Alex
>
>
> --------------------------------------------------------
> Alexander Götz, M.Sc.
> Technische Universität München // Fakultät für Physik
> Lehrstuhl für Bioelektronik E.14
> Maximus-von-Imhof Forum 4 (room P059)
> 85350 Freising, Germany
> T: +49 8161 71-3540
>
> Please consider the environment before printing this email
>
>
>
>
>
> --
> Brian Radak
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>
> 9700 South Cass Avenue, Bldg. 240
> Argonne, IL 60439-4854
> (630) 252-8643
> brian.radak_at_anl.gov

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:22:32 CST