Re: NAMD breaks with wrong timestep

From: Brian Radak (bradak_at_anl.gov)
Date: Fri Oct 21 2016 - 15:20:19 CDT

As usual, I have missed something subtle - thanks Axel.

It sounds like the consensus is that the "firstTimestep" solution is the
best and fastest option.

Going forward, is this a problem that needs to be addressed? If not
switching to 64 bit ints, is there a more expected behavior that can
replace the current one? Should the step rollover to zero? Wouldn't this
have to be done in such a way that the output frequencies are still
respected? Should there just be a limit on what number of steps a user
should be permitted to reach?

Brian

On 10/21/2016 10:54 AM, Axel Kohlmeyer wrote:
> On Fri, Oct 21, 2016 at 11:41 AM, Brian Radak <bradak_at_anl.gov> wrote:
>> I think you are hitting the limit of 32 bit signed integers somewhere
>> (2147483647). There is not always good habit in the code of using unsigned
>> integers where applicable, probably because the step isn't really used for
>> anything other than checking the output frequency.
> signed vs. unsigned only buys you a factor of 2. that is usually not
> helping much and you lose the ability to detect overflows.
> using unsigned integers has a lot of issues regardless. outside of
> counts that have to be able to count bytes for the full address space
> range (e.g. size_t). it is generally better to avoid unsigned integers
> and switch to explicit 64-bit integers instead. generally, blindly
> converting signed integers to unsigned ones is solving the wrong
> problem (and creating new ones in the process).
>
>> It might not be satisfying, but you can probably solve this by using the
>> "firstTimestep" command to reset the count.
> in the case of the (regular) dcd file format (which is derived from
> fortran unformatted output with signed 32-bit integers) the latter is
> the reasonable option to follow.
>
> axel.
>
>>
>> HTH,
>>
>> Brian
>>
>>
>> On 10/21/2016 08:23 AM, Götz, Alexander wrote:
>>
>> Hello everybody,
>>
>>
>> I currently face some troubles with NAMD2.10 and my simulation systems. The
>> systems (I have three nearly equal membrane systems) have all run for 2.1µs
>> in 21 chunks of 100ns (all atom CHARMM36, 2fs integration timestep).
>> Everything worked perfectly fine until step 22. Whenever I want to start
>> step 22 for any of my systems I get the following error in the NAMD output:
>>
>>
>> TCL: Running for 50000000 steps
>> ETITLE: TS BOND ANGLE DIHED IMPRP
>> ELECT VDW BOUNDARY MISC KINETIC
>> TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG
>> PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
>>
>> ENERGY: 2100000000 1866.2220 9802.6548 6896.3270
>> 78.1218 -103131.8273 3410.1396 0.0000 0.0000
>> 26331.7058 -54746.6563 301.5845 -81078.3620 -54571.7237
>> 301.5845 -263.9492 -261.2014 396913.4892 -263.9492
>> -261.2014
>>
>> OPENING EXTENDED SYSTEM TRAJECTORY FILE
>> WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP -2144967296
>> CLOSING EXTENDED SYSTEM TRAJECTORY FILE
>> WRITING COORDINATES TO OUTPUT FILE AT STEP -2144967296
>> COORDINATE DCD FILE <path removed by the author> WAS NOT CREATED
>> The last position output (seq=-2) takes 0.006 seconds, 1399.262 MB of memory
>> in use
>> WRITING VELOCITIES TO OUTPUT FILE AT STEP -2144967296
>> The last velocity output (seq=-2) takes 0.004 seconds, 1400.191 MB of memory
>> in use
>> ====================================================
>>
>> WallClock: 4.380243 CPUTime: 4.380243 Memory: 1400.191406 MB
>> [Partition 0][Node 0] End of program
>>
>> I am quite confused about this, because I changed nothing in my NAMD
>> configuration files except for the file numbering of the restart and output
>> files and these are fine (checked by 3 different people). For me the
>> problem seems to be related with generation of the DCD file. For the cluster
>> part, the file system of the cluster (IBM GPFS) should be fine because other
>> jobs with equal configurations are working and there has not been any
>> maintenance that could be in relation to the observed problems. In addition
>> step 21 of one of the system worked properly while step 22 of the other two
>> systems failed at the same time. Looks a little bit like 22 is a magic
>> number?
>>
>>
>> Furthermore, the negative step number in the output, which is not in a line
>> with the run steps, is also quite mysterious for me. I hope anybody has a
>> tip or a solution for me because I have checked nearly everything that came
>> into my mind until now.
>>
>>
>> Best Regards
>>
>>
>> Alex
>>
>>
>> --------------------------------------------------------
>> Alexander Götz, M.Sc.
>> Technische Universität München // Fakultät für Physik
>> Lehrstuhl für Bioelektronik E.14
>> Maximus-von-Imhof Forum 4 (room P059)
>> 85350 Freising, Germany
>> T: +49 8161 71-3540
>>
>> Please consider the environment before printing this email
>>
>>
>>
>>
>>
>> --
>> Brian Radak
>> Postdoctoral Appointee
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>> 9700 South Cass Avenue, Bldg. 240
>> Argonne, IL 60439-4854
>> (630) 252-8643
>> brian.radak_at_anl.gov
>
>

-- 
Brian Radak
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
9700 South Cass Avenue, Bldg. 240
Argonne, IL 60439-4854
(630) 252-8643
brian.radak_at_anl.gov

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:22:32 CST