Re: Overflow in LJcorrection?

From: Jason Swails (jason.swails_at_gmail.com)
Date: Thu Dec 03 2015 - 16:15:15 CST

Ah, well the code I highlighted is identical to the problematic code that
was in OpenMM, so it wouldn't surprise me if that was problematic here as
well. We casted to double to get around the problem. I think if you just
use unsigned long you're pushing the problem a little farther down the road
(but still well within the # of atoms that namd is routinely applied to).
All it does is increase by a factor of 2 the largest integer you can store,
before it loops and becomes small again. long long is probably the way to
go if you really wanted to avoid double.

On Thu, Dec 3, 2015 at 5:10 PM, Brian Radak <bradak_at_anl.gov> wrote:

> The error actually happens well before that, since LJAvgA and LJAvgB are
> first accumulated as sumOfAs, sumOfBs, and count.
>
> "count" and the N^2 terms are now "unsigned long int"'s. These seems to
> fix everything.
>
> This might be too kludgy for some, but why not just hardcode to use the A
> and B coefficients of a single type (e.g. TIP3P) if the number of atoms of
> that kind is so much larger than all others? I had heard that GROMACS did
> something like this and just considered it a "solvent dispersion
> correction." That might be more stable long term than an arms race in terms
> of memory footprint and hardly distinguishable in practice...
>
> Brian
>
>
> On 12/03/2015 04:04 PM, Jason Swails wrote:
>
> The tail correction is calculated as part of setup (it appears) in lines
> 9127 to 9133 in Molecule.C:
>
> if (simParams->switchingActive) {
> tail_corr_ener = tail_corr_virial =
> (16*numAtoms*numAtoms*PI*(-105*LJAvgB*rcut5*rswitch5 + LJAvgA*(3*rcut4 +
> 9*rcut3*rswitch + 11*rcut2*rswitch2 + 9*rcut*rswitch3 + 3*rswitch4)))/
> (315*rcut5*rswitch5*((rcut + rswitch)*(rcut + rswitch)*(rcut + rswitch)));
> }
> else {
> tail_corr_virial = -4 * numAtoms * numAtoms * PI * (3 * LJAvgB *
> rcut3 * rcut3 - 2 * LJAvgA) / (9 * rcut3 * rcut3 * rcut3);
> tail_corr_ener = 2 * numAtoms * numAtoms * PI * (LJAvgA - 3 *
> LJAvgB * rcut3 * rcut3) / (9 * rcut3 * rcut3 * rcut3);
> }
>
> The problem here is that 16 and -4 are both integers, as is numAtoms. So
> when the switching function is off, it starts computing
> 16*numAtoms*numAtoms*PI -- since multiplication is L->R associative,
> 16*numAtoms*numAtoms all gets stored in an integer, which overflows for
> large particle counts. Or when switching is off, -4*numAtoms*numAtoms,
> which also eventually overflows (it takes well less than 100K atoms for
> this to happen).
>
> If you moved PI to the *beginning* of the expression, or changed 16, 4,
> and 2 to doubles via 16.0., -4.0l, and 2.0l, then this problem should
> probably disappear. The way I fixed this in OpenMM was to explicitly cast
> numAtoms to double in this expression. But that's probably not necessary
> since if the leading factor is double the ints should get promoted
> automatically.
>
> Brian -- can you try that?
>
> All the best,
> Jason
>
> On Thu, Dec 3, 2015 at 4:55 PM, Peter Freddolino <petefred_at_umich.edu>
> wrote:
>
>> Hi Brian,
>> Out of curiosity, is this dependent on having FEP or related settings on,
>> or no?
>> Thanks,
>> Peter
>>
>> > On Dec 3, 2015, at 4:45 PM, Brian Radak < <bradak_at_anl.gov>
>> bradak_at_anl.gov> wrote:
>> >
>> > I've been debugging/testing alchemical corrections to the LJcorrection
>> when I stumbled upon this weird behavior:
>> >
>> > For largish systems (I just threw and asp in a big box, 163737 atoms),
>> the average A and B coefficients comes back negative. This seems to go away
>> (sometimes?) if I zero out the hydrogen LJ epsilon parameter in the
>> parameter file. Some strategic print statements seem to suggest that the
>> "sumOfAs" and "sumOfBs" counters go negative when the next increment is
>> above ~1e12?
>> >
>> > Here is output for vdw type 60 (OT from TIP3P) and the counters for
>> sumOfAs and sumOfBs when it first goes negative.
>> > The increments should be (54469-1)*54469*A/B
>> >
>> > pair : 60/60 : A = 581924 : B = 595.015 | counts : 54469/54469
>> > 5.0349e+13-->-7.22533e+14
>> > 5.86584e+10-->-7.3161e+11
>> >
>> > The second column should be: 2.229940435e+14 and 1.8239593e+12
>> >
>> > Does this look like overflow? I can't imagine a double precision value
>> having trouble here, although the atom counts are 32 bit signed integers
>> (or whatever the compiler understands "int" to be).
>>
>>
>>
>
>
> --
> Jason M. Swails
> BioMaPS,
> Rutgers University
> Postdoctoral Researcher
>
>
> --
> Brian Radak
> Theta Early Science Program Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>
> 9700 South Cass Avenue
> Building 240, 1.D.16
> Lemont, IL 60439-4871
> Tel: 630/252-8643
> email: bradak_at_anl.gov
>

-- 
Jason M. Swails
BioMaPS,
Rutgers University
Postdoctoral Researcher

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:22:17 CST