Re: Overflow in LJcorrection?

From: Brian Radak (bradak_at_anl.gov)
Date: Thu Dec 03 2015 - 16:15:59 CST

Next message: Jason Swails: "Re: Overflow in LJcorrection?"
Previous message: Jason Swails: "Re: Overflow in LJcorrection?"
In reply to: Brian Radak: "Re: Overflow in LJcorrection?"
Next in thread: Jason Swails: "Re: Overflow in LJcorrection?"
Reply: Jason Swails: "Re: Overflow in LJcorrection?"
Reply: Peter Freddolino: "Re: Overflow in LJcorrection?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Scratch that fix. Things blow up again if I include non-zero TIP3P
hydrogens, even with "unsigned long long int"

On 12/03/2015 04:10 PM, Brian Radak wrote:
> The error actually happens well before that, since LJAvgA and LJAvgB
> are first accumulated as sumOfAs, sumOfBs, and count.
>
> "count" and the N^2 terms are now "unsigned long int"'s. These seems
> to fix everything.
>
> This might be too kludgy for some, but why not just hardcode to use
> the A and B coefficients of a single type (e.g. TIP3P) if the number
> of atoms of that kind is so much larger than all others? I had heard
> that GROMACS did something like this and just considered it a "solvent
> dispersion correction." That might be more stable long term than an
> arms race in terms of memory footprint and hardly distinguishable in
> practice...
>
> Brian
>
> On 12/03/2015 04:04 PM, Jason Swails wrote:
>> The tail correction is calculated as part of setup (it appears) in
>> lines 9127 to 9133 in Molecule.C:
>>
>> if (simParams->switchingActive) {
>> tail_corr_ener = tail_corr_virial =
>> (16*numAtoms*numAtoms*PI*(-105*LJAvgB*rcut5*rswitch5 +
>> LJAvgA*(3*rcut4 + 9*rcut3*rswitch + 11*rcut2*rswitch2 +
>> 9*rcut*rswitch3 + 3*rswitch4)))/ (315*rcut5*rswitch5*((rcut +
>> rswitch)*(rcut + rswitch)*(rcut + rswitch)));
>> }
>> else {
>> tail_corr_virial = -4 * numAtoms * numAtoms * PI * (3 *
>> LJAvgB * rcut3 * rcut3 - 2 * LJAvgA) / (9 * rcut3 * rcut3 * rcut3);
>> tail_corr_ener = 2 * numAtoms * numAtoms * PI * (LJAvgA - 3 *
>> LJAvgB * rcut3 * rcut3) / (9 * rcut3 * rcut3 * rcut3);
>> }
>>
>> The problem here is that 16 and -4 are both integers, as is
>> numAtoms. So when the switching function is off, it starts computing
>> 16*numAtoms*numAtoms*PI -- since multiplication is L->R associative,
>> 16*numAtoms*numAtoms all gets stored in an integer, which overflows
>> for large particle counts. Or when switching is off,
>> -4*numAtoms*numAtoms, which also eventually overflows (it takes well
>> less than 100K atoms for this to happen).
>>
>> If you moved PI to the *beginning* of the expression, or changed 16,
>> 4, and 2 to doubles via 16.0., -4.0l, and 2.0l, then this problem
>> should probably disappear. The way I fixed this in OpenMM was to
>> explicitly cast numAtoms to double in this expression. But that's
>> probably not necessary since if the leading factor is double the ints
>> should get promoted automatically.
>>
>> Brian -- can you try that?
>>
>> All the best,
>> Jason
>>
>> On Thu, Dec 3, 2015 at 4:55 PM, Peter Freddolino <petefred_at_umich.edu
>> <mailto:petefred_at_umich.edu>> wrote:
>>
>> Hi Brian,
>> Out of curiosity, is this dependent on having FEP or related
>> settings on, or no?
>> Thanks,
>> Peter
>>
>> > On Dec 3, 2015, at 4:45 PM, Brian Radak <bradak_at_anl.gov> wrote:
>> >
>> > I've been debugging/testing alchemical corrections to the
>> LJcorrection when I stumbled upon this weird behavior:
>> >
>> > For largish systems (I just threw and asp in a big box, 163737
>> atoms), the average A and B coefficients comes back negative.
>> This seems to go away (sometimes?) if I zero out the hydrogen LJ
>> epsilon parameter in the parameter file. Some strategic print
>> statements seem to suggest that the "sumOfAs" and "sumOfBs"
>> counters go negative when the next increment is above ~1e12?
>> >
>> > Here is output for vdw type 60 (OT from TIP3P) and the counters
>> for sumOfAs and sumOfBs when it first goes negative.
>> > The increments should be (54469-1)*54469*A/B
>> >
>> > pair : 60/60 : A = 581924 : B = 595.015 | counts : 54469/54469
>> > 5.0349e+13-->-7.22533e+14
>> > 5.86584e+10-->-7.3161e+11
>> >
>> > The second column should be: 2.229940435e+14 and 1.8239593e+12
>> >
>> > Does this look like overflow? I can't imagine a double
>> precision value having trouble here, although the atom counts are
>> 32 bit signed integers (or whatever the compiler understands
>> "int" to be).
>>
>>
>>
>>
>>
>> --
>> Jason M. Swails
>> BioMaPS,
>> Rutgers University
>> Postdoctoral Researcher
>
> --
> Brian Radak
> Theta Early Science Program Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>
> 9700 South Cass Avenue
> Building 240, 1.D.16
> Lemont, IL 60439-4871
> Tel: 630/252-8643
> email: bradak_at_anl.gov

-- 
Brian Radak
Theta Early Science Program Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
9700 South Cass Avenue
Building 240, 1.D.16
Lemont, IL 60439-4871
Tel: 630/252-8643
email: bradak_at_anl.gov

Next message: Jason Swails: "Re: Overflow in LJcorrection?"
Previous message: Jason Swails: "Re: Overflow in LJcorrection?"
In reply to: Brian Radak: "Re: Overflow in LJcorrection?"
Next in thread: Jason Swails: "Re: Overflow in LJcorrection?"
Reply: Jason Swails: "Re: Overflow in LJcorrection?"
Reply: Peter Freddolino: "Re: Overflow in LJcorrection?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:21:36 CST