Re: Overflow in LJcorrection?

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Mon Dec 07 2015 - 16:35:27 CST

Thanks to Peter and Brian this issue should be completely fixed in the
next nightly build (Tuesday, December 8).

Jim

On Sat, 5 Dec 2015, Peter Freddolino wrote:

> Hi Brian,
> Can you send me the test system that you’re using for this? I cannot seem to reproduce the negative LJAvg sums even for a large system.
>
> The integer overflow pointed out by Jason is definitely happening, and I now have a patch ready for that, but I want to figure out this other issue that Brian is seeing.
> Thanks,
> Peter
>
>> On Dec 3, 2015, at 5:15 PM, Brian Radak <bradak_at_anl.gov> wrote:
>>
>> Scratch that fix. Things blow up again if I include non-zero TIP3P hydrogens, even with "unsigned long long int"
>>
>> On 12/03/2015 04:10 PM, Brian Radak wrote:
>>> The error actually happens well before that, since LJAvgA and LJAvgB are first accumulated as sumOfAs, sumOfBs, and count.
>>>
>>> "count" and the N^2 terms are now "unsigned long int"'s. These seems to fix everything.
>>>
>>> This might be too kludgy for some, but why not just hardcode to use the A and B coefficients of a single type (e.g. TIP3P) if the number of atoms of that kind is so much larger than all others? I had heard that GROMACS did something like this and just considered it a "solvent dispersion correction." That might be more stable long term than an arms race in terms of memory footprint and hardly distinguishable in practice...
>>>
>>> Brian
>>>
>>> On 12/03/2015 04:04 PM, Jason Swails wrote:
>>>> The tail correction is calculated as part of setup (it appears) in lines 9127 to 9133 in Molecule.C:
>>>>
>>>> if (simParams->switchingActive) {
>>>> tail_corr_ener = tail_corr_virial = (16*numAtoms*numAtoms*PI*(-105*LJAvgB*rcut5*rswitch5 + LJAvgA*(3*rcut4 + 9*rcut3*rswitch + 11*rcut2*rswitch2 + 9*rcut*rswitch3 + 3*rswitch4)))/ (315*rcut5*rswitch5*((rcut + rswitch)*(rcut + rswitch)*(rcut + rswitch)));
>>>> }
>>>> else {
>>>> tail_corr_virial = -4 * numAtoms * numAtoms * PI * (3 * LJAvgB * rcut3 * rcut3 - 2 * LJAvgA) / (9 * rcut3 * rcut3 * rcut3);
>>>> tail_corr_ener = 2 * numAtoms * numAtoms * PI * (LJAvgA - 3 * LJAvgB * rcut3 * rcut3) / (9 * rcut3 * rcut3 * rcut3);
>>>> }
>>>>
>>>> The problem here is that 16 and -4 are both integers, as is numAtoms. So when the switching function is off, it starts computing 16*numAtoms*numAtoms*PI -- since multiplication is L->R associative, 16*numAtoms*numAtoms all gets stored in an integer, which overflows for large particle counts. Or when switching is off, -4*numAtoms*numAtoms, which also eventually overflows (it takes well less than 100K atoms for this to happen).
>>>>
>>>> If you moved PI to the *beginning* of the expression, or changed 16, 4, and 2 to doubles via 16.0., -4.0l, and 2.0l, then this problem should probably disappear. The way I fixed this in OpenMM was to explicitly cast numAtoms to double in this expression. But that's probably not necessary since if the leading factor is double the ints should get promoted automatically.
>>>>
>>>> Brian -- can you try that?
>>>>
>>>> All the best,
>>>> Jason
>>>>
>>>> On Thu, Dec 3, 2015 at 4:55 PM, Peter Freddolino <petefred_at_umich.edu> wrote:
>>>> Hi Brian,
>>>> Out of curiosity, is this dependent on having FEP or related settings on, or no?
>>>> Thanks,
>>>> Peter
>>>>
>>>>> On Dec 3, 2015, at 4:45 PM, Brian Radak <bradak_at_anl.gov> wrote:
>>>>>
>>>>> I've been debugging/testing alchemical corrections to the LJcorrection when I stumbled upon this weird behavior:
>>>>>
>>>>> For largish systems (I just threw and asp in a big box, 163737 atoms), the average A and B coefficients comes back negative. This seems to go away (sometimes?) if I zero out the hydrogen LJ epsilon parameter in the parameter file. Some strategic print statements seem to suggest that the "sumOfAs" and "sumOfBs" counters go negative when the next increment is above ~1e12?
>>>>>
>>>>> Here is output for vdw type 60 (OT from TIP3P) and the counters for sumOfAs and sumOfBs when it first goes negative.
>>>>> The increments should be (54469-1)*54469*A/B
>>>>>
>>>>> pair : 60/60 : A = 581924 : B = 595.015 | counts : 54469/54469
>>>>> 5.0349e+13-->-7.22533e+14
>>>>> 5.86584e+10-->-7.3161e+11
>>>>>
>>>>> The second column should be: 2.229940435e+14 and 1.8239593e+12
>>>>>
>>>>> Does this look like overflow? I can't imagine a double precision value having trouble here, although the atom counts are 32 bit signed integers (or whatever the compiler understands "int" to be).
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jason M. Swails
>>>> BioMaPS,
>>>> Rutgers University
>>>> Postdoctoral Researcher
>>>
>>> --
>>> Brian Radak
>>> Theta Early Science Program Postdoctoral Appointee
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>> 9700 South Cass Avenue
>>> Building 240, 1.D.16
>>> Lemont, IL 60439-4871
>>> Tel: 630/252-8643
>>> email: bradak_at_anl.gov
>>
>> --
>> Brian Radak
>> Theta Early Science Program Postdoctoral Appointee
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>> 9700 South Cass Avenue
>> Building 240, 1.D.16
>> Lemont, IL 60439-4871
>> Tel: 630/252-8643
>> email: bradak_at_anl.gov
>
>

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:21:37 CST