Re: Periodic cell too small with GPUs, not with pur CPUs

From: JC Gumbart (gumbart_at_physics.gatech.edu)
Date: Sat Aug 26 2017 - 08:11:13 CDT

Thanks! That’s very helpful. Even my smallest systems are at least two patches in every dimension.

Best,
JC

> On Aug 25, 2017, at 9:34 PM, David Hardy <dhardy_at_ks.uiuc.edu> wrote:
>
> Hi JC,
>
> The issue was with the CUDA nonbonded compute kernels introduced in NAMD 2.12. When a system is small enough, so that a patch is its own periodic image along some dimension, the kernel was previously looking purely at atom indexing to determine "self-interactions," thus would sum the electrostatic and van der Waals interactions incorrectly.
>
> As far as I can tell, there should have been no problem if your system had at least two patches along each dimension. Also, since the bug calculated bad forces, it generally created significant instabilities during simulation, resulting in extremely large pressure values, missing bonds and exclusions due to atoms moving too fast, NaNs and segfaults.
>
> Best regards,
> Dave
>
> --
> David J. Hardy, Ph.D.
> Theoretical and Computational Biophysics
> Beckman Institute, University of Illinois
> dhardy_at_ks.uiuc.edu <mailto:dhardy_at_ks.uiuc.edu>
> http://www.ks.uiuc.edu/~dhardy/ <http://www.ks.uiuc.edu/~dhardy/>
>
>
>> On Aug 25, 2017, at 5:09 PM, JC Gumbart <gumbart_at_physics.gatech.edu <mailto:gumbart_at_physics.gatech.edu>> wrote:
>>
>> Hi Dave,
>>
>> Can you elaborate on the nature of this bug and what effects you think it had? I’ve been running some NPT membrane simulations with NAMD 2.12 CUDA in which the area is a little bigger than I would expect. I wonder if this could be related?
>>
>> Thanks!
>> JC
>>
>>> On Aug 25, 2017, at 2:31 PM, David Hardy <dhardy_at_ks.uiuc.edu <mailto:dhardy_at_ks.uiuc.edu>> wrote:
>>>
>>> Dear Francesco,
>>>
>>> I've found and fixed a long-standing bug in the newer CUDA kernels that appears to have been affecting your simulations. I merged the source patch into NAMD yesterday, so you can now get a fixed nightly build of NAMD from http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD <http://www.ks.uiuc.edu/Development/Download/download.cgi?PackageName=NAMD> that should allow you to run with the fast kernels (i.e., the default, "useCUDA2 on").
>>>
>>> Best regards,
>>> Dave
>>>
>>> --
>>> David J. Hardy, Ph.D.
>>> Theoretical and Computational Biophysics
>>> Beckman Institute, University of Illinois
>>> dhardy_at_ks.uiuc.edu <mailto:dhardy_at_ks.uiuc.edu>
>>> http://www.ks.uiuc.edu/~dhardy/ <http://www.ks.uiuc.edu/~dhardy/>
>>>> On Aug 4, 2017, at 6:44 AM, Francesco Pietra <chiendarret_at_gmail.com <mailto:chiendarret_at_gmail.com>> wrote:
>>>>
>>>> Dear David;
>>>>
>>>> I am answering late (had problems with the cluster0, however with "useCUDA2 off" added to namd config file worked well. Thank you
>>>> francesco
>>>>
>>>> On Tue, Jul 18, 2017 at 6:05 PM, David Hardy <dhardy_at_ks.uiuc.edu <mailto:dhardy_at_ks.uiuc.edu>> wrote:
>>>> Dear Francesco,
>>>>
>>>> You could try disabling the newer CUDA kernels with "useCUDA2 off" in your config file.
>>>>
>>>> Others have also reported issues with NPT simulation using the new CUDA kernels. I am looking into it.
>>>>
>>>> Best regards,
>>>> Dave
>>>>
>>>>
>>>>> On Jul 17, 2017, at 9:09 AM, Francesco Pietra <chiendarret_at_gmail.com <mailto:chiendarret_at_gmail.com>> wrote:
>>>>>
>>>>> Dear Dave:
>>>>> I got the impression that problems with such poor GPUs like my GTX680 are exacerbated with version 12 of namd, or at least with the night build that I mentioned on previous mail.
>>>>>
>>>>> I am now at a larger box for the same system that I mentioned (just to allow the protein tumbling). Minimization on a 4 core desktop gave no problems. Gradual heating on the GTX680 box led to chaining of the TIP3P waters. Again no problem with the desktop, at its very low speed (rigidbonds water, ts=1.0fs, margin 0)
>>>>>
>>>>> NPT (based on the heating output) on the GTX680 box could not be run by even going to ts=0.1fs, larger margin).
>>>>>
>>>>> Again no problems on the desktop (the nextScale cluster in under tuning of nam12 with knl, presently very poor performance).
>>>>>
>>>>> I had no similar problems with previous versions of namd on the same GTX680 box.
>>>>>
>>>>> francesco
>>>>>
>>>>> On Wed, Jul 5, 2017 at 6:02 PM, David Hardy <dhardy_at_ks.uiuc.edu <mailto:dhardy_at_ks.uiuc.edu>> wrote:
>>>>> Dear Francesco,
>>>>>
>>>>> This issue shouldn't have anything to do with the "wrapall" option, since this just affects the output of atom coordinates to the DCD file.
>>>>>
>>>>> The "periodic cell became too small" error occurs when using a barostat that shrinks the periodic cell to the point where the patch size along a dimension becomes smaller than the extended cutoff distance, which is why increasing "margin" is helping.
>>>>>
>>>>> Overall, your problem between running on GPUs vs CPUs is most likely due to differences in the calculation of the virial, which then affects the barostat. The virial appears to be, numerically speaking, not a well conditioned quantity to compute, so the use of single precision (GPUs) versus double precision (CPUs) in its calculation is probably at the root of your issue.
>>>>>
>>>>> In terms of using NAMD on GPUs, increasing the "margin," as Josh recommends, is probably your best course of action. Increasing the margin to 10 seems to me to be really extreme, as this will impede performance. Ask yourself: Is your periodic cell really expected to shrink as much as (minimum number of patches along a given dimension)*margin? Maybe first try setting it more modestly to 1 or 2 to see if you can then successfully run.
>>>>>
>>>>> Best regards,
>>>>> Dave
>>>>>
>>>>> --
>>>>> David J. Hardy, Ph.D.
>>>>> Theoretical and Computational Biophysics
>>>>> Beckman Institute, University of Illinois
>>>>> dhardy_at_ks.uiuc.edu <mailto:dhardy_at_ks.uiuc.edu>
>>>>> http://www.ks.uiuc.edu/~dhardy/ <http://www.ks.uiuc.edu/~dhardy/>
>>>>>
>>>>>
>>>>>> On Jul 5, 2017, at 9:39 AM, Joshua . <joshua.timmons1_at_gmail.com <mailto:joshua.timmons1_at_gmail.com>> wrote:
>>>>>>
>>>>>> Hello Francesco,
>>>>>>
>>>>>> I am new to NAMD, but I had a similar problem "periodic cell became too small" and resolved it by setting margin to 10 (while keeping wrapAll on). Documentation from 2.6 says to leave it alone unless trying to optimize performance, but it solved the issue for me immediately: http://www.ks.uiuc.edu/Research/namd/2.6/olddocs/ug/node26.html <http://www.ks.uiuc.edu/Research/namd/2.6/olddocs/ug/node26.html>
>>>>>>
>>>>>> Josh
>>>>>>
>>>>>> On Wed, Jul 5, 2017 at 3:09 AM, Francesco Pietra <chiendarret_at_gmail.com <mailto:chiendarret_at_gmail.com>> wrote:
>>>>>> Hallo:
>>>>>> I am at an unbiased MD with a large protein containing organic ligands in a TIP3P box that (wrapall on) gave no troubles on a Nextscale cluster on 264 pure CPUs along a 58.2ns simulation, ts=1.0fs.
>>>>>>
>>>>>> On trying to continue the simulation with my workstation with a couple of GTX680, I am facing immediate "periodic cell became too small" under either "wrapall on" or "wrapall no" (I used successfully this hardware for MD up to this case). NAMD_CVS_2017-05-25_Linux-x86_64_multicore-CUDA.
>>>>>>
>>>>>> In contrast, the simulation continues without problems (albeit very slowly) on pure CPUs with a desktop, either "wrapall no" (which was the reason for continuing the simulation in order to safely measure the distances between the centers of mass of protein and ligands) or "wrapall on".
>>>>>>
>>>>>> francesco pietra
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:32 CST