AW: NAMD-2.12 CUDA2 and PMECUDA problems

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Sep 01 2017 - 03:23:28 CDT

The simulation again crashed later with segfault and no backtrace.

 

Von: Norman Geist [mailto:norman.geist_at_uni-greifswald.de]
Gesendet: Donnerstag, 31. August 2017 11:12
An: David Hardy <dhardy_at_ks.uiuc.edu>
Betreff: Re: namd-l: NAMD-2.12 CUDA2 and PMECUDA problems

 

Input and logfile appended.

It seems you are right the initialization error disappears when using only one gpu. I will report if this also fixes the later crash that still came up when setting "useCUDA2 no".

Just btw:
Speed with 1x4 Threads and 1 GPU : 0.144923 days/ns (to be observed for long runs)
Same with "useCUDA2 no" : 0.130746 days/ns
With 2x4 Threads and 2 GPU : 0.117877 days/ns (the case that often crashed later)

Thanks
Norman

Am Mittwoch, den 30-08-2017 um 18:05 schrieb David Hardy:

Dear Norman,

 

Please send me your NAMD config file and also the log file and backtrace produced by the initialization error.

Setting "useCUDA2 off" should be using the older short-range nonbonded CUDA kernels. Maybe also try setting "PMEOffload off" to see if that eliminates the simulation crashes. An earlier note I saw from Jim said that the old PMEOffload only works with a single GPU per node, something about how the new kernels ignore +devices and grab every GPU available.

 

Thanks,

Dave

 

--
David J. Hardy, Ph.D.
Theoretical and Computational Biophysics
Beckman Institute, University of Illinois
dhardy_at_ks.uiuc.edu <mailto:dhardy_at_ks.uiuc.edu> 
http://www.ks.uiuc.edu/~dhardy/
 
 
On Aug 30, 2017, at 3:47 AM, Norman Geist <norman.geist_at_uni-greifswald.de <mailto:norman.geist_at_uni-greifswald.de> > wrote:
 
I exclude a hardware problem since especially this initialization problem it
happens on all of 6 GPU nodes and on another cluster we have access to,
containing 10 GPU nodes. I should also mention that we have a 4fs timestep
using the hydrogen mass repartitioning method trough the parmed utily
modifying the amber parm7 file. But this is no explanation for the
initialization error, maybe for the stability issues, but still it working
fine with 2.10 and 2.11.
Thanks so far
-----Ursprüngliche Nachricht-----
Von: owner-namd-l_at_ks.uiuc.edu <mailto:owner-namd-l_at_ks.uiuc.edu>  [mailto:owner-namd-l_at_ks.uiuc.edu] Im
Auftrag von Nicholas M Glykos
Gesendet: Mittwoch, 30. August 2017 10:05
An: Norman Geist <norman.geist_at_uni-greifswald.de <mailto:norman.geist_at_uni-greifswald.de> >
Cc: namd-l_at_ks.uiuc.edu <mailto:namd-l_at_ks.uiuc.edu> ; glykos_at_mbg.duth.gr <mailto:glykos_at_mbg.duth.gr> 
Betreff: Re: AW: namd-l: NAMD-2.12 CUDA2 and PMECUDA problems
Yes, it is the nightly build. It's weird that I get such a backtrace
during the CUDA initialization already and nobody seems to have
encountered the same. I also get similar errors for GBIS with 2.11,
where CUDA acceleration has been changed for implicit solvent.
If I disable useCUDA2 some of the systems run for a while, but most of
them crash later with e.g. segfault or by instability. Sometimes also
lot's of margin warnings occur inbetween. There's must still be a bug
somewhere in the new CUDA kernels.
Yes, it is weird. Being a pessimist, I usually connect weirdness with
hardware issues but you could be right that this is indeed a software
problem. For the record I have used the new cuda kernels on machines with
Xeon E5-2660v3 plus 2 x K40 without stability problems. Ditto for
workstations with i7-6800 + GT1070. Good luck with it, I'm out of my depth
here.
--
           Nicholas M. Glykos, Department of Molecular Biology
    and Genetics, Democritus University of Thrace, University Campus,
 Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620,
   Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/glykos/
 
 

This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:21:37 CST