From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Sep 01 2017 - 03:23:28 CDT
The simulation again crashed later with segfault and no backtrace.
Von: Norman Geist [mailto:norman.geist_at_uni-greifswald.de]
Gesendet: Donnerstag, 31. August 2017 11:12
An: David Hardy <dhardy_at_ks.uiuc.edu>
Betreff: Re: namd-l: NAMD-2.12 CUDA2 and PMECUDA problems
Input and logfile appended.
It seems you are right the initialization error disappears when using only one gpu. I will report if this also fixes the later crash that still came up when setting "useCUDA2 no".
Just btw:
Speed with 1x4 Threads and 1 GPU : 0.144923 days/ns (to be observed for long runs)
Same with "useCUDA2 no" : 0.130746 days/ns
With 2x4 Threads and 2 GPU : 0.117877 days/ns (the case that often crashed later)
Thanks
Norman
Am Mittwoch, den 30-08-2017 um 18:05 schrieb David Hardy:
Dear Norman,
Please send me your NAMD config file and also the log file and backtrace produced by the initialization error.
Setting "useCUDA2 off" should be using the older short-range nonbonded CUDA kernels. Maybe also try setting "PMEOffload off" to see if that eliminates the simulation crashes. An earlier note I saw from Jim said that the old PMEOffload only works with a single GPU per node, something about how the new kernels ignore +devices and grab every GPU available.
Thanks,
Dave
-- David J. Hardy, Ph.D. Theoretical and Computational Biophysics Beckman Institute, University of Illinois dhardy_at_ks.uiuc.edu <mailto:dhardy_at_ks.uiuc.edu> http://www.ks.uiuc.edu/~dhardy/ On Aug 30, 2017, at 3:47 AM, Norman Geist <norman.geist_at_uni-greifswald.de <mailto:norman.geist_at_uni-greifswald.de> > wrote: I exclude a hardware problem since especially this initialization problem it happens on all of 6 GPU nodes and on another cluster we have access to, containing 10 GPU nodes. I should also mention that we have a 4fs timestep using the hydrogen mass repartitioning method trough the parmed utily modifying the amber parm7 file. But this is no explanation for the initialization error, maybe for the stability issues, but still it working fine with 2.10 and 2.11. Thanks so far -----Ursprüngliche Nachricht----- Von: owner-namd-l_at_ks.uiuc.edu <mailto:owner-namd-l_at_ks.uiuc.edu> [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von Nicholas M Glykos Gesendet: Mittwoch, 30. August 2017 10:05 An: Norman Geist <norman.geist_at_uni-greifswald.de <mailto:norman.geist_at_uni-greifswald.de> > Cc: namd-l_at_ks.uiuc.edu <mailto:namd-l_at_ks.uiuc.edu> ; glykos_at_mbg.duth.gr <mailto:glykos_at_mbg.duth.gr> Betreff: Re: AW: namd-l: NAMD-2.12 CUDA2 and PMECUDA problems Yes, it is the nightly build. It's weird that I get such a backtrace during the CUDA initialization already and nobody seems to have encountered the same. I also get similar errors for GBIS with 2.11, where CUDA acceleration has been changed for implicit solvent. If I disable useCUDA2 some of the systems run for a while, but most of them crash later with e.g. segfault or by instability. Sometimes also lot's of margin warnings occur inbetween. There's must still be a bug somewhere in the new CUDA kernels. Yes, it is weird. Being a pessimist, I usually connect weirdness with hardware issues but you could be right that this is indeed a software problem. For the record I have used the new cuda kernels on machines with Xeon E5-2660v3 plus 2 x K40 without stability problems. Ditto for workstations with i7-6800 + GT1070. Good luck with it, I'm out of my depth here. -- Nicholas M. Glykos, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus, Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620, Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/glykos/
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:34 CST