RE: Unexplained segmentation faults in NAMD 2.9 using CUDA and GBIS

From: Tristan Croll (tristan.croll_at_qut.edu.au)
Date: Thu Nov 29 2012 - 17:42:53 CST

I guess that's a strong possibility - my system contains just over 35k atoms. On the other hand, it crashes the same way on machines with one or two GPUs (I've just double-checked, and the ones on the cluster are actually M2090 blades) - does each GPU have to hold the full system in memory?

On the other other hand, after spending most of yesterday adjusting some of the more obviously problematic regions with AutoIMD, it ran overnight for a little over half a nanosecond (substantially longer than previously) but still crashed. This leaves me scratching my head.

- Tristan

From: Aron Broom [mailto:broomsday_at_gmail.com]
Sent: Friday, 30 November 2012 9:01 AM
To: Tristan Croll
Cc: namd-l_at_ks.uiuc.edu
Subject: Re: namd-l: Unexplained segmentation faults in NAMD 2.9 using CUDA and GBIS

Is there a chance it could be due to memory size? I've run a number of CUDA GBIS simulations with NAMD 2.9 on C2070s without any problems. But my system is a single small protein domain (~2k atoms). If I recall from the AMBER website (not sure how this correlates with NAMD), implicit solvent simulations take a fair amount of memory. I think the C2070s have 3GB?

~~Aron
On Wed, Nov 28, 2012 at 6:40 AM, Tristan Croll <tristan.croll_at_qut.edu.au<mailto:tristan.croll_at_qut.edu.au>> wrote:
Hi all,

As per the subject line, I've been getting segmentation faults at seemingly random intervals when running implicit solvent simulations in the CUDA version of NAMD 2.9. Unlike most crash situations, this one doesn't throw up any error message other than "Segmentation Fault". Possibly related, I've also had a number of cases of simulations crashing during energy minimisation due to polling of the CUDA cards timing out.

Relevant specs: I've seen the problem on two different machines, running different flavours of Linux. One is an 8-core Xeon (Nehalem) workstation with a single Tesla C2050, the other is a blade on our cluster (16-core Sandy Bridge Xeon with two C2070s).

The simulation itself is of a rather large glycoprotein (glycans using the new forcefield parameters from the MacKerell lab). There are some fairly clear misfoldings in two domains (crystallisation artefacts or threading errors) which makes me suspect that the problem may be an energy term going out of range and being mishandled. On the other hand, continuing from the restart files after a crash (without reinitialising velocities) usually *doesn't* replicate the crash.

The one thing I can clearly say is that it definitely seems to be the combination of GBIS and CUDA that is the problem - explicit solvent works fine (but is a poor choice for the TMD simulations I want to run), as does GBIS in the non-CUDA version of NAMD (but it's agonisingly slow for the system I'm simulating). I'd go with multi-node simulations, but the recent upgrade of our cluster seems to have broken its compatibility with the ibverbs NAMD build (the guys in charge of the cluster are working on that).

Sorry to give you such a vague list of symptoms, but hopefully something in there will help.

Cheers,

Tristan

--
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:19 CST