Unexplained segmentation faults in NAMD 2.9 using CUDA and GBIS

From: Tristan Croll (tristan.croll_at_qut.edu.au)
Date: Wed Nov 28 2012 - 05:40:18 CST

Hi all,

As per the subject line, I've been getting segmentation faults at seemingly random intervals when running implicit solvent simulations in the CUDA version of NAMD 2.9. Unlike most crash situations, this one doesn't throw up any error message other than "Segmentation Fault". Possibly related, I've also had a number of cases of simulations crashing during energy minimisation due to polling of the CUDA cards timing out.

Relevant specs: I've seen the problem on two different machines, running different flavours of Linux. One is an 8-core Xeon (Nehalem) workstation with a single Tesla C2050, the other is a blade on our cluster (16-core Sandy Bridge Xeon with two C2070s).

The simulation itself is of a rather large glycoprotein (glycans using the new forcefield parameters from the MacKerell lab). There are some fairly clear misfoldings in two domains (crystallisation artefacts or threading errors) which makes me suspect that the problem may be an energy term going out of range and being mishandled. On the other hand, continuing from the restart files after a crash (without reinitialising velocities) usually *doesn't* replicate the crash.

The one thing I can clearly say is that it definitely seems to be the combination of GBIS and CUDA that is the problem – explicit solvent works fine (but is a poor choice for the TMD simulations I want to run), as does GBIS in the non-CUDA version of NAMD (but it's agonisingly slow for the system I'm simulating). I'd go with multi-node simulations, but the recent upgrade of our cluster seems to have broken its compatibility with the ibverbs NAMD build (the guys in charge of the cluster are working on that).

Sorry to give you such a vague list of symptoms, but hopefully something in there will help.

Cheers,

Tristan

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:19 CST