AW: Unexplained segmentation faults in NAMD 2.9 using CUDA and GBIS

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri Nov 30 2012 - 00:32:14 CST

Hi,

 

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Aron Broom
Gesendet: Freitag, 30. November 2012 00:01
An: Tristan Croll
Cc: namd-l_at_ks.uiuc.edu
Betreff: Re: namd-l: Unexplained segmentation faults in NAMD 2.9 using CUDA
and GBIS

 

Is there a chance it could be due to memory size? I've run a number of CUDA
GBIS simulations with NAMD 2.9 on C2070s without any problems. But my
system is a single small protein domain (~2k atoms). If I recall from the
AMBER website (not sure how this correlates with NAMD), implicit solvent
simulations take a fair amount of memory. I think the C2070s have 3GB?

Good hint Aron (they got 6GB). But I can't imagine that running out of vram
would cause a segfault. But you can check it with nvidia-smi, also check the
volatile ecc errors if you have ecc enabled. However, it's also likely that
your os kills the job because the machine is running out of local memory,
sometimes this does also show up as a segfault. If you have no monitoring
system installed, you can check the memory usage with top or vmstat.

Also, isn't there a stack trace printed which could point to the code
segment that produced the segfault?!

~Norman

~Aron

On Wed, Nov 28, 2012 at 6:40 AM, Tristan Croll <
<mailto:tristan.croll_at_qut.edu.au> tristan.croll_at_qut.edu.au> wrote:

Hi all,

 

As per the subject line, I've been getting segmentation faults at seemingly
random intervals when running implicit solvent simulations in the CUDA
version of NAMD 2.9. Unlike most crash situations, this one doesn't throw
up any error message other than "Segmentation Fault". Possibly related,
I've also had a number of cases of simulations crashing during energy
minimisation due to polling of the CUDA cards timing out.

 

Relevant specs: I've seen the problem on two different machines, running
different flavours of Linux. One is an 8-core Xeon (Nehalem) workstation
with a single Tesla C2050, the other is a blade on our cluster (16-core
Sandy Bridge Xeon with two C2070s).

 

The simulation itself is of a rather large glycoprotein (glycans using the
new forcefield parameters from the MacKerell lab). There are some fairly
clear misfoldings in two domains (crystallisation artefacts or threading
errors) which makes me suspect that the problem may be an energy term going
out of range and being mishandled. On the other hand, continuing from the
restart files after a crash (without reinitialising velocities) usually
*doesn't* replicate the crash.

 

The one thing I can clearly say is that it definitely seems to be the
combination of GBIS and CUDA that is the problem - explicit solvent works
fine (but is a poor choice for the TMD simulations I want to run), as does
GBIS in the non-CUDA version of NAMD (but it's agonisingly slow for the
system I'm simulating). I'd go with multi-node simulations, but the recent
upgrade of our cluster seems to have broken its compatibility with the
ibverbs NAMD build (the guys in charge of the cluster are working on that).

 

Sorry to give you such a vague list of symptoms, but hopefully something in
there will help.

 

Cheers,

 

Tristan

-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:48 CST