RE: BUG: NAMD-2.10 CUDA REMD segfault with TCL exec randomly

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Mar 04 2015 - 04:40:07 CST

Here's the backtrace I got, seems that libcuda isn't thread-safe. It's a
real issue at many places that TCLs exec uses fork() .

This could maybe be circumvented by adding a command to NAMDs TCL interface
for calling external tools without need to exec and fork() as NAMD will
anyway throw a FATAL for TCL errors, the forking exec doesn't change a thing
here.

 

Program terminated with signal SIGSEGV, Segmentation fault.

#0 0x00002ae8a2893602 in ?? () from /lib64/libc.so.6

(gdb) bt

#0 0x00002ae8a2893602 in ?? () from /lib64/libc.so.6

#1 0x00002ae8a7c7ef81 in ?? () from /usr/lib64/libcuda.so

#2 0x00002ae8a75f2b5a in ?? () from /usr/lib64/libcuda.so

#3 0x00002ae8a7c7f588 in ?? () from /usr/lib64/libcuda.so

#4 0x00002ae8a17ab1f3 in start_thread () from /lib64/libpthread.so.0

#5 0x00002ae8a29041ad in clone () from /lib64/libc.so.6

 

Norman Geist.

 

From: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] On Behalf
Of Norman Geist
Sent: Wednesday, March 04, 2015 8:48 AM
To: namd-l_at_ks.uiuc.edu
Subject: namd-l: BUG: NAMD-2.10 CUDA REMD segfault with TCL exec randomly

 

Hey,

 

I want to report that there's a weird problem with segmentation faults
occurring after a random time (number of steps) on "exec" from a NAMD
jobscript, but so far only observed for CUDA+REMD runs. The same system runs
fine with the CPU version. I already ran into two cases where calling "exec"
caused a segfault after some time. The first one was a call to "date" to
measure the time per run for REMD. Another was calling VMD to do some
measurements during a REMD.

 

If there's interest to solve this, I can supply the problematic code and
core/backtrace.

 

Norman Geist

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:20:56 CST