From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Mar 04 2015 - 04:40:07 CST
Here's the backtrace I got, seems that libcuda isn't thread-safe. It's a
real issue at many places that TCLs exec uses fork() .
This could maybe be circumvented by adding a command to NAMDs TCL interface
for calling external tools without need to exec and fork() as NAMD will
anyway throw a FATAL for TCL errors, the forking exec doesn't change a thing
here.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00002ae8a2893602 in ?? () from /lib64/libc.so.6
(gdb) bt
#0 0x00002ae8a2893602 in ?? () from /lib64/libc.so.6
#1 0x00002ae8a7c7ef81 in ?? () from /usr/lib64/libcuda.so
#2 0x00002ae8a75f2b5a in ?? () from /usr/lib64/libcuda.so
#3 0x00002ae8a7c7f588 in ?? () from /usr/lib64/libcuda.so
#4 0x00002ae8a17ab1f3 in start_thread () from /lib64/libpthread.so.0
#5 0x00002ae8a29041ad in clone () from /lib64/libc.so.6
Norman Geist.
From: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] On Behalf
Of Norman Geist
Sent: Wednesday, March 04, 2015 8:48 AM
To: namd-l_at_ks.uiuc.edu
Subject: namd-l: BUG: NAMD-2.10 CUDA REMD segfault with TCL exec randomly
Hey,
I want to report that there's a weird problem with segmentation faults
occurring after a random time (number of steps) on "exec" from a NAMD
jobscript, but so far only observed for CUDA+REMD runs. The same system runs
fine with the CPU version. I already ran into two cases where calling "exec"
caused a segfault after some time. The first one was a call to "date" to
measure the time per run for REMD. Another was calling VMD to do some
measurements during a REMD.
If there's interest to solve this, I can supply the problematic code and
core/backtrace.
Norman Geist
This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:20:56 CST