NAMD2.9 Release and Multi-GPU Minimization Errors

From: c jepson (jepson.c_at_gmail.com)
Date: Thu Jul 19 2012 - 15:15:32 CDT

Hello,

A previously reported bug from the beta in polling when using multiple CUDA
GPUs seem to be persisting. Here is the stderror output of a 10k step
minimization I attempted to perform:

"pc_at_pc-desktop:~/Desktop/NAMD29CUDA/minimization$ ../charmrun ++local +p8
./namd2 +idlepoll +devices 0,1 test_minimization.namd > output.log
------------- Processor 2 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: CUDA error in cuda_check_remote_progress on Pe 2
(pc-desktop device 0): the launch timed out and was terminated

Charm++ fatal error:
FATAL ERROR: CUDA error in cuda_check_remote_progress on Pe 2 (pc-desktop
device 0): the launch timed out and was terminated

Aborted (core dumped)"

This happens usually about 1000 steps into the minimization.

The minimization goes fine using just one GPU instead of both. I am using
nVidia 295.49 drivers, 2x GTX570s, an i7 2600K, Toolkit 4.2.9, and NAMD 2.9
Linux 64-bit Multi-Core with CUDA. The minimization is using AMBER force
fields and I'm using the included libcudart so. The exact same
minimization also works fine for NAMD 2.8 with both GPUs, although NAMD 2.9
with one GPU is faster than NAMD 2.8 with two GPUs.

Thanks for your help,
C Jepson

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:49 CST