RE: NAMD hanging on GPU

From: Sadhu, Shubho (NIH/NCI) [F] (sadhusj_at_mail.nih.gov)
Date: Tue Jul 28 2009 - 09:17:48 CDT

I ran a few more tests, and got access to another machine, and the results are mixed.

To keep things straight, I've now tested NAMD with CUDA on three machines: 1) 32-bit quad-core with a Quadro FX 3700, 2) 64-bit quad-core with GeForce 8800 GTX's, 3) 64-bit dual quad-core with 2 Tesla C870s and a Quadro NVS 290. The Quadro cards are connected to GUIs.

1) On the first machine, the RNA test case (~31K atoms) runs fine without hanging or crashing. I've tested up to 300,000 steps (10.3 hours).
2) On the second machine, the RNA test case hangs within 10,000 steps. The machine was rebooted to see if it would help. Now it seems to last slightly longer (still within 10,000 steps though), and it sometimes crashes with "FATAL ERROR: CUDA error dev_nonbonded: unspecified launch failure" (stack traceback is below).
3) On the third machine, the test case works on the Teslas, but when the Quadro is added the simulation hangs before 2000 steps.

The hanging is reproducible on multiple machines, which suggests a bug in the software or CUDA rather than the specific machines.

Any help is appreciated.

Shubho

Stack traceback for machine 2 ([0] to [14] are identical for different crashes):
Info: Initial time: 1 CPUs 0.104865 s/step 0.606858 days/ns 59.1389 MB memory
FATAL ERROR: CUDA error dev_nonbonded: unspecified launch failure
[0] Stack Traceback:
  [0] CmiAbort+0x71 [0x97abad]
  [1] _Z8NAMD_diePKc+0x65 [0x4f9875]
  [2] _Z13cuda_errcheckPKc+0x45 [0x64bd59]
  [3] _Z21cuda_nonbonded_forces6float3S_S_fiiii+0x148 [0x8c7088]
  [4] _ZN20ComputeNonbondedCUDA15recvYieldDeviceEi+0x2ab [0x647913]
  [5] _ZN18CkIndex_ComputeMgr32_call_recvYieldDevice_marshall18EPvP10ComputeMgr+0x7b [0x569423]
  [6] CkDeliverMessageFree+0x25 [0x91060f]
  [7] _Z15_processHandlerPvP11CkCoreState+0x5d2 [0x90fca0]
  [8] CmiHandleMessage+0x2a [0x97bb66]
  [9] CsdScheduleForever+0x5b [0x97bc77]
  [10] CsdScheduler+0x1c [0x97b952]
  [11] _ZN7BackEnd7suspendEv+0xb [0x506593]
  [12] _ZN9ScriptTcl3runEPc+0x1a3 [0x848c99]
  [13] main+0x259 [0x4fe19d]
  [14] __libc_start_main+0xf4 [0x2b7be4b1eb54]
  [15] _ZNSt8ios_base4InitD1Ev+0x41 [0x4f8e29]
________________________________________
From: owner-namd-l_at_ks.uiuc.edu [owner-namd-l_at_ks.uiuc.edu] On Behalf Of Sadhu, Shubho (NIH/NCI) [F]
Sent: Friday, July 17, 2009 9:00 AM
To: namd-l_at_ks.uiuc.edu
Subject: namd-l: NAMD hanging on GPU

Hi everyone,
I'm trying to get NAMD running on two CUDA-enabled machines, one with a GeForce 8800 GTX, the other with a Quadro FX 3700. NAMD works fine on the Quadro, but I'm getting mixed results with the GeForce. NAMD hangs within 10 minutes for the Apoa1 case and another 32K atom solvated RNA. It doesn't hang at the same spot every time; for the RNA, it's usually before step 500, but it has gone to 9500 steps once. However, alanin works (I've tried up to 2,000,000 steps).
I tried "twoAwayX yes", but that didn't help. I also tried turning off PME, temperature regulation, pressure regulation, and RATTLE, but no luck.
Technical details: The machine with the GeForce is 64-bit, the Quadro machine is 32-bit. I'm using a CVS version of NAMD that I downloaded on June 23. I tried compiling with g++ and icc, but both versions hang. The Nvidia driver version for both machines is 180.22.
Any help is appreciated.

Shubho

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:05 CST