NAMD 2.7b2 with CUDA and infinband

From: Jyh-Shyong (c00jsh00_at_nchc.org.tw)
Date: Sun Dec 20 2009 - 23:03:37 CST

Hi,

I tried NAMD 2.7b2 on our GPU cluster, but so far I have not been
successful, any hint and suggestion
is appreciated.

1. I download the binary NAMD_2.7b2_Linux-x86_64-ibverbs-CUDA, and ran
a test case with command

./charmrun ++local ++p 4 namd2 +idlepoll ./alanin.namd
Charmrun> IBVERBS version of charmrun
Charmrun: Bad initnode data length. Aborting

2. I tried again with command

./charmrun ++nodelist ./hostlist ++p 4 namd2 +idlepoll ./alanin.namd

Here file hostlist contains two lines:

group main
  host gc16

gc16 is the hostname of the computer I was using. Here is the output of
this command:

..
Info:
Info: Entering startup at 0.376303 s, 104.066 MB of memory in use
Info: Startup phase 0 took 0.00472808 s, 104.066 MB of memory in use
Info: Startup phase 1 took 0.00161982 s, 104.066 MB of memory in use
Info: Startup phase 2 took 0.000169039 s, 104.066 MB of memory in use
FATAL ERROR: CUDA-enabled NAMD requires more patches than processes.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: CUDA-enabled NAMD requires more patches than processes.

[0] Stack Traceback:
  [0] CmiAbort+0x5f [0x9f6257]
  [1] _Z8NAMD_diePKc+0x62 [0x50ad52]
  [2] _ZN11WorkDistrib12patchMapInitEv+0x8e6 [0x8ecf46]
  [3] _ZN4Node7startupEv+0xd4d [0x8497fb]
  [4] _ZN12CkIndex_Node18_call_startup_voidEPvP4Node+0x12 [0x848aaa]
  [5] CkDeliverMessageFree+0x21 [0x96a2d5]
  [6] _Z15_processHandlerPvP11CkCoreState+0x4ba [0x96994a]
  [7] CsdScheduleForever+0xa5 [0x9f706d]
  [8] CsdScheduler+0x1c [0x9f6c6e]
  [9] _ZN7BackEnd7suspendEv+0xb [0x5138cd]
  [10] _ZN9ScriptTcl9initcheckEv+0x80 [0x8aeaaa]
  [11] _ZN9ScriptTcl3runEPc+0xb5 [0x8aac23]
  [12] _Z18after_backend_initiPPc+0x22b [0x50f56b]
  [13] main+0x3a [0x50f30a]
  [14] __libc_start_main+0xe6 [0x7f17e1e61586]
  [15] _ZNSt8ios_base4InitD1Ev+0x72 [0x50a6ca]
Fatal error on PE 0> FATAL ERROR: CUDA-enabled NAMD requires more
patches than processes.

There are 4 Tesla C1070s on this node:
chem_at_gc16:/work/chem/alanin> ls -l /dev/nvi*
crw-rw-rw- 1 root video 195, 0 2009-10-13 10:56 /dev/nvidia0
crw-rw-rw- 1 root video 195, 1 2009-10-13 10:56 /dev/nvidia1
crw-rw-rw- 1 root video 195, 2 2009-10-13 10:56 /dev/nvidia2
crw-rw-rw- 1 root video 195, 3 2009-10-13 10:56 /dev/nvidia3
crw-rw-rw- 1 root video 195, 255 2009-10-13 10:56 /dev/nvidiactl

I wonder something in my environment settings might be wrong, but I
don't know what it is.
I also downloaded the latest version of source code and built the binary
with ibverbs option
for charm, and I got the same result.

Regards

Jyh-Shyong Ho, Ph.D.
Research Scientist
National Center for High Performance Computing
Hsinchu, Taiwan, ROC

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:37 CST