Issue attaching to nodes

From: Bryan Holland (bholla01_at_uoguelph.ca)
Date: Thu Feb 25 2010 - 10:24:13 CST

Hi there,

When attempting to run:

charmrun ++verbose ++nodelist nodelist ++nodegroup n01 +p4 namd2 +idlepoll /test/apoa1/apoa1.namd > /test/apoa1.out

I get the following output:

Charmrun> charmrun started...
Charmrun> using nodelist as nodesfile
Charmrun> remote shell (n01ib0.cluster.net:0) started
Charmrun> remote shell (n01ib0.cluster.net:1) started
Charmrun> remote shell (n01ib0.cluster.net:2) started
Charmrun> remote shell (n01ib0.cluster.net:3) started
Charmrun> node programs all started
Charmrun> error 0 attaching to node:

I've seen this problem a few times in the archives but haven't found a way to fix it. I'm attempting to run the ibverb-CUDA binary off the NAMD website on a small, recently built GPU cluster running openSUSE 11.2. The infiniband network is working fine and rsh is set up and working (e.g. "rsh n01ib0 pwd" does not require a password).

Other details:
- using OFED 1.5.1
- CUDA 2.3 (which works fine with the SDK)
- nvidia driver: 190.53
- processors - xeon 5520s
- GPUs - GeForce 295GTXs
- I have the directory that contains libcudart.so.2 in /etc/ld.so.conf before other cuda ld directories as stipulated in the release notes.
- using nfs to export the master's /home directory, this is where the NAMD binaries are so all nodes can see

I'm stumped so any help would be greatly appreciated, need any more details let me know.

Cheers,
Bryan

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:50 CST