NAMD hangs at "Load balancer assumes all CPUs are same." when runs on an infiniband cluster.

From: yjcoshc (yjcoshc_at_gmail.com)
Date: Mon May 14 2018 - 22:08:46 CDT

Dear NAMD users,

I am trying to setup an infiniband cluster (just two nodes now) to run
NAMD. If I run NAMD with:

/home/data/NAMD_Git-2018-05-03_Source/Linux-x86_64-g++/charmrun ++p 92
++ppn 46 ++remote-shell ssh ++verbose
/home/data/NAMD_Git-2018-05-03_Source/Linux-x86_64-g++/namd2
+ignoresharing ++nodelist ./machinefile +devices 0,1 ++verbose
./apoa1.namd > chc.log&

then the log shows NAMD hangs at:

CharmLB> Load balancer assumes all CPUs are same.

However if I use just one node (local or remote) the NAMD works properly.

Any ideas?

The full log is attached below:

Charmrun remote shell(10.10.10.17.1)> remote responding...
Charmrun remote shell(10.10.10.15.0)> remote responding...
Charmrun remote shell(10.10.10.17.1)> starting node-program...
Charmrun remote shell(10.10.10.15.0)> starting node-program...
Charmrun remote shell(10.10.10.17.1)> remote shell phase successful.
Charmrun remote shell(10.10.10.15.0)> remote shell phase successful.
Charmrun> scalable start enabled.
Charmrun> charmrun started...
Charmrun> using ./machinefile as nodesfile
Charmrun> added host "10.10.10.15", IP:10.10.10.15
Charmrun> added host "10.10.10.17", IP:10.10.10.17
Charmrun> Charmrun = 10.10.10.17, port = 36593
Charmrun> IBVERBS version of charmrun
Charmrun> Sending "0 10.10.10.17 36593 10677 0" to client 0.
Charmrun> find the node program
"/home/data/NAMD_Git-2018-05-03_Source/Linux-x86_64-g++/namd2" at
"/home/data/apoa1" for 0.
Charmrun> Starting ssh 10.10.10.15 -l root -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (10.10.10.15:0) started
Charmrun> Sending "1 10.10.10.17 36593 10677 0" to client 1.
Charmrun> find the node program
"/home/data/NAMD_Git-2018-05-03_Source/Linux-x86_64-g++/namd2" at
"/home/data/apoa1" for 1.
Charmrun> Starting ssh 10.10.10.17 -l root -o
KbdInteractiveAuthentication=no -o PasswordAuthentication=no -o
NoHostAuthenticationForLocalhost=yes /bin/bash -f
Charmrun> remote shell (10.10.10.17:1) started
Charmrun> node programs all started
Charmrun> Waiting for 0-th client to connect.
Charmrun> Waiting for 1-th client to connect.
Charmrun> All clients connected.
Charmrun> IP tables sent.
Charmrun> node programs all connected
Charmrun> started all node programs in 1.245 seconds.
Charm++> Running in SMP mode: numNodes 2,  46 worker threads per process
Charm++> The comm. thread only receives messages, while work threads
send messages
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.8.2-685-gddf5c291d
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.

Haochuan Chen

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:21:07 CST