SMP NAMD reports threads greater than physical cores, even when distributed to other nodes

From: Tom Coles (tcoles_at_mit.edu)
Date: Thu Dec 17 2015 - 10:26:20 CST

I am trying to run NAMD in SMP mode with ibverbs. I have tried versions 2.10 and 2.11b2, but it always reports that the total number of threads is greater than the number of physical cores, even though I am asking it to place the threads on different nodes. In fact, this happens when the node that is running namd2 is not in the nodelist and does not receive any PE threads.

The following message is printed:
Warning: the number of SMP threads (32) is greater than the number of physical cores (8), so threads will sleep while idling. Use +CmiSpinOnIdle or +CmiSleepOnIdle to control this directly.

I have four nodes, 8 cores per node, and I know that I need to leave on core per node free for the communications thread. I have 8 full cores each, no HT.

The command line is:
charmrun namd2 +p28 ++ppn 7 ++nodelist mynodelist ++verbose namdInput

The mynodelist file contains each node listed only once:
group main
host node001 ++shell ssh ++cpus 8
host node002 ++shell ssh ++cpus 8
host node003 ++shell ssh ++cpus 8
host node004 ++shell ssh ++cpus 8

The verbose output confirms that it is connected to all four nodes and I have connected to them with ssh and used ps to confirm that namd2 is running in each case.
Charmrun> adding client 0: "node001", IP:127.0.0.1
Charmrun> adding client 1: "node001", IP:127.0.0.1
Charmrun> adding client 2: "node001", IP:127.0.0.1
Charmrun> adding client 3: "node001", IP:127.0.0.1
Charmrun> adding client 4: "node001", IP:127.0.0.1
Charmrun> adding client 5: "node001", IP:127.0.0.1
Charmrun> adding client 6: "node001", IP:127.0.0.1
Charmrun> adding client 7: "node002", IP:172.16.0.2
Charmrun> adding client 8: "node002", IP:172.16.0.2
Charmrun> adding client 9: "node002", IP:172.16.0.2
Charmrun> adding client 10: "node002", IP:172.16.0.2
Charmrun> adding client 11: "node002", IP:172.16.0.2
Charmrun> adding client 12: "node002", IP:172.16.0.2
Charmrun> adding client 13: "node002", IP:172.16.0.2
Charmrun> adding client 14: "node003", IP:172.16.0.3
Charmrun> adding client 15: "node003", IP:172.16.0.3
Charmrun> adding client 16: "node003", IP:172.16.0.3
Charmrun> adding client 17: "node003", IP:172.16.0.3
Charmrun> adding client 18: "node003", IP:172.16.0.3
Charmrun> adding client 19: "node003", IP:172.16.0.3
Charmrun> adding client 20: "node003", IP:172.16.0.3
Charmrun> adding client 21: "node004", IP:172.16.0.4
Charmrun> adding client 22: "node004", IP:172.16.0.4
Charmrun> adding client 23: "node004", IP:172.16.0.4
Charmrun> adding client 24: "node004", IP:172.16.0.4
Charmrun> adding client 25: "node004", IP:172.16.0.4
Charmrun> adding client 26: "node004", IP:172.16.0.4
Charmrun> adding client 27: "node004", IP:172.16.0.4

Please can you let me know if I am doing something wrong? I am also concerned that 28 clients are added - is it correct that it needs to add one client per thread (rather than per process) like this?

I wonder if there might be a bug, as I have attempted to run the command from a fifth node (not in the nodelist) and the same message has been printed, even though no threads are assigned on that node! I have confirmed that nothing is actually running on that node by looking at the top command - there is no significant activity from namd2.

Thanks for any help,
Tom Coles

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:22:20 CST