AW: Always 24-way SMP?

From: Norman Geist (
Date: Wed Feb 27 2013 - 01:53:27 CST

Hi again Andrew,


what you see is the well known memory bandwidth bottleneck. Some things that
can help:


1. If this is a Intel Xeon CPU: These CPUs usually scale down the
memory bandwidth the more memory modules are pluged in.

So usually one slot used per channel results in 1333MT - 2 in 1066MT and 3
in 800MT or similar. (MT = MegaTransfers = 8byte*MT/s = memory bandwidth in

2. The OS can lead to stupid behavior on multicore/multiprocessor
environments, usually it does a nice job, but sometimes distributing/binding
the processes to cores manually or to pin them so they don't hop around, can
bring improvements.

3. Try different ratios between shared memory and distributed memory
processes. Sometimes starting one process per physical processor using all
of the cores with smp (or other ratios) can make a big difference.


But the real bottom line is, it's the hardware and it's not the programs
fault. I managed to create a little model to precheck cpu architectures
before buying them for exact the same reason of being sick of hardware,
bought by formerly people with no idea, not providing the properties the
programs we use need. I also met people that meant that there are not
hardware properties to think about regarding specific software when buying
new computers, they just meant you can change the program then. But you
can't barely change the nature of a given problem. And just rewriting a
stable and well tested code to better fit another hardware will produce much
more bugs than real use. But namd really doesn't need that much memory
bandwidth usually, CPMD or VASP are much more dependent.


To better understand what's going on, you could try a benchmark while
pinning the processors to specific cores. You can find out which core
belongs to which physical processor when looking at the output of "cat
/proc/cpuinfo" and remember which coreid belongs to which physical id. Same
physical id means same CPU socket. Then you can start namd with taskset to
bind the namd processes with to different schemata from 1 to 12/16 cores.
The first is multiple socket avoidance, the second is shared socket
avoidance. So the first means to fill up one processor first the seconds
means to distribute between the sockets. This usually shows, that shared
socket avoidance provides the better performance compared to multiple socket
avoidance, as the processes have not to share the same memory bandwidth, as
every socket, since QPI/Hypertransport, has its own memory. You will observe
a number of cores used per cpu, when the memory bandwidth usage is


Good luck


Norman Geist.


Von: [] Im Auftrag
von Andrew Pearson
Gesendet: Dienstag, 26. Februar 2013 15:15
An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: Always 24-way SMP?


Hi Norman

OK, the first thing to note is that I have several 12-core nodes and several
16-core nodes in my cluster. I restricted my initial runs to the 12-core
nodes in order to keep things consistent. You were correct when you said
that Charm++ said I was using a 24-way SMP node because HT was enabled. Now
that HT is disabled, Charm++ sees a 12-way SMP.

Later when I did further tests, the 12-core nodes were occupied by another
user and so I switched to the 16-core nodes. Now, Charm++ sees a 16-way SMP
node (because I've now disabled HT). I would imagine that if I had used the
16-core nodes previously, Charm++ would have seen 32-way SMP.

I performed a scaling test on a single 16-core node, first with HT, and then
without. The HT result shows linear scaling until approximately 8
processors, and by 12 processors the departure from linearity is
significant. The non-HT result shows the same initial linear scaling, but it
continues all the way to 16 processors. At 12 processors the speedup is 9.5
and at 16 processors the speedup is 12.3.

I admit to not being an expert in the specific numerical method that NAMD
uses to solve the problem, but I imagine that it involves a lot of
communication, and that the resulting speedup will not be ideal. Is this
correct, or should I be expecting almost-16x speedup for 16 processors?

I think this explains everything. I'll send you my /proc/cpuinfo if you
really want to see it or you think there are still-unanswered questions.






On Tue, Feb 26, 2013 at 1:27 AM, Norman Geist
<> wrote:

Hi Andrew,


nice to hear that so far. But I'm still confused about:


1. Charm++ telling it's a 24way smp node.

2. The speedup being 12

3. You telling it's a 16 core node


Could you post the output of "cat /proc/cpuinfo", so we can make sure that
we fully understood what's going on.


Norman Geist.


Von: [] Im Auftrag
von Andrew Pearson
Gesendet: Montag, 25. Februar 2013 18:50

An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: Always 24-way SMP?


Hello again Norman

Yes, this was exactly the problem. I disabled hyperthreading on a compute
node and performed my scaling test again, and this time the results were
perfect. The speedup is now linear, and I get 12.3x for a 16-core run on a
single 16-core node. Thank you for your advice and for pointing out this
problem -- this would have affected many of our users, and not just NAMD



On Mon, Feb 25, 2013 at 10:06 AM, Norman Geist
<> wrote:



what kind of cpu are you using on this node. What you experience remembers
me on hyper threading. Could it be that your machine has only 12 physical
cores, and the rest are the hyper threading "logical" cores? If so, it's no
wonder that namd can't get any benefits out of the virtual cores (actually
only a second command schedule per physical core), which are usually thought
to better fill up spaces in the cpu schedule when doing multitasking, as
tasks also produce wait times for example with disk IO. As namd doesn't
leave to much spaces because of being a highly optimized code, the maximum
speedup of 12 is reasonable.

So I think you have two six-core cpus on your node. Please let us know this


Furthermore, I never observed problems with the precompiled namd builds. And
most things I read about it, were about infiniband and ofed stuff. Also,
this problems were about succesfully starting namd, but not about bad
parallel scaling.


Norman Geist.


Von: Andrew Pearson []
Gesendet: Montag, 25. Februar 2013 13:28
An: Norman Geist
Cc: Namd Mailing List
Betreff: Re: namd-l: Always 24-way SMP?


Hi Norman


Thanks for the response. I didn't phrase my question well - I know I'm
experiencing scaling problems, and I'm trying to determine whether
precompiled namd binaries are known to cause problems. I ask this since many
people seem to say that you should compile namd yourself to save headaches.


Your explanation about charm++ displaying information about the number of
cores makes sense. I'll bet that's what's happening.


My scaling problem is that for a given system (27 patches, 50000 atoms) I
get perfect speedup until nprocs = 12 and then the speedup line goes almost
flat. This occurs for runs performed on a single 16 core node.



On Monday, February 25, 2013, Norman Geist wrote:

Hi Andrew,


it's a bad idea to ask someone else if you have scaling problems. You should
know if you have or not. The information from the outfile just comes from
the charm++ startup and is simply a information about the underlying
hardware. It doesn't mean it uses smp. It just tells you it's a
multiprocessor/multicore node. Watch the output carefully and you will see
IMHO that it uses the right number of cpus (for example the Benchmark
lines). So what kind of scaling problems you have? Don't you get the
expected speedup?


Norman Geist.


Von: [] Im Auftrag
von Andrew Pearson
Gesendet: Freitag, 22. Februar 2013 19:30
Betreff: namd-l: Always 24-way SMP?


I'm investigating scaling problems with NAMD. I'm running precompiled
linux-64-tcp binaries on a linux cluster with 12-core nodes using "charmrun
+p $NPROCS ++mpiexec".

I know scaling problems have been covered, but I can't find the answer to my
specific question. No matter how many cores I use or how many nodes they
are spread over, at the top of stdout charm++ always reports "Running on #
unique compute nodes (24-way SMP)". It gets # correct, but it's always
24-way SMP. Is this supposed to be this way? If so, why?

Everyone seems to say that you should recompile NAMD with your own MPI
library, but I don't seem to have problems running NAMD jobs to completion
with charmrun + OpenMPI built with intel (except for the scaling). Could
using the precompiled binaries result in scaling problems?

Thank you.



This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:59 CST