Re: NAMD benchmark results for dual-Nehalem, was Re: questions regarding new single node box for NAMD: re components, CUDA, Fermi

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Thu Dec 10 2009 - 00:52:52 CST

hi biff,

very interesting results. looks like with every new generation of
CPUs performance predictions become more complicated.
it speaks for the implementation of the non-bonded forces
in NAMD, that it can translate increase in clock rate into real
performance, without being affected that much by memory
bandwidth issues.

[...]

> The bottom lines are that
> (1) performance is strictly proportional to CPU clock rate between 2.0 and
> 3.33 GHz at all "+p" values.  Apparently the  architecture improvements in
> Nehalem have fixed earlier memory bottlenecks.
> (2) NAMD scaling efficiency drops to about 60% on going from +p4 to +p8 and
> then holds fairly steady to +p16 (see more detailed steps at the very end of
> the message) -- puzzling that the drop is this early.  Here are the raw
> values for the default apoa1 benchmark in seconds/step:

now, this second finding is particularly strange. your system has only 8 cores,
and the remaining "cores" that make a total of 16 are virtual due to activated
hyper-threading. unlike the hyper-threading in pentium-4 type processors,
there is some performance benefit from using hyperthreading (the processor
can overlap data reads and computations from different processes), but i
found the benefit to be up to 10% at the most on our 2x-quad core nehalem
nodes running linux.

i can only speculate where this is coming from and my first guess would be
the process scheduling in the windows kernel. it has shown to be not well
adapted to the workloads in scientific computing and even with the ineffective
p4 hyperthreading, windows machines would generally work smoother when
hyperthreading was enabled. this would also explain the strange drop in
performance, the scheduler could be filling the cores in the wrong order, i.e.
first all real and virtual cores on one CPU and then the real and virtual ones
on the second CPU.

i would suggest to reboot the machine with hyperthreading disabled and
rerun the test in order to confirm or refute this assertion.

another performance related item to check out would be to study the impact
of processor affinity.

cheers,
    axel.

-- 
Dr. Axel Kohlmeyer    akohlmey_at_gmail.com
Institute for Computational Molecular Science
College of Science and Technology
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:35 CST