Re: NAMD benchmark results for dual-Nehalem, was Re: questions regarding new single node box for NAMD: re components, CUDA, Fermi

From: Biff Forbush (biff.forbush_at_yale.edu)
Date: Thu Dec 10 2009 - 19:47:58 CST

Hi Axel,

    Thanks for your comments, you're right-on about the hyperthreading,
makes perfect sense. The data below show that:

1) The decrease in scaling is indeed from scheduling both real and
virtual cores. With hyperthreading off, scaling stays near 0.95 from
2-8 real cores. My guess from watching taskmanager is that the windows
scheduler schedules randomly and with hyperthreading on it sometimes
schedules real and virtual even when more single cores are available --
this fits at least qualitatively with the early scaling dropoff between
+p4 and +p7. More tests fooling with processor affinity could nail
this, but don't seem worth the time.

2) All that being said, with the system running all out, hyperthreading
gives about a 16% advantage, comparing the +p8-off and +p16-on benchmark
results. I expect a similar result with linux. But all bets are off
when GPUs are added (or maybe you want to place a bet).

[3) Not shown below I also ran the benchmark for a 2.8GHz i7-860
system. At 28GHz, the data are almost identical to the numbers for the
first 8 hyperthreaded "cores" in the dual Xeons at the same clock
speed. Unlike the Xeons, the i7 did show some performance dropoff with
processor speed, 16% below linear extrapolation from 2GHz to 2.8GHz,
suggesting some memory bandwith issue -- the Lynnfield i7 has a poorer
memory subsystem.]

3.3 GHz dual Nehalem
HT-on (prev. data) Hyperthreading off
+p s/step scaling +p s/step scaling
1 1.643 1.000 1 1.637 1
2 0.842 0.975 2 0.836 0.97
3 0.570 0.962 3 0.575 0.94
4 0.428 0.959 4 0.420 0.97
5 0.394 0.834 5 0.340 0.96
6 0.402 0.681 6 0.289 0.94
7 0.369 0.636 7 0.241 0.96
8 0.336 0.611 8 0.212 0.96
9 0.303 0.603
10 0.273 0.601
11 0.251 0.594
12 0.227 0.604
13 0.212 0.595
14 0.201 0.583
15 0.188 0.584
16 0.180 0.569

Regards,
Biff

> hi biff,
>
> very interesting results. looks like with every new generation of
> CPUs performance predictions become more complicated.
> it speaks for the implementation of the non-bonded forces
> in NAMD, that it can translate increase in clock rate into real
> performance, without being affected that much by memory
> bandwidth issues.
>
> [...]
>
>
>> The bottom lines are that
>> (1) performance is strictly proportional to CPU clock rate between 2.0 and
>> 3.33 GHz at all "+p" values. Apparently the architecture improvements in
>> Nehalem have fixed earlier memory bottlenecks.
>> (2) NAMD scaling efficiency drops to about 60% on going from +p4 to +p8 and
>> then holds fairly steady to +p16 (see more detailed steps at the very end of
>> the message) -- puzzling that the drop is this early. Here are the raw
>> values for the default apoa1 benchmark in seconds/step:
>>
>
> now, this second finding is particularly strange. your system has only 8 cores,
> and the remaining "cores" that make a total of 16 are virtual due to activated
> hyper-threading. unlike the hyper-threading in pentium-4 type processors,
> there is some performance benefit from using hyperthreading (the processor
> can overlap data reads and computations from different processes), but i
> found the benefit to be up to 10% at the most on our 2x-quad core nehalem
> nodes running linux.
>
> i can only speculate where this is coming from and my first guess would be
> the process scheduling in the windows kernel. it has shown to be not well
> adapted to the workloads in scientific computing and even with the ineffective
> p4 hyperthreading, windows machines would generally work smoother when
> hyperthreading was enabled. this would also explain the strange drop in
> performance, the scheduler could be filling the cores in the wrong order, i.e.
> first all real and virtual cores on one CPU and then the real and virtual ones
> on the second CPU.
>
> i would suggest to reboot the machine with hyperthreading disabled and
> rerun the test in order to confirm or refute this assertion.
>
> another performance related item to check out would be to study the impact
> of processor affinity.
>
> cheers,
> axel.
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:35 CST