[cluster-l] Single- vs. Dual- vs. Quad-core CPUs
Jay A. Kreibich
jay at kreibi.ch
Wed Mar 28 15:07:35 CDT 2007
On Wed, Mar 28, 2007 at 01:57:02PM -0500, Nils Oberg scratched on the wall:
> I don't understand why the Xeon performs better than the Opteron on
> one core, but worse than the Opteron on 4 cores. I tried a different
> CFD code and the same pattern emerged. Why might this be happening?
This type of performance analysis is mostly smoke and mirrors. Really
understanding what is going on requires a lot of nitty gritty
analysis at a very low level. In the bigger picture, this is why you
run these kinds of tests and benchmarks _with your own software_.
You find what works best and don't worry too much about why.
That said, some intelligent guesses can be made. The big thing to
remember about cores is that the current generation of
chip/sockets/motherboards has so much bandwidth in and out of the
chip. While you can add more cores, this usually isn't accounted for
in the processor bus design. From that, if you have a very memory
intensive application (as most scientific computing stuff is), extra
cores don't always buy you much. Putting four cores behind one
processor bus will only drive up contention on the bus. Things are
even worse if the cores share upper-layer on-chip caches.
So my guess is that you're hitting a bus contention issue, and the
different hardware is showing this in different ways. The fact that
the AMDs are dual core, while the Xeon is quad core, can be very
significant-- even if the total number of cores is the same. Also,
when running "on four cores" for something like the dual-quad Xeon,
it can make a big difference if that's "four cores on one chip" or
"two cores on two chips." You might sneak two cores onto one
processor bus without too many issues. You're not going to get away
with that with four-core chips.
> I was getting better than linear speedup results for one of our
> programs. Is this possible? Here are some results:
Sure. Different cache usage, perhaps leading to less memory thrashing
as the problem is spread out. The OS is going to require some amount
of overhead, but you only pay that once, regardless of the number of
cores you're using (this is assuming the unused cores are turned off,
not simply left unutilized by the program). This is especially true
of something like networking, which can put a huge number of
interrupts on a processor, but usually only one. This leads to
massive context switching costs when using only one core.
There are a million different reasons we could come up with, but to be
honest, its all a lot of guessing without a pretty in depth analysis.
Additionally, wall-clock speed is not a great way to do performance
tests. In the end, it is usually what matters, but it isn't going to
answer very many questions of this nature.
-j
--
Jay A. Kreibich < J A Y @ K R E I B I.C H >
"'People who live in bamboo houses should not throw pandas.' Jesus said that."
- "The Ninja", www.AskANinja.com, "Special Delivery 10: Pop!Tech 2006"
More information about the cluster-l
mailing list