[cluster-l] Single- vs. Dual- vs. Quad-core CPUs

Jay A. Kreibich jay at kreibi.ch
Wed Mar 28 15:07:35 CDT 2007


On Wed, Mar 28, 2007 at 01:57:02PM -0500, Nils Oberg scratched on the wall:

> I don't understand why the Xeon performs better than the Opteron on 
> one core, but worse than the Opteron on 4 cores.  I tried a different 
> CFD code and the same pattern emerged.  Why might this be happening?

  This type of performance analysis is mostly smoke and mirrors.  Really
  understanding what is going on requires a lot of nitty gritty
  analysis at a very low level.  In the bigger picture, this is why you
  run these kinds of tests and benchmarks _with your own software_.
  You find what works best and don't worry too much about why.


  That said, some intelligent guesses can be made.  The big thing to
  remember about cores is that the current generation of
  chip/sockets/motherboards has so much bandwidth in and out of the
  chip.  While you can add more cores, this usually isn't accounted for
  in the processor bus design.  From that, if you have a very memory
  intensive application (as most scientific computing stuff is), extra
  cores don't always buy you much.  Putting four cores behind one
  processor bus will only drive up contention on the bus.  Things are
  even worse if the cores share upper-layer on-chip caches.

  So my guess is that you're hitting a bus contention issue, and the
  different hardware is showing this in different ways.  The fact that
  the AMDs are dual core, while the Xeon is quad core, can be very
  significant-- even if the total number of cores is the same.  Also,
  when running "on four cores" for something like the dual-quad Xeon,
  it can make a big difference if that's "four cores on one chip" or
  "two cores on two chips."  You might sneak two cores onto one
  processor bus without too many issues.  You're not going to get away
  with that with four-core chips.

> I was getting better than linear speedup results for one of our 
> programs.  Is this possible?  Here are some results:

  Sure.  Different cache usage, perhaps leading to less memory thrashing
  as the problem is spread out.  The OS is going to require some amount
  of overhead, but you only pay that once, regardless of the number of
  cores you're using (this is assuming the unused cores are turned off,
  not simply left unutilized by the program).  This is especially true
  of something like networking, which can put a huge number of
  interrupts on a processor, but usually only one.  This leads to
  massive context switching costs when using only one core.

  There are a million different reasons we could come up with, but to be
  honest, its all a lot of guessing without a pretty in depth analysis.

  Additionally, wall-clock speed is not a great way to do performance
  tests.  In the end, it is usually what matters, but it isn't going to
  answer very many questions of this nature.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"'People who live in bamboo houses should not throw pandas.' Jesus said that."
   - "The Ninja", www.AskANinja.com, "Special Delivery 10: Pop!Tech 2006"


More information about the cluster-l mailing list