From: Brunner, Robert Kraemer (
Date: Fri May 09 2014 - 12:03:53 CDT

On May 9, 2014, at 11:45 AM, Kenno Vanommeslaeghe <> wrote:

> I'm not convinced this is true. The shared FPU on an AMD bulldozer module is 256 bits wide and a single thread can only saturate it through relatively intensive use of AVX instructions. Given more real-life like workloads, it acts as two 128-bit FPUs. Last time we benchmarked, we could actually make NAMD run substantially faster by using all the logical cores, though the speedup was significantly lower than the one we saw when comparing the same numbers of cores on a machine with twice as many modules (frequency scaling might also play a role there). The same could not be said of our Intel benchmarks, where the speedup from using all the virtual cores was nearly negligible. For fairness, it should be noted that Intel *also* has these wide FPUs (and wider in more recent iterations) that are shared between threads, so we ascribed the difference to more aggressive frequency scaling from Intel's part.

Our experience with NAMD on Blue Waters (which uses Bulldozer processors) is that using all the logical cores is usually slightly faster than only 1 thread per FP unit, but the difference is not huge. Problem size is undoubtedly a factor; at some point communication starts to dominate and the difference in FP performance doesn't matter.


Robert Brunner
Blue Waters Science and Engineering Applications Support
National Center for Supercomputing Applications
4006F NCSA Building, MC-257
1205 W Clark St
Urbana, IL 61801

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:24 CST