From: Robert Brunner (rbrunner_at_illinois.edu)
Date: Wed Nov 26 2008 - 15:21:58 CST
There are two issues here, the sequential performance and the parallel
scaling. Compiler issues could certainly lead to poor performance
relative to your AMD cluster, but shouldn't affect the performance on
multiple processors relative to single-core speed. That's more a
consequence of how fast inter-processor communication is, and whether
there is enough work per processor to amortize the communication cost.
It looks like your +p1 time on the Mac would be roughly 124 min/ps,
and around 112 min/ps on your AMD cluster. I'm just scaling your 8-
core results on AMD and your 2-core results on the Mac, so those are
probably pretty far off, but they indicate roughly-equivalent
sequential performance. I'd try the 1-core test on each system to
check for differences there.
The bigger problem is the poor scaling. You should be able to scale
pretty well to 8 cores, if you have enough atoms per processor. How
many atoms are in your system? If you have more than 100 or so atoms
per core, you should be okay.
Check for system resource being used by other processes. On your 8-
core run, how much memory is being used by each NAMD process? How much
is free? Are there any other background processes using the CPU a lot?
The command-line utility top can tell you this, or look at the
Activity Monitor application.
You can't control the number of patches that finely, since its a
geometric factor derived from your cutoff and margin. You can change
the margin and perhaps increase or decrease the dimension of the patch
grid, but that will change the number of patches by more than 1 (ie.
6x5x5 maybe). That shouldn't matter though, because the work being
parallelized has more to do with the number of short range
interactions, rather than the number of patches.
Just some ideas.
On Nov 25, 2008, at 3:28 PM, Christopher Hartshorn wrote:
> I have not received any replies as of yet. So, thank you for the
> response. As far as benchmarks, I have done the following:
> All 8 cores of the cluster nodes=14min/ps
> On 8 core mac with:
> The fastest is the +p4 option which I am not sure what that means
> since I thought that the option designated the number of core/
> processors to be utilized. I see the obvious trend from the +p2
> option where +p4 is 2x faster, but I would expect anything >+p4 to
> be faster (maybe not double, but definitely faster). Also, the AMD
> cluster is still so much faster that it must be the that they (the
> Mac Intel binaries) are compiled for totally different systems, but
> I would suprised if the performance gain in compiling for 64bit vs
> 32 bit would be any more then 15% (not quite the 2X difference that
> there is between the two). Finally, can I compile my own using the
> latest XCode 3.x.x and the latest Intel Fortran compiler for Mac
> 11.x.x plus the source of the latest Charmm++ build and the latest
> NAMD build? I really am hoping to optimize the two MacPros I have
> because even a savings of 5min/ps will save me weeks of time on this
> Thank you,
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:50:10 CST