Re: benchmarking on Cray XT4

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Tue Mar 16 2010 - 10:35:56 CDT

On Tue, 2010-03-16 at 14:32 +0000, Hannes Loeffler wrote:

> You've guessed right, Philip, it's Hector. I'm quite disappointed with
> the results because it makes running jobs there quite expensive
> obviously. I would be very much interested in your data too.
>
> Axel, it is namd 2.6. The program has been compiled by the Hector
> people if I am not mistaken.

then you should inquire, if they were using g++ and or pgCC
and then there is the question of tuning memory accesses and
related stuff.
my personal experience is limited to the cray xt3 in pittsburgh
(so far, i'm about to get access to the jaguar xt4/xt5 RSN),
and there were a lot of MPI tweaks that would help performance,
particularly using the -small_pages is a big winner with all
c and c++ codes that allocate lots of small memory areas.

http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdOnCrayXT3
it might be useful to check the documentation of your machine
to see how much of it still applies.

cheers,
   axel.

> Here are my benchmarking results.
>
> # system: protein/membrane solvated with 465399 atoms total
> # no. steps: 10.000
> # machine: hector
> # program/force field: namd2.6/CHARMM
> #
> # no. cores vs. CPUTime
> #cores npepn=1 npepn=2 npepn=4
> 8 15659.57 15923.37 16544.06
> 16 8061.80 8205.97 8570.84
> 32 4122.70 4181.28 4405.94
> 64 2007.51 2043.67 2189.68
> 128 1096.88 1142.68 1202.05
> 256 617.09 674.21 789.80
> 512 370.91 380.18 432.75
> 1024 247.69 258.52 270.41
> 2048 220.12 227.54 292.43
>
>
> Thanks to all who answered,
> Hannes.
>
>
> On Tue, 16 Mar 2010 12:59:49 +0000
> Philip Peartree <philpac_at_gmail.com> wrote:
>
> > Hi Hannes
> >
> > I found a similar situation on the XT4. My understanding is that the
> > seastar interconnect is shared across the cores of a processor,
> > therefore 2048 tasks on 2048 processors is faster than 2048 tasks on
> > 512 processors. From my experience, asking for 2048 pe with 4 tasks
> > per node will give only 512 procs in use, which is slower, but in most
> > job accounting methodologies users are billed per proc, so it is
> > beneficial to work on as fewer procs as possible.
> >
> > Could I enquire the system you are working on, is it Hector? I can
> > supply some data on this if you like
> >
> > Philip Peartree
> > University of Manchester
> >
> > Sent from my iPhone
> >
> > On 16 Mar 2010, at 10:50, Hannes Loeffler <Hannes.Loeffler_at_stfc.ac.uk>
> > wrote:
> >
> > > Hi,
> > >
> > > I am currently running some benchmarks on a Cray XT4 with quad-core
> > > processors. Users can choose how many tasks to run per processor.
> > > What I find is that a single task/processor outperforms a two
> > > task/processor run which itself is faster than a four task/processor
> > > run. I see this behaviour for processor counts from 8 to 2048.
> > > Now, I do understand that there may be performance hits when certain
> > > resource are shared but I would still have expected a different
> > > outcome. Can anyone comment on my findings? Is that the
> > > performance that I have to expect from namd on this architecture?
> > >
> > > Cheers,
> > > Hannes.
> > >
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com
http://sites.google.com/site/akohlmey/
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:54 CST