Re: AMD Multicore + CUDA Benchmarks, are them ok ?

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Thu May 05 2011 - 08:47:56 CDT

2011/5/5 Nicholas M Glykos <glykos_at_mbg.duth.gr>:
>
> Hi Dave,
>
>> 12 CPU core + 4 GPU : 0,34 days/ns
>
> For the same test (apoa1), we get 0.29 days/ns from an i7 (four cores)
> plus a single GTX295 card˛, so I would suspect that there is space for
> improvement with your hardware. FYI, we perform the cuda run with

hmmm... there are some more factors to consider:

GPU performance in classical MD with "simple" potentials
is mostly bounded by the memory bandwidth and not so much
by the GPU clock speed (or number of cores).

- the intel i7 has about 50% more memory bandwidth than
  one "die" of the amd (here the back-of-the-envelope estimate:
  the 12 core chips are essentially two 6 core chips with two
  memory channels each, while the single intel chip has three)

- the tesla cards are wired to have two cards on one bridge each.
  that will limit the communication bandwidth per GPU, but the
  same is true for the GTX 295

- in (high-end) geforce cards, the memory is often clocked at a
  higher speed than in the "professional" tesla or quadro cards.

- if the tesla has ECC mode memory enabled (one of the main
  advantages for large installations and thus one of the reason
  super computing centers strongly prefer to buy those), that
  may have an additional impact on GPU internal memory bandwidth
  and thus performance.

- the S-type teslas can come with two different types of PCIe
  bridge adapters. they can be 8xPCIe or 16xPCIe and similarly
  you need to check if they slots they are placed in are actually
  both 16x PCIe and do have full concurrent bandwidth.

- memory/processor affinity can have an impact, too. with GPUs,
  however, this is particularly difficult, since you first have to find
  out to which processor's memory controller the southbridge is
  attached to, which can be complicated on a 4-way machine.

that being said, i believe that at the current level of pricing, the
cost/performance ratio for running classical MD of a 4-way opteron
8-core or 12-core machine is highly competitive to a 2-way machine
with 4 tesla C2050 in full 16x-PCIe-v2 slots. after having tested,
the 12-core cpu extensively (we have currently six of those monsters),
i have started to like them a lot, but even when running CPU code,
they seem to be a little bit memory bandwidth starved, so actually
going for 4-way 8-core at, say, 2.5GHz clock is probably the currently
most cost efficient way to provide compute capacity(!) for classical
MD and particularly NAMD. mind you, those manycore nodes don't
scale as well as 2-way 6-core intel over QDR-infiniband, so if you
are after running as fast as possible, no matter what, then that seems
to be a better choice. the nice thing about the 4x-8/12-core machines
is that you need less memory overall, which significantly cuts down costs.

hope that helps.

cheers,
    axel.

> something like         /usr/local/namd_cuda/charmrun ++local \
> /usr/local/namd_cuda/namd2 +p4 +idlepoll +noAnytimeMigration \
> +setcpuaffinity +LBSameCpus something.conf
>
> Nicholas
>
>
> ˛ http://norma.mbg.duth.gr/index.php?id=about:benchmarks:namdv27cuda
>
>
> --
>
>
>          Dr Nicholas M. Glykos, Department of Molecular Biology
>     and Genetics, Democritus University of Thrace, University Campus,
>  Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620,
>    Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/
>

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:57:04 CST