AW: NAMD2.9 single-node benchmarks, 0-2 Kepler GPU's.

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Thu May 02 2013 - 01:13:08 CDT

Hi,

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Aron Broom
> Gesendet: Mittwoch, 1. Mai 2013 07:00
> An: Aaron Cann
> Cc: namd-l_at_ks.uiuc.edu
> Betreff: Re: namd-l: NAMD2.9 single-node benchmarks, 0-2 Kepler GPU's.
>
> Thanks for sharing the benchmarks!
>
> As general remarks/answers to your questions:
>
> 1) NAMD is currently limited considerably by GPU bandwidth, so you
> might
> see some big improvements if you get the PCI 3.0 working. This is not
> the
> case for AMBER.

I think so too. While smaller system require low latency PCIE communication,
the bigger ones need the bandwidth. Both should be better with the 3.0
version.
 
>
> 2) I wouldn't do any hyper-threading, I've never seen it help and often
> seen it hurt, which is not shocking given the intensity of the task.
> Certainly if you want better performance more CPUs will help,
> particularly
> if you get the PCI working.

Agree. If you want to keep the HT enabled, maybe to provide some possible
resources to the OS while wait time if NAMD, you can do so. But in this
case, you should bind your NAMD processes to some real cores instead of
letting the OS doing it. I often saw a nice gain then. To find out which
logical processor correspond to which physical core, you can use "cat
/proc/cpuinfo". Cores with the same physical id belong to the same socket.
In your case, all should contain to 0. Then look at the core id. There
should always be two cores that have the same core id in your case. This
logical processors are the same physical core than. Why do you need to know
that? Because the OS can number the cores differently:

Processor: 0 1 2 3 4 5 6 7
CoreID : 0 0 1 1 2 2 3 3

or

Processor: 0 1 2 3 4 5 6 7
CoreID : 0 1 2 3 0 1 2 3

To allow only non shared cores to namd you can use the linux tool "taskset",
an example would look like:

charmrun +p4 taskset -c 0,2,4,6 namd2 +idlepoll apoa1.in >> bench.out

Keep in mind that a bad automatic distribution of the processes due the OS,
can influence a benchmark heavily. This is also important if you have
multiple CPU sockets, where you want to benchmark the scaling per socket and
over sockets.

>
> 3) NAMD can run using AMBER inputs, there is a section in the manual
> about
> how to get equivalent behavior if you really want to compare.
>
> 4) For implicit solvent or small explicit systems ( < 10k atoms ) AMBER
> will dominate NAMD for performance. By the time your systems are >
> 100k
> the difference will be less extreme.
>
> 5) The GPU enhanced AMBER is missing a lot of the AMBER functionality.
> You
> should make sure it can actually do what you want as it could be years
> in
> coming, NAMD does not have this problem.
>
> 6) I think currently for all the MD codes out there if you have
> multiple
> GPUs the best use is to run multiple simulations.

This can only be assumed, if a MD code doesn't intensively use the CPU for
computation at all, like ACEMD or AMBER. For all codes that use the
offloading method, you can expect losses due the shared PCIE. Additionally,
if you haven't got one CPU socket per job on your mainboard, expect a loss
due the sharing of the memory access, too. This often is a reason for people
being disappointed about the scaling of a processor, but QM codes suffer
much more.

>
> 7) If you want performance and function, and have even casual
> programming
> abilities, check out OpenMM.
>
>
> On Tue, Apr 30, 2013 at 10:17 PM, Aaron Cann <aaron_at_canncentral.net>
> wrote:
>
> > Hello all, I thought I’d write some of my experiences setting up a
> basic
> > NAMD 2.9 GPU workstation. Lots of benchmarks, and some conclusions
> and a
> > few questions for the illuminati.****
> >
> > SETUP: System is an Intel LGA 2011 system with two Nvidia GTX 4G
> 670’s in
> > the x16 slots. They’re running at PCIE 2.0 thanks to the strange
> snafus
> > with Sandy Bridge-E CPU’s at 3.0 speeds. CPU is a 4-core
> hyperthreaded i7,
> > 3.6GHz. Running Ubuntu 13.04, NAMD 2.9, either with or without CUDA,
> 64
> > bit, latest NVIDIA drivers. Displays were hanging off the GPUs, not
> doing
> > anything during the runs. Switching to console mode didn’t change
> anything.
> > Deliberately loading the GPU with a large VMD rotation slowed runs
> down.
> > ****
> >
> > Note that I cite thread numbers: up to 8 threads on 4 cores. > 4
> threads =
> > “fake” extra CPUs.****
> >
> > Standard namd benchmarks except outputEnergies=600. Dhfr was adapted
> from
> > the AMBER benchmark by Charles Brooks, 2 fs timestep.****
> >
> > STMV benchmark.****
> >
> > Ns/day. T= # threads, (may be 2x # of cores.)****
> >
> > T 1GPU 2GPU****
> >
> > 1 0.099 0.100****
> >
> > 2 0.151 0.193****
> >
> > 3 0.175 0.220****
> >
> > 4 0.182 0.282****
> >
> > 8 0.186 0.282****
> >
> > Thoughts-- ****
> >
> > STMV is a large dataset. 2 threads gets 94% of the horsepower out of
> 1
> > gpu, and moving from 2T/1G à 4T/2G gives pretty good scaling with
> this
> > dataset (94% of doubled output). This dataset looks largely GPU
> bound,
> > although a six core CPU would still have been slightly better.
> Adding a 3
> > rd GPU here (on the existing four core CPU) would be an inefficient
> use
> > of the 3rd gpu. ****
> >
> > ** **
> >
> > APOA1****
> >
> > Ns/day****
> >
> > T 0GPU 1GPU 2GPU****
> >
> > 1 0.12 1.10 1.10****
> >
> > 2 0.24 1.94 2.18****
> >
> > 3 0.33 2.21 2.79****
> >
> > 4 0.30 2.31 3.10****
> >
> > 8 0.31 2.28 3.70****
> >
> > Moving from 1 threads on 1 GPU to 2T/2G again has excellent scaling,
> 99%
> > of doubled output, although most of the increase was from the second
> core,
> > not the GPU. Getting to 96% of peak output of 1 GPU required 3
> threads,
> > not two. Moving from 2 threads/1GPU to 4/2 gave only an 80% speedup,
> > suggesting communications was becoming an issue instead of GPU
> horsepower=

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:11 CST