Re: questions regarding new single node box for NAMD: re components, CUDA, Fermi

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Mon Oct 05 2009 - 02:01:47 CDT

On Sun, 2009-10-04 at 21:37 -0400, Biff Forbush wrote:
> Thanks Richard and Axel for your helpful comments. I am proceeding with
> (2x) GTX-295 and (2x) W5590's.
>
> Richard Owczarzy wrote:
> > I would watch for power supplies and power requirements. The latest Xeon
> > W5590 is using more power than Xeon X5570 and some motherboards and
> > systems cannot handle it.
> >
> You made me think a lot about this -- overall power is a lot for one
> box, and with 4x gpus may be too much even for a 1.5kW supply. That is
> sobering from the point of view of cooling. This prompted me to
> consider watercooling -- this is usually the province of the
> overclocking fringe (which clearly has more time for plumbing than I do)
> but can easily reduce gpu temps by 30oC and CPU a lot as well
> (presumably increasing reliability), all with less noise but a lot of
> hassle -- I'll stick with air unless there is a problem. As to the
> motherboard, all I can do is hope that Tyan does what it claims.

there is a lot of "hope" in your statements. i have learned
to be more paranoid over time and don't go with theoretical
possibilities.

FWIW, at GTC there was a vendor demoing a machine with 8x C1060
cards (having 3 power supplies, IIRC).

>
> Axel Kohlmeyer wrote:
> > but keep in mind that extreme hardware always carries the risk of
> > being extremely sensitive to crashing and overloading components.
> >
> I had hoped that if clocked at spec, and cooled well, the top-end
> processors were not really extreme. I am about to see if Intel delivers
> on this.
> > the key components for good GPU performance are the i/o bandwidth
> > of the mainboard and memory bandwidth of the CPU. at the high end,
> > cpu performance is always limited by the memory bandwidth, so the
> > higher you go with the clock rate, the less you get out of it unless
> > you are treating problems small enough to fit entirely into the
> > CPU cache. so rather then squeezing the last bit of performance
> > out of clock rate, you may also consider how the memory subsystem
> > can be optimized.
> >
> Thanks for pointing this out, I wondered about this. This would
> argue for the 2.66 GHz part, the lowest cpu speed to have the 1333MHz
> memory clock. As I understand it, the memory controller is on the CPU
> in the Nehalem, so there's not much else that can be done other than
> populating all three channels per CPU.
> >> Dual Xeon W5590 (3.3 GHz, calculating that the the incremental system
> >> cost/Hz is actually nearly a constant, so you linearly get what you pay
> >> for).
> >>
> >
> > this is a very optimistic assumption. i would expect more like an
> > exponential increase of the price over performance at the high end.
> >
> Actually, this is retail cost, not an assumption. Today prices for
> Nehalem Xeon on Newegg are:
> GHz: 1.86, 2.0, 2.13, 2.26, 2.4, 2.56, 2.66, 2.8, 2.93, 3.2, 3.33
> $$$: 200, 240, 270, 385, 540, 780, 979, 1204, 1419, 1659, 1669
> delta$/deltaGHz, incremental jump:
> (107), 286, 231, 885, 1107, 1563, 2100, 1786, 1462, 959, -69
> system cost, $/cpuGHz (assuming dual cpu, and $3600 non-CPU costs)
> 2258, 2140, 2038, 2022, 2033, 2102, 2180, 2250, 2280, 2249, 2156

please note that i didn't refer to cost/ghz but cost/performance ratio.
the higher the clock rate, the less you get out of it due to the
severe imbalance between the cpu and the main memory performance.

> The least cost-effective increases are actually in the midrange, not
> considering the memory speed and cache jumps at 2.26 and 2.66.
> > i don't know any details about specific mainboards, but
> > if you want a compute machine, you should make sure that you
> > have an additional graphics chip connected somewhere that
> > you can use for (textmode) output and that this does not
> > bring down the performance on any of the other PCIe busses.
> > ideally, it would be something from a different vendor that
> > is routed independently.
> >
> There is the vanilla VGA, I assume I can use that if it turns out to
> be practical to run NAMD all-out on two CPUs and all the gpus.
> >> I assume that with this weeks' announcement of Fermi,...
> > i would not hold my breath.
> good advice! I am still interested to know if there is a likelihood
> that NAMD will progress with gpus, and whether Fermi makes that more likely.

the GPU acceleration of NAMD will of course be improving over time.
this will in part be due to improvements in the drivers and due to
improvements in the code itself. the features that the fermi
architecture offers, will help with that. how much that architecture
will help, depends on how low the overhead of launching GPU kernels
can be made and how large a system will be run. same as parallel
performance at the high end is more limited by the communication
rather than the CPU performance, it will be limited by the cost of
launching GPU kernels rather than the GPU performance for smaller
systems. i.e. in your example config (2x GTX 295) you may get
faster execution with 4x GTX 260 due to not having to share PCIe
busses (and even faster with 4x GTX 275, and 4x GTX 285).

axel.

> > it is not quite obvious to me, how a machine like you describe it
> > will hold up to the massive memory and bus bandwidth demands. with
> > the next-gen hardware those demands can only go up. how much performance
> > you will see and whether 4 GTX-295, 4 GTX-285, or 4 C1060 are the better
> > solution depends a lot on what you are going to do with the machine.
> > NAMD for example is able to schedule GPU kernels from multiple CPU tasks
> > into the same GPU, and since not all compute kernels are ported to CUDA,
> > they will stay idle from some time. with the GTX-295 you will be able
> > to fit more GPUs into one case, but at the same time those GPUs will
> > be slower (a GTX-295 is effectively two GTX-260 glue together and
> > sharing a PCIe slot through a bridge) than the fastest single GPU
> > in at GTX-285.
> >
> > ultimately. there is nothing but running realistic benchmarks, that
> > can tell you whether you will get what you are looking for or not.
> >
> >
> I'll try the two GTX-295s, either on one CPU or one on each CPU, see
> how that goes.
>
> Thanks again,
> Biff
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com 
Institute for Computational Molecular Science
College of Science and Technology
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:20 CST