Re: questions regarding new single node box for NAMD: re components, CUDA, Fermi

From: Biff Forbush (biff.forbush_at_yale.edu)
Date: Sun Oct 04 2009 - 20:37:43 CDT

Thanks Richard and Axel for your helpful comments. I am proceeding with
(2x) GTX-295 and (2x) W5590's.

Richard Owczarzy wrote:
> I would watch for power supplies and power requirements. The latest Xeon
> W5590 is using more power than Xeon X5570 and some motherboards and
> systems cannot handle it.
>
    You made me think a lot about this -- overall power is a lot for one
box, and with 4x gpus may be too much even for a 1.5kW supply. That is
sobering from the point of view of cooling. This prompted me to
consider watercooling -- this is usually the province of the
overclocking fringe (which clearly has more time for plumbing than I do)
but can easily reduce gpu temps by 30oC and CPU a lot as well
(presumably increasing reliability), all with less noise but a lot of
hassle -- I'll stick with air unless there is a problem. As to the
motherboard, all I can do is hope that Tyan does what it claims.

Axel Kohlmeyer wrote:
> but keep in mind that extreme hardware always carries the risk of
> being extremely sensitive to crashing and overloading components.
>
    I had hoped that if clocked at spec, and cooled well, the top-end
processors were not really extreme. I am about to see if Intel delivers
on this.
> the key components for good GPU performance are the i/o bandwidth
> of the mainboard and memory bandwidth of the CPU. at the high end,
> cpu performance is always limited by the memory bandwidth, so the
> higher you go with the clock rate, the less you get out of it unless
> you are treating problems small enough to fit entirely into the
> CPU cache. so rather then squeezing the last bit of performance
> out of clock rate, you may also consider how the memory subsystem
> can be optimized.
>
    Thanks for pointing this out, I wondered about this. This would
argue for the 2.66 GHz part, the lowest cpu speed to have the 1333MHz
memory clock. As I understand it, the memory controller is on the CPU
in the Nehalem, so there's not much else that can be done other than
populating all three channels per CPU.
>> Dual Xeon W5590 (3.3 GHz, calculating that the the incremental system
>> cost/Hz is actually nearly a constant, so you linearly get what you pay
>> for).
>>
>
> this is a very optimistic assumption. i would expect more like an
> exponential increase of the price over performance at the high end.
>
    Actually, this is retail cost, not an assumption. Today prices for
Nehalem Xeon on Newegg are:
GHz: 1.86, 2.0, 2.13, 2.26, 2.4, 2.56, 2.66, 2.8, 2.93, 3.2, 3.33
$$$: 200, 240, 270, 385, 540, 780, 979, 1204, 1419, 1659, 1669
delta$/deltaGHz, incremental jump:
    (107), 286, 231, 885, 1107, 1563, 2100, 1786, 1462, 959, -69
system cost, $/cpuGHz (assuming dual cpu, and $3600 non-CPU costs)
     2258, 2140, 2038, 2022, 2033, 2102, 2180, 2250, 2280, 2249, 2156

    The least cost-effective increases are actually in the midrange, not
considering the memory speed and cache jumps at 2.26 and 2.66.
> i don't know any details about specific mainboards, but
> if you want a compute machine, you should make sure that you
> have an additional graphics chip connected somewhere that
> you can use for (textmode) output and that this does not
> bring down the performance on any of the other PCIe busses.
> ideally, it would be something from a different vendor that
> is routed independently.
>
    There is the vanilla VGA, I assume I can use that if it turns out to
be practical to run NAMD all-out on two CPUs and all the gpus.
>> I assume that with this weeks' announcement of Fermi,...
> i would not hold my breath.
    good advice! I am still interested to know if there is a likelihood
that NAMD will progress with gpus, and whether Fermi makes that more likely.

> it is not quite obvious to me, how a machine like you describe it
> will hold up to the massive memory and bus bandwidth demands. with
> the next-gen hardware those demands can only go up. how much performance
> you will see and whether 4 GTX-295, 4 GTX-285, or 4 C1060 are the better
> solution depends a lot on what you are going to do with the machine.
> NAMD for example is able to schedule GPU kernels from multiple CPU tasks
> into the same GPU, and since not all compute kernels are ported to CUDA,
> they will stay idle from some time. with the GTX-295 you will be able
> to fit more GPUs into one case, but at the same time those GPUs will
> be slower (a GTX-295 is effectively two GTX-260 glue together and
> sharing a PCIe slot through a bridge) than the fastest single GPU
> in at GTX-285.
>
> ultimately. there is nothing but running realistic benchmarks, that
> can tell you whether you will get what you are looking for or not.
>
>
    I'll try the two GTX-295s, either on one CPU or one on each CPU, see
how that goes.

    Thanks again,
Biff

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:20 CST