Re: CUDA-NAMD hangs -- check the Northbridge temp!

From: Biff Forbush (biff.forbush_at_yale.edu)
Date: Wed Jan 06 2010 - 18:12:08 CST

Hi Paul,
   You are right about the CPU/GPU ratio (to a first approximation; and
the same applies to CPU speed), stats to follow soon. The idea was to
have the fastest possible CPU system in any case, on the off chance that
all of the beta aspects of CUDA and NAMD-Cuda made use of the gpus
impractical or limited.
   Regards,
Biff

Paul Rigor (uci-ics) wrote:
> By the way, your system seems overkill! Remember that for optimal GPU
> performance, you'll need one namd process per GPU device.
> Essentially, multiple cpu processes will be contending for gpu time
> and depending on the system you are simulating the namd-cuda processes
> might be spending most of their time polling for when a gpu device is
> free. Fermi-based gpu will have a better way of 'sharing' the device.
>
> But I'm definitely looking forward to some future stats!!
>
> On Wed, Jan 6, 2010 at 7:36 AM, Pu Tian <tianpu_at_mail.nih.gov
> <mailto:tianpu_at_mail.nih.gov>> wrote:
>
> Hi Biff,
>
> Thanks for sharing. That's very helpful information for anyone
> (including me) who is considering using NAMD/GPU.
>
> Best,
>
> Pu
>
>
> On Jan 5, 2010, at 10:51 PM, Biff Forbush wrote:
>
> Hi Namd & VMD gpu users,
>
> In getting a Nehalem-gtx295 system up-and-running I have
> experienced
> frequent (you could say regular) freezes in NAMD when multiple
> CPUs and
> GPUs are in use. In reviewing recent discussions, I see I am
> not the
> first with apparent "GPU overheating problems". But in this
> case, both
> CPU and GPU core temps were generally in the upper 50's and
> low 60's (C)
> -- very warm, but shouldn't be too much for these chips.
> [NVIDIA hasthe gpu fans running at 40-50% -- turning them up
> to 100% with nvclock
>
> lowered the GPU temps 3-5 degrees but did not prevent the
> hangups].
> After swearing for a while at the usual (software) suspects, I
> stuck my
> hand in the case to check the two X58 (Tylersburg Northbridge)
> heatsinks...
>
> ... almost burned myself -- the heatsinks were 88oC under
> NAMD load
> and 78oC at "idle" (no X, no NAMD) as checked with an IR
> thermometer.
> Sure enough, directing a cool air gun at the heatsinks dropped the
> heatsink temps to under 50oC (without significantly affecting
> CPU or GPU
> temps) and COMPLETELY solved the NAMD freezeup problem.
>
> Moral of the story: Check the Northbridge temp, not just
> the CPU
> and GPU. Apparently this particular board is terribly
> underdesigned in
> this regard, but I suspect the problem is more general. [This
> board has
> low-profile X58 heatsinks (aka egg cookers), no fans, and no
> room for
> much more, since the X58s are underneath two of the double PCIEx16
> slots... it should be possible to mount a small fan to blow
> horizontally
> over these, else liquid is needed].
>
> Board Tyan S7025, dual Xeon Nehalem (3.33GHz, Scythe
> coolers), dual
> X58's, two Geforce gtx295 (BFG), one Master Heat Gun (heater off).
> Benchmarks to follow. soon.
>
> [It remains a mystery to me why the X58s are running so hot
> at "idle"].
>
> Regards,
> Biff
>
>
>
>
>
> --
> Paul Rigor
> Pre-doctoral BIT Fellow and Graduate Student
> Institute for Genomics and Bioinformatics
> Donald Bren School of Information and Computer Sciences
> University of California, Irvine
> http://www.ics.uci.edu/~prigor <http://www.ics.uci.edu/%7Eprigor>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:39 CST