Re: CUDA-NAMD hangs -- check the Northbridge temp!

From: Paul Rigor (uci-ics) (prigor_at_ics.uci.edu)
Date: Wed Jan 06 2010 - 17:12:54 CST

By the way, your system seems overkill! Remember that for optimal GPU
performance, you'll need one namd process per GPU device. Essentially,
multiple cpu processes will be contending for gpu time and depending on the
system you are simulating the namd-cuda processes might be spending most of
their time polling for when a gpu device is free. Fermi-based gpu will have
a better way of 'sharing' the device.

But I'm definitely looking forward to some future stats!!

On Wed, Jan 6, 2010 at 7:36 AM, Pu Tian <tianpu_at_mail.nih.gov> wrote:

> Hi Biff,
>
> Thanks for sharing. That's very helpful information for anyone (including
> me) who is considering using NAMD/GPU.
>
> Best,
>
> Pu
>
>
> On Jan 5, 2010, at 10:51 PM, Biff Forbush wrote:
>
> Hi Namd & VMD gpu users,
>>
>> In getting a Nehalem-gtx295 system up-and-running I have experienced
>> frequent (you could say regular) freezes in NAMD when multiple CPUs and
>> GPUs are in use. In reviewing recent discussions, I see I am not the
>> first with apparent "GPU overheating problems". But in this case, both
>> CPU and GPU core temps were generally in the upper 50's and low 60's (C)
>> -- very warm, but shouldn't be too much for these chips. [NVIDIA hasthe
>> gpu fans running at 40-50% -- turning them up to 100% with nvclock
>>
>> lowered the GPU temps 3-5 degrees but did not prevent the hangups].
>> After swearing for a while at the usual (software) suspects, I stuck my
>> hand in the case to check the two X58 (Tylersburg Northbridge)
>> heatsinks...
>>
>> ... almost burned myself -- the heatsinks were 88oC under NAMD load
>> and 78oC at "idle" (no X, no NAMD) as checked with an IR thermometer.
>> Sure enough, directing a cool air gun at the heatsinks dropped the
>> heatsink temps to under 50oC (without significantly affecting CPU or GPU
>> temps) and COMPLETELY solved the NAMD freezeup problem.
>>
>> Moral of the story: Check the Northbridge temp, not just the CPU
>> and GPU. Apparently this particular board is terribly underdesigned in
>> this regard, but I suspect the problem is more general. [This board has
>> low-profile X58 heatsinks (aka egg cookers), no fans, and no room for
>> much more, since the X58s are underneath two of the double PCIEx16
>> slots... it should be possible to mount a small fan to blow horizontally
>> over these, else liquid is needed].
>>
>> Board Tyan S7025, dual Xeon Nehalem (3.33GHz, Scythe coolers), dual
>> X58's, two Geforce gtx295 (BFG), one Master Heat Gun (heater off).
>> Benchmarks to follow. soon.
>>
>> [It remains a mystery to me why the X58s are running so hot at "idle"].
>>
>> Regards,
>> Biff
>>
>>
>

-- 
Paul Rigor
Pre-doctoral BIT Fellow and Graduate Student
Institute for Genomics and Bioinformatics
Donald Bren School of Information and Computer Sciences
University of California, Irvine
http://www.ics.uci.edu/~prigor

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:40 CST