CUDA-NAMD hangs -- check the Northbridge temp!

From: Biff Forbush (biff.forbush_at_yale.edu)
Date: Tue Jan 05 2010 - 21:51:51 CST

Hi Namd & VMD gpu users,

    In getting a Nehalem-gtx295 system up-and-running I have experienced
frequent (you could say regular) freezes in NAMD when multiple CPUs and
GPUs are in use. In reviewing recent discussions, I see I am not the
first with apparent "GPU overheating problems". But in this case, both
CPU and GPU core temps were generally in the upper 50's and low 60's (C)
-- very warm, but shouldn't be too much for these chips. [NVIDIA has
the gpu fans running at 40-50% -- turning them up to 100% with nvclock
lowered the GPU temps 3-5 degrees but did not prevent the hangups].
After swearing for a while at the usual (software) suspects, I stuck my
hand in the case to check the two X58 (Tylersburg Northbridge) heatsinks...

     ... almost burned myself -- the heatsinks were 88oC under NAMD load
and 78oC at "idle" (no X, no NAMD) as checked with an IR thermometer.
Sure enough, directing a cool air gun at the heatsinks dropped the
heatsink temps to under 50oC (without significantly affecting CPU or GPU
temps) and COMPLETELY solved the NAMD freezeup problem.

    Moral of the story: Check the Northbridge temp, not just the CPU
and GPU. Apparently this particular board is terribly underdesigned in
this regard, but I suspect the problem is more general. [This board has
low-profile X58 heatsinks (aka egg cookers), no fans, and no room for
much more, since the X58s are underneath two of the double PCIEx16
slots... it should be possible to mount a small fan to blow horizontally
over these, else liquid is needed].

    Board Tyan S7025, dual Xeon Nehalem (3.33GHz, Scythe coolers), dual
X58's, two Geforce gtx295 (BFG), one Master Heat Gun (heater off).
Benchmarks to follow. soon.

    [It remains a mystery to me why the X58s are running so hot at "idle"].

Regards,
Biff

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:22:38 CST