Selecting GPU for namd2

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Mon May 06 2013 - 02:27:57 CDT

Hello:
With a two-GTX-680 board/namd2-cuda

nvidia-smi
driver v. 304.48
0 GTX 680 Bus-Id 0000:02:00.0 mem-usage 4% 89MB/2047MB
1 GTX 680 Bus-Id 0000:03:00.0 mem-usage 5% 93MB/2047MB

nvidia-smi -L
GPU0 UID 600f64d0-2996-8e71-dca8-8d66f139f772
GPU1 UID 704bb625-95a7-8779-cfdc-14a90e6581fc

At some stage, the simulation crashed with "unspecified launch failure" on
GPU0 / pe2. Files were the same as for previous successful simulations.
The simulation was launched with command

charmrun $NAMD_HOME/bin/namd2namd2 filename.conf +idlepoll +p6 2>&1 | tee
filename.log

I extracted the cards:

The one on PCIEX16-1 reads, on the left (on a superimposed label):
C416383447
N1996
and, on the right (on a superimposed label):
N680GTX-PM202GD6
S/N-602-V282-015B1204050 555
EAN 4 711072 257415
UPC-A 8 16909 09568 5
HDMI

The other one, on PCIEX16-2 reads, on the left (directly on the board, no
superimposed label):
S/N 0421312026933
GTX 680
and, on the right (on a superimposed label):
N680GTX-PM2D2GD5
S/N 912-V801-1233B1204006883 (I confirm, S/N is given twice with different
numbers)
EAN 4 719072 260156
UPC-A 8 16909 09645 3

Clearly, there is no relationship with the ID derived from "nvidia-smi -L",
which will be no surprise to hardware experts,

Well, I have inverted the cards on the two PCIEX16. The same launch failure
occurred at some stage of the simulation, now related to GPU1 / pe4.

As far as I can understand, GTX-680 S/N-602-V282-015B1204050 555 is faulty.
xxxxxxxxxxxxxxxxxxxxxxxxxxx

Now, with the latter arrangement of GTX on PCIEX, I have tried to launch
the same simulation on the valid GTX-680 at GPU0 (S/N 0421312026933 and S/N
912-V801-1233B1204006883 (I confirm, S/N is given twice with different
numbers) with commands

# nvidia-smi -L
# nvidia-smi -pm

$ charmrun $NAMD_HOME/bin/namd2namd2 filename.conf +idlepoll +p6 +devices
0 2>&1 | tee filename.log

simulation is running

 "nvidia-smi" reports
GPU 0 6% mem (117MB/2047MB)
GPU 1 0% mem (7 MB/2047MB)

while "nvidia-smi -q TEMPERATURE" reports
GPU 0000.02:00.0
70 C

GPU 0000.03:00.0
  27C

I wonder whether commands nvidia-smi -L and nvidia-smi -pm are needed.
Also, it seems to me that the S/N-602-V282-015B1204050 555 has to be
replaced.

Thanks for advice

francesco pietra

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:12 CST