question about '+devices'

From: ukulililixl (ukulililixl_at_gmail.com)
Date: Wed Oct 08 2014 - 01:06:50 CDT

Hello everybody!
I'm trying to run NAMD_2.9_CUDA in my cluster, and I am confused by a
serious error.
Our cluster has 4 nodes, each node has 2 Intel Xeon and 2 Nvidia Tesla
K40m. Here are details about my NAMD:

version 2.9
compiled by gcc
ibverbs_smp
CUDA 6.5

I want to run NAMD with only GPU 1, with '+devices 1'. So I run:

charmrun ++p10 ++ppn 10 ++local namd2 +idlepoll +devices 1 <path_to_apoa1>

to simulate apoa1 with only one node and one GPU, however, the simulation
aborts after about 1000 steps, and the system shows the following infos:

NVRM: Xid (PCI:0000:81:00): 13, Graphics SM Warp Exception on (PC 1, TPC
2): Out Of Range Address
NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ESR 0x50d648=0x26000e
0x50d50=0x0 0x50d644=0x13eff2 0x50d64c=0x7f
NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ChID 0008, Class
0000a1c0, offset 00001b0c, Data 00000000
NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: Graphics SM Warp
Exception on (GPC 1, TPC 1):Illegal Instruction Encoding
pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal),
type=Transaction Layer, id=0010(Receiver ID)
pcieport 0000:00:02.0: device [8086:2f04] error
status/mask=00000020/00000000
pcieport 0000:00:02.0: [ 5] Unknown Error Bit (First)
NVRM: Xid (PCI:0000:81:00): 13, Graphics SM Global Exception on (GPC 1,
TPC1): Physical Multiple Warp Errors
NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ESR 0x50ce48=0x3f0009
0x50c50=0x4 0x50ce44=0x1fffe 0x50ce4c=0xf
NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ChID 0000, Class
000000, offset 00000000, Data 00000000
BUG: soft lockup - CPU#36 stuck for 67s! [namd2:4932]

But the error didn't occur without the parameter '+devices 1', so I think
this may be caused by +devices 1.

Could anyone help me?

Lee

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:17 CST