AW: question about '+devices'

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Oct 08 2014 - 02:13:08 CDT

Hi Lee,

 

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von ukulililixl
Gesendet: Mittwoch, 8. Oktober 2014 08:07
An: namd-l_at_ks.uiuc.edu
Betreff: namd-l: question about '+devices'

 

Hello everybody!

I'm trying to run NAMD_2.9_CUDA in my cluster, and I am confused by a serious error.

Our cluster has 4 nodes, each node has 2 Intel Xeon and 2 Nvidia Tesla K40m. Here are details about my NAMD:

 

version 2.9

compiled by gcc

ibverbs_smp

CUDA 6.5

 

I want to run NAMD with only GPU 1, with '+devices 1'. So I run:

 

charmrun ++p10 ++ppn 10 ++local namd2 +idlepoll +devices 1 <path_to_apoa1>

 

to simulate apoa1 with only one node and one GPU, however, the simulation aborts after about 1000 steps, and the system shows the following infos:

 

NVRM: Xid (PCI:0000:81:00): 13, Graphics SM Warp Exception on (PC 1, TPC 2): Out Of Range Address

NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ESR 0x50d648=0x26000e 0x50d50=0x0 0x50d644=0x13eff2 0x50d64c=0x7f

NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ChID 0008, Class 0000a1c0, offset 00001b0c, Data 00000000

NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: Graphics SM Warp Exception on (GPC 1, TPC 1):Illegal Instruction Encoding

pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0010(Receiver ID)

pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00000020/00000000

pcieport 0000:00:02.0: [ 5] Unknown Error Bit (First)

NVRM: Xid (PCI:0000:81:00): 13, Graphics SM Global Exception on (GPC 1, TPC1): Physical Multiple Warp Errors

NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ESR 0x50ce48=0x3f0009 0x50c50=0x4 0x50ce44=0x1fffe 0x50ce4c=0xf

NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ChID 0000, Class 000000, offset 00000000, Data 00000000

BUG: soft lockup - CPU#36 stuck for 67s! [namd2:4932]

 

But the error didn't occur without the parameter '+devices 1', so I think this may be caused by +devices 1.

 

Is this reproducible ? For me it looks more like a transmission or memory error on the PCIE stack and so indicates a hardware problem. Check if “nvidia-smi –q –g 1” reports recent ECC errors. Also you might want to check if mcelog is active: Try “cat /var/log/mcelog” to check for hardware errors. Also as you have k40m which I guess is passively cooled, you need to make sure that the cooling of your case is sufficient, otherwise you will experience exactly what you seem to do now.

 

Norman Geist

 

Could anyone help me?

 

Lee

---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
http://www.avast.com

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:17 CST