From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Oct 08 2014 - 02:13:08 CDT
Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag von ukulililixl
Gesendet: Mittwoch, 8. Oktober 2014 08:07
Betreff: namd-l: question about '+devices'
I'm trying to run NAMD_2.9_CUDA in my cluster, and I am confused by a serious error.
Our cluster has 4 nodes, each node has 2 Intel Xeon and 2 Nvidia Tesla K40m. Here are details about my NAMD:
compiled by gcc
I want to run NAMD with only GPU 1, with '+devices 1'. So I run:
charmrun ++p10 ++ppn 10 ++local namd2 +idlepoll +devices 1 <path_to_apoa1>
to simulate apoa1 with only one node and one GPU, however, the simulation aborts after about 1000 steps, and the system shows the following infos:
NVRM: Xid (PCI:0000:81:00): 13, Graphics SM Warp Exception on (PC 1, TPC 2): Out Of Range Address
NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ESR 0x50d648=0x26000e 0x50d50=0x0 0x50d644=0x13eff2 0x50d64c=0x7f
NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ChID 0008, Class 0000a1c0, offset 00001b0c, Data 00000000
NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: Graphics SM Warp Exception on (GPC 1, TPC 1):Illegal Instruction Encoding
pcieport 0000:00:02.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0010(Receiver ID)
pcieport 0000:00:02.0: device [8086:2f04] error status/mask=00000020/00000000
pcieport 0000:00:02.0: [ 5] Unknown Error Bit (First)
NVRM: Xid (PCI:0000:81:00): 13, Graphics SM Global Exception on (GPC 1, TPC1): Physical Multiple Warp Errors
NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ESR 0x50ce48=0x3f0009 0x50c50=0x4 0x50ce44=0x1fffe 0x50ce4c=0xf
NVRM: Xid (PCI:0000:81:00): 13, Graphics Exception: ChID 0000, Class 000000, offset 00000000, Data 00000000
BUG: soft lockup - CPU#36 stuck for 67s! [namd2:4932]
But the error didn't occur without the parameter '+devices 1', so I think this may be caused by +devices 1.
Is this reproducible ? For me it looks more like a transmission or memory error on the PCIE stack and so indicates a hardware problem. Check if “nvidia-smi –q –g 1” reports recent ECC errors. Also you might want to check if mcelog is active: Try “cat /var/log/mcelog” to check for hardware errors. Also as you have k40m which I guess is passively cooled, you need to make sure that the cooling of your case is sufficient, otherwise you will experience exactly what you seem to do now.
Could anyone help me?
--- Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv. http://www.avast.com
This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:55 CST