ibverbs crash

From: David A. Horita (dhorita_at_wfubmc.edu)
Date: Tue Nov 03 2009 - 14:17:17 CST

Hi,
I compiled the 2009/10/30 CVS with ibverbs, following the recent note from Bjoern Olausson, compile seems fine for charm 6.1.3 using intel 11.0.84 and the apoaI benchmarks run fine. However, somewhere after 10,000 steps in an MD run, I'll get:

========

[10] Stack Traceback:
  [0] /wfurc4/horitaGrp/namd2-ibverbs [0x9ca769]
  [1] /wfurc4/horitaGrp/namd2-ibverbs [0x9cd26e]
  [2] /wfurc4/horitaGrp/namd2-ibverbs [0x9c707b]
  [3] /wfurc4/horitaGrp/namd2-ibverbs [0x9cd1d2]
  [4] /wfurc4/horitaGrp/namd2-ibverbs [0x9d256f]
  [5] /wfurc4/horitaGrp/namd2-ibverbs [0x9d24a0]
  [6] CldHandler+0x76 [0x4243b6]
  [7] /wfurc4/horitaGrp/namd2-ibverbs [0x9ce9fa]
  [8] /wfurc4/horitaGrp/namd2-ibverbs [0x9ce934]
[11] Stack Traceback:
  [0] /wfurc4/horitaGrp/namd2-ibverbs [0x9ca769]
  [1] /wfurc4/horitaGrp/namd2-ibverbs [0x9cd26e]
  [2] /wfurc4/horitaGrp/namd2-ibverbs [0x9c707b]
  [3] /wfurc4/horitaGrp/namd2-ibverbs [0x9cd1d2]
  [4] /wfurc4/horitaGrp/namd2-ibverbs [0x9d256f]
  [5] /wfurc4/horitaGrp/namd2-ibverbs [0x9d24a0]
  [6] CldHandler+0x76 [0x4243b6]
  [7] /wfurc4/horitaGrp/namd2-ibverbs [0x9ce9fa]
  [8] /wfurc4/horitaGrp/namd2-ibverbs [0x9ce934]
  [9] /wfurc4/horitaGrp/namd2-ibverbs [0x9ce8dc]
  [10] /wfurc4/horitaGrp/namd2-ibverbs [0x42f95d]
  [11] /wfurc4/horitaGrp/namd2-ibverbs [0x429129]
  [12] __libc_start_main+0xdb [0x34fe41c40b]
  [13] _ZNSt8ios_base4InitD1Ev+0x4a [0x4242aa]

=======

in the namd log file, and:

========
[12] Assertion "key != ((void *)0)" failed in file machine-ibverbs.c line 2370.
------------- Processor 12 Exiting: Called CmiAbort ------------
Reason:
[11] Assertion "key != ((void *)0)" failed in file machine-ibverbs.c line 2370.
------------- Processor 11 Exiting: Called CmiAbort ------------
Reason:
[15] Assertion "key != ((void *)0)" failed in file machine-ibverbs.c line 2370.
------------- Processor 15 Exiting: Called CmiAbort ------------
Reason:
[13] Assertion "key != ((void *)0)" failed in file machine-ibverbs.c line 2370.
------------- Processor 13 Exiting: Called CmiAbort ------------
Reason:
[10] Assertion "key != ((void *)0)" failed in file machine-ibverbs.c line 2370.
------------- Processor 10 Exiting: Called CmiAbort ------------
Reason:
[9] Assertion "key != ((void *)0)" failed in file machine-ibverbs.c line 2370.
------------- Processor 9 Exiting: Called CmiAbort ------------
Reason:
Fatal error on PE 10>
=======

in the pbs log file.
 
I had been writing restarts every 10,000 steps so I'm not exactly sure where it crashes. In an SMD run, it made it to 10,600 steps (writing every 200). Any ideas? Is this our hardware, a memory leak, or a bad compile of charm or namd? If it's hardware, any suggestions as to where to point our sysadmin?

Thanks,
David

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:53:26 CST