charmrun error: Work completion error in sendCq

From: Pang, Yui Tik (andrewpang_at_gatech.edu)
Date: Sat Oct 19 2019 - 11:10:07 CDT

Dear all,

I get an error from charmrun from the precompiled NAMD2.13 ibverbs and verbs version . The error persist even for self-compiled version of charm-6.8.2/verbs-linux-x86_64-ifort-iccstatic. The error is pasted as follows:

[0] wc[0] status 9 wc[i].opcode 0
mlx5: login-hive1.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008a12 0a001e80 0036b1d2
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Work completion error in sendCq
[0] Stack Traceback:
  [0:0] [0x6176e3]
  [0:1] [0x617736]
  [0:2] [0x613a78]
  [0:3] [0x61383e]
  [0:4] [0x61c881]
  [0:5] [0x61ead9]
  [0:6] [0x61315f]
  [0:7] [0x617857]
  [0:8] [0x625f28]
  [0:9] [0x626d93]
  [0:10] [0x621671]
  [0:11] [0x621ac9]
  [0:12] [0x6219a0]
  [0:13] [0x6174b6]
  [0:14] [0x617337]
  [0:15] [0x4e2a6b]
  [0:16] __libc_start_main+0xf5 [0x7ffff6d753d5]
  [0:17] [0x408ba9]

Our cluster uses MLX Infiniband and REHL 7 if the information helps. Any help will be appreciated!

Thank you!

Best,
Andrew Pang

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:20:59 CST