RE: charmrun error: Work completion error in sendCq

From: Pang, Yui Tik (andrewpang_at_gatech.edu)
Date: Sat Oct 19 2019 - 16:26:33 CDT

Thanks for your help! In my case, I am pretty sure that is nothing to do with the system size because I get the same error by just running the charmrun megatest (charm6.8.2/verbs-linux-x86_64-ifort-iccstatic/tests/charm++/megatest) (charmrun ++p 4 ./pgm). We are testing it on a single CPU node. It is a brand new cluster, so everything is new, and we are just installing NAMD on it for the first time. Other version of NAMD (tried net and smp) works fine except the performance isn’t as good. We really want to try out the ibverbs version but run into that sendCq error. Thanks!

Best,
Andrew Pang

From: Vermaas, Joshua<mailto:Joshua.Vermaas_at_nrel.gov>
Sent: Saturday, October 19, 2019 16:55
To: Pang, Yui Tik<mailto:andrewpang_at_gatech.edu>; namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu>
Subject: RE: charmrun error: Work completion error in sendCq

Oh hey! I thought I did something wrong and was starting to dig into that error message myself. I've narrowed it down to something related to the system being large, possibly having to do with the exclusion lists being transferred, since the simulation dies somewhere in phase 1 of the setup (check your logs. For me, it gets past phase 0 and dies in phase 1). My system is only 3M particles or so, but because the bonds are all out of order, the exclusion lists are much larger than they would be for a typical system. Does this system work on a single GPU node? Also, have there been any recent updates to your software? I can dig up more of my own notes on monday.

-Josh

On 2019-10-19 10:20:11-06:00 owner-namd-l_at_ks.uiuc.edu wrote:
Dear all,
I get an error from charmrun from the precompiled NAMD2.13 ibverbs and verbs version . The error persist even for self-compiled version of charm-6.8.2/verbs-linux-x86_64-ifort-iccstatic. The error is pasted as follows:
[0] wc[0] status 9 wc[i].opcode 0
mlx5: login-hive1.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008a12 0a001e80 0036b1d2
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Work completion error in sendCq
[0] Stack Traceback:
  [0:0] [0x6176e3]
  [0:1] [0x617736]
  [0:2] [0x613a78]
  [0:3] [0x61383e]
  [0:4] [0x61c881]
  [0:5] [0x61ead9]
  [0:6] [0x61315f]
  [0:7] [0x617857]
  [0:8] [0x625f28]
  [0:9] [0x626d93]
  [0:10] [0x621671]
  [0:11] [0x621ac9]
  [0:12] [0x6219a0]
  [0:13] [0x6174b6]
  [0:14] [0x617337]
  [0:15] [0x4e2a6b]
  [0:16] __libc_start_main+0xf5 [0x7ffff6d753d5]
  [0:17] [0x408ba9]
Our cluster uses MLX Infiniband and REHL 7 if the information helps. Any help will be appreciated!
Thank you!
Best,
Andrew Pang

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:20:59 CST