RE: charmrun error: Work completion error in sendCq

From: Vermaas, Joshua (Joshua.Vermaas_at_nrel.gov)
Date: Sat Oct 19 2019 - 15:55:11 CDT

Next message: Pang, Yui Tik: "RE: charmrun error: Work completion error in sendCq"
Previous message: Pang, Yui Tik: "charmrun error: Work completion error in sendCq"
In reply to: Pang, Yui Tik: "charmrun error: Work completion error in sendCq"
Next in thread: Pang, Yui Tik: "RE: charmrun error: Work completion error in sendCq"
Reply: Pang, Yui Tik: "RE: charmrun error: Work completion error in sendCq"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Oh hey! I thought I did something wrong and was starting to dig into that error message myself. I've narrowed it down to something related to the system being large, possibly having to do with the exclusion lists being transferred, since the simulation dies somewhere in phase 1 of the setup (check your logs. For me, it gets past phase 0 and dies in phase 1). My system is only 3M particles or so, but because the bonds are all out of order, the exclusion lists are much larger than they would be for a typical system. Does this system work on a single GPU node? Also, have there been any recent updates to your software? I can dig up more of my own notes on monday.

-Josh

On 2019-10-19 10:20:11-06:00 owner-namd-l_at_ks.uiuc.edu wrote:

Dear all,
I get an error from charmrun from the precompiled NAMD2.13 ibverbs and verbs version . The error persist even for self-compiled version of charm-6.8.2/verbs-linux-x86_64-ifort-iccstatic. The error is pasted as follows:
[0] wc[0] status 9 wc[i].opcode 0
mlx5: login-hive1.pace.gatech.edu: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008a12 0a001e80 0036b1d2
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Work completion error in sendCq
[0] Stack Traceback:
  [0:0] [0x6176e3]
  [0:1] [0x617736]
  [0:2] [0x613a78]
  [0:3] [0x61383e]
  [0:4] [0x61c881]
  [0:5] [0x61ead9]
  [0:6] [0x61315f]
  [0:7] [0x617857]
  [0:8] [0x625f28]
  [0:9] [0x626d93]
  [0:10] [0x621671]
  [0:11] [0x621ac9]
  [0:12] [0x6219a0]
  [0:13] [0x6174b6]
  [0:14] [0x617337]
  [0:15] [0x4e2a6b]
  [0:16] __libc_start_main+0xf5 [0x7ffff6d753d5]
  [0:17] [0x408ba9]
Our cluster uses MLX Infiniband and REHL 7 if the information helps. Any help will be appreciated!
Thank you!
Best,
Andrew Pang

Next message: Pang, Yui Tik: "RE: charmrun error: Work completion error in sendCq"
Previous message: Pang, Yui Tik: "charmrun error: Work completion error in sendCq"
In reply to: Pang, Yui Tik: "charmrun error: Work completion error in sendCq"
Next in thread: Pang, Yui Tik: "RE: charmrun error: Work completion error in sendCq"
Reply: Pang, Yui Tik: "RE: charmrun error: Work completion error in sendCq"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:20:59 CST