RE: RE: charmrun error: Work completion error in sendCq

From: Vermaas, Joshua (Joshua.Vermaas_at_nrel.gov)
Date: Thu Oct 24 2019 - 15:40:47 CDT

Oh, that would explain it too. How do we tell chamm++ to use ucx? MPI builds are a no-no from what I gather on GPU machines.

-Josh

On 2019-10-24 14:30:43-06:00 Julio Maia wrote:

Hi Andrew,
Can you try to rebuild Charm++ using UCX or MPI instead of verbs? We recently discovered that the Verbs layer of our runtime system is broken for modern Infiniband machines.
Please get back to us and see if that fixes your problem.
Thanks,

On Sat, Oct 19, 2019 at 4:31 PM Pang, Yui Tik <andrewpang_at_gatech.edu<mailto:andrewpang_at_gatech.edu>> wrote:
Thanks for your help! In my case, I am pretty sure that is nothing to do with the system size because I get the same error by just running the charmrun megatest (charm6.8.2/verbs-linux-x86_64-ifort-iccstatic/tests/charm++/megatest) (charmrun ++p 4 ./pgm). We are testing it on a single CPU node. It is a brand new cluster, so everything is new, and we are just installing NAMD on it for the first time. Other version of NAMD (tried net and smp) works fine except the performance isn’t as good. We really want to try out the ibverbs version but run into that sendCq error. Thanks!

Best,
Andrew Pang

From: Vermaas, Joshua<mailto:Joshua.Vermaas_at_nrel.gov>
Sent: Saturday, October 19, 2019 16:55
To: Pang, Yui Tik<mailto:andrewpang_at_gatech.edu>; namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu>
Subject: RE: charmrun error: Work completion error in sendCq

Oh hey! I thought I did something wrong and was starting to dig into that error message myself. I've narrowed it down to something related to the system being large, possibly having to do with the exclusion lists being transferred, since the simulation dies somewhere in phase 1 of the setup (check your logs. For me, it gets past phase 0 and dies in phase 1). My system is only 3M particles or so, but because the bonds are all out of order, the exclusion lists are much larger than they would be for a typical system. Does this system work on a single GPU node? Also, have there been any recent updates to your software? I can dig up more of my own notes on monday.

-Josh

On 2019-10-19 10:20:11-06:00 owner-namd-l_at_ks.uiuc.edu<mailto:owner-namd-l_at_ks.uiuc.edu> wrote:
Dear all,
I get an error from charmrun from the precompiled NAMD2.13 ibverbs and verbs version . The error persist even for self-compiled version of charm-6.8.2/verbs-linux-x86_64-ifort-iccstatic. The error is pasted as follows:
[0] wc[0] status 9 wc[i].opcode 0
mlx5: login-hive1.pace.gatech.edu<http://login-hive1.pace.gatech.edu>: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000001 00000000 00000000 00000000
00000000 00008a12 0a001e80 0036b1d2
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: Work completion error in sendCq
[0] Stack Traceback:
  [0:0] [0x6176e3]
  [0:1] [0x617736]
  [0:2] [0x613a78]
  [0:3] [0x61383e]
  [0:4] [0x61c881]
  [0:5] [0x61ead9]
  [0:6] [0x61315f]
  [0:7] [0x617857]
  [0:8] [0x625f28]
  [0:9] [0x626d93]
  [0:10] [0x621671]
  [0:11] [0x621ac9]
  [0:12] [0x6219a0]
  [0:13] [0x6174b6]
  [0:14] [0x617337]
  [0:15] [0x4e2a6b]
  [0:16] __libc_start_main+0xf5 [0x7ffff6d753d5]
  [0:17] [0x408ba9]
Our cluster uses MLX Infiniband and REHL 7 if the information helps. Any help will be appreciated!
Thank you!
Best,
Andrew Pang

--
JULIO MAIA
Research Programmer
Beckman Institute for Advanced Science and Technology
Vice Chancellor Research Institutes
University of Illinois at Urbana-Champaign
405 N. Mathews Avenue | M/C 251
Urbana, IL 61801
217-244-1928 | jmaia_at_ks.uiuc.edu<mailto:jmaia_at_ks.uiuc.edu>
beckman.illinois.edu<http://beckman.illinois.edu/>
[https://webtools.illinois.edu/webservices/js/ds/signature_logo.png]<http://illinois.edu/>
Under the Illinois Freedom of Information Act any written communication to or from university employees regarding university business is a public record and may be subject to public disclosure.

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:12 CST