Re: RE: charmrun error: Work completion error in sendCq

From: Julio Maia (jmaia_at_ks.uiuc.edu)
Date: Thu Oct 24 2019 - 16:05:50 CDT

Hey Josh,

You need to recompile Charm++ for UCX. Run something like ./build charm++
ucx-linux-x86_64 -j8 --with-production
You can also use Charm++'s nice interactive build script. Just run a
./build for that, and you can specify interconnects, compilers, and the
sort by following the steps.

Thanks,

On Thu, Oct 24, 2019 at 3:40 PM Vermaas, Joshua <Joshua.Vermaas_at_nrel.gov>
wrote:

> Oh, that would explain it too. How do we tell chamm++ to use ucx? MPI
> builds are a no-no from what I gather on GPU machines.
>
> -Josh
>
>
>
> On 2019-10-24 14:30:43-06:00 Julio Maia wrote:
>
> Hi Andrew,
> Can you try to rebuild Charm++ using UCX or MPI instead of verbs? We
> recently discovered that the Verbs layer of our runtime system is broken
> for modern Infiniband machines.
> Please get back to us and see if that fixes your problem.
> Thanks,
>
> On Sat, Oct 19, 2019 at 4:31 PM Pang, Yui Tik <andrewpang_at_gatech.edu>
> wrote:
>
>> Thanks for your help! In my case, I am pretty sure that is nothing to do
>> with the system size because I get the same error by just running the
>> charmrun megatest
>> (charm6.8.2/verbs-linux-x86_64-ifort-iccstatic/tests/charm++/megatest)
>> (charmrun ++p 4 ./pgm). We are testing it on a single CPU node. It is a
>> brand new cluster, so everything is new, and we are just installing NAMD on
>> it for the first time. Other version of NAMD (tried net and smp) works fine
>> except the performance isn’t as good. We really want to try out the ibverbs
>> version but run into that sendCq error. Thanks!
>>
>>
>> Best,
>>
>> Andrew Pang
>>
>>
>> *From: *Vermaas, Joshua <Joshua.Vermaas_at_nrel.gov>
>> *Sent: *Saturday, October 19, 2019 16:55
>> *To: *Pang, Yui Tik <andrewpang_at_gatech.edu>; namd-l_at_ks.uiuc.edu
>> *Subject: *RE: charmrun error: Work completion error in sendCq
>>
>>
>> Oh hey! I thought I did something wrong and was starting to dig into that
>> error message myself. I've narrowed it down to something related to the
>> system being large, possibly having to do with the exclusion lists being
>> transferred, since the simulation dies somewhere in phase 1 of the setup
>> (check your logs. For me, it gets past phase 0 and dies in phase 1). My
>> system is only 3M particles or so, but because the bonds are all out of
>> order, the exclusion lists are much larger than they would be for a typical
>> system. Does this system work on a single GPU node? Also, have there been
>> any recent updates to your software? I can dig up more of my own notes on
>> monday.
>>
>> -Josh
>>
>>
>>
>>
>> On 2019-10-19 10:20:11-06:00 owner-namd-l_at_ks.uiuc.edu wrote:
>>
>> Dear all,
>>
>> I get an error from charmrun from the precompiled NAMD2.13 ibverbs and
>> verbs version . The error persist even for self-compiled version of
>> charm-6.8.2/verbs-linux-x86_64-ifort-iccstatic. The error is pasted as
>> follows:
>>
>> [0] wc[0] status 9 wc[i].opcode 0
>>
>> mlx5: login-hive1.pace.gatech.edu: got completion with error:
>>
>> 00000000 00000000 00000000 00000000
>>
>> 00000000 00000000 00000000 00000000
>>
>> 00000001 00000000 00000000 00000000
>>
>> 00000000 00008a12 0a001e80 0036b1d2
>>
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>>
>> Reason: Work completion error in sendCq
>>
>> [0] Stack Traceback:
>>
>> [0:0] [0x6176e3]
>>
>> [0:1] [0x617736]
>>
>> [0:2] [0x613a78]
>>
>> [0:3] [0x61383e]
>>
>> [0:4] [0x61c881]
>>
>> [0:5] [0x61ead9]
>>
>> [0:6] [0x61315f]
>>
>> [0:7] [0x617857]
>>
>> [0:8] [0x625f28]
>>
>> [0:9] [0x626d93]
>>
>> [0:10] [0x621671]
>>
>> [0:11] [0x621ac9]
>>
>> [0:12] [0x6219a0]
>>
>> [0:13] [0x6174b6]
>>
>> [0:14] [0x617337]
>>
>> [0:15] [0x4e2a6b]
>>
>> [0:16] __libc_start_main+0xf5 [0x7ffff6d753d5]
>>
>> [0:17] [0x408ba9]
>>
>> Our cluster uses MLX Infiniband and REHL 7 if the information helps. Any
>> help will be appreciated!
>>
>> Thank you!
>>
>> Best,
>>
>> Andrew Pang
>>
>
> --
> *JULIO MAIA*
> *Research Programmer*
>
> Beckman Institute for Advanced Science and Technology
> Vice Chancellor Research Institutes
> University of Illinois at Urbana-Champaign
> 405 N. Mathews Avenue | M/C 251
> Urbana, IL 61801
> 217-244-1928 | jmaia_at_ks.uiuc.edu
> beckman.illinois.edu
>
> <http://illinois.edu/>
>
> *Under the Illinois Freedom of Information Act any written communication
> to or from university employees regarding university business is a public
> record and may be subject to public disclosure. *
>
>

-- 
*JULIO MAIA*
*Research Programmer*
Beckman Institute for Advanced Science and Technology
Vice Chancellor Research Institutes
University of Illinois at Urbana-Champaign
405 N. Mathews Avenue | M/C 251
Urbana, IL 61801
217-244-1928 | jmaia_at_ks.uiuc.edu
beckman.illinois.edu
<http://illinois.edu/>
*Under the Illinois Freedom of Information Act any written communication to
or from university employees regarding university business is a public
record and may be subject to public disclosure. *

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:20:59 CST