Re: Re: The HIP version of NAMD gets wrong results when computing on more than one node

From: Josh Vermaas (joshua.vermaas_at_gmail.com)
Date: Fri Jul 03 2020 - 23:17:09 CDT

Oh lovely. Looks like I broke PME communication. :D Thanks for letting me
know!

-Josh

On Fri, Jul 3, 2020 at 10:04 PM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn> wrote:

> Hi Josh,
>
>
> There's something new about the problem. I tested the verbs-linux-x86_64-smp backend
> of charm-6.10.1. The phenomena are the same: on one node the results are
> right and on 2 nodes they go wrong. However, if I disable the PME gpu
> computing using these parameters in apoa1.namd file:
>
> bondedCUDA 255
>
> usePMECUDA off
>
> PMEoffload off
>
>
> the results of 2 nodes get right. Then I went back to test the
> ucx-linux-x86_64-ompipmix-smp backend and got the same phenomena: results
> of 2 nodes are wrong, but get right if turn off the PME gpu computing. So
> the problem may caused by the PME gpu code of NAMD.
>
>
> Hope these tests can help when you revise the code.
>
>
> Sincerely,
>
> Zhang
>
>
>
> -----原始邮件-----
> *发件人:*"Josh Vermaas" <joshua.vermaas_at_gmail.com>
> *发送时间:*2020-07-03 23:33:14 (星期五)
> *收件人:* "张驭洲" <zhangyuzhou15_at_mails.ucas.edu.cn>
> *抄送:* "NAMD list" <namd-l_at_ks.uiuc.edu>
> *主题:* Re: namd-l: The HIP version of NAMD gets wrong results when
> computing on more than one node
>
> The test hardware I have access to is administered by others, and they
> push forward ROCM versions immediately. Even if I wanted to, I don't have a
> machine available with 3.3. :(
>
> -Josh
>
> On Fri, Jul 3, 2020 at 5:36 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn>
> wrote:
>
>> Hi Josh,
>>
>>
>> My tests were done with ROCm 3.3. I want to know if you are going
>> straight to ROCm 3.5 or test with netlrts backend on ROCm 3.3?
>>
>>
>> Sincerely,
>>
>> Zhang
>>
>>
>>
>> -----原始邮件-----
>> *发件人:*"Josh Vermaas" <joshua.vermaas_at_gmail.com>
>> *发送时间:*2020-07-03 18:54:33 (星期五)
>> *收件人:* "NAMD list" <namd-l_at_ks.uiuc.edu>, "张驭洲" <
>> zhangyuzhou15_at_mails.ucas.edu.cn>
>> *抄送:*
>> *主题:* Re: namd-l: The HIP version of NAMD gets wrong results when
>> computing on more than one node
>>
>> Hi Zhang,
>>
>> The list of configurations I tested before getting distracted by COVID
>> research were multicore builds, and netlrts builds that split a single node
>> (networking wasn't working properly on our test setups). This was also in
>> the era of ROCM 3.3, and now I see this morning that those old binaries
>> don't work with 3.5, so I'm still working to reproduce your result. Two
>> things I'd try in the interim:
>>
>> 1. compile with clang. In my own testing, things work better when I use
>> clang (which really aliases to AMD's LLVM compiler) over gcc.
>> 2. Try the netlrts backend just as a sanity check. My own personal
>> experience with ucx is that it is far from bulletproof, and it would help
>> to isolate if it is a HIP-specific issue or a ucx issue.
>>
>> -Josh
>>
>> On Fri, Jul 3, 2020 at 3:59 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn>
>> wrote:
>>
>>> Hello,
>>>
>>>
>>> I noticed that there is a HIP version of NAMD in the gerrit repository
>>> of NAMD. I tried it using the apoa1 and stmv benchmark. The results of
>>> single node with multi GPU seem right, but when using more than one node,
>>> the total energy keeps increasing, and sometimes the computation even
>>> crashes because of too fast moving of atoms. I used the
>>> ucx-linux-x86_64-ompipmix-smp building of charm-6.10.1. Could anyone give
>>> me some hints about this problem?
>>>
>>>
>>> Sincerely,
>>>
>>> Zhang
>>>
>>

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:13 CST