Re: The HIP version of NAMD gets wrong results when computing on more than one node

From: Josh Vermaas (joshua.vermaas_at_gmail.com)
Date: Fri Jul 03 2020 - 10:33:14 CDT

The test hardware I have access to is administered by others, and they push
forward ROCM versions immediately. Even if I wanted to, I don't have a
machine available with 3.3. :(

-Josh

On Fri, Jul 3, 2020 at 5:36 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn> wrote:

> Hi Josh,
>
>
> My tests were done with ROCm 3.3. I want to know if you are going straight
> to ROCm 3.5 or test with netlrts backend on ROCm 3.3?
>
>
> Sincerely,
>
> Zhang
>
>
>
> -----原始邮件-----
> *发件人:*"Josh Vermaas" <joshua.vermaas_at_gmail.com>
> *发送时间:*2020-07-03 18:54:33 (星期五)
> *收件人:* "NAMD list" <namd-l_at_ks.uiuc.edu>, "张驭洲" <
> zhangyuzhou15_at_mails.ucas.edu.cn>
> *抄送:*
> *主题:* Re: namd-l: The HIP version of NAMD gets wrong results when
> computing on more than one node
>
> Hi Zhang,
>
> The list of configurations I tested before getting distracted by COVID
> research were multicore builds, and netlrts builds that split a single node
> (networking wasn't working properly on our test setups). This was also in
> the era of ROCM 3.3, and now I see this morning that those old binaries
> don't work with 3.5, so I'm still working to reproduce your result. Two
> things I'd try in the interim:
>
> 1. compile with clang. In my own testing, things work better when I use
> clang (which really aliases to AMD's LLVM compiler) over gcc.
> 2. Try the netlrts backend just as a sanity check. My own personal
> experience with ucx is that it is far from bulletproof, and it would help
> to isolate if it is a HIP-specific issue or a ucx issue.
>
> -Josh
>
> On Fri, Jul 3, 2020 at 3:59 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn>
> wrote:
>
>> Hello,
>>
>>
>> I noticed that there is a HIP version of NAMD in the gerrit repository of
>> NAMD. I tried it using the apoa1 and stmv benchmark. The results of single
>> node with multi GPU seem right, but when using more than one node, the
>> total energy keeps increasing, and sometimes the computation even crashes
>> because of too fast moving of atoms. I used the
>> ucx-linux-x86_64-ompipmix-smp building of charm-6.10.1. Could anyone give
>> me some hints about this problem?
>>
>>
>> Sincerely,
>>
>> Zhang
>>
>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:09 CST