Re: : Re: The HIP version of NAMD gets wrong results when computing on more than one node

From: Josh Vermaas (joshua.vermaas_at_gmail.com)
Date: Fri Aug 28 2020 - 18:21:18 CDT

Hi Zhang,

Based on the ROCm 3.7 release notes, I got a new idea today for why this
behavior was occurring. Long story short, direct GPU to GPU communication
is not functioning properly on AMD hardware through HIP (
https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Current-Release-Notes.html#issue-with-peer-to-peer-transfers).
The only place in the NAMD codebase that currently tries to exploit that
feature is the PME calculation. In my own tests today, I was able to
correct for this by only letting NAMD see one device on each node (+devices
0), and then PME energies and forces were comparable to what is computed
via the CPU code path. Many apologies for having this take so long to get
back to you on, but I was distracted by the rather substantial performance
regressions that have been happening with ROCm versions 3.5 and up.

-Josh

On Wed, Jul 8, 2020 at 1:45 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn> wrote:

> Hi Josh,
>
>
> The compiler I used is GCC 7.3.1. I'm not sure if your results are from
> one node or more than one node run. I got similar curves of yours
> when running on one node, but the curves are totally different when running
> on two nodes. It's not about processes, but physical nodes. On one physical
> node, I also start 4 processes, using 4 AMD GPUs, and the
> results are right. But if I run on 2 nodes, even starting 2 processes
> on each node, the results are wrong. What's more, using 2 nodes costs much
> more time than using one node. The low performance is not caused by
> the small size of the apoa1 case. When I turn off the GPU PME computing, I
> can get right results on 2 nodes and the performance is much higher
> than turning on the GPU PME computing. I also test the clang compiler,
> and all the phenomena are the same with GCC. I only recompiled NAMD with
> clang, not including charm++. I failed to rebuild charm++ with clang. I'm
> not sure whether it is because that the ucx I used is built with GCC.
>
>
> I attached my results. If you can open the excel file, you can click each
> figure and drag the data area from one colomn to another to check
> each curve. Note there are 6 sheets in the excel file, each containing data
> from one test. If you can not open it, please have a look at
> the three pictures, which are total energy of one node, two nodes and two
> nodes with GPU PME turned off, respectively. You can see that the curve
> of two nodes running is completely wrong, the total energy keeps
> increasing with time steps.
>
>
> I compiled the source code on one of the nodes that I run the test cases,
> which has four AMD GPUs. I'm sure the amdgpu-target was properly set.
>
>
> Sincerely,
>
> Zhang
>
>
> -----原始邮件-----
> *发件人:*"Josh Vermaas" <joshua.vermaas_at_gmail.com>
> *发送时间:*2020-07-07 05:33:26 (星期二)
> *收件人:* "张驭洲" <zhangyuzhou15_at_mails.ucas.edu.cn>
> *抄送:* "NAMD list" <namd-l_at_ks.uiuc.edu>
> *主题:* Re: Re: namd-l: The HIP version of NAMD gets wrong results when
> computing on more than one node
>
> Ok, I've attached what I got when I ran it myself, with the lrts backend
> when compared against the multicore implementation (still with ROCm 3.3.0).
> Basically, I can reproduce that the electrostatic energy goes down, but
> that the total energy stays constant, which is what we'd expect for apoa1
> run without a thermostat. The energy is just sloshing around into different
> pots, which isn't what you describe, and I'm keen to figure out why that
> is. What compiler were you using, if its not clang, does clang fix the
> problem? As I've spent the day working on this problem, did you have an AMD
> GPU on the machine doing the compiling? I've noticed that if I don't
> compile on a compute node or set the amdgpu-target, environment variables
> are not set correctly, and seemingly baffling segfaults result.
>
> -Josh
>
> On Sat, Jul 4, 2020 at 12:17 AM Josh Vermaas <joshua.vermaas_at_gmail.com>
> wrote:
>
>> Oh lovely. Looks like I broke PME communication. :D Thanks for letting me
>> know!
>>
>> -Josh
>>
>> On Fri, Jul 3, 2020 at 10:04 PM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn>
>> wrote:
>>
>>> Hi Josh,
>>>
>>>
>>> There's something new about the problem. I tested the verbs-linux-x86_64-smp backend
>>> of charm-6.10.1. The phenomena are the same: on one node the results
>>> are right and on 2 nodes they go wrong. However, if I disable the PME
>>> gpu computing using these parameters in apoa1.namd file:
>>>
>>> bondedCUDA 255
>>>
>>> usePMECUDA off
>>>
>>> PMEoffload off
>>>
>>>
>>> the results of 2 nodes get right. Then I went back to test the
>>> ucx-linux-x86_64-ompipmix-smp backend and got the same phenomena: results
>>> of 2 nodes are wrong, but get right if turn off the PME gpu computing. So
>>> the problem may caused by the PME gpu code of NAMD.
>>>
>>>
>>>
>>>
>>>
>>> Hope these tests can help when you revise the code.
>>>
>>>
>>> Sincerely,
>>>
>>> Zhang
>>>
>>>
>>>
>>> -----原始邮件-----
>>> *发件人:*"Josh Vermaas" <joshua.vermaas_at_gmail.com>
>>> *发送时间:*2020-07-03 23:33:14 (星期五)
>>> *收件人:* "张驭洲" <zhangyuzhou15_at_mails.ucas.edu.cn>
>>> *抄送:* "NAMD list" <namd-l_at_ks.uiuc.edu>
>>> *主题:* Re: namd-l: The HIP version of NAMD gets wrong results when
>>> computing on more than one node
>>>
>>> The test hardware I have access to is administered by others, and they
>>> push forward ROCM versions immediately. Even if I wanted to, I don't have a
>>> machine available with 3.3. :(
>>>
>>> -Josh
>>>
>>> On Fri, Jul 3, 2020 at 5:36 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn>
>>> wrote:
>>>
>>>> Hi Josh,
>>>>
>>>>
>>>> My tests were done with ROCm 3.3. I want to know if you are going
>>>> straight to ROCm 3.5 or test with netlrts backend on ROCm 3.3?
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> Zhang
>>>>
>>>>
>>>>
>>>> -----原始邮件-----
>>>> *发件人:*"Josh Vermaas" <joshua.vermaas_at_gmail.com>
>>>> *发送时间:*2020-07-03 18:54:33 (星期五)
>>>> *收件人:* "NAMD list" <namd-l_at_ks.uiuc.edu>, "张驭洲" <
>>>> zhangyuzhou15_at_mails.ucas.edu.cn>
>>>> *抄送:*
>>>> *主题:* Re: namd-l: The HIP version of NAMD gets wrong results when
>>>> computing on more than one node
>>>>
>>>> Hi Zhang,
>>>>
>>>> The list of configurations I tested before getting distracted by COVID
>>>> research were multicore builds, and netlrts builds that split a single node
>>>> (networking wasn't working properly on our test setups). This was also in
>>>> the era of ROCM 3.3, and now I see this morning that those old binaries
>>>> don't work with 3.5, so I'm still working to reproduce your result. Two
>>>> things I'd try in the interim:
>>>>
>>>> 1. compile with clang. In my own testing, things work better when I use
>>>> clang (which really aliases to AMD's LLVM compiler) over gcc.
>>>> 2. Try the netlrts backend just as a sanity check. My own personal
>>>> experience with ucx is that it is far from bulletproof, and it would help
>>>> to isolate if it is a HIP-specific issue or a ucx issue.
>>>>
>>>> -Josh
>>>>
>>>> On Fri, Jul 3, 2020 at 3:59 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>>
>>>>> I noticed that there is a HIP version of NAMD in the gerrit repository
>>>>> of NAMD. I tried it using the apoa1 and stmv benchmark. The results of
>>>>> single node with multi GPU seem right, but when using more than one node,
>>>>> the total energy keeps increasing, and sometimes the computation even
>>>>> crashes because of too fast moving of atoms. I used the
>>>>> ucx-linux-x86_64-ompipmix-smp building of charm-6.10.1. Could anyone give
>>>>> me some hints about this problem?
>>>>>
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> Zhang
>>>>>
>>>>

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:14 CST