Re: Re: The HIP version of NAMD gets wrong results when computing on more than one node

From: 张驭洲 (zhangyuzhou15_at_mails.ucas.edu.cn)
Date: Fri Aug 28 2020 - 23:36:12 CDT

Hi Josh,

I tried letting NAMD see only one GPU on each node and got the right result. The peer to peer copy looks indeed the cause of the problem. But I want to know further about the behaviors of that copy.

On one node, the result of multi GPUs is right, so does that peer to peer copy happen in this scenario? If it happens, as the result of multi GPUs on two nodes is wrong, we can say that the peer to peer copy between GPUs on different nodes is not functioning properly on AMD hardware through HIP. Then, why the result gets right when using only one GPU per node? Does the peer to peer copy happen in this scenario? As the result is right, the p2p copy should not happen. So, the final inference is that the p2p copy only happens among GPUs in one node; when there are more than one node, there are some other functions doing the communication of GPUs among different nodes, and, if the p2p copy is used in this scenario, someting will go wrong.

I'm not sure about these inferences. Hope you can help me to get more clear about this problem.

Thank you!

Zhang

-----原始邮件-----
发件人:"Josh Vermaas" <joshua.vermaas_at_gmail.com>
发送时间:2020-08-29 07:21:18 (星期六)
收件人: "张驭洲" <zhangyuzhou15_at_mails.ucas.edu.cn>, "NAMD list" <namd-l_at_ks.uiuc.edu>
抄送:
主题: Re: : Re: namd-l: The HIP version of NAMD gets wrong results when computing on more than one node

Hi Zhang,

Based on the ROCm 3.7 release notes, I got a new idea today for why this behavior was occurring. Long story short, direct GPU to GPU communication is not functioning properly on AMD hardware through HIP (https://rocmdocs.amd.com/en/latest/Current_Release_Notes/Current-Release-Notes.html#issue-with-peer-to-peer-transfers). The only place in the NAMD codebase that currently tries to exploit that feature is the PME calculation. In my own tests today, I was able to correct for this by only letting NAMD see one device on each node (+devices 0), and then PME energies and forces were comparable to what is computed via the CPU code path. Many apologies for having this take so long to get back to you on, but I was distracted by the rather substantial performance regressions that have been happening with ROCm versions 3.5 and up.

-Josh

On Wed, Jul 8, 2020 at 1:45 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn> wrote:

Hi Josh,

The compiler I used is GCC 7.3.1. I'm not sure if your results are from one node or more than one node run. I got similar curves of yours when running on one node, but the curves are totally different when running on two nodes. It's not about processes, but physical nodes. On one physical node, I also start 4 processes, using 4 AMD GPUs, and the results are right. But if I run on 2 nodes, even starting 2 processes on each node, the results are wrong. What's more, using 2 nodes costs much more time than using one node. The low performance is not caused by the small size of the apoa1 case. When I turn off the GPU PME computing, I can get right results on 2 nodes and the performance is much higher than turning on the GPU PME computing. I also test the clang compiler, and all the phenomena are the same with GCC. I only recompiled NAMD with clang, not including charm++. I failed to rebuild charm++ with clang. I'm not sure whether it is because that the ucx I used is built with GCC.

I attached my results. If you can open the excel file, you can click each figure and drag the data area from one colomn to another to check each curve. Note there are 6 sheets in the excel file, each containing data from one test. If you can not open it, please have a look at the three pictures, which are total energy of one node, two nodes and two nodes with GPU PME turned off, respectively. You can see that the curve of two nodes running is completely wrong, the total energy keeps increasing with time steps.
 

I compiled the source code on one of the nodes that I run the test cases, which has four AMD GPUs. I'm sure the amdgpu-target was properly set.
 

Sincerely,

Zhang
 

-----原始邮件-----
发件人:"Josh Vermaas" <joshua.vermaas_at_gmail.com>
发送时间:2020-07-07 05:33:26 (星期二)
收件人: "张驭洲" <zhangyuzhou15_at_mails.ucas.edu.cn>
抄送: "NAMD list" <namd-l_at_ks.uiuc.edu>
主题: Re: Re: namd-l: The HIP version of NAMD gets wrong results when computing on more than one node

Ok, I've attached what I got when I ran it myself, with the lrts backend when compared against the multicore implementation (still with ROCm 3.3.0). Basically, I can reproduce that the electrostatic energy goes down, but that the total energy stays constant, which is what we'd expect for apoa1 run without a thermostat. The energy is just sloshing around into different pots, which isn't what you describe, and I'm keen to figure out why that is. What compiler were you using, if its not clang, does clang fix the problem? As I've spent the day working on this problem, did you have an AMD GPU on the machine doing the compiling? I've noticed that if I don't compile on a compute node or set the amdgpu-target, environment variables are not set correctly, and seemingly baffling segfaults result.

-Josh

On Sat, Jul 4, 2020 at 12:17 AM Josh Vermaas <joshua.vermaas_at_gmail.com> wrote:

Oh lovely. Looks like I broke PME communication. :D Thanks for letting me know!

-Josh

On Fri, Jul 3, 2020 at 10:04 PM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn> wrote:

Hi Josh,

There's something new about the problem. I tested the verbs-linux-x86_64-smp backend of charm-6.10.1. The phenomena are the same: on one node the results are right and on 2 nodes they go wrong. However, if I disable the PME gpu computing using these parameters in apoa1.namd file:

bondedCUDA 255

usePMECUDA off

PMEoffload off

the results of 2 nodes get right. Then I went back to test the ucx-linux-x86_64-ompipmix-smp backend and got the same phenomena: results of 2 nodes are wrong, but get right if turn off the PME gpu computing. So the problem may caused by the PME gpu code of NAMD.

Hope these tests can help when you revise the code.

Sincerely,

Zhang
 
 

-----原始邮件-----
发件人:"Josh Vermaas" <joshua.vermaas_at_gmail.com>
发送时间:2020-07-03 23:33:14 (星期五)
收件人: "张驭洲" <zhangyuzhou15_at_mails.ucas.edu.cn>
抄送: "NAMD list" <namd-l_at_ks.uiuc.edu>
主题: Re: namd-l: The HIP version of NAMD gets wrong results when computing on more than one node

The test hardware I have access to is administered by others, and they push forward ROCM versions immediately. Even if I wanted to, I don't have a machine available with 3.3. :(

-Josh

On Fri, Jul 3, 2020 at 5:36 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn> wrote:

Hi Josh,

My tests were done with ROCm 3.3. I want to know if you are going straight to ROCm 3.5 or test with netlrts backend on ROCm 3.3?

Sincerely,

Zhang
 
 

-----原始邮件-----
发件人:"Josh Vermaas" <joshua.vermaas_at_gmail.com>
发送时间:2020-07-03 18:54:33 (星期五)
收件人: "NAMD list" <namd-l_at_ks.uiuc.edu>, "张驭洲" <zhangyuzhou15_at_mails.ucas.edu.cn>
抄送:
主题: Re: namd-l: The HIP version of NAMD gets wrong results when computing on more than one node

Hi Zhang,

The list of configurations I tested before getting distracted by COVID research were multicore builds, and netlrts builds that split a single node (networking wasn't working properly on our test setups). This was also in the era of ROCM 3.3, and now I see this morning that those old binaries don't work with 3.5, so I'm still working to reproduce your result. Two things I'd try in the interim:

1. compile with clang. In my own testing, things work better when I use clang (which really aliases to AMD's LLVM compiler) over gcc.
2. Try the netlrts backend just as a sanity check. My own personal experience with ucx is that it is far from bulletproof, and it would help to isolate if it is a HIP-specific issue or a ucx issue.

-Josh

On Fri, Jul 3, 2020 at 3:59 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn> wrote:

Hello,

I noticed that there is a HIP version of NAMD in the gerrit repository of NAMD. I tried it using the apoa1 and stmv benchmark. The results of single node with multi GPU seem right, but when using more than one node, the total energy keeps increasing, and sometimes the computation even crashes because of too fast moving of atoms. I used the ucx-linux-x86_64-ompipmix-smp building of charm-6.10.1. Could anyone give me some hints about this problem?

Sincerely,

Zhang

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:14 CST