Re: Re: The HIP version of NAMD gets wrong results when computing on more than one node

From: 张驭洲 (zhangyuzhou15_at_mails.ucas.edu.cn)
Date: Fri Jul 03 2020 - 23:03:54 CDT

Hi Josh,

There's something new about the problem. I tested the verbs-linux-x86_64-smp backend of charm-6.10.1. The phenomena are the same: on one node the results are right and on 2 nodes they go wrong. However, if I disable the PME gpu computing using these parameters in apoa1.namd file:

bondedCUDA 255

usePMECUDA off

PMEoffload off

the results of 2 nodes get right. Then I went back to test the ucx-linux-x86_64-ompipmix-smp backend and got the same phenomena: results of 2 nodes are wrong, but get right if turn off the PME gpu computing. So the problem may caused by the PME gpu code of NAMD.

Hope these tests can help when you revise the code.

Sincerely,

Zhang
 
 

-----原始邮件-----
发件人:"Josh Vermaas" <joshua.vermaas_at_gmail.com>
发送时间:2020-07-03 23:33:14 (星期五)
收件人: "张驭洲" <zhangyuzhou15_at_mails.ucas.edu.cn>
抄送: "NAMD list" <namd-l_at_ks.uiuc.edu>
主题: Re: namd-l: The HIP version of NAMD gets wrong results when computing on more than one node

The test hardware I have access to is administered by others, and they push forward ROCM versions immediately. Even if I wanted to, I don't have a machine available with 3.3. :(

-Josh

On Fri, Jul 3, 2020 at 5:36 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn> wrote:

Hi Josh,

My tests were done with ROCm 3.3. I want to know if you are going straight to ROCm 3.5 or test with netlrts backend on ROCm 3.3?

Sincerely,

Zhang
 
 

-----原始邮件-----
发件人:"Josh Vermaas" <joshua.vermaas_at_gmail.com>
发送时间:2020-07-03 18:54:33 (星期五)
收件人: "NAMD list" <namd-l_at_ks.uiuc.edu>, "张驭洲" <zhangyuzhou15_at_mails.ucas.edu.cn>
抄送:
主题: Re: namd-l: The HIP version of NAMD gets wrong results when computing on more than one node

Hi Zhang,

The list of configurations I tested before getting distracted by COVID research were multicore builds, and netlrts builds that split a single node (networking wasn't working properly on our test setups). This was also in the era of ROCM 3.3, and now I see this morning that those old binaries don't work with 3.5, so I'm still working to reproduce your result. Two things I'd try in the interim:

1. compile with clang. In my own testing, things work better when I use clang (which really aliases to AMD's LLVM compiler) over gcc.
2. Try the netlrts backend just as a sanity check. My own personal experience with ucx is that it is far from bulletproof, and it would help to isolate if it is a HIP-specific issue or a ucx issue.

-Josh

On Fri, Jul 3, 2020 at 3:59 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn> wrote:

Hello,

I noticed that there is a HIP version of NAMD in the gerrit repository of NAMD. I tried it using the apoa1 and stmv benchmark. The results of single node with multi GPU seem right, but when using more than one node, the total energy keeps increasing, and sometimes the computation even crashes because of too fast moving of atoms. I used the ucx-linux-x86_64-ompipmix-smp building of charm-6.10.1. Could anyone give me some hints about this problem?

Sincerely,

Zhang

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:13 CST