Re: The HIP version of NAMD gets wrong results when computing on more than one node

From: 张驭洲 (zhangyuzhou15_at_mails.ucas.edu.cn)
Date: Fri Jul 03 2020 - 06:36:12 CDT

Hi Josh,

My tests were done with ROCm 3.3. I want to know if you are going straight to ROCm 3.5 or test with netlrts backend on ROCm 3.3?

Sincerely,

Zhang
 
 

-----原始邮件-----
发件人:"Josh Vermaas" <joshua.vermaas_at_gmail.com>
发送时间:2020-07-03 18:54:33 (星期五)
收件人: "NAMD list" <namd-l_at_ks.uiuc.edu>, "张驭洲" <zhangyuzhou15_at_mails.ucas.edu.cn>
抄送:
主题: Re: namd-l: The HIP version of NAMD gets wrong results when computing on more than one node

Hi Zhang,

The list of configurations I tested before getting distracted by COVID research were multicore builds, and netlrts builds that split a single node (networking wasn't working properly on our test setups). This was also in the era of ROCM 3.3, and now I see this morning that those old binaries don't work with 3.5, so I'm still working to reproduce your result. Two things I'd try in the interim:

1. compile with clang. In my own testing, things work better when I use clang (which really aliases to AMD's LLVM compiler) over gcc.
2. Try the netlrts backend just as a sanity check. My own personal experience with ucx is that it is far from bulletproof, and it would help to isolate if it is a HIP-specific issue or a ucx issue.

-Josh

On Fri, Jul 3, 2020 at 3:59 AM 张驭洲 <zhangyuzhou15_at_mails.ucas.edu.cn> wrote:

Hello,

I noticed that there is a HIP version of NAMD in the gerrit repository of NAMD. I tried it using the apoa1 and stmv benchmark. The results of single node with multi GPU seem right, but when using more than one node, the total energy keeps increasing, and sometimes the computation even crashes because of too fast moving of atoms. I used the ucx-linux-x86_64-ompipmix-smp building of charm-6.10.1. Could anyone give me some hints about this problem?

Sincerely,

Zhang

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:09 CST