Re: Re: CPU vs GPU Question

From: vermaasj (vermaasj_at_msu.edu)
Date: Wed Dec 09 2020 - 18:23:32 CST

What do the slurm arguments look like for your cluster Kelly? I’ve found *significant* differences in performance for NAMD 2.14 depending on what resources I’ve exposed to NAMD. In my estimation, it isn’t even a question. Get thyself more GPU nodes, since price/performance ratio is better than for CPUs, especially if you use NAMD3. There is a patch in gerrit to use multiple gpus in one simulation for NAMD 3. It might get you what you need in performance on your current hardware.

Running multimode GPU setups can be a bit tricky, and I just happen to have spent most of the day testing out different ones on hardware I have available. Given a CUDA-SMP build of NAMD, the arguments below will provide *ok* performance.

#!/bin/bash
#SBATCH --gres=gpu:4
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
srun namd2 +ppn 11 +ignoresharing run.namd

However, NAMD 2.14 is a lot faster (in my hands almost 4x faster) when you help it out and assign a single GPU per MPI-like task.

#SBATCH --gres=gpu:4
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=12
#SBATCH --gpu-bind=map_gpu:0,1,2,3
srun namd2 +ppn 11 +ignoresharing run.namd

So it might just be that it is a better investment to get a bunch of nodes with GPUs instead and run them interconnectedly, since my anecdata is that 8 V100s will buy you 3ns/day for a 2.4M atom system with 2fs timesteps, or that 6 V100 GPUs will get you 12ns/day for STMV (about a million atoms) with 2fs timesteps as well. My suspicion is that you are running a multicore NAMD build on a single node, and I’m slowly starting to realize that this isn’t anywhere near optimal for running NAMD on the hardware.

-Josh

From: owner-namd-l_at_ks.uiuc.edu <owner-namd-l_at_ks.uiuc.edu>
Date: Wednesday, December 9, 2020 at 3:43 PM
To: Gumbart, JC <gumbart_at_physics.gatech.edu>
Cc: namd-l_at_ks.uiuc.edu <namd-l_at_ks.uiuc.edu>, Bennion, Brian <bennion1_at_llnl.gov>
Subject: Re: namd-l: Re: CPU vs GPU Question
I see. Yes, so I am actually at UC San Diego and we have access to a good XSEDE, or at least my boss is willing to pay for a couple of nodes that have 128 CPU cores per node. Before making the purchase though he wanted to ask if that amount of CPUs will really speed up our large simulation over the 4xGPU node we are currently using. We do need a continuous 2-us trajectory for a cryo-em project we are working on.

Dr. Kelly L. McGuire

PhD Biophysics

Department of Physiology and Developmental Biology

Brigham Young University

LSB 3050

Provo, UT 84602

________________________________
From: Gumbart, JC <gumbart_at_physics.gatech.edu>
Sent: Wednesday, December 9, 2020 3:31 PM
To: McGuire, Kelly <mcg05004_at_byui.edu>
Cc: namd-l_at_ks.uiuc.edu <namd-l_at_ks.uiuc.edu>; Bennion, Brian <bennion1_at_llnl.gov>
Subject: Re: namd-l: Re: CPU vs GPU Question

It sounds like you’re asking for the impossible, at least without a good XSEDE or INCITE allocation (and even then, probably impossible given queue times). You need to revise the questions you’re asking to match the resources you have available.

I don’t know your project, and in any case, your advisor is better suited to make the call if aggregate time is good enough for your purposes or if you need a 2-ìs continuous trajectory.

Best,
JC

On Dec 9, 2020, at 5:27 PM, McGuire, Kelly <mcg05004_at_byui.edu<mailto:mcg05004_at_byui.edu>> wrote:

Would four copies get me to 2 microseconds total simulation time faster, or are four copies only for statistical purposes? I need a total of 2 microseconds for my simulation. Right now that will take me about 7.2 months with these four GPUs and I need to have it done in about 1 month.

Dr. Kelly L. McGuire
PhD Biophysics
Department of Physiology and Developmental Biology
Brigham Young University
LSB 3050
Provo, UT 84602

________________________________
From: Gumbart, JC <gumbart_at_physics.gatech.edu<mailto:gumbart_at_physics.gatech.edu>>
Sent: Wednesday, December 9, 2020 3:19 PM
To: McGuire, Kelly <mcg05004_at_byui.edu<mailto:mcg05004_at_byui.edu>>
Cc: namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu> <namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu>>; Bennion, Brian <bennion1_at_llnl.gov<mailto:bennion1_at_llnl.gov>>
Subject: Re: namd-l: Re: CPU vs GPU Question

I don’t have those cards available myself to compare against, but I wouldn’t be surprised if that was the best it could do. If you can switch to NAMD3 though (depending on your specific needs), you could run four copies of your system and probably get comparable performance for each.

Best,
JC

On Dec 9, 2020, at 5:15 PM, McGuire, Kelly <mcg05004_at_byui.edu<mailto:mcg05004_at_byui.edu>> wrote:

Hi JC, I've been using NAMD2 with 4 GPUs. Would you expect that 4x2080ti's and 24 processors on one node would only be able to do about 9 ns/day for the 1.4 million atom system?

Dr. Kelly L. McGuire
PhD Biophysics
Department of Physiology and Developmental Biology
Brigham Young University
LSB 3050
Provo, UT 84602

________________________________
From: Gumbart, JC <gumbart_at_physics.gatech.edu<mailto:gumbart_at_physics.gatech.edu>>
Sent: Wednesday, December 9, 2020 1:56 PM
To: namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu> <namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu>>; Bennion, Brian <bennion1_at_llnl.gov<mailto:bennion1_at_llnl.gov>>
Cc: McGuire, Kelly <mcg05004_at_byui.edu<mailto:mcg05004_at_byui.edu>>
Subject: Re: namd-l: Re: CPU vs GPU Question

For reference, for a 1.45-million atom system with NAMD3, I get 9 ns/day on a V100 with 4-fs time steps and HMR. I recall P100s being ~1/2 the speed, so you’re not too far off my expectation.

I haven’t tried over multiple GPUs or nodes, but I find it’s usually easier to just run multiple copies, one GPU each. Shorter runs but better statistics.

Best,
JC

On Dec 9, 2020, at 11:21 AM, Bennion, Brian <Bennion1_at_llnl.gov<mailto:Bennion1_at_llnl.gov>> wrote:

Hello Kelly,

I am not well versed on the workload distribution in namd3 so if there is anyone out there willing to correct me, I will say that you need 10 more nodes of the exact setup you are currently using to see the same throughput.

For AMBER, cross node GPU communication is not recommended, at least for amber18.

brian
________________________________
From: McGuire, Kelly <mcg05004_at_byui.edu<mailto:mcg05004_at_byui.edu>>
Sent: Tuesday, December 8, 2020 11:14 PM
To: Bennion, Brian <bennion1_at_llnl.gov<mailto:bennion1_at_llnl.gov>>; namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu> <namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu>>
Subject: Re: CPU vs GPU Question

Brian, follow up question. Did you mean at least 10 more nodes of CPUs only, or can I use multiple nodes of CPUs and GPUs? Is it true that GPUs don't work great across nodes?

Dr. Kelly L. McGuire
PhD Biophysics
Department of Physiology and Developmental Biology
Brigham Young University
LSB 3050
Provo, UT 84602

________________________________
From: Bennion, Brian <bennion1_at_llnl.gov<mailto:bennion1_at_llnl.gov>>
Sent: Wednesday, December 2, 2020 4:05 PM
To: namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu> <namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu>>; McGuire, Kelly <mcg05004_at_byui.edu<mailto:mcg05004_at_byui.edu>>
Subject: Re: CPU vs GPU Question

Hello,
You will be needing to use at least 10 more nodes to approach the throughput you are accustomed to seeing. That is where MPI or Ifiniban will be playing the key role in the calculations.
Your sysadmin will be able to tell you if/what mpi exists on the cluster.

Brian

________________________________
From: owner-namd-l_at_ks.uiuc.edu<mailto:owner-namd-l_at_ks.uiuc.edu> <owner-namd-l_at_ks.uiuc.edu<mailto:owner-namd-l_at_ks.uiuc.edu>> on behalf of McGuire, Kelly <mcg05004_at_byui.edu<mailto:mcg05004_at_byui.edu>>
Sent: Wednesday, December 2, 2020 2:51 PM
To: namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu> <namd-l_at_ks.uiuc.edu<mailto:namd-l_at_ks.uiuc.edu>>
Subject: namd-l: CPU vs GPU Question

In all of my simulations so far, I have used one node with 4xP100 GPUs and 24 CPU cores. I usually get ~40 ns/day with a system between 75,000 and 150,000 atoms. I am now trying to do a simulation that is 1.4 million atoms. Currently getting ~4 ns/day.

What is a better approach to speed up this simulation as atom number scales? More GPUs on one node or more CPUs and use multiple nodes? Where does MPI come into play here?

Dr. Kelly L. McGuire
PhD Biophysics
Department of Physiology and Developmental Biology
Brigham Young University
LSB 3050
Provo, UT 84602

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:15 CST