Re: NAMD QM/MM multi nodes performance bad

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Tue Dec 15 2020 - 03:37:34 CST

Multinode is problematic. An alternative (that I adopted) is single node of
36 cores, one core for namd and the other ones for orca, very finely tuned
flags (see posts of one year ago) and restarting QMMM (see recent post) in
order to get adequate trajectories. good luck, francesco

On Mon, Dec 14, 2020 at 8:24 PM James Kress <jimkress_58_at_kressworks.org>
wrote:

> Chunli,
>
>
>
> Why are you using 20 cores with namd but 60 with ORCA? Could this
> asymmetric utilization of cores be causing issues with core allocation and
> core swapping?
>
>
>
> Also, what communication method is being used between the nodes (btw you
> claim 4 nodes but only list 3 in your node file)? Ethernet is infamous for
> latency issues. Hopefully Infiniband is being used.
>
>
>
> Are there any file system I/O issues?
>
>
>
> Also, ORCA scales quite well across my systems.
>
>
>
> Jim
>
>
>
> James Kress Ph.D., President
>
> The KressWorks® Institute
>
> An IRS Approved 501 (c)(3) Charitable, Nonprofit Corporation
>
> “*Engineering The Cure*” ©
>
> (248) 573-5499
>
>
>
> Learn More and Donate At:
>
> Website: http://www.kressworks.org
>
>
>
> Confidentiality Notice | This e-mail message, including any attachments,
> is for the sole use of the intended recipient(s) and may contain
> confidential or proprietary information. Any unauthorized review, use,
> disclosure or distribution is prohibited. If you are not the intended
> recipient, immediately contact the sender by reply e-mail and destroy all
> copies of the original message.
>
>
>
> *From:* owner-namd-l_at_ks.uiuc.edu <owner-namd-l_at_ks.uiuc.edu> *On Behalf Of
> *Josh Vermaas
> *Sent:* Monday, December 14, 2020 10:43 AM
> *To:* namd-l_at_ks.uiuc.edu; Chunli Yan <utchunliyan_at_gmail.com>; Josh
> Vermaas <joshua.vermaas_at_gmail.com>
> *Subject:* Re: namd-l: NAMD QM/MM multi nodes performance bad
>
>
>
> Hi Chunli,
>
> You clearly get a performance win by running ORCA stand alone here. How do
> the slurm arguments compare? It could be that multiple nodes doesn't help
> you, since QM codes generally scale pretty poorly across multiple nodes.
> What you are showing here is that so long as NAMD can get the MM done in
> under a second, the NAMD part of the problem really doesn't matter.
>
> -Josh
>
> On 12/13/20 9:21 PM, Chunli Yan wrote:
>
> I took the input generated from NAMD and ran with orca only using 60 core,
> below is the timing:
>
>
>
> Timings for individual modules:
>
>
>
> Sum of individual times ... 32.070 sec (= 0.535 min)
>
> GTO integral calculation ... 4.805 sec (= 0.080 min) 15.0
> %
>
> SCF iterations .. 19.318 sec (= 0.322 min) 60.2 %
>
> SCF Gradient evaluation ... 7.947 sec (= 0.132 min) 24.8
> %
>
> ****ORCA TERMINATED NORMALLY****
>
> TOTAL RUN TIME: 0 days 0 hours 0 minutes 39 seconds 917 msec
>
>
>
> With NAMD and ORCA combined (60 core for orca):
>
>
>
> Timings for individual modules:
>
>
>
> Sum of individual times ... 77.582 sec (= 1.293 min)
>
> GTO integral calculation ... 5.404 sec (= 0.090 min) 7.0
> %
>
> SCF iterations .. 67.242 sec (= 1.121 min) 86.7 %
>
> SCF Gradient evaluation ... 4.937 sec (= 0.082 min) 6.4
> %
>
> ****ORCA TERMINATED NORMALLY****
>
>
>
>
>
> Best,
>
>
>
> *Chunli*
>
>
>
>
>
>
>
>
>
> On Sun, Dec 13, 2020 at 10:53 PM Josh Vermaas <joshua.vermaas_at_gmail.com>
> wrote:
>
> Just a quick question: how fast is the QM part of the calculation? I don't
> know what your expectation is, but each timestep is taking over a minute.
> The vast majority of that is likely the QM, as I'm sure you will find that
> a MM only system with a handful of cores will calculate a timestep in under
> a second. My advice is to figure out the QM half of the calculation, and
> get it running optimally. Even then, your performance is going to be awful
> compared with pure MM calculations, since you are trying to evaluate a much
> harder energy functions.
>
>
>
> Josh
>
> On Sun, Dec 13, 2020, 7:49 PM Chunli Yan <utchunliyan_at_gmail.com> wrote:
>
> Hello,
>
> NAMD QM/MM parallel runs cross multi nodes:
>
> I wrote a nodelist file into the directory to where the orca runs. Below
> is the job submission script:
>
>
>
> *#!/bin/bash*
>
> *#SBATCH -A bip174*
>
> *#SBATCH -J test*
>
> *#SBATCH -N 4*
>
> *##SBATCH --tasks-per-node=32*
>
> *##SBATCH --cpus-per-task=1*
>
> *##SBATCH --mem=0*
>
> *#SBATCH -t 48:00:00*
>
>
>
> *#module load openmpi/3.1.4*
>
>
>
> *export PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/bin/:$PATH"*
>
> *export
> LD_LIBRARY_PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/lib/:$LD_LIBRARY_PATH"*
>
>
>
>
>
> *# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted
> from*
>
> *cd $SLURM_SUBMIT_DIR*
>
>
>
> *# Generate ORCA nodelist*
>
> *for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do*
>
> *echo "$n slots=20 max-slots=32" >>
> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes*
>
> *done*
>
> *sed -i '1d'
> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes*
>
>
>
> *cd /gpfs/alpine/scratch/chunli/bip174/eABF/run.smd.dft5*
>
> */ccs/home/chunli/NAMD_2.14_Source/Linux-x86_64-g++/namd2 +p30
> +isomalloc_sync decarboxylase.1.conf > output.smd1.log*
>
>
>
> I also exclude the first node where NAMD launches to avoid competition
> between NAMD and ORCA.
>
> The nodelist is below:
>
>
>
> *andes4 slots=20 max-slots=32*
>
> *andes6 slots=20 max-slots=32*
>
> *andes7 slots=20 max-slots=32*
>
>
>
> In order to use the host file for mpirun, I edited the runORCA.py:
>
> *cmdline += orcaInFileName + **" **\"**--hostfile
> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes
> --bind-to core -nooversubscribe **\"** "** + **" > "** + orcaOutFileName*
>
>
>
> QM methods: B3LYP def2-SVP Grid4 EnGrad SlowConv TightSCF RIJCOSX D3BJ
> def2/J
>
>
>
> I request 4 nodes total, request 60 cores for ORCA and 20 for NAMD. But
> the performance is really bad:
>
> for 48968 total atoms and 32 QM atoms. Below is performance:
>
>
>
> *Info: Initial time: 30 CPUs 75.0565 s/step 1737.42 days/ns 2285.66 MB
> memory*
>
> *Info: Initial time: 30 CPUs 81.1294 s/step 1877.99 days/ns 2286 MB memory*
>
> *Info: Initial time: 30 CPUs 87.776 s/step 2031.85 days/ns 2286 MB memory*
>
>
>
> Can someone help me to find out whether I did something wrong. Or whether
> NAMD QM/MM can scale well across the nodes. I checked orca MPI jobs on each
> node and found the cpu usage only 50-70%.
>
>
>
> The namd was compiled with smp, icc:
>
> ./build charm++ verbs-linux-x86_64 icc smp -with-production
>
> ./config Linux-x86_64-g++ --charm-arch verbs-linux-x86_64-icc-smp
>
>
>
> Thanks.
>
>
>
> Best,
>
> *Chunli Yan*
>
>
>
>
>
>

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:15 CST