Re: NAMD QM/MM multi nodes performance bad

From: Josh Vermaas (vermaasj_at_msu.edu)
Date: Mon Dec 14 2020 - 09:43:03 CST

Hi Chunli,

You clearly get a performance win by running ORCA stand alone here. How
do the slurm arguments compare? It could be that multiple nodes doesn't
help you, since QM codes generally scale pretty poorly across multiple
nodes. What you are showing here is that so long as NAMD can get the MM
done in under a second, the NAMD part of the problem really doesn't matter.

-Josh

On 12/13/20 9:21 PM, Chunli Yan wrote:
>
> I took the input generated from NAMD and ran with orca only using 60
> core, below is the timing:
>
>
> Timings for individual modules:
>
>
> Sum of individual times         ...       32.070 sec (=   0.535 min)
>
> GTO integral calculation        ...        4.805 sec (=   0.080 min) 
> 15.0 %
>
> SCF iterations                  ...       19.318 sec (=   0.322 min) 
> 60.2 %
>
> SCF Gradient evaluation         ...        7.947 sec (=   0.132 min) 
> 24.8 %
>
>                              ****ORCA TERMINATED NORMALLY****
>
> TOTAL RUN TIME: 0 days 0 hours 0 minutes 39 seconds 917 msec
>
>
> With NAMD  and ORCA combined (60 core for orca):
>
> Timings for individual modules:
>
>
> Sum of individual times         ...       77.582 sec (=   1.293 min)
>
> GTO integral calculation        ...        5.404 sec (=   0.090 min)  
> 7.0 %
>
> SCF iterations                  ...       67.242 sec (=   1.121 min) 
> 86.7 %
>
> SCF Gradient evaluation         ...        4.937 sec (=   0.082 min)  
> 6.4 %
>
>                              ****ORCA TERMINATED NORMALLY****
>
>
>
> Best,
>
>
> *Chunli*
>
>
>
>
> On Sun, Dec 13, 2020 at 10:53 PM Josh Vermaas
> <joshua.vermaas_at_gmail.com <mailto:joshua.vermaas_at_gmail.com>> wrote:
>
> Just a quick question: how fast is the QM part of the calculation?
> I don't know what your expectation is, but each timestep is taking
> over a minute. The vast majority of that is likely the QM, as I'm
> sure you will find that a MM only system with a handful of cores
> will calculate a timestep in under a second. My advice is to
> figure out the QM half of the calculation, and get it running
> optimally. Even then, your performance is going to be awful
> compared with pure MM calculations, since you are trying to
> evaluate a much harder energy functions.
>
> Josh
>
> On Sun, Dec 13, 2020, 7:49 PM Chunli Yan <utchunliyan_at_gmail.com
> <mailto:utchunliyan_at_gmail.com>> wrote:
>
> Hello,
> NAMD QM/MM parallel runs cross multi nodes:
> I wrote a nodelist file into the directory to where the orca
> runs. Below is the job submission script:
>
> /
> /
>
> /#!/bin/bash/
>
> /#SBATCH -A bip174/
>
> /#SBATCH -J test/
>
> /#SBATCH -N 4/
>
> /##SBATCH --tasks-per-node=32/
>
> /##SBATCH --cpus-per-task=1/
>
> /##SBATCH --mem=0/
>
> /#SBATCH -t 48:00:00/
>
> /
> /
>
> /#module load openmpi/3.1.4/
>
> /
> /
>
> /export
> PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/bin/:$PATH"/
>
> /export
> LD_LIBRARY_PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/lib/:$LD_LIBRARY_PATH"/
>
> /
> /
>
> /
> /
>
> /# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was
> submitted from/
>
> /cd $SLURM_SUBMIT_DIR/
>
> /
> /
>
> /# Generate ORCA nodelist/
>
> /for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do/
>
> /echo "$n slots=20 max-slots=32" >>
> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes/
>
> /done/
>
> /sed -i '1d'
> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes/
>
> /
> /
>
> /cd /gpfs/alpine/scratch/chunli/bip174/eABF/run.smd.dft5/
>
> //ccs/home/chunli/NAMD_2.14_Source/Linux-x86_64-g++/namd2 +p30
> +isomalloc_sync decarboxylase.1.conf > output.smd1.log/
>
>
> I also exclude the first node where NAMD launches to avoid
> competition between NAMD and ORCA.
> The nodelist is below:
>
> /andes4 slots=20 max-slots=32/
>
> /andes6 slots=20 max-slots=32/
>
> /andes7 slots=20 max-slots=32/
>
>
> In order to use the host file for mpirun, I edited the runORCA.py:
>
> /cmdline += orcaInFileName + " \"--hostfile
> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes
> --bind-to core -nooversubscribe \" " + " > " + orcaOutFileName/
>
>
> QM methods: B3LYP def2-SVP Grid4 EnGrad SlowConv TightSCF
> RIJCOSX D3BJ def2/J
>
> I request 4 nodes total, request 60 cores for ORCA and 20 for
> NAMD. But the performance is really bad:
> for 48968 total atoms and 32 QM atoms. Below is performance:
>
> /Info: Initial time: 30 CPUs 75.0565 s/step 1737.42
> days/*ns* 2285.66 MB memory/
>
> /Info: Initial time: 30 CPUs 81.1294 s/step 1877.99
> days/*ns* 2286 MB memory/
>
> /Info: Initial time: 30 CPUs 87.776 s/step 2031.85
> days/*ns* 2286 MB memory/
>
>
> Can someone help me to find out whether I did something wrong.
> Or whether NAMD QM/MM can scale well across the nodes. I
> checked orca MPI jobs on each node and found the cpu usage
> only 50-70%.
>
> The namd was compiled with smp, icc:
> ./build charm++ verbs-linux-x86_64 icc smp -with-production
> ./config Linux-x86_64-g++ --charm-arch verbs-linux-x86_64-icc-smp
>
> Thanks.
>
> Best,
>
> *Chunli Yan*
>
>

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:15 CST