From: James Kress (jimkress_58_at_kressworks.org)
Date: Mon Dec 14 2020 - 13:21:33 CST
Chunli,
Why are you using 20 cores with namd but 60 with ORCA? Could this asymmetric utilization of cores be causing issues with core allocation and core swapping?
Also, what communication method is being used between the nodes (btw you claim 4 nodes but only list 3 in your node file)? Ethernet is infamous for latency issues. Hopefully Infiniband is being used.
Are there any file system I/O issues?
Also, ORCA scales quite well across my systems.
Jim
James Kress Ph.D., President
The KressWorks® Institute
An IRS Approved 501 (c)(3) Charitable, Nonprofit Corporation
“Engineering The Cure” ©
(248) 573-5499
Learn More and Donate At:
Website: <http://www.kressworks.org/> http://www.kressworks.org
Confidentiality Notice | This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential or proprietary information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, immediately contact the sender by reply e-mail and destroy all copies of the original message.
From: owner-namd-l_at_ks.uiuc.edu <owner-namd-l_at_ks.uiuc.edu> On Behalf Of Josh Vermaas
Sent: Monday, December 14, 2020 10:43 AM
To: namd-l_at_ks.uiuc.edu; Chunli Yan <utchunliyan_at_gmail.com>; Josh Vermaas <joshua.vermaas_at_gmail.com>
Subject: Re: namd-l: NAMD QM/MM multi nodes performance bad
Hi Chunli,
You clearly get a performance win by running ORCA stand alone here. How do the slurm arguments compare? It could be that multiple nodes doesn't help you, since QM codes generally scale pretty poorly across multiple nodes. What you are showing here is that so long as NAMD can get the MM done in under a second, the NAMD part of the problem really doesn't matter.
-Josh
On 12/13/20 9:21 PM, Chunli Yan wrote:
I took the input generated from NAMD and ran with orca only using 60 core, below is the timing:
Timings for individual modules:
Sum of individual times ... 32.070 sec (= 0.535 min)
GTO integral calculation ... 4.805 sec (= 0.080 min) 15.0 %
SCF iterations ... 19.318 sec (= 0.322 min) 60.2 %
SCF Gradient evaluation ... 7.947 sec (= 0.132 min) 24.8 %
****ORCA TERMINATED NORMALLY****
TOTAL RUN TIME: 0 days 0 hours 0 minutes 39 seconds 917 msec
With NAMD and ORCA combined (60 core for orca):
Timings for individual modules:
Sum of individual times ... 77.582 sec (= 1.293 min)
GTO integral calculation ... 5.404 sec (= 0.090 min) 7.0 %
SCF iterations ... 67.242 sec (= 1.121 min) 86.7 %
SCF Gradient evaluation ... 4.937 sec (= 0.082 min) 6.4 %
****ORCA TERMINATED NORMALLY****
Best,
Chunli
On Sun, Dec 13, 2020 at 10:53 PM Josh Vermaas <joshua.vermaas_at_gmail.com <mailto:joshua.vermaas_at_gmail.com> > wrote:
Just a quick question: how fast is the QM part of the calculation? I don't know what your expectation is, but each timestep is taking over a minute. The vast majority of that is likely the QM, as I'm sure you will find that a MM only system with a handful of cores will calculate a timestep in under a second. My advice is to figure out the QM half of the calculation, and get it running optimally. Even then, your performance is going to be awful compared with pure MM calculations, since you are trying to evaluate a much harder energy functions.
Josh
On Sun, Dec 13, 2020, 7:49 PM Chunli Yan <utchunliyan_at_gmail.com <mailto:utchunliyan_at_gmail.com> > wrote:
Hello,
NAMD QM/MM parallel runs cross multi nodes:
I wrote a nodelist file into the directory to where the orca runs. Below is the job submission script:
#!/bin/bash
#SBATCH -A bip174
#SBATCH -J test
#SBATCH -N 4
##SBATCH --tasks-per-node=32
##SBATCH --cpus-per-task=1
##SBATCH --mem=0
#SBATCH -t 48:00:00
#module load openmpi/3.1.4
export PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/bin/:$PATH"
export LD_LIBRARY_PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/lib/:$LD_LIBRARY_PATH"
# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted from
cd $SLURM_SUBMIT_DIR
# Generate ORCA nodelist
for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do
echo "$n slots=20 max-slots=32" >> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes
done
sed -i '1d' /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes
cd /gpfs/alpine/scratch/chunli/bip174/eABF/run.smd.dft5
/ccs/home/chunli/NAMD_2.14_Source/Linux-x86_64-g++/namd2 +p30 +isomalloc_sync decarboxylase.1.conf > output.smd1.log
I also exclude the first node where NAMD launches to avoid competition between NAMD and ORCA.
The nodelist is below:
andes4 slots=20 max-slots=32
andes6 slots=20 max-slots=32
andes7 slots=20 max-slots=32
In order to use the host file for mpirun, I edited the runORCA.py:
cmdline += orcaInFileName + " \"--hostfile /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes --bind-to core -nooversubscribe \" " + " > " + orcaOutFileName
QM methods: B3LYP def2-SVP Grid4 EnGrad SlowConv TightSCF RIJCOSX D3BJ def2/J
I request 4 nodes total, request 60 cores for ORCA and 20 for NAMD. But the performance is really bad:
for 48968 total atoms and 32 QM atoms. Below is performance:
Info: Initial time: 30 CPUs 75.0565 s/step 1737.42 days/ns 2285.66 MB memory
Info: Initial time: 30 CPUs 81.1294 s/step 1877.99 days/ns 2286 MB memory
Info: Initial time: 30 CPUs 87.776 s/step 2031.85 days/ns 2286 MB memory
Can someone help me to find out whether I did something wrong. Or whether NAMD QM/MM can scale well across the nodes. I checked orca MPI jobs on each node and found the cpu usage only 50-70%.
The namd was compiled with smp, icc:
./build charm++ verbs-linux-x86_64 icc smp -with-production
./config Linux-x86_64-g++ --charm-arch verbs-linux-x86_64-icc-smp
Thanks.
Best,
Chunli Yan
This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:10 CST