Re: NAMD QM/MM multi nodes performance bad

From: Chunli Yan (utchunliyan_at_gmail.com)
Date: Sun Dec 13 2020 - 22:21:52 CST

I took the input generated from NAMD and ran with orca only using 60 core,
below is the timing:

Timings for individual modules:

Sum of individual times ... 32.070 sec (= 0.535 min)

GTO integral calculation ... 4.805 sec (= 0.080 min) 15.0 %

SCF iterations ... 19.318 sec (= 0.322 min) 60.2 %

SCF Gradient evaluation ... 7.947 sec (= 0.132 min) 24.8 %

                             ****ORCA TERMINATED NORMALLY****

TOTAL RUN TIME: 0 days 0 hours 0 minutes 39 seconds 917 msec

With NAMD and ORCA combined (60 core for orca):

Timings for individual modules:

Sum of individual times ... 77.582 sec (= 1.293 min)

GTO integral calculation ... 5.404 sec (= 0.090 min) 7.0 %

SCF iterations ... 67.242 sec (= 1.121 min) 86.7 %

SCF Gradient evaluation ... 4.937 sec (= 0.082 min) 6.4 %

                             ****ORCA TERMINATED NORMALLY****

Best,

*Chunli*

On Sun, Dec 13, 2020 at 10:53 PM Josh Vermaas <joshua.vermaas_at_gmail.com>
wrote:

> Just a quick question: how fast is the QM part of the calculation? I don't
> know what your expectation is, but each timestep is taking over a minute.
> The vast majority of that is likely the QM, as I'm sure you will find that
> a MM only system with a handful of cores will calculate a timestep in under
> a second. My advice is to figure out the QM half of the calculation, and
> get it running optimally. Even then, your performance is going to be awful
> compared with pure MM calculations, since you are trying to evaluate a much
> harder energy functions.
>
> Josh
>
> On Sun, Dec 13, 2020, 7:49 PM Chunli Yan <utchunliyan_at_gmail.com> wrote:
>
>> Hello,
>> NAMD QM/MM parallel runs cross multi nodes:
>> I wrote a nodelist file into the directory to where the orca runs. Below
>> is the job submission script:
>>
>>
>> *#!/bin/bash*
>>
>> *#SBATCH -A bip174*
>>
>> *#SBATCH -J test*
>>
>> *#SBATCH -N 4*
>>
>> *##SBATCH --tasks-per-node=32*
>>
>> *##SBATCH --cpus-per-task=1*
>>
>> *##SBATCH --mem=0*
>>
>> *#SBATCH -t 48:00:00*
>>
>>
>> *#module load openmpi/3.1.4*
>>
>>
>> *export PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/bin/:$PATH"*
>>
>> *export
>> LD_LIBRARY_PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/lib/:$LD_LIBRARY_PATH"*
>>
>>
>>
>> *# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted
>> from*
>>
>> *cd $SLURM_SUBMIT_DIR*
>>
>>
>> *# Generate ORCA nodelist*
>>
>> *for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do*
>>
>> * echo "$n slots=20 max-slots=32" >>
>> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes*
>>
>> *done*
>>
>> *sed -i '1d'
>> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes*
>>
>>
>> *cd /gpfs/alpine/scratch/chunli/bip174/eABF/run.smd.dft5*
>>
>> */ccs/home/chunli/NAMD_2.14_Source/Linux-x86_64-g++/namd2 +p30
>> +isomalloc_sync decarboxylase.1.conf > output.smd1.log*
>>
>> I also exclude the first node where NAMD launches to avoid competition
>> between NAMD and ORCA.
>> The nodelist is below:
>>
>> *andes4 slots=20 max-slots=32*
>>
>> *andes6 slots=20 max-slots=32*
>>
>> *andes7 slots=20 max-slots=32*
>>
>> In order to use the host file for mpirun, I edited the runORCA.py:
>>
>> *cmdline += orcaInFileName + " \"--hostfile
>> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes
>> --bind-to core -nooversubscribe \" " + " > " + orcaOutFileName*
>>
>> QM methods: B3LYP def2-SVP Grid4 EnGrad SlowConv TightSCF RIJCOSX D3BJ
>> def2/J
>>
>> I request 4 nodes total, request 60 cores for ORCA and 20 for NAMD. But
>> the performance is really bad:
>> for 48968 total atoms and 32 QM atoms. Below is performance:
>>
>> *Info: Initial time: 30 CPUs 75.0565 s/step 1737.42 days/ns 2285.66 MB
>> memory*
>>
>> *Info: Initial time: 30 CPUs 81.1294 s/step 1877.99 days/ns 2286 MB
>> memory*
>>
>> *Info: Initial time: 30 CPUs 87.776 s/step 2031.85 days/ns 2286 MB memory*
>>
>> Can someone help me to find out whether I did something wrong. Or whether
>> NAMD QM/MM can scale well across the nodes. I checked orca MPI jobs on each
>> node and found the cpu usage only 50-70%.
>>
>> The namd was compiled with smp, icc:
>> ./build charm++ verbs-linux-x86_64 icc smp -with-production
>> ./config Linux-x86_64-g++ --charm-arch verbs-linux-x86_64-icc-smp
>>
>> Thanks.
>>
>> Best,
>>
>> *Chunli Yan*
>>
>>
>>

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:15 CST