Re: NAMD QM/MM multi nodes performance bad

From: Josh Vermaas (joshua.vermaas_at_gmail.com)
Date: Sun Dec 13 2020 - 21:53:31 CST

Just a quick question: how fast is the QM part of the calculation? I don't
know what your expectation is, but each timestep is taking over a minute.
The vast majority of that is likely the QM, as I'm sure you will find that
a MM only system with a handful of cores will calculate a timestep in under
a second. My advice is to figure out the QM half of the calculation, and
get it running optimally. Even then, your performance is going to be awful
compared with pure MM calculations, since you are trying to evaluate a much
harder energy functions.

Josh

On Sun, Dec 13, 2020, 7:49 PM Chunli Yan <utchunliyan_at_gmail.com> wrote:

> Hello,
> NAMD QM/MM parallel runs cross multi nodes:
> I wrote a nodelist file into the directory to where the orca runs. Below
> is the job submission script:
>
>
> *#!/bin/bash*
>
> *#SBATCH -A bip174*
>
> *#SBATCH -J test*
>
> *#SBATCH -N 4*
>
> *##SBATCH --tasks-per-node=32*
>
> *##SBATCH --cpus-per-task=1*
>
> *##SBATCH --mem=0*
>
> *#SBATCH -t 48:00:00*
>
>
> *#module load openmpi/3.1.4*
>
>
> *export PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/bin/:$PATH"*
>
> *export
> LD_LIBRARY_PATH="/ccs/home/chunli/namd-andes/openmpi-3.1.4/lib/:$LD_LIBRARY_PATH"*
>
>
>
> *# DIRECTORY TO RUN - $SLURM_SUBMIT_DIR is directory job was submitted
> from*
>
> *cd $SLURM_SUBMIT_DIR*
>
>
> *# Generate ORCA nodelist*
>
> *for n in `echo $SLURM_NODELIST | scontrol show hostnames`; do*
>
> * echo "$n slots=20 max-slots=32" >>
> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes*
>
> *done*
>
> *sed -i '1d'
> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes*
>
>
> *cd /gpfs/alpine/scratch/chunli/bip174/eABF/run.smd.dft5*
>
> */ccs/home/chunli/NAMD_2.14_Source/Linux-x86_64-g++/namd2 +p30
> +isomalloc_sync decarboxylase.1.conf > output.smd1.log*
>
> I also exclude the first node where NAMD launches to avoid competition
> between NAMD and ORCA.
> The nodelist is below:
>
> *andes4 slots=20 max-slots=32*
>
> *andes6 slots=20 max-slots=32*
>
> *andes7 slots=20 max-slots=32*
>
> In order to use the host file for mpirun, I edited the runORCA.py:
>
> *cmdline += orcaInFileName + " \"--hostfile
> /gpfs/alpine/scratch/chunli/bip174/eABF/smd.qm.dft5/0/qmmm_0.nodes
> --bind-to core -nooversubscribe \" " + " > " + orcaOutFileName*
>
> QM methods: B3LYP def2-SVP Grid4 EnGrad SlowConv TightSCF RIJCOSX D3BJ
> def2/J
>
> I request 4 nodes total, request 60 cores for ORCA and 20 for NAMD. But
> the performance is really bad:
> for 48968 total atoms and 32 QM atoms. Below is performance:
>
> *Info: Initial time: 30 CPUs 75.0565 s/step 1737.42 days/ns 2285.66 MB
> memory*
>
> *Info: Initial time: 30 CPUs 81.1294 s/step 1877.99 days/ns 2286 MB memory*
>
> *Info: Initial time: 30 CPUs 87.776 s/step 2031.85 days/ns 2286 MB memory*
>
> Can someone help me to find out whether I did something wrong. Or whether
> NAMD QM/MM can scale well across the nodes. I checked orca MPI jobs on each
> node and found the cpu usage only 50-70%.
>
> The namd was compiled with smp, icc:
> ./build charm++ verbs-linux-x86_64 icc smp -with-production
> ./config Linux-x86_64-g++ --charm-arch verbs-linux-x86_64-icc-smp
>
> Thanks.
>
> Best,
>
> *Chunli Yan*
>
>
>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:10 CST