From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Mon Sep 07 2020 - 14:09:51 CDT
Alex,
you are missing a couple of points.
the gain in utilization from hyperthreading is rarely more than 20%. it is
less for well optimized and vectorized kernels.
but there are several reasons why your running all calculations 112 tasks
will be slower.
- cache efficiency. running independent calculations means that there is no
sharing of data between threads and that also means that each task has
effectively only half the cache available. on modern CPUs the impact of
cache is huge. hyper-threading would be most effective in cases where the
tasks can share cached data, e.g. via thread paralellization or through
having less data per parallel task being processed.
- turbo boost. on modern CPUs the actual clock frequency used may be higher
than the nominal frequency, if the power budget and CPU core temperature
allows it. that depends on the individual hardware. if hyperthreading would
be effective, it increases utilization and thus reduces the CPU clock to
nominal values, so the benefit may be eliminated.
- memory bandwidth contention. this is essentially the same argument as for
cache efficiency. hyperthreading does not create new hardware, but aims to
use the CPU more efficiently. any operation other than what is happening
inside the CPU will be slowed down since you now have double the bandwidth
requirements. this will only be of limited impact, if your entire compute
kernel fits into the CPU cache (available per thread) and no data needs to
be communicated.
even with hyper-threading disabled, people often argue, that running twice
as many individual calculations on N cores will just run them at half the
speed. but that reasoning has not been valid for a long time. typically for
the reasons listed above, using some mechanism that will make sure that
only runs one calculation per core is significantly more efficient than
starting all calculations at the same time and thus oversubscribing the
CPU. the slowdown in this case can be even more drastic than for using
hyper-threading, but that is the main reason why hyperthreading is rarely
useful for parallel or concurrent HPC applications.
axel.
On Mon, Sep 7, 2020 at 2:38 PM Alex Balaeff <abalaeff_at_polarisqb.com> wrote:
> Thanks a lot for your comments Marcelo. Throwing in my 2 cents (in
> hope to be criticized if these are wrong cents :) : there are
> situations when using every thread makes sense.
>
> For example, say, I need to run 112 similar jobs on the CPU cores in
> question. And let's say the performance of 1 job per thread is 50%
> worth than that of 1 job per core.
>
> In that case, option 1 is to run two successive batches of 56 jobs
> each. If a job takes time T, my whole simulation takes 2T.
>
> Option 2 is to run all 112 jobs simultaneously. They will finish in
> 1.5*T -- still better than the 2T timing of option 1??
>
> Best,
>
> Alexander.
>
> On Mon, Sep 7, 2020 at 2:13 PM Marcelo C. R. Melo <melomcr_at_gmail.com>
> wrote:
> >
> > Hi Zhihong,
> >
> > The performance of a QM/MM simulation will (almost always) be determined
> by the performance of the QM calculation itself. In this case, you are
> using ORCA to run DFT using 4 CPU cores (by asking for "PAL4").
> >
> > In QM calculations, it is important to know what is the size of the QM
> region, that is, how many atoms are in the QM region? 10 atoms, 100 atoms?
> This will make a gigantic difference in performance.
> >
> > The best bet for you is to balance the number of cores dedicated to NAMD
> with the number of cores dedicated to ORCA, and absolutely never overlap
> the CPU cores for both.
> > Something else that has been discussed in this list extensively is the
> use of hiperthreading. In your example, since you have two 28-core CPUs,
> you should only allocate a total of 56 processes between NAMD and ORCA, no
> more than that. Using all the 112 threads will probably lead to terrible
> performance.
> >
> > I would suggest starting with 10 cores for NAMD and 46 for ORCA. (I am
> assuming based on your performance that you have many atoms in your QM
> region, which will benefit from more CPU cores).
> > You will need to use ORCA's long format for parallelism instead of using
> "PAL4", and I see you already have a line like that in your NAMD config
> file asking for 10 cores.
> > Try benchmarking the ratio of NAMD/ORCA CPU cores, and do not exceed 56
> (or maybe 54, to leave a couple of cores for the OS, since you are running
> in a workstation).
> >
> > Best,
> > Marcelo
> >
> > On Mon, 7 Sep 2020 at 04:42, 辛志宏 <xzhfood_at_njau.edu.cn> wrote:
> >>
> >> Dear all,
> >>
> >> I am running a enzyme complex (298 amino acid and 1 ligand and 90
> thousand water molecules ) molecular dynamic simulation by QM/MM using
> NAMD, but it is very slowly with which only 25 steps being done every day
> (24 hours) in a
> >>
> >> minimization simulation (minimize  100, run 2000), I wonder if there
> are some isses regarding to the parameters of config file, any suggestion
> to improve the speed for running QM/MM will be much appreciated.
> >>
> >>
> >> The hardware for my computer (8173M workstation) is fine with 384GB
> memory  and two physical memory (28 core per CPU, and 112 threads) , the
> command is as follows:
> >>
> >>
> >> charmrun ++local +p20 +isomalloc_sync namd2 YZZ-config.ORCA-1.namd |
> tee YZZ-config.ORCA-1.namd.log
> >>
> >>
> >> Thank you in advance.
> >>
> >>
> >> Zhihong Xin,
> >>
> >>
> >>
> >> The config file is as follows:
> >>
> >> ## Single QM region with MM water box
> >>
> >> structure       ionized.psf
> >>
> >> coordinates     ionized.pdb
> >>
> >> #Continuing a job from the restart files
> >>
> >> if {1} {
> >>
> >> set inputname      YZZ_equil_MM
> >>
> >> binCoordinates     $inputname.coor
> >>
> >> extendedSystem     $inputname.xsc
> >>
> >> }
> >>
> >> cellBasisVector1 64.945    0       0
> >>
> >> cellBasisVector2 0     65.353      0
> >>
> >> cellBasisVector3 0     0       67.919
> >>
> >> cellOrigin 55.318   57.874   55.561
> >>
> >> seed            7910881
> >>
> >> # Output Parameters
> >>
> >> binaryoutput no
> >>
> >> outputname YZZ-QM-min-out
> >>
> >> outputenergies 1
> >>
> >> outputtiming 1
> >>
> >> outputpressure 1
> >>
> >> binaryrestart yes
> >>
> >> dcdfile YZZ-QM-min-out.dcd
> >>
> >> dcdfreq 1
> >>
> >> XSTFreq 1
> >>
> >> restartfreq 100
> >>
> >> restartname YZZ-QM-min-out.restart
> >>
> >> # mobile atom selection:
> >>
> >> constraints          on
> >>
> >> consexp              2
> >>
> >> consref              YZZ-restraint.pdb
> >>
> >> conskfile            YZZ-restraint.pdb
> >>
> >> conskcol             B
> >>
> >> constraintScaling    2.0
> >>
> >> # PME Parameters
> >>
> >> PME on
> >>
> >> PMEGridspacing 1
> >>
> >> set temperature 300
> >>
> >> temperature $temperature
> >>
> >> # Thermostat Parameters
> >>
> >> langevin     on
> >>
> >> langevintemp        $temperature
> >>
> >> langevinHydrogen    on
> >>
> >> langevindamping     50
> >>
> >> # Barostat Parameters
> >>
> >> usegrouppressure        yes
> >>
> >> useflexiblecell         no
> >>
> >> useConstantArea         no
> >>
> >> langevinpiston         on
> >>
> >> langevinpistontarget    1.01325
> >>
> >> langevinpistonperiod    200
> >>
> >> langevinpistondecay     100
> >>
> >> langevinpistontemp      $temperature
> >>
> >> surfacetensiontarget    0.0
> >>
> >> strainrate              0. 0. 0.
> >>
> >> wrapAll         on
> >>
> >> wrapWater       on
> >>
> >> # Integrator Parameters
> >>
> >> timestep         0.5
> >>
> >> firstTimestep         0
> >>
> >> fullElectFrequency      1
> >>
> >> nonbondedfreq         1
> >>
> >> # Force Field Parameters
> >>
> >> paratypecharmm  on
> >>
> >> parameters ../CHARMpars/toppar_all36_carb_glycopeptide.str
> >>
> >> parameters      ../CHARMpars/toppar_water_ions_namd.str
> >>
> >> parameters ../CHARMpars/toppar_all36_na_nad_ppi_gdp_gtp.str
> >>
> >> parameters ../CHARMpars/par_all36_carb.prm
> >>
> >> parameters ../CHARMpars/par_all36_cgenff.prm
> >>
> >> parameters ../CHARMpars/par_all36_lipid.prm
> >>
> >> parameters ../CHARMpars/par_all36_na.prm
> >>
> >> parameters ../CHARMpars/par_all36_prot.prm
> >>
> >> parameters      ../common/DMP_ABD769.prm
> >>
> >> #printExclusions on
> >>
> >> exclude scaled1-4
> >>
> >> 1-4scaling 1.0
> >>
> >> rigidbonds none
> >>
> >> cutoff 12.0
> >>
> >> pairlistdist 14.0
> >>
> >> switching on
> >>
> >> switchdist 10.0
> >>
> >> stepspercycle   1
> >>
> >> # Truns ON or OFF the QM calculations
> >>
> >> qmForces        on
> >>
> >> qmParamPDB     "YZZ-namd-QM-0.pdb"
> >>
> >> qmColumn        "beta"
> >>
> >> qmBondColumn    "occ"
> >>
> >> #Link Atoms
> >>
> >> qmBondDist           on
> >>
> >> # Number of simultaneous QM simulations per node
> >>
> >> QMSimsPerNode   20
> >>
> >> QMElecEmbed on
> >>
> >> QMSwitching on
> >>
> >> QMSwitchingType shift
> >>
> >> QMPointChargeScheme none
> >>
> >> QMBondScheme "cs"
> >>
> >> #qmBaseDir  "/dev/shm/YZZ-NAMD_MIN"
> >>
> >> # Directory where QM calculations will be ran.
> >>
> >> qmBaseDir  "/dev/shm/NAMD_Example1"
> >>
> >> ## ORCA
> >>
> >> qmConfigLine    "! B3LYP 6-31G Grid4 PAL4 EnGrad TightSCF"
> >>
> >> qmConfigLine    "%%output PrintLevel Mini Print\[ P_Mulliken \] 1
> Print\[P_AtCharges_M\] 1 end"
> >>
> >> #qmConfigLine     "%%pal nprocs 10 end"
> >>
> >> # construction of ORCA's input file.
> >>
> >> qmMult          "1 2"
> >>
> >> qmCharge        "1 -1"
> >>
> >> qmSoftware      "orca"
> >>
> >> qmExecPath
>  "/home/xzhfood/software/orca_4_1_2_linux_x86-64_openmpi313/orca"
> >>
> >>  QMOutStride     1
> >>
> >> QMPositionOutStride     1
> >>
> >> # Number of steps in the QM/MM simulation.
> >>
> >> minimize  100
> >>
> >> run 2000
> >>
> >>
>
>
> --
>  -----
>   Dr. Alexander Balaeff
>   Polaris Quantum Biotech
>   www.PolarisQB.com
>   (919)-270-5772
>
>
-- Dr. Axel Kohlmeyer akohlmey_at_gmail.com http://goo.gl/1wk0 College of Science & Technology, Temple University, Philadelphia PA, USA International Centre for Theoretical Physics, Trieste. Italy.
This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:09 CST