Re: Compiling issue: selfcompiled NAMD2.14 multicore version factor ~2x slower

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Sat Aug 07 2021 - 15:40:42 CDT

Hello René, if Colvars is not active in both runs (with pre-compiled and
self-compiled) it is very unlikely that changes in its source files can
impact performance. You could probably confirm this by using your own
build of an *unmodified* 2.14 source tree, which I would expect to behave
the same way.

One thing to note is that most pre-compiled builds of NAMD are built with
the Intel compiler, see e.g. the following when you launch the
"Linux-x86_64-multicore-CUDA" build:

Info: Based on Charm++/Converse 61002 for *multicore-linux-x86_64-iccstatic*
Info: Built Mon Aug 24 10:10:58 CDT 2020 by jim on belfast.ks.uiuc.edu

Jim Phillips, or one of the other core maintainers at UIUC may be able to
comment further about what you could do on your end to reproduce the
optimizations of the pre-compiled build on your own build.

In general, I would also look into changing the number of CPU threads
associated in 2.x NAMD. You have a fairly good performance to begin with,
consistent with such a small system. The CPU-GPU communication step is one
of the main factors limiting simulation speed, and this is definitely
affected by how many CPU threads communicate with the GPU.

For such a small system, fewer CPU threads per GPU would be more
appropriate. (Note that this is valid for 2.x, NAMD 3.0 is entirely
different).

Giacomo

On Sat, Aug 7, 2021 at 3:18 PM René Hafner TUK <hamburge_at_physik.uni-kl.de>
wrote:

> Dear NAMD maintainers,
>
>
> I tried implementing a new colvar (which was successful) but wondered
> about speed reduction by it.
>
> Though I compared my self compiled version plain MD simulations
> (finally without colvars) with the precompiled binary from the website.
>
> The only thing changed in the code is in colvars module files that is
> not active for the following comparism.
>
>
> I obtain the following speed of simulations for a single standard
> cutoff etc. simulation (membrane + water, 7k atoms)
>
> Precompiled: 300 ns/day (4fs ,HMR)
>
> Selfcompiled: 162 ns/day (4fs timestep, HMR)
>
>
> This is not CUDA Version dependent as this results is stable with both
> CUDA 11.3 as well as with CUDA 10.1 (this latter version was used in the
> precompiled binary).
>
>
> Any help is appreciated.
>
> Kind regards
>
> René
>
> I compiled it with the following settings:
>
> """
>
> # building charmm
> module purge
> module load gcc/8.4
> ./build charm++ multicore-linux-x86_64 gcc -j16 --with-production
>
> ###
> module purge
> module load gcc/8.4
> module load nvidia/10.1
> ./config Linux-x86_64-g++ --charm-arch multicore-linux-x86_64-gcc
> --with-tcl --with-python --with-fftw --with-cuda --arch-suffix
> SelfCompiledNAMD214MultiCoreCUDA_cuda101_gcc_with_newcolvar_def
> cd Linux-x86_64-g++
> # append the line CXXOPTS=-lstdc++ -std=c++11 to the Make.config
> ## if no CXXOPTS like --with-debug are defined then it will not work
> echo "CXXOPTS=-lstdc++ -std=c++11" >> Make.config
>
> echo "showin Make.config"
> cat Make.config
> # then run it
> make -j 12 | tee
> make_log_SelfCompiledNAMD214MultiCoreCUDA_cuda101_gcc_with_newcolvar_def.txt
>
> """
>
>
> --
> Dipl.-Phys. René Hafner
> TU Kaiserslautern
> Germany
>
>

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST