Re: Replica exchange simulation with GPU Accelaration

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Fri Jan 26 2018 - 09:12:28 CST

In general the multicore version (i.e. SMP with no network) is the best
approach for CUDA, provided that the system is small enough. With nearly
everything offloaded to the GPUs in the recent version, the CPUs are mostly
idle, and adding more CPU cores only clogs up the motherboard bus.

Running CUDA jobs in parallel, particularly with MPI, is a whole other
endeavor.

In Souvik's case, it is a setup that is difficult to run fast. You may
consider using the multicore version for multiple-replicas metadynamics
runs, which can communicate between replicas using files and do not need
MPI.

Giacomo

On Thu, Jan 25, 2018 at 2:40 PM, Renfro, Michael <Renfro_at_tntech.edu> wrote:

> I can’t speak for running replicas as such, but my usual way of running on
> a single node with GPUs is to use the multicore-CUDA NAMD build, and to run
> namd as:
>
> namd2 +setcpuaffinity +devices ${GPU_DEVICE_ORDINAL} +p${SLURM_NTASKS}
> ${INPUT} >& ${OUTPUT}
>
> Where ${GPU_DEVICE_ORDINAL} is “0”, “1”, or “0,1” depending on which GPU I
> reserve; ${SLURM_NTASKS} is the number of cores needed, and ${INPUT} and
> ${OUTPUT} are the NAMD input file and the file to record standard output.
>
> Use HECBioSym’s 3M atom benchmark model, an single K80 card (presented as
> 2 distinct GPUs) could keep 8 E5-2680v4 CPU cores busy. But 16 or 28 cores
> (the maxiumum on a single node of ours) was hardly any faster with 2 GPUs
> than 8 cores.
>
> --
> Mike Renfro / HPC Systems Administrator, Information Technology Services
> 931 372-3601 / Tennessee Tech University
>
> > On Jan 25, 2018, at 12:59 PM, Souvik Sinha <souvik.sinha893_at_gmail.com>
> wrote:
> >
> > Thanks for your reply.
> > I was wondering, why 'idlepoll' can't even call gpu to work despite the
> probability of a poor performance.
> >
> > On 25 Jan 2018 19:53, "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com> wrote:
> > Hi Souvik, this seems connected to the compilation options. Compiling
> with MPI + SMP + CUDA used to be very poor performance, although I haven't
> tried with the new CUDA kernels (2.12 and later).
> >
> > Giacomo
> >
> > On Thu, Jan 25, 2018 at 4:02 AM, Souvik Sinha <souvik.sinha893_at_gmail.com>
> wrote:
> > NAMD Users,
> >
> > I am trying to run replica exchange ABF simulations in a machine with 32
> cores and 2 Tesla K40 cards. NAMD_2.12, compiled from source is what I am
> using.
> >
> > From this earlier thread, http://www.ks.uiuc.edu/Researc
> h/namd/mailing_list/namd-l.2014-2015/2490.html, I find out that using
> "twoAwayX" or "idlepoll" might help the GPUs to work but somehow in my
> situation it's not helping the GPUs to work ("twoAwayX" is speeding up the
> jobs though). The 'idlepoll' switch generally works fine for Cuda build
> NAMD versions for non-replica jobs. From the aforesaid thread, I get that
> running 4 replicas in 32 CPUs and 2 GPUs may not provide a big boost to my
> simulations but I just want to check whether it works or not?
> >
> > I am running command for the job:
> > mpirun -np 32 /home/sgd/program/NAMD_2.12_Source/Linux-x86_64-g++/namd2
> +idlepoll +replicas 4 $inputfile +stdout log/job0.%d.log
> >
> > My understanding is not helping me much, so any advice will be helpful.
> >
> > Thank you
> >
> > --
> > Souvik Sinha
> > Research Fellow
> > Bioinformatics Centre (SGD LAB)
> > Bose Institute
> >
> > Contact: 033 25693275
> >
> >
> >
> > --
> > Giacomo Fiorin
> > Associate Professor of Research, Temple University, Philadelphia, PA
> > Contractor, National Institutes of Health, Bethesda, MD
> > http://goo.gl/Q3TBQU
> > https://github.com/giacomofiorin
>
>
>

-- 
Giacomo Fiorin
Associate Professor of Research, Temple University, Philadelphia, PA
Contractor, National Institutes of Health, Bethesda, MD
http://goo.gl/Q3TBQU
https://github.com/giacomofiorin

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:48 CST