Re: Replica exchange simulation with GPU Accelaration

From: Souvik Sinha (souvik.sinha893_at_gmail.com)
Date: Fri Jan 26 2018 - 23:20:14 CST

Thanks for the reply. This earlier thread is really helpful. I will
definitely try your suggestion of building my NAMD to work around replica
jobs.
On 27 Jan 2018 03:16, "Jeff Comer" <jeffcomer_at_gmail.com> wrote:

Dear Souvik,

I routinely use GPUs for multiple-walker ABF with decent performance. I
have a workstation with 3 GPUs and usually use 3 replicas. I haven't tried
using more threads than GPUs. I posted on this mailing list about setting
it up:

http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l.2016-2017/1721.html

Jeff

–––––––––––––––––––––––––––––––––––———————
Jeffrey Comer, PhD
Assistant Professor
Institute of Computational Comparative Medicine
Nanotechnology Innovation Center of Kansas State
Kansas State University
Office: P-213 Mosier Hall
Phone: 785-532-6311
Website: http://jeffcomer.us

On Fri, Jan 26, 2018 at 1:35 PM, Souvik Sinha <souvik.sinha893_at_gmail.com>
wrote:

> Ok, I get that. Thanks.
> On 27 Jan 2018 12:48 a.m., "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
> wrote:
>
>> I'm not familiar with how the new CUDA code manages concurrency with the
>> GPU between different processes. Eventually, somebody at UIUC will provide
>> some info.
>>
>> For sure, sharing a GPU is much worse than what you may expect: you
>> wouldn't just divide its speed in half. Transferring data to/from the GPU
>> is one of the slowest operations. The kernel will try sharing time on the
>> GPU between two processes in a manner that is completely unaware of the
>> processes' compute loops. You may well end up being with interrupted loops
>> on the GPUs, thus losing much more than half.
>>
>> With NAMD being a performance-oriented code, there may very well be
>> instructions that prevent you from doing that, either explicit or
>> implicitly as a result of the Charm++ scheduler.
>>
>> Giacomo
>>
>> On Fri, Jan 26, 2018 at 2:02 PM, Souvik Sinha <souvik.sinha893_at_gmail.com>
>> wrote:
>>
>>> Ok. Now it shines some light. I have mentioned in my earlier post that
>>> I'm not expecting much boost from gpu for replicas. I was just checking
>>> whether the multiple walker scheme at all has the privilage of gpu usage. I
>>> get that launching more processes over less number of gpus is completely
>>> useless. Earlier, with multicore-CUDA binary , single process performance
>>> has been greatly elevated with the use of 2 gpu.
>>>
>>> Just one question: is it because of launching 4 replicas over 2 gpu that
>>> completely abandoned the gpu cores to work at all? I mean if I launch 2
>>> replicas over 2 cores, will it put the gpus to work? Obviously I can check
>>> that myself and can get back to you. Thanks again.
>>> On 27 Jan 2018 12:03 a.m., "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>>> wrote:
>>>
>>>> The two multiple-walker schemes use different code. I wrote the one
>>>> for metadynamics a few years back before NAMD had multiple-copy capability,
>>>> using the file system. Jeff Comer and others at UIUC wrote the one for
>>>> ABF, using the network: for this reason, its use is subject to the
>>>> constraints of Charm++, where the simultaneous use of MPI and CUDA has so
>>>> far been difficult.
>>>>
>>>> The network-based solution should be more scalable in large HPC
>>>> clusters, but for a small commodity cluster of single-node replicas it
>>>> should be OK.
>>>>
>>>> By the way, I just noticed that you are launching 4 copies of NAMD over
>>>> 2 GPUs? Don't do that. GPUs must be assigned exclusively to one process,
>>>> or their benefits go out the window.
>>>>
>>>> Giacomo
>>>>
>>>> On Fri, Jan 26, 2018 at 1:24 PM, Souvik Sinha <
>>>> souvik.sinha893_at_gmail.com> wrote:
>>>>
>>>>> Thanks for the replies. I get that in the present scenario it is gonna
>>>>> be hard to get the gpu resources for my replica runs because of some
>>>>> difficulty in the parallelisation scheme for gpu usage as MPI execution.
>>>>>
>>>>> Is the replica exchange scheme for multiple walker ABF is differently
>>>>> implimented than for metadynamics or other NAMD replica exchange
>>>>> strategies? I am just curious because my understanding in this regard is
>>>>> not much of a mark.
>>>>> On 26 Jan 2018 20:43, "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>>>>> wrote:
>>>>>
>>>>>> In general the multicore version (i.e. SMP with no network) is the
>>>>>> best approach for CUDA, provided that the system is small enough. With
>>>>>> nearly everything offloaded to the GPUs in the recent version, the CPUs are
>>>>>> mostly idle, and adding more CPU cores only clogs up the motherboard bus.
>>>>>>
>>>>>> Running CUDA jobs in parallel, particularly with MPI, is a whole
>>>>>> other endeavor.
>>>>>>
>>>>>> In Souvik's case, it is a setup that is difficult to run fast. You
>>>>>> may consider using the multicore version for multiple-replicas metadynamics
>>>>>> runs, which can communicate between replicas using files and do not need
>>>>>> MPI.
>>>>>>
>>>>>> Giacomo
>>>>>>
>>>>>> On Thu, Jan 25, 2018 at 2:40 PM, Renfro, Michael <Renfro_at_tntech.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> I can’t speak for running replicas as such, but my usual way of
>>>>>>> running on a single node with GPUs is to use the multicore-CUDA NAMD build,
>>>>>>> and to run namd as:
>>>>>>>
>>>>>>> namd2 +setcpuaffinity +devices ${GPU_DEVICE_ORDINAL}
>>>>>>> +p${SLURM_NTASKS} ${INPUT} >& ${OUTPUT}
>>>>>>>
>>>>>>> Where ${GPU_DEVICE_ORDINAL} is “0”, “1”, or “0,1” depending on which
>>>>>>> GPU I reserve; ${SLURM_NTASKS} is the number of cores needed, and ${INPUT}
>>>>>>> and ${OUTPUT} are the NAMD input file and the file to record standard
>>>>>>> output.
>>>>>>>
>>>>>>> Use HECBioSym’s 3M atom benchmark model, an single K80 card
>>>>>>> (presented as 2 distinct GPUs) could keep 8 E5-2680v4 CPU cores busy. But
>>>>>>> 16 or 28 cores (the maxiumum on a single node of ours) was hardly any
>>>>>>> faster with 2 GPUs than 8 cores.
>>>>>>>
>>>>>>> --
>>>>>>> Mike Renfro / HPC Systems Administrator, Information Technology
>>>>>>> Services
>>>>>>> 931 372-3601 / Tennessee Tech University
>>>>>>>
>>>>>>> > On Jan 25, 2018, at 12:59 PM, Souvik Sinha <
>>>>>>> souvik.sinha893_at_gmail.com> wrote:
>>>>>>> >
>>>>>>> > Thanks for your reply.
>>>>>>> > I was wondering, why 'idlepoll' can't even call gpu to work
>>>>>>> despite the probability of a poor performance.
>>>>>>> >
>>>>>>> > On 25 Jan 2018 19:53, "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>>>>>>> wrote:
>>>>>>> > Hi Souvik, this seems connected to the compilation options.
>>>>>>> Compiling with MPI + SMP + CUDA used to be very poor performance, although
>>>>>>> I haven't tried with the new CUDA kernels (2.12 and later).
>>>>>>> >
>>>>>>> > Giacomo
>>>>>>> >
>>>>>>> > On Thu, Jan 25, 2018 at 4:02 AM, Souvik Sinha <
>>>>>>> souvik.sinha893_at_gmail.com> wrote:
>>>>>>> > NAMD Users,
>>>>>>> >
>>>>>>> > I am trying to run replica exchange ABF simulations in a machine
>>>>>>> with 32 cores and 2 Tesla K40 cards. NAMD_2.12, compiled from source is
>>>>>>> what I am using.
>>>>>>> >
>>>>>>> > From this earlier thread, http://www.ks.uiuc.edu/Researc
>>>>>>> h/namd/mailing_list/namd-l.2014-2015/2490.html, I find out that
>>>>>>> using "twoAwayX" or "idlepoll" might help the GPUs to work but somehow in
>>>>>>> my situation it's not helping the GPUs to work ("twoAwayX" is speeding up
>>>>>>> the jobs though). The 'idlepoll' switch generally works fine for Cuda build
>>>>>>> NAMD versions for non-replica jobs. From the aforesaid thread, I get that
>>>>>>> running 4 replicas in 32 CPUs and 2 GPUs may not provide a big boost to my
>>>>>>> simulations but I just want to check whether it works or not?
>>>>>>> >
>>>>>>> > I am running command for the job:
>>>>>>> > mpirun -np 32 /home/sgd/program/NAMD_2.12_Source/Linux-x86_64-g++/namd2
>>>>>>> +idlepoll +replicas 4 $inputfile +stdout log/job0.%d.log
>>>>>>> >
>>>>>>> > My understanding is not helping me much, so any advice will be
>>>>>>> helpful.
>>>>>>> >
>>>>>>> > Thank you
>>>>>>> >
>>>>>>> > --
>>>>>>> > Souvik Sinha
>>>>>>> > Research Fellow
>>>>>>> > Bioinformatics Centre (SGD LAB)
>>>>>>> > Bose Institute
>>>>>>> >
>>>>>>> > Contact: 033 25693275
>>>>>>> >
>>>>>>> >
>>>>>>> >
>>>>>>> > --
>>>>>>> > Giacomo Fiorin
>>>>>>> > Associate Professor of Research, Temple University, Philadelphia,
>>>>>>> PA
>>>>>>> > Contractor, National Institutes of Health, Bethesda, MD
>>>>>>> > http://goo.gl/Q3TBQU
>>>>>>> > https://github.com/giacomofiorin
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Giacomo Fiorin
>>>>>> Associate Professor of Research, Temple University, Philadelphia, PA
>>>>>> Contractor, National Institutes of Health, Bethesda, MD
>>>>>> http://goo.gl/Q3TBQU
>>>>>> https://github.com/giacomofiorin
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Giacomo Fiorin
>>>> Associate Professor of Research, Temple University, Philadelphia, PA
>>>> Contractor, National Institutes of Health, Bethesda, MD
>>>> http://goo.gl/Q3TBQU
>>>> https://github.com/giacomofiorin
>>>>
>>>
>>
>>
>> --
>> Giacomo Fiorin
>> Associate Professor of Research, Temple University, Philadelphia, PA
>> Contractor, National Institutes of Health, Bethesda, MD
>> http://goo.gl/Q3TBQU
>> https://github.com/giacomofiorin
>>
>

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:48 CST