Re: Replica exchange simulation with GPU Accelaration

From: Souvik Sinha (souvik.sinha893_at_gmail.com)
Date: Fri Jan 26 2018 - 13:35:12 CST

Next message: Jeff Comer: "Re: Replica exchange simulation with GPU Accelaration"
Previous message: Giacomo Fiorin: "Re: Replica exchange simulation with GPU Accelaration"
In reply to: Giacomo Fiorin: "Re: Replica exchange simulation with GPU Accelaration"
Next in thread: Jeff Comer: "Re: Replica exchange simulation with GPU Accelaration"
Reply: Jeff Comer: "Re: Replica exchange simulation with GPU Accelaration"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Ok, I get that. Thanks.
On 27 Jan 2018 12:48 a.m., "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
wrote:

> I'm not familiar with how the new CUDA code manages concurrency with the
> GPU between different processes. Eventually, somebody at UIUC will provide
> some info.
>
> For sure, sharing a GPU is much worse than what you may expect: you
> wouldn't just divide its speed in half. Transferring data to/from the GPU
> is one of the slowest operations. The kernel will try sharing time on the
> GPU between two processes in a manner that is completely unaware of the
> processes' compute loops. You may well end up being with interrupted loops
> on the GPUs, thus losing much more than half.
>
> With NAMD being a performance-oriented code, there may very well be
> instructions that prevent you from doing that, either explicit or
> implicitly as a result of the Charm++ scheduler.
>
> Giacomo
>
> On Fri, Jan 26, 2018 at 2:02 PM, Souvik Sinha <souvik.sinha893_at_gmail.com>
> wrote:
>
>> Ok. Now it shines some light. I have mentioned in my earlier post that
>> I'm not expecting much boost from gpu for replicas. I was just checking
>> whether the multiple walker scheme at all has the privilage of gpu usage. I
>> get that launching more processes over less number of gpus is completely
>> useless. Earlier, with multicore-CUDA binary , single process performance
>> has been greatly elevated with the use of 2 gpu.
>>
>> Just one question: is it because of launching 4 replicas over 2 gpu that
>> completely abandoned the gpu cores to work at all? I mean if I launch 2
>> replicas over 2 cores, will it put the gpus to work? Obviously I can check
>> that myself and can get back to you. Thanks again.
>> On 27 Jan 2018 12:03 a.m., "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>> wrote:
>>
>>> The two multiple-walker schemes use different code. I wrote the one for
>>> metadynamics a few years back before NAMD had multiple-copy capability,
>>> using the file system. Jeff Comer and others at UIUC wrote the one for
>>> ABF, using the network: for this reason, its use is subject to the
>>> constraints of Charm++, where the simultaneous use of MPI and CUDA has so
>>> far been difficult.
>>>
>>> The network-based solution should be more scalable in large HPC
>>> clusters, but for a small commodity cluster of single-node replicas it
>>> should be OK.
>>>
>>> By the way, I just noticed that you are launching 4 copies of NAMD over
>>> 2 GPUs? Don't do that. GPUs must be assigned exclusively to one process,
>>> or their benefits go out the window.
>>>
>>> Giacomo
>>>
>>> On Fri, Jan 26, 2018 at 1:24 PM, Souvik Sinha <souvik.sinha893_at_gmail.com
>>> > wrote:
>>>
>>>> Thanks for the replies. I get that in the present scenario it is gonna
>>>> be hard to get the gpu resources for my replica runs because of some
>>>> difficulty in the parallelisation scheme for gpu usage as MPI execution.
>>>>
>>>> Is the replica exchange scheme for multiple walker ABF is differently
>>>> implimented than for metadynamics or other NAMD replica exchange
>>>> strategies? I am just curious because my understanding in this regard is
>>>> not much of a mark.
>>>> On 26 Jan 2018 20:43, "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>>>> wrote:
>>>>
>>>>> In general the multicore version (i.e. SMP with no network) is the
>>>>> best approach for CUDA, provided that the system is small enough. With
>>>>> nearly everything offloaded to the GPUs in the recent version, the CPUs are
>>>>> mostly idle, and adding more CPU cores only clogs up the motherboard bus.
>>>>>
>>>>> Running CUDA jobs in parallel, particularly with MPI, is a whole other
>>>>> endeavor.
>>>>>
>>>>> In Souvik's case, it is a setup that is difficult to run fast. You
>>>>> may consider using the multicore version for multiple-replicas metadynamics
>>>>> runs, which can communicate between replicas using files and do not need
>>>>> MPI.
>>>>>
>>>>> Giacomo
>>>>>
>>>>> On Thu, Jan 25, 2018 at 2:40 PM, Renfro, Michael <Renfro_at_tntech.edu>
>>>>> wrote:
>>>>>
>>>>>> I can’t speak for running replicas as such, but my usual way of
>>>>>> running on a single node with GPUs is to use the multicore-CUDA NAMD build,
>>>>>> and to run namd as:
>>>>>>
>>>>>> namd2 +setcpuaffinity +devices ${GPU_DEVICE_ORDINAL}
>>>>>> +p${SLURM_NTASKS} ${INPUT} >& ${OUTPUT}
>>>>>>
>>>>>> Where ${GPU_DEVICE_ORDINAL} is “0”, “1”, or “0,1” depending on which
>>>>>> GPU I reserve; ${SLURM_NTASKS} is the number of cores needed, and ${INPUT}
>>>>>> and ${OUTPUT} are the NAMD input file and the file to record standard
>>>>>> output.
>>>>>>
>>>>>> Use HECBioSym’s 3M atom benchmark model, an single K80 card
>>>>>> (presented as 2 distinct GPUs) could keep 8 E5-2680v4 CPU cores busy. But
>>>>>> 16 or 28 cores (the maxiumum on a single node of ours) was hardly any
>>>>>> faster with 2 GPUs than 8 cores.
>>>>>>
>>>>>> --
>>>>>> Mike Renfro / HPC Systems Administrator, Information Technology
>>>>>> Services
>>>>>> 931 372-3601 / Tennessee Tech University
>>>>>>
>>>>>> > On Jan 25, 2018, at 12:59 PM, Souvik Sinha <
>>>>>> souvik.sinha893_at_gmail.com> wrote:
>>>>>> >
>>>>>> > Thanks for your reply.
>>>>>> > I was wondering, why 'idlepoll' can't even call gpu to work despite
>>>>>> the probability of a poor performance.
>>>>>> >
>>>>>> > On 25 Jan 2018 19:53, "Giacomo Fiorin" <giacomo.fiorin_at_gmail.com>
>>>>>> wrote:
>>>>>> > Hi Souvik, this seems connected to the compilation options.
>>>>>> Compiling with MPI + SMP + CUDA used to be very poor performance, although
>>>>>> I haven't tried with the new CUDA kernels (2.12 and later).
>>>>>> >
>>>>>> > Giacomo
>>>>>> >
>>>>>> > On Thu, Jan 25, 2018 at 4:02 AM, Souvik Sinha <
>>>>>> souvik.sinha893_at_gmail.com> wrote:
>>>>>> > NAMD Users,
>>>>>> >
>>>>>> > I am trying to run replica exchange ABF simulations in a machine
>>>>>> with 32 cores and 2 Tesla K40 cards. NAMD_2.12, compiled from source is
>>>>>> what I am using.
>>>>>> >
>>>>>> > From this earlier thread, http://www.ks.uiuc.edu/Researc
>>>>>> h/namd/mailing_list/namd-l.2014-2015/2490.html, I find out that
>>>>>> using "twoAwayX" or "idlepoll" might help the GPUs to work but somehow in
>>>>>> my situation it's not helping the GPUs to work ("twoAwayX" is speeding up
>>>>>> the jobs though). The 'idlepoll' switch generally works fine for Cuda build
>>>>>> NAMD versions for non-replica jobs. From the aforesaid thread, I get that
>>>>>> running 4 replicas in 32 CPUs and 2 GPUs may not provide a big boost to my
>>>>>> simulations but I just want to check whether it works or not?
>>>>>> >
>>>>>> > I am running command for the job:
>>>>>> > mpirun -np 32 /home/sgd/program/NAMD_2.12_Source/Linux-x86_64-g++/namd2
>>>>>> +idlepoll +replicas 4 $inputfile +stdout log/job0.%d.log
>>>>>> >
>>>>>> > My understanding is not helping me much, so any advice will be
>>>>>> helpful.
>>>>>> >
>>>>>> > Thank you
>>>>>> >
>>>>>> > --
>>>>>> > Souvik Sinha
>>>>>> > Research Fellow
>>>>>> > Bioinformatics Centre (SGD LAB)
>>>>>> > Bose Institute
>>>>>> >
>>>>>> > Contact: 033 25693275
>>>>>> >
>>>>>> >
>>>>>> >
>>>>>> > --
>>>>>> > Giacomo Fiorin
>>>>>> > Associate Professor of Research, Temple University, Philadelphia, PA
>>>>>> > Contractor, National Institutes of Health, Bethesda, MD
>>>>>> > http://goo.gl/Q3TBQU
>>>>>> > https://github.com/giacomofiorin
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Giacomo Fiorin
>>>>> Associate Professor of Research, Temple University, Philadelphia, PA
>>>>> Contractor, National Institutes of Health, Bethesda, MD
>>>>> http://goo.gl/Q3TBQU
>>>>> https://github.com/giacomofiorin
>>>>>
>>>>
>>>
>>>
>>> --
>>> Giacomo Fiorin
>>> Associate Professor of Research, Temple University, Philadelphia, PA
>>> Contractor, National Institutes of Health, Bethesda, MD
>>> http://goo.gl/Q3TBQU
>>> https://github.com/giacomofiorin
>>>
>>
>
>
> --
> Giacomo Fiorin
> Associate Professor of Research, Temple University, Philadelphia, PA
> Contractor, National Institutes of Health, Bethesda, MD
> http://goo.gl/Q3TBQU
> https://github.com/giacomofiorin
>

Next message: Jeff Comer: "Re: Replica exchange simulation with GPU Accelaration"
Previous message: Giacomo Fiorin: "Re: Replica exchange simulation with GPU Accelaration"
In reply to: Giacomo Fiorin: "Re: Replica exchange simulation with GPU Accelaration"
Next in thread: Jeff Comer: "Re: Replica exchange simulation with GPU Accelaration"
Reply: Jeff Comer: "Re: Replica exchange simulation with GPU Accelaration"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:19:38 CST