RE: replica exchange and GPU acceleration

From: Norman Geist (
Date: Mon Jul 13 2015 - 00:51:11 CDT

Fine you got it running so far.


The error about “requires at least one patch per process” simply means that your system was too small for that amount of computing resources. This can be overcome by using “twoawayx yes” in config file to artificially increase the number of patches (box slices) for parallelization.


For small systems you should always check the impact of “twoawayx yes”. Usually it brings a two-fold speedup on gpus.


This might also improve the desired 16core+16replica+4gpu case.


Also try +idlepoll to the namd2 binary which can again cause a two-fold speedup, but never harms.


Norman Geist.


From: [] On Behalf Of Mitchell Gleed
Sent: Monday, July 13, 2015 5:37 AM
To: NAMD list; Norman Geist
Subject: Re: namd-l: replica exchange and GPU acceleration


Sorry for the late reply, I've been utilizing the university supercomputer's GPU nodes for other simulations the past week and couldn't test this out until those simulations finished up.

Since the GPU nodes have 24 cores, I followed your suggestion to do 4 replicas with 8 processes since I can't do 16 replicas with 32 processes. With this setup, I started getting the error "CUDA-enabled NAMD requires at least one patch per thread" for the namd/lib/replica/example test case.

I thought maybe the error meant I could only use CUDA-enabled NAMD with a PME system, so I decided to make a test case for a PME system, adapting the lib/replica/umbrella-2d case. I'm now able to get the GPU's to accelerate the replica exchange simulations, even 1 replica:1 gpu:1 process. However, I've found the GPU's only help if there's one GPU per replica, and when #replicas > #gpu's, simulations run slower with the GPU's than without. I assume that might just be the way things will have to be, but if there's anything else I can try in order to get my ideal case of 16replicas:16procs:4gpu to benefit from the GPU's, that'd be great.

Here are the benchmark results for the ~30k atom system I tested, in case anyone's interested:
4replicas 4procs 0gpu 1.61468 days/ns

4replicas 4procs 4gpu 0.669901 days/ns

4replicas 8procs 0gpu 1.11726 days/ns

4replicas 8procs 4gpu 0.445677 days/ns

4replicas 16procs 0gpu 1.03864 days/ns

16replicas 16procs 0gpu 1.87094 days/ns

16replicas 16procs 4gpu 2.52038 days/ns

Thanks for your help, Norman.

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:21:13 CST