From: Vermaas, Joshua (Joshua.Vermaas_at_nrel.gov)
Date: Mon Jul 31 2017 - 13:57:50 CDT
Hi Alexander,
On #1, my recollection is that NAMD is still CPU bound. For the apoa1
benchmark on my own personal desktop (Quadro M5000, E5-2687W) using a
CVS build that came after the last of Antti-Pekka's commits to the GPU
code, this is what I see in terms of performance based on the number of
cores I throw at it:
Cores, s/step
1, 0.055
2, 0.028
4, 0.015
6, 0.010;#Performance stagnates. Probably GPU bound at this point
8, 0.010
10, 0.010
To me, this seems to suggest that each GPU needs between 4 and 8 CPUs
just to deal with the work left on the CPU to prevent the GPU from being
idle all the time, after which throwing more CPU's at the problem
doesn't help since it is GPU-bound. On paper, the ratio you are
proposing doesn't seem crazy to me (4ish cores per GPU), but unlike
GROMACS, NAMD doesn't warn you if you don't have a good balance between
CPU work and GPU work, and my hardware != your proposed hardware, so
just take this as what it is: a single datapoint. :)
Also, keep in mind that there are some parts of the NAMD code that
CANNOT be run on GPUs (like the alchemical stuff). Honestly, the EPYC
sounds like a great place to save money, since then you could feasibly
go down to 1-socket motherboards instead of 2-socket ones, in addition
to the CPU cost savings.
2. At a certain point, GROMACS has the same problems as NAMD does, in
that there are some things that the CPU does on its own that limit the
impact of adding more GPUs. I've never done extensive testing to figure
out where that point is unfortunately. Usually though on my desktop the
GPU is idle more often, so that would point to fewer GPUs in favor of
more CPU threads. AMBER has the opposite problem, and you typically want
a ton of GPUs per CPU.
3. I mean, technically the best efficiency always comes from using a
single node at a time, forgetting about any fast interconnects. However,
if you need an answer for a project on a tight deadline, I think you'll
be kicking yourself for not having the flexibility of just throwing more
processors at the problem until it is solved. Gigabit ethernet just
doesn't cut it in terms of latency.
4. 16GB is probably also enough of overkill, so long as you stay away
from really big systems, and don't expect to do analysis on the cluster.
-Josh
On 07/31/2017 04:43 AM, Vogel, Alexander wrote:
> Hello everybody,
>
> I'm currently highly involved in the planning of a new HPC cluster for MD simulations. The main applications are NAMD and GROMACS (sometimes in conjuction with PLUMED). Typical simulations are about 100,000 atoms up to 300,000 at max. So we got a quote from a manufacturer and I have a few questions regarding the details that probably can only be answered with some experience...and that's why I'm asking here:
>
> 1. The compute nodes contain 2x Intel Xeon Broadwell-EP E5-2680v4 (each 14 cores, 2.4GHz base clock) and 8x GTX 1080 Ti. That is very GPU focused and from what I read from the NAMD 2.13 release notes that might make sense because almost everything seems to be offloaded to the GPU now. But I don't find any useful benchmarks. What do you think? Can 28 CPU cores fuel 8 1080 Ti GPUs? It is also rather likely that we will end up using AMD EPYC CPUs once they are out so that we probably will have more cores, more PCIe lanes, and a higher memory bandwidth in the end.
>
> 2. I know this is a NAMD mailing list but if someone happens to know GROMACS well: The same question as above just for GROMACS. There are even fewer recent benchmarks with GPU. I found this one from Nvidia which seems to suggest that it only scales well up to two GPUs: https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.nvidia.com%2Fobject%2Fgromacs-benchmarks.html&data=02%7C01%7CJoshua.Vermaas%40nrel.gov%7Cd85d29e559ea443b97b008d4d800e7dc%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C636370945787509305&sdata=eaM8FG%2F0moRsrOXqVRoz4zT8JGS42RZdO%2F8pgRYD1U8%3D&reserved=0
>
> 3. Currently the quote contains Infiniband. However, given the computational power of a single node I could imagine that the simulation (which will not be excessive in size...we only plan to use simulations up to 300,000 atoms) would not scale well to two nodes or even more. If this is the case we could drop Infiniband and invest the money into more nodes. What do you think about this?
>
> 4. Currently the quote contains 64GB of RAM for each compute node. To me that seems very high as from my experience MD simulations only take up a few GB at most for "reasonable" system sizes (we only plan to use simulations up to 300,000 atoms). Using 32GB instead could also save some money. What do you think about this?
>
> Any help would be highly appreciated,
>
> Alexander
>
>
>
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:28 CST