Re: Fwd: About wall clock time

From: Renfro, Michael (Renfro_at_tntech.edu)
Date: Mon Oct 16 2017 - 08:18:27 CDT

Two things I’ve found influenced benchmarking:

- model size: smaller models don’t provide enough compute work before needing to communicate back across cores and nodes
- network interconnect: on a modern Xeon system, gigabit Ethernet is a bottleneck, at least on large models (possibly all models)

I benchmarked a relatively similar system starting in July (Dell 730 and 6320, Infiniband, K80 GPUs in the 730 nodes). Results are at [1]. If I wasn’t using a ibverbs-smp build of NAMD, and was using the regular tcp version, 2 nodes gave slower run times than 1. 20k atom models topped out at around 5 28-core nodes, and 3M atom models kept getting better run times, even out to 34 28-core nodes.

A 73k system certainly should show a consistent speedup across your 6 nodes, though. And a CUDA-enabled build showed a 3-5x speedup compared to a non-CUDA run on our tests, so 1-2 of your GPU nodes could run as fast as all your non-GPU nodes combined.

So check your NAMD build features for ibverbs, and maybe verify your Infiniband is working correctly — I used [2] for checking Infiniband, even though I’m not using Debian on my cluster.

[1] https://its.tntech.edu/display/MON/HPC+Sample+Job%3A+NAMD
[2] https://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html

-- 
Mike Renfro  / HPC Systems Administrator, Information Technology Services
931 372-3601 / Tennessee Tech University
> On Oct 16, 2017, at 1:20 AM, Rik Chakraborty <rik.chakraborty01_at_gmail.com> wrote:
> 
> Dear NAMD experts,
> 
> Recently, we have installed a new cluster and the configurations are following below,
> 
> 1. Master node with storage node- DELL PowerEdge R730xd Server
> 2. CPU only node- DELL PowerEdge R430 Server (6 nos.)
> 3. GPU node- DELL PowerEdge R730 Server (3 nos.)
> 4. 18 ports Infiniband Switch- Mellanox SX6015 
> 5. 24 ports Gigabit Ethernet switch- D-link make
> 
> We have run a NAMD job using this cluster to check *the efiiciency in time with increasing number of CPU node. Each CPU node has 24 processor. The details of the given system and the outcomes are listed below,
> 
> 1. No. of atoms used: 73310
> 2. Total simulation time: 1ns
> 3. Time step: 2fs
> 
> No. of nodes
> 
> Wall Clock Time (s)
> 
> 1 
> 
> 27568.892578
> 
> 2 
> 
> 28083.976562
> 
> 3 
> 
> 30725.347656
> 
> 4
> 
> 33117.160156
> 
> 5
> 
> 35750.988281
> 
> 6 
> 
> 39922.492188
> 
> 
> As we can see, wall clock time is increased with the increase of no. of CPU nodes which is not expected.
> 
> So, this is my kind request to check this out and let me know about the problem.
> 
> Thanking you,
> 
> Rik Chakraborty
> Junior Research Fellow (Project)
> Dept. of Biological Sciences
> Indian Institute of Science Education and Research, Kolkata
> Mohanpur, Dist. Nadia
> Pin 721246
> West Bengal, India
> 
> 
> 
> 

This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:21:43 CST