Re: Fwd: About wall clock time

From: Giacomo Fiorin (giacomo.fiorin_at_gmail.com)
Date: Mon Oct 16 2017 - 12:16:29 CDT

Can you double-check that you are actually launching tasks in all requested
nodes? The fact that the time increases slightly leads to think that you
may be oversubscribing the first node. Meaning, you are dividing up the
work among the same CPU cores, but using more tasks for each core.
Theoretically this should make no difference, but the communication
overhead will make things go a bit slower.

What is the NAMD build and how are you launching it?

On Mon, Oct 16, 2017 at 9:48 AM, Chitrak Gupta <chgupta_at_mix.wvu.edu> wrote:

> Hi Rik,
>
> Any specific reason why you are looking at the wall clock time and not the
> benchmark times in your log file? From what I understand, benchmark times
> are more accurate than the wall clock time.
>
>
> Chitrak.
>
> On Mon, Oct 16, 2017 at 9:18 AM, Renfro, Michael <Renfro_at_tntech.edu>
> wrote:
>
>> Two things I’ve found influenced benchmarking:
>>
>> - model size: smaller models don’t provide enough compute work before
>> needing to communicate back across cores and nodes
>> - network interconnect: on a modern Xeon system, gigabit Ethernet is a
>> bottleneck, at least on large models (possibly all models)
>>
>> I benchmarked a relatively similar system starting in July (Dell 730 and
>> 6320, Infiniband, K80 GPUs in the 730 nodes). Results are at [1]. If I
>> wasn’t using a ibverbs-smp build of NAMD, and was using the regular tcp
>> version, 2 nodes gave slower run times than 1. 20k atom models topped out
>> at around 5 28-core nodes, and 3M atom models kept getting better run
>> times, even out to 34 28-core nodes.
>>
>> A 73k system certainly should show a consistent speedup across your 6
>> nodes, though. And a CUDA-enabled build showed a 3-5x speedup compared to a
>> non-CUDA run on our tests, so 1-2 of your GPU nodes could run as fast as
>> all your non-GPU nodes combined.
>>
>> So check your NAMD build features for ibverbs, and maybe verify your
>> Infiniband is working correctly — I used [2] for checking Infiniband, even
>> though I’m not using Debian on my cluster.
>>
>> [1] https://its.tntech.edu/display/MON/HPC+Sample+Job%3A+NAMD
>> [2] https://pkg-ofed.alioth.debian.org/howto/infiniband-howto.html
>>
>> --
>> Mike Renfro / HPC Systems Administrator, Information Technology Services
>> 931 372-3601 / Tennessee Tech University
>>
>> > On Oct 16, 2017, at 1:20 AM, Rik Chakraborty <
>> rik.chakraborty01_at_gmail.com> wrote:
>> >
>> > Dear NAMD experts,
>> >
>> > Recently, we have installed a new cluster and the configurations are
>> following below,
>> >
>> > 1. Master node with storage node- DELL PowerEdge R730xd Server
>> > 2. CPU only node- DELL PowerEdge R430 Server (6 nos.)
>> > 3. GPU node- DELL PowerEdge R730 Server (3 nos.)
>> > 4. 18 ports Infiniband Switch- Mellanox SX6015
>> > 5. 24 ports Gigabit Ethernet switch- D-link make
>> >
>> > We have run a NAMD job using this cluster to check *the efiiciency in
>> time with increasing number of CPU node. Each CPU node has 24 processor.
>> The details of the given system and the outcomes are listed below,
>> >
>> > 1. No. of atoms used: 73310
>> > 2. Total simulation time: 1ns
>> > 3. Time step: 2fs
>> >
>> > No. of nodes
>> >
>> > Wall Clock Time (s)
>> >
>> > 1
>> >
>> > 27568.892578
>> >
>> > 2
>> >
>> > 28083.976562
>> >
>> > 3
>> >
>> > 30725.347656
>> >
>> > 4
>> >
>> > 33117.160156
>> >
>> > 5
>> >
>> > 35750.988281
>> >
>> > 6
>> >
>> > 39922.492188
>> >
>> >
>> > As we can see, wall clock time is increased with the increase of no. of
>> CPU nodes which is not expected.
>> >
>> > So, this is my kind request to check this out and let me know about the
>> problem.
>> >
>> > Thanking you,
>> >
>> > Rik Chakraborty
>> > Junior Research Fellow (Project)
>> > Dept. of Biological Sciences
>> > Indian Institute of Science Education and Research, Kolkata
>> > Mohanpur, Dist. Nadia
>> > Pin 721246
>> > West Bengal, India
>> >
>> >
>> >
>> >
>>
>>
>>
>

-- 
Giacomo Fiorin
Associate Professor of Research, Temple University, Philadelphia, PA
Contractor, National Institutes of Health, Bethesda, MD
http://goo.gl/Q3TBQU
https://github.com/giacomofiorin

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:39 CST