RE: Is there any point in running NAMD over an ethernet-linked cluster?

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Thu Apr 16 2015 - 10:39:38 CDT

> -----Original Message-----
> From: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] On
> Behalf Of Douglas Houston
> Sent: Thursday, April 16, 2015 12:10 PM
> To: Michael Charlton
> Cc: NAMD list
> Subject: Re: namd-l: Is there any point in running NAMD over an
> ethernet-linked cluster?
>
> Hi Michael,

Hey,

>
> I can't recall off the top of my head what we did to resolve that
> particular issue (I do remember having to switch off firewalls),
> although the thread continues here:
>
> http://www.ks.uiuc.edu/Research//namd/mailing_list/namd-l.2014-
> 2015/0772.html
>
> but I can tell you it was a waste of time anyway as even if you can
> get it to run the speedup you see by running NAMD across multiple
> nodes is nonexistent, if they are linked via a standard ethernet
> network using cheap consumer-grade switches. Others have reported the
> same to me.

This is a question of relation between computing power and network bandwidth
(and latency).
I couldn't get what type of ethernet hardware you guys are actually talking
about, is it fast-ethernet 100Mbit/s or Gigabit 1000Mbit/s (aka 1Gbit/s).
With 1Gbit/s-Ethernet you CAN get a useful scaling, but as already
mentioned, only if the relation to the computing power of the nodes allows
it.

>
> It's probably related to latency; NAMD appears to have strict network
> hardware requirements that are not actually published.

This is not true. NAMD actually doesn't have "special", "strict" or
relatively "high" requirements regarding the node interconnect and scales
quite well also on cheap commodity clusters and Gigabit. Parallel scaling
needs to be understood in general here, see Amdahls law.

>
> This is the case for a single simulation; things like REMD might work
> better.
>

Agree, as the individual replicas will run on their small piece of cluster,
so requires less network bandwidth.

> cheers,
> Doug
>
>
> Quoting Michael Charlton <michael.charlton_at_inhibox.com> on Thu, 16 Apr
> 2015 09:37:29 +0100:
>
> > Hi Douglas,
> > I have been following your problem with getting NAMD running in
> > parallel on
> > http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l.2013-
> 2014/2687.html
> > It seems that I have an identical problem and I cannot see a final
> > resolution on this thread. Can you tell me if you managed to get it
> > working and what your solution was ?
> > (I am running on a Centos/Rocks cluster where my file space is
> > shared between all nodes and hence they are all trying to read the
> > same ssh keys which seems to cause the clash).
> >
> > Many thanks,
> > Michael Charlton, InhibOx
> >
> >
>
>
>
> Quoting Nicholas M Glykos <glykos_at_mbg.duth.gr> on Mon, 22 Sep 2014
> 11:40:59 +0300 (EEST):
>
> >
> >
> >> > Thank you very much for that, it is supremely helpful. I am going
> to try
> >> to replicate the various benchmarking tests you describe in your
> link. To
> >> that end, I wonder if you would be able to supply me with the
> 60,000-atom
> >> ionized.psf and heat_out.coor files you used so that my steps match
> yours
> >> as closely as possible?
> >
> > I don't expect scaling problems to be protein-specific. We were
> getting
> > reasonable scaling with the ApoA1 benchmark distributed by NAMD
> > developers (see
> >
> http://norma.mbg.duth.gr/index.php?id=about:benchmarks:namdv28cudagtx46
> 0
> > for a more recent test with NAMD 2.8 + CUDA. The measurements stop at
> four
> > nodes because we only had four nodes with GPU's :-)
> >
> >
> >
> >> > If not (you hint that your tests were done a long time ago), I
> wonder if
> >> you ever looked into total bandwidth usage in your tests? 64 minutes
> of
> >> simulation runtime across 2 of my nodes results in a total of 390GiB
> of
> >> data transferred between them (according to ifconfig) - this equates
> to
> >> about 100 MiB/sec. This is for my 80,000-atom system. Does this mean
> that
> >> my network bandwidth could indeed be saturating (100 MiB/sec being
> not far
> >> off 1Gbit/sec)? If this is true, it is not clear to me why the data
> >> transfer rate is so high in my case but not yours.
> >
> > As Axel said, latency is probably more important. Have you
> benchmarked
> > your network with a tool like NetPIPE ? (see
> > http://norma.mbg.duth.gr/index.php?id=about:benchmarks:network for an
> > example, this was again back in 2009, so there may be much better
> tools
> > around these days).
> >
> >
> >
> >
> > --
> >
> >
> > Nicholas M. Glykos, Department of Molecular Biology
> > and Genetics, Democritus University of Thrace, University
> Campus,
> > Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office)
> +302551030620,
> > Ext.77620, Tel (lab) +302551030615,
> http://utopia.duth.gr/~glykos/
> >
> >
> >
>
>
>
>
> _____________________________________________________
> Dr. Douglas R. Houston
> Lecturer
> Institute of Structural and Molecular Biology
> Room 3.23, Michael Swann Building
> King's Buildings
> University of Edinburgh
> Edinburgh, EH9 3JR, UK
> Tel. 0131 650 7358
> http://tinyurl.com/douglasrhouston
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:21:04 CST