Re: Is there any point in running NAMD over an ethernet-linked cluster?

From: Douglas Houston (DouglasR.Houston_at_ed.ac.uk)
Date: Thu Apr 16 2015 - 05:10:12 CDT

Hi Michael,

I can't recall off the top of my head what we did to resolve that
particular issue (I do remember having to switch off firewalls),
although the thread continues here:

http://www.ks.uiuc.edu/Research//namd/mailing_list/namd-l.2014-2015/0772.html

but I can tell you it was a waste of time anyway as even if you can
get it to run the speedup you see by running NAMD across multiple
nodes is nonexistent, if they are linked via a standard ethernet
network using cheap consumer-grade switches. Others have reported the
same to me.

It's probably related to latency; NAMD appears to have strict network
hardware requirements that are not actually published.

This is the case for a single simulation; things like REMD might work better.

cheers,
Doug

Quoting Michael Charlton <michael.charlton_at_inhibox.com> on Thu, 16 Apr
2015 09:37:29 +0100:

> Hi Douglas,
> I have been following your problem with getting NAMD running in
> parallel on
> http://www.ks.uiuc.edu/Research/namd/mailing_list/namd-l.2013-2014/2687.html
> It seems that I have an identical problem and I cannot see a final
> resolution on this thread. Can you tell me if you managed to get it
> working and what your solution was ?
> (I am running on a Centos/Rocks cluster where my file space is
> shared between all nodes and hence they are all trying to read the
> same ssh keys which seems to cause the clash).
>
> Many thanks,
> Michael Charlton, InhibOx
>
>

Quoting Nicholas M Glykos <glykos_at_mbg.duth.gr> on Mon, 22 Sep 2014
11:40:59 +0300 (EEST):

>
>
>> > Thank you very much for that, it is supremely helpful. I am going to try
>> to replicate the various benchmarking tests you describe in your link. To
>> that end, I wonder if you would be able to supply me with the 60,000-atom
>> ionized.psf and heat_out.coor files you used so that my steps match yours
>> as closely as possible?
>
> I don't expect scaling problems to be protein-specific. We were getting
> reasonable scaling with the ApoA1 benchmark distributed by NAMD
> developers (see
> http://norma.mbg.duth.gr/index.php?id=about:benchmarks:namdv28cudagtx460
> for a more recent test with NAMD 2.8 + CUDA. The measurements stop at four
> nodes because we only had four nodes with GPU's :-)
>
>
>
>> > If not (you hint that your tests were done a long time ago), I wonder if
>> you ever looked into total bandwidth usage in your tests? 64 minutes of
>> simulation runtime across 2 of my nodes results in a total of 390GiB of
>> data transferred between them (according to ifconfig) - this equates to
>> about 100 MiB/sec. This is for my 80,000-atom system. Does this mean that
>> my network bandwidth could indeed be saturating (100 MiB/sec being not far
>> off 1Gbit/sec)? If this is true, it is not clear to me why the data
>> transfer rate is so high in my case but not yours.
>
> As Axel said, latency is probably more important. Have you benchmarked
> your network with a tool like NetPIPE ? (see
> http://norma.mbg.duth.gr/index.php?id=about:benchmarks:network for an
> example, this was again back in 2009, so there may be much better tools
> around these days).
>
>
>
>
> --
>
>
> Nicholas M. Glykos, Department of Molecular Biology
> and Genetics, Democritus University of Thrace, University Campus,
> Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620,
> Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/
>
>
>

_____________________________________________________
Dr. Douglas R. Houston
Lecturer
Institute of Structural and Molecular Biology
Room 3.23, Michael Swann Building
King's Buildings
University of Edinburgh
Edinburgh, EH9 3JR, UK
Tel. 0131 650 7358
http://tinyurl.com/douglasrhouston

-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:21:48 CST