Re: NAMD speed on MPICH2 Ubuntu 64 bit Cluster

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Wed Apr 06 2011 - 16:22:48 CDT

On Wed, Apr 6, 2011 at 4:50 PM, Robert McCarrick
<rob.mccarrick_at_muohio.edu>wrote:

> Axel,
> Thanks so much for the reply. That's disappointing as this was just a
> $2,800 build to speed up some calculations for one of the lab's here at
> Miami (this is not my area at all, I just happen to be good with Linux). In
> looking at the infiniband hardware, it would be about about triple the cost
> of the cluster itself.
>

if you are good a linux stuff and not afraid to dig a little deeper,
you may try out building OpenMX and run OpenMPI on top of it.

http://open-mx.gforge.inria.fr/

i never tried it myself, but it bypasses the TCP/IP layer and
thus avoid some of the performance problems.

another option for people on a budget is to obtain
"obsolete" hardware, e.g. old myrinet 2000 gear.
make sure that you get sufficient spare parts,
since the optical transceivers do age and occasionally
will break

while it is no competition to 4x-QDR infiniband,
that kind of hardware will still put gigE with TCP/IP
to shame. i have been running a machine where we
transplanted a myrinet in this way and effectively
doubled the value of the gige based machine.

another option is to only buy infiniband cards (no switch)
and then connect only two machines. again, it might be
worth looking for leftover DDR hardware.

> What I will probably end up doing is writing my own little queuing script
> that will take advantage of the fact that each
>

why write a queueing software. just use torque. i even use it on
my desktop if i have a large number of serial calculations to run.

> computer is pretty darn fast and the cluster could still run through a
> series of experiments distributing the individual jobs to run on one of the
> computers themselves using the 6 cores and the multicore optimized version
> of NAMD.
>

for running on a single node, you don't need the multi-core version.
just use the UDP version and run with ++local.
the multi-core version should specifically help for your case when
running across multiple nodes, as it reduces the contention for
the network. only one task will communicate.

> That way it will not have been a complete waste of time money.
>

what is a waste depends as much on how you can use hardware
in a smart way that it depends on having good hardware.
in HPC - and particularly if you are on a budget - it is often important
to purchase "balanced" hardware and also to decide up to which
degree you want to offset cost with personal effort.

you can build extremely cheap machines, but if they require a lot
of maintenance effort, then it may be cheaper to buy expensive
hardware. that sounds crazy at first, but if you really think about it,
then you'll see that sometimes money isn't everything.

i have "managing" a cluster that - while often having over 90% utilization
per month, requires effectively no maintenance, because we invested
a lot of effort in picking the most suitable hardware and setting it up
in a way, that the machine handles almost any kind of failure gracefully.
after from a few nodes that needed a little push, the whole machine has
been running for over a year. e.g.:

ssh node72 "cat /proc/uptime | awk '{print \$1/60/60/24, \$2/60/60/24}'"
384.909 24.9688

that is 25 days idle out of 385, i.e. about 6%
at no maintenance effort.

you won't get that with standard desktop hardware.

cheers,
     axel.

> Rob
>
>
> On Wed, 2011-04-06 at 13:06 -0400, Axel Kohlmeyer wrote:
>
>
>
> On Wed, Apr 6, 2011 at 12:19 PM, Robert McCarrick <
> rob.mccarrick_at_muohio.edu> wrote:
>
> Hi Everyone,
> Just to give more information on this. If I use the following command
> (with the TCP optimized version of NAMD for x86_64 Linux):
>
> ./charmrun namd2 +p6 <configuration_file>
>
> I get a time of 0.0290901 s/step and I get 6 processes running on the main
> computer with a system load of 1.32. If I use the following command:
>
> ./charmrun namd2 +p24 <configuration_file>
>
> I get a time of 0.0497453 s/step and I get 6 processes on each of the four
> computers, but the system load on the main computer on which I executed the
> command has a load of 0.53 and each of the other three computers have loads
> of about 0.01, indicating that they aren't really doing much of anything
> even though they have 6 namd processes running. I have a nodelist file and
> all of the computers can SSH to each other without a password. The
> directory in which the NAMD and configuration files are contained is
> mirrored on the other three computers via NFS (all of the user UIDs and GIDs
> and permissions are carrying over fine). I've been searching online and
> haven't found any way to address this. As mentioned in the previous email,
> I also compiled the mpi-Linux-x86_64 version and it doesn't seem to help the
> problem. Any help would be greatly appreciated.
>
>
>
> rob,
>
>
> TCP/IP networking doesn't give you great scaling, because of the high
> latencies.
> classical MD is quite sensitive to that, since you need to communicate
> multiple
> times in each time step and the computing effort for each step is rather
> small.
>
>
> now NAMD can do _some_ latency hiding, and thus does much better over
> TCP/IP than most other codes that i know. nevertheless, with 6 cores per
> node,
> you are really pushing the limit. you may benefit from the multi-core
> version
> that is now provided with version 2.8b1, as that will limit the
> communication
> to one task (instead of 6 tasks fighting for access to the network).
>
>
> if you really want good performance, you need to consider buying a fast
> low-latency interconnect. there are several of them with different
> properties
> and costs associated. the most popular currently seems to be infiniband,
> which seems to be a good match. i am seeing very good scaling behavior
> of NAMD (or rather charm++) using the IBVERBS library interface.
>
> cheers,
> axel.
>
>
> Thanks,
> Rob
>
>
>
> On Tue, 2011-04-05 at 14:51 -0400, Robert McCarrick wrote:
>
> Hi Everyone,
> I'm new to the cluster computer world. I've built a four-computer cluster,
> each with a 6-core AMD Phenom processor running Ubuntu Server 10.10 64 bit.
> I've tried both the TCP optimized version on NAMD and compiling from scratch
> with the mpi-Linux-X86_64 build of Charm. In all cases, I'm getting about a
> 4-fold reduction in calculation times when I run the job utilizing all four
> computers (i.e. going from +p6 to +p24 causes a big slowdown). This seems
> odd and I was wondering if anyone had any suggestions as to where I might
> have gone wrong.
> Rob
>
> --
> Robert M. McCarrick, Ph.D.
> EPR Instrumentation Specialist
> Department of Chemistry and Biochemistry
> Miami University
> 701 E. High Street
> 101 Hughes Laboratories
> Oxford, OH 45056513.529.0507 CW Room513.529.6829 Pulse Room513.529.5715 faxrob.mccarrick_at_muohio.eduhttp://epr.muohio.edu
>
>
> --
> Robert M. McCarrick, Ph.D.
> EPR Instrumentation Specialist
> Department of Chemistry and Biochemistry
> Miami University
> 701 E. High Street
> 101 Hughes Laboratories
> Oxford, OH 45056513.529.0507 CW Room513.529.6829 Pulse Room513.529.5715 faxrob.mccarrick_at_muohio.eduhttp://epr.muohio.edu
>
>
>
>
> --
> Dr. Axel Kohlmeyer
> akohlmey_at_gmail.com http://goo.gl/1wk0
>
> Institute for Computational Molecular Science
> Temple University, Philadelphia PA, USA.
>
>
> --
> Robert M. McCarrick, Ph.D.
> EPR Instrumentation Specialist
> Department of Chemistry and Biochemistry
> Miami University
> 701 E. High Street
> 101 Hughes Laboratories
> Oxford, OH 45056513.529.0507 CW Room513.529.6829 Pulse Room513.529.5715 faxrob.mccarrick_at_muohio.eduhttp://epr.muohio.edu
>
>
>

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:20:05 CST