Re: Altix Performance Tuning

From: Mark Abraham (Mark.Abraham_at_anu.edu.au)
Date: Mon Sep 18 2006 - 13:42:53 CDT

Alessandro Cembran wrote:
> Hi,
>
> I've been experiencing a problem running NAMD (with any of the versions
> 2.6b1, 2.6b2 and 2.6) on a 256 processors node altix 3700 BX2 machine
> (http://www.msi.umn.edu/altix/intro/).

"My" Altix 2700 Bx2 cluster
http://nf.apac.edu.au/facilities/ac/hardware.php is segmented into
partitions of 32 processors with shared memory and NUMAlink4
interconnects between those partitions. It seems to work well for
parallel MD codes, but I've never run NAMD2 on it.

> What happens is that with systems of different size (either ~55,000 or
> ~190,000 atoms) and with different number of processors (8 or 40), the
> performances of my calculations are not reproducible at all. In
> particular, a job might run extremely fast (i.e., almost linear scaling)
> for hours or days and all of a sudden its performances slow down to 10%
> or even ~2% of the peak performance and never recover.
> I talked with the systems manager here and he said that this is related
> to the architecture of the machine, because many jobs are competing for
> the network resources.

You've described a single 256-processor shared-memory node, and the
website doesn't change that description. What network resources are
getting hit?

> In fact, I could track down that in some
> occasions the slow down arose when another "massively parallel" NAMD job
> started on the same node, and both of them then were running very slowly.
> So, I was wondering whether there is anything that could be done to make
> a better use of the altix architecture. In particular I was thinking if
> there is a way to reduce the message passing among the processors or
> tune it.

Since that's an obvious and common bottleneck, I'd expect it's already
optimized for a general case. I'd expect any Altix-specific improvement
is not worth someone's time, even if possible.

> Note: I always set the variables MPI_DSM_DISTRIBUTE
> I also set MPI_MEMMAP_OFF=1 because my jobs crashed after a while they
> were running because they ran put of memory. The following is a quote
> from the system manager:
>
>> Another NAMD user ran into a problem with respect to the amount of
>> virtual memory that was being allocated to NAMD by the operating
>> system on the 256-processor Altix node. It turns out that the Altix
>> MPI is designed to put huge memory maps into memory that speed up
>> performance when running MPI jobs that share memory between seperate
>> Altix partitions (a feature we do not use). When this other NAMD user
>> would attempt to run large NAMD jobs, they would segfault. If he set
>> the MPI_MEMMAP_OFF environment variable, his jobs no longer segfaulted.

The culprit appears to be either the practice of not partitioning, or
using an MPI version that is optimized for partitioning on a system that
is not partitioned. This should surely be a matter that SGI or the
sysadmin could rectify swiftly.

Mark

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:19:46 CST