Re: namd cvs compilation with a maximum number of cores to run on

From: Axel Kohlmeyer (akohlmey_at_cmm.chem.upenn.edu)
Date: Fri Feb 13 2009 - 07:21:33 CST

On Fri, 13 Feb 2009, Andrew Emerson wrote:

AE> Dear Vlad
AE>
AE> I noticed exactly the same problem on our opteron/infiniband cluster
AE> (openmpi) with an even smaller system of about 150k atoms. I was told it was
AE> probably a problem of openmpi. I dont know if this is true or not but after

vlad, andrew,

there are some older versions of OpenMPI that had trouble
with some rdma based communication, but i have not noticed
any problems since OpenMPI version 1.2.6.

.. and our applications are _more_ demanding in terms of
memory use and communication load than NAMD.

now for the extra bit of magic:

to use the shared request queue in openmpi, no recompile is
required (unless you have an old version or openmpi, that is).
all you need to do, is to add "--mca btl_openib_use_srq 1" to
your mpirun command line. i noticed that it not only made our
large jobs run without crashing, but also that job which didn't
crash would run a little bit faster. if this setting helps, then
you can make it permanent, by creating a file

$HOME/.openmpi/mca-params.conf

with the line:

 btl_openib_use_srq = 1

while you are playing with mca parameters, you most likely want
to try out this one, too:

mpi_paffinity_alone = 1

this will tie the mpi processes to specific processors and
should result in a significant speedup as well (the command line
version of that is: --mca mpi_paffinity_alone 1).

in the case you are locked into MVAPICH2, you may want to try
setting the environment variable MV2_SRQ_SIZE to 4000
there have to be more options related to using SRQ, but
you'd have to look them up.

hope this helps,
    axel.

AE> some system upgrades I certainly dont get the problem anymore, although
AE> admittedly I havent tried with a larger system.
AE>
AE> I think the idea of switching to openmpi seems a good one.

p.s.: it is hard for me to understand why people shy away from
openmpi. if installed and configured properly, it is a system
administrator's _and_ a user's dream:

i have locally just one installation of openmpi that can be used
on 4 different clusters with 3 different interconnects, without
having to build compiler specific executables (all compiled with
gcc).
.. and the resulting executables, compiled and tested locally,
work without a hitch on many supercomputing center clusters that
have a compatible (i.e. 1.2.x version) installation of openmpi.

AE>
AE> cheers
AE> andy
AE>
AE>

-- 
=======================================================================
Axel Kohlmeyer   akohlmey_at_cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:22 CST