From: Axel Kohlmeyer (akohlmey_at_cmm.chem.upenn.edu)
Date: Fri Feb 13 2009 - 07:21:33 CST
On Fri, 13 Feb 2009, Andrew Emerson wrote:
AE> Dear Vlad
AE> I noticed exactly the same problem on our opteron/infiniband cluster
AE> (openmpi) with an even smaller system of about 150k atoms. I was told it was
AE> probably a problem of openmpi. I dont know if this is true or not but after
there are some older versions of OpenMPI that had trouble
with some rdma based communication, but i have not noticed
any problems since OpenMPI version 1.2.6.
.. and our applications are _more_ demanding in terms of
memory use and communication load than NAMD.
now for the extra bit of magic:
to use the shared request queue in openmpi, no recompile is
required (unless you have an old version or openmpi, that is).
all you need to do, is to add "--mca btl_openib_use_srq 1" to
your mpirun command line. i noticed that it not only made our
large jobs run without crashing, but also that job which didn't
crash would run a little bit faster. if this setting helps, then
you can make it permanent, by creating a file
with the line:
btl_openib_use_srq = 1
while you are playing with mca parameters, you most likely want
to try out this one, too:
mpi_paffinity_alone = 1
this will tie the mpi processes to specific processors and
should result in a significant speedup as well (the command line
version of that is: --mca mpi_paffinity_alone 1).
in the case you are locked into MVAPICH2, you may want to try
setting the environment variable MV2_SRQ_SIZE to 4000
there have to be more options related to using SRQ, but
you'd have to look them up.
hope this helps,
AE> some system upgrades I certainly dont get the problem anymore, although
AE> admittedly I havent tried with a larger system.
AE> I think the idea of switching to openmpi seems a good one.
p.s.: it is hard for me to understand why people shy away from
openmpi. if installed and configured properly, it is a system
administrator's _and_ a user's dream:
i have locally just one installation of openmpi that can be used
on 4 different clusters with 3 different interconnects, without
having to build compiler specific executables (all compiled with
.. and the resulting executables, compiled and tested locally,
work without a hitch on many supercomputing center clusters that
have a compatible (i.e. 1.2.x version) installation of openmpi.
-- ======================================================================= Axel Kohlmeyer akohlmey_at_cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 ======================================================================= If you make something idiot-proof, the universe creates a better idiot.
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:50:30 CST