Re: Error: transport retry exceeded error

From: Axel Kohlmeyer (akohlmey_at_cmm.chem.upenn.edu)
Date: Sat May 17 2008 - 15:37:12 CDT

On Sat, 17 May 2008, Alexandre A. Vakhrouchev wrote:

[...]

AV> 1ps I used before). I'm not shure that was the solution. Now I have to
AV> increase processors number because simulation is too long, so I'll
AV> contact our sysadmins following Axel's advice. We have queueing system
AV> at our cluster, so I do not overload the whole system, the bad thing
AV> is that I could not get the result)

alex,

the "overloading" i was mentioning is not coming from
direct overloading of the node, but from the fact that
you have 8 (eight!) MPI tasks on each node doing a lot
of polls and sends. this is putting a huge load on each
local infiniband card and the load is increasing the
more nodes you use for your job as the pieces of compute
work get smaller and thus the frequency of communication
calls higher.

the message you are seeing is a timeout of the infiniband
host controller that is not able to transfer data to
some other host controller.

this can have basically have 3 reasons:

- your controller is defective or has a defective cable
  => your sysadmins should be able to test this.
  it is unlikely since you report that the error is intermittent,
  so it could be a "near defective" part or a loose cable(?).

- your job is overloading the host adapter (it has to send
  more messages that it can handle within a given time).
  => quite possible. particularly since you have dual quad-core
  nodes. this may also depend on the MPI library. some allow
  to change the defaults for how many times the lowlevel tries
  until it considers a communication failed and you may also
  be able to change the retry delay time. both are usually set
  to small values, since that keeps the latencies down and gives
  thus the best benchmark numbers. however, different MPI
  implementations have different defaults as well.
  i've seen these kind of timeouts particularly with jobs
  using more than 32 nodes and the more ofter the more nodes
  were used.
  
- your infiniband switch is (over)loaded (other jobs are doing
  lots of high bandwidth communication).
  => this can happen particularly on machines where the home
  or work file system is run on the same infiniband fabric
  as the MPI (e.g. as a lustre filesystem). this way high i/o
  users (e.g. quantum chemists with large integral files) can
  affect other jobs _even_ though they are assigned to different
  nodes. if you combine this with the previous effect, you can
  explain the intermittent nature of the timeouts.
  a signature of high switch load is that (smaller) jobs may
  need varying amounts of time (up to 10%) to run the same job
  across the same number of nodes.

hope that helps,
    axel.

--
=======================================================================
Axel Kohlmeyer   akohlmey_at_cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 05:21:03 CST