Re: Problem running NAMD on ranger@TACC

From: Axel Kohlmeyer (akohlmey_at_cmm.chem.upenn.edu)
Date: Tue Feb 17 2009 - 08:43:49 CST

On Tue, 17 Feb 2009, shayan_at_msu.edu wrote:

SM>
SM>
SM> Hello NAMD-list,
SM>
SM> I am running NAMD on ranger_at_TACC on a system containing 2 million atoms using the script  from
SM>
SM> ~tg455591/NAMD_scripts/runbatch. It used to run problem free until recently when it started exiting unexpectedly after few steps giving errors like:
SM>
SM> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 244000
SM> LDB: TIME 7267.75 LOAD: AVG 1.35357 MAX 1.40105  PROXIES: TOTAL 1838 MAXPE 13 MAXPATCH 7 None 0.478391
SM> LDB: TIME 7267.76 LOAD: AVG 1.35357 MAX 1.38049  PROXIES: TOTAL 1838 MAXPE 13 MAXPATCH 7 Refine 0.478391

SM> Abort signaled by rank 136: [i139-411.ranger.tacc.utexas.edu:136]
SM> Got completion with error IBV_WC_WR_FLUSH_ERR, code=5, dest rank=152

this is an error message from the infiniband/MPI layer, not from NAMD.

SM> Exit code -3 signaled from i139-411.ranger.tacc.utexas.edu
SM> Killing remote processes...MPI process terminated unexpectedly
SM> DONE
SM> TACC: MPI job exited with code: 1
SM> TACC: Shutting down parallel environment.
SM> TACC: Shutdown complete. Exiting.
SM>
SM> At other times the error showed "Got completion with error IBV_WC_RETRY_EXC_ERR" 
SM> or " Got completion with error IBV_WC_LOC_PROT_ERR".

another infiniband related error.

SM> I have also received the following error few times:
SM>
SM> WRITING COORDINATES TO DCD FILE AT STEP 796000
SM> LDB: TIME 22116.6 LOAD: AVG 1.18 MAX 1.23874  PROXIES: TOTAL 1966 MAXPE 13 MAXPATCH 7 None 0.44839
SM> LDB: TIME 22116.6 LOAD: AVG 1.18 MAX 1.20339  PROXIES: TOTAL 1968 MAXPE 13 MAXPATCH 7 Refine 0.44839
SM> MPI process terminated unexpectedly
SM> Exit code -5 signaled from i141-406.ranger.tacc.utexas.edu
SM> Killing remote processes...DONE
SM> TACC: MPI job exited with code: 1
SM> TACC: Shutting down parallel environment.
SM> TACC: Shutdown complete. Exiting.
SM>

SM> Although, I am not sure whether this is a NAMD issue, any suggestion
SM> is greatly appreciated.

i would doubt that this is a NAMD issue. you could try running your
system on a different machine (w/o infiniband).

SM> I have also contacted the Ranger staff but they havn't been able to
SM> help me out till now.

bug them some more. it is their job, they get paid for that.

one item to try out would be a different MPI implementation.
cheers,
   axel.

SM> Best Wishes,
SM> Shayantani
SM>
SM> Shayantani Mukherjee
SM> Department of Biochemistry and Molecular Biology
SM> Michigan State University
SM> East Lansing, Michigan
SM> E-mail: shayan_at_msu.edu
SM>
SM>

-- 
=======================================================================
Axel Kohlmeyer   akohlmey_at_cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:22 CST