From: Axel Kohlmeyer (akohlmey_at_cmm.chem.upenn.edu)
Date: Tue Feb 17 2009 - 08:43:49 CST
On Tue, 17 Feb 2009, shayan_at_msu.edu wrote:
SM>
SM>
SM> Hello NAMD-list,
SM>
SM> I am running NAMD on ranger_at_TACC on a system containing 2 million atoms using the script from
SM>
SM> ~tg455591/NAMD_scripts/runbatch. It used to run problem free until recently when it started exiting unexpectedly after few steps giving errors like:
SM>
SM> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 244000
SM> LDB: TIME 7267.75 LOAD: AVG 1.35357 MAX 1.40105 PROXIES: TOTAL 1838 MAXPE 13 MAXPATCH 7 None 0.478391
SM> LDB: TIME 7267.76 LOAD: AVG 1.35357 MAX 1.38049 PROXIES: TOTAL 1838 MAXPE 13 MAXPATCH 7 Refine 0.478391
SM> Abort signaled by rank 136: [i139-411.ranger.tacc.utexas.edu:136]
SM> Got completion with error IBV_WC_WR_FLUSH_ERR, code=5, dest rank=152
this is an error message from the infiniband/MPI layer, not from NAMD.
SM> Exit code -3 signaled from i139-411.ranger.tacc.utexas.edu
SM> Killing remote processes...MPI process terminated unexpectedly
SM> DONE
SM> TACC: MPI job exited with code: 1
SM> TACC: Shutting down parallel environment.
SM> TACC: Shutdown complete. Exiting.
SM>
SM> At other times the error showed "Got completion with error IBV_WC_RETRY_EXC_ERR"
SM> or " Got completion with error IBV_WC_LOC_PROT_ERR".
another infiniband related error.
SM> I have also received the following error few times:
SM>
SM> WRITING COORDINATES TO DCD FILE AT STEP 796000
SM> LDB: TIME 22116.6 LOAD: AVG 1.18 MAX 1.23874 PROXIES: TOTAL 1966 MAXPE 13 MAXPATCH 7 None 0.44839
SM> LDB: TIME 22116.6 LOAD: AVG 1.18 MAX 1.20339 PROXIES: TOTAL 1968 MAXPE 13 MAXPATCH 7 Refine 0.44839
SM> MPI process terminated unexpectedly
SM> Exit code -5 signaled from i141-406.ranger.tacc.utexas.edu
SM> Killing remote processes...DONE
SM> TACC: MPI job exited with code: 1
SM> TACC: Shutting down parallel environment.
SM> TACC: Shutdown complete. Exiting.
SM>
SM> Although, I am not sure whether this is a NAMD issue, any suggestion
SM> is greatly appreciated.
i would doubt that this is a NAMD issue. you could try running your
system on a different machine (w/o infiniband).
SM> I have also contacted the Ranger staff but they havn't been able to
SM> help me out till now.
bug them some more. it is their job, they get paid for that.
one item to try out would be a different MPI implementation.
cheers,
axel.
SM> Best Wishes,
SM> Shayantani
SM>
SM> Shayantani Mukherjee
SM> Department of Biochemistry and Molecular Biology
SM> Michigan State University
SM> East Lansing, Michigan
SM> E-mail: shayan_at_msu.edu
SM>
SM>
-- ======================================================================= Axel Kohlmeyer akohlmey_at_cmm.chem.upenn.edu http://www.cmm.upenn.edu Center for Molecular Modeling -- University of Pennsylvania Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323 tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425 ======================================================================= If you make something idiot-proof, the universe creates a better idiot.
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:22 CST