Input/output error

From: 王棽 (corarbor_at_163.com)
Date: Tue May 18 2010 - 01:55:20 CDT

Next message: Axel Kohlmeyer: "Re: Input/output error"
Previous message: Giacomo Fiorin: "Re: ABF in various simulations"
Next in thread: Axel Kohlmeyer: "Re: Input/output error"
Reply: Axel Kohlmeyer: "Re: Input/output error"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Dear NAMD users:
I am running NAMD on Dawning5000A super computer, "http://www.ssc.net.cn/en/resources.asp". However, I found my NAMD processes vulnerable on such a platfrom. They usually died with an input/output error of the *.restart.coor, *.restart.vel or *.restart.xsc files. There is an example of stand output below:

WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 4331000
WRITING COORDINATES TO DCD FILE AT STEP 4331000
WRITING COORDINATES TO RESTART FILE AT STEP 4331000
FATAL ERROR: Error on write to binary file coord.restart.coor: Input/output error
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: FATAL ERROR: Error on write to binary file coord.restart.coor: Input/output error

[0] Stack Traceback:
  [0] CmiAbort+0x2b [0x8b68f1]
  [1] _Z8NAMD_errPKc+0x84 [0x4d1444]
  [2] _ZN6Output17write_binary_fileEPciP6Vector+0xda [0x78d8ea]
  [3] _ZN6Output26output_restart_coordinatesEP6Vectorii+0x1d1 [0x78e001]
  [4] _ZN6Output10coordinateEiiP6VectorP11FloatVectorR7Lattice+0x1b2 [0x78ef92]
  [5] _ZN16CollectionMaster16receivePositionsEP16CollectVectorMsg+0x1f1 [0x4dd971]
  [6] CkDeliverMessageFree+0x38 [0x8583c0]
  [7] _Z15_processHandlerPvP11CkCoreState+0x982 [0x85dbbe]
  [8] CmiHandleMessage+0x27 [0x8b7f28]
  [9] CsdScheduleForever+0x64 [0x8b9a58]
  [10] CsdScheduler+0xd [0x8b9adb]
  [11] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x156 [0x7ceef6]
  [12] TclInvokeStringCommand+0x69 [0x2b662cc99ed9]
  [13] TclEvalObjvInternal+0xf8 [0x2b662cc9aeb8]
  [14] Tcl_EvalEx+0x166 [0x2b662cc9b366]
  [15] Tcl_FSEvalFile+0xec [0x2b662ccd85ec]
  [16] Tcl_EvalFile+0x27 [0x2b662ccd9792]
  [17] _ZN9ScriptTcl3runEPc+0x14 [0x7ceff4]
  [18] _Z18after_backend_initiPPc+0x223 [0x4d4133]
  [19] main+0x24 [0x4d4214]
  [20] __libc_start_main+0xf4 [0x2b662d6df184]
  [21] __gxx_personality_v0+0x139 [0x4d0b69]
[0] [MPI Abort by user] Aborting Program!
Abort signaled by rank 0: MPI Abort by user Aborting program !
Exit code -3 signaled from d544
Killing remote processes...MPI process terminated unexpectedly
DONE
Signal 15 received.

I contacted with the engineers of the super computer center, and they found there was a temporary lustre terminal connection break and reconnect event when such input/output error happened, which is quite often observed during the communication of compute nodes and OSS nodes.

Do you have any suggestion on this problem?
Cheers.
Shen.

Next message: Axel Kohlmeyer: "Re: Input/output error"
Previous message: Giacomo Fiorin: "Re: ABF in various simulations"
Next in thread: Axel Kohlmeyer: "Re: Input/output error"
Reply: Axel Kohlmeyer: "Re: Input/output error"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:55:47 CST