Re: namd at ranger(tacc)

From: Peter Freddolino (petefred_at_ks.uiuc.edu)
Date: Thu Feb 24 2011 - 17:02:27 CST

Hi Sandor,
One problem that I have encountered recently is that the lustre
filesystem can sometimes lag unacceptably when many jobs are rapidly
trying to create files in the same directory (this was the explanation
that I got from the tacc folks when I reported similar issues); this can
in my experience lead to crashes. Reducing your output frequency or
having different jobs write to different directories can get rid of the
problem; obviously the latter is preferable.

Best,
Peter

On 02/24/2011 05:46 PM, Sándor Kovács wrote:
> Hi Lei,
>
> I too have just started running new NAMD jobs at Ranger using the
> scripts found at /share/home/00288/tg455591/NAMD_scripts/
> I have no trouble starting up and running these, but one did exit
> prematurely yesterday with the following error (parsed from the log file):
>
> SMD 2860000 34.4911 -32.609 37.017 -198.135 0 0
> WRITING COORDINATES TO DCD FILE AT STEP 2860000
> WRITING COORDINATES TO RESTART FILE AT STEP 2860000
> FATAL ERROR: Cannot open file
> 'OFMO_CsmABCL_UP_SOLV_runSMD512_1.restart.coor' in PDB::write.:
> Interrupted system call
> ------------- Processor 0 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: Cannot open file
> 'OFMO_CsmABCL_UP_SOLV_runSMD512_1.restart.coor' in PDB::write.:
> Interrupted system call
>
> [0] Stack Traceback:
> [0:0] _Z8NAMD_errPKc+0xa3 [0x4e8f45]
> [0:1] _ZN3PDB5writeEPKcS1_+0x150 [0x97bb2a]
> [0:2] _ZN6Output10coordinateEiiP6VectorP11FloatVectorR7Lattice+0x2c6
> [0x941bac]
> [0:3]
> _ZN24CkIndex_CollectionMaster39_call_receivePositions_CollectVectorMsgEPvP16CollectionMaster+0x141
> [0x4fcfd7]
> [0:4] _Z15_processHandlerPvP11CkCoreState+0x55b [0xa4e743]
> [0:5] CsdScheduler+0x424 [0xb18288]
> [0:6] _ZN7BackEnd7suspendEv+0xb [0x4f5a27]
> [0:7] _ZN9ScriptTcl7Tcl_runEPvP10Tcl_InterpiPPc+0x11d [0x9af9ed]
> [0:8] TclInvokeStringCommand+0x91 [0xb50ed8]
> [0:9]
> /share/home/00288/tg455591/NAMD_2.7b3_Linux-x86_64-ibverbs-Ranger/namd2
> [0xb86d28]
> [0:10] Tcl_EvalEx+0x176 [0xb8736b]
> [0:11] Tcl_EvalFile+0x134 [0xb7ed74]
> [0:12] _ZN9ScriptTcl3runEPc+0x13 [0x9ae861]
> [0:13] main+0x259 [0x4ed489]
> [0:14] __libc_start_main+0xdb [0x3a47a1c3fb]
> [0:15] _ZNSt8ios_base4InitD1Ev+0x42 [0x4e80aa]
> Fatal error on PE 0> FATAL ERROR: Cannot open file
> 'OFMO_CsmABCL_UP_SOLV_runSMD512_1.restart.coor' in PDB::write.:
> Interrupted system call
>
> I too would be indebted to any assistance in explaining these issues (so
> they could be avoided in future runs).
>
> Thanks,
> Sándor
>
>
> On Feb 24, 2011, at 12:56 PM, Lei Shi wrote:
>
>> Has anyone run into problems like me to launch new namd jobs at
>> ranger(tacc) in recent two days, using the namd described in
>> (http://www.ks.uiuc.edu/Research/namd/wiki/index.cgi?NamdAtTexas)?
>> My jobs quickly failed (the simulation system and qsub script have
>> been working for months). The error message is like below, which does
>> not tell much:
>> ----------
>> TACC: Starting up job 1833721
>> TACC: Setting up parallel environment for MVAPICH ssh-based mpirun.
>> TACC: Setup complete. Running job script.
>> TACC: starting parallel tasks...
>>
>> Child exited abnormally!
>> Killing remote processes...DONE
>> TACC: MPI job exited with code: 1
>> TACC: Shutting down parallel environment.
>> TACC: Shutdown complete. Exiting.
>> ---------
>>
>> I suspect there might be some recent changes of the "parallel
>> environment", which are beyond my capability to detect. Can the guy(s)
>> in charge of tg455591 help (e.g., run some tests)?
>>
>> Many Thanks!
>> Lei
>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:41 CST