Re: Charmrun: error on request socket--

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Thu Oct 28 2010 - 14:14:28 CDT

This is very strange. NAMD is using the rename() function to avoid
overwriting the previous output file, and the error returned is saying
that 2mrt_md_extend.restart.coor and 2mrt_md_extend.restart.coor.old are
not on the same filesystem. I have no idea how this could be the case.

You can test the same operation in the shell via:
  ln 2mrt_md_extend.restart.coor 2mrt_md_extend.restart.coor.old

(This creates a hard link, not a symbolic link as in "ln -s".)

Are you using regular NFS or the new pNFS "Parallel NFS"? (Do you have
multiple file servers for this filesystem, or just multiple clients?)

-Jim

On Thu, 28 Oct 2010, Kwee Hong wrote:

> *Hi *all,
>
> I had my simulation run on a 14 nodes cluster and I got this error msg:
>
>
> *WRITING COORDINATES TO DCD FILE AT STEP 1605500
> WRITING COORDINATES TO RESTART FILE AT STEP 1605500
> ERROR: Error on renaming file 2mrt_md_extend.restart.coor to
> 2mrt_md_extend.restart.coor.old: Invalid cross-device link
> FATAL ERROR: Unable to open binary file 2mrt_md_extend.restart.coor: File
> exists
> ------------- Processor 0 Exiting: Called CmiAbort ------------
> Reason: FATAL ERROR: Unable to open binary file 2mrt_md_extend.restart.coor:
> File exists*
>
> And after posting the error at the mailing list and we got it solved as it
> is due to file's permission. After some time, another similar error occur
> with an exra notes:
> *
> *
> *ERROR: Error on renaming file ZN_wb_md.restart.coor to
> ZN_wb_md.restart.coor.old: Invalid cross-device link*
> *FATAL ERROR: Unable to open binary file ZN_wb_md.restart.coor: File exists*
> *------------- Processor 0 Exiting: Called CmiAbort ------------*
> *Reason: FATAL ERROR: Unable to open binary file ZN_wb_md.restart.coor: File
> exists*
> *
> *
> *[0] Stack Traceback:*
> * [0:0] CmiAbort+0x5c [0xb4521c]*
> * [0:1] _Z8NAMD_errPKc+0x9d [0x520c99]*
> * [0:2] _ZN6Output17write_binary_fileEPciP6Vector+0x17e [0x98619e]*
> * [0:3] _ZN6Output26output_restart_coordinatesEP6Vectorii+0x1b5 [0x986003]
> *
> * [0:4] _ZN6Output10coordinateEiiP6VectorP11FloatVectorR7Lattice+0x12b
> [0x985c57]*
> * [0:5]
> _ZN24CkIndex_CollectionMaster39_call_receivePositions_CollectVectorMsgEPvP16CollectionMaster+0x18f
> [0x533603]*
> * [0:6] CkDeliverMessageFree+0x21 [0xa863df]*
> *Charmrun: error on request socket--*
> *Socket closed before recv.*
>
> This round I doubt the problem got to do with the file's permission. We are
> using nfs parallel file system on the cluster. We export the nfs
> using (rw,sync,no_subtree_check,no_root_squash) options.
>
>
> Anyway to tackle this?
>
> Thanks
>
> Regards,
> Joyce
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:54:41 CST