Re: Charmrun: error on request socket--

From: Kwee Hong (joyssstan0202_at_gmail.com)
Date: Thu Oct 28 2010 - 22:05:27 CDT

On Fri, Oct 29, 2010 at 3:14 AM, Jim Phillips <jim_at_ks.uiuc.edu> wrote:

>
> This is very strange. NAMD is using the rename() function to avoid
> overwriting the previous output file, and the error returned is saying that
> 2mrt_md_extend.restart.coor and 2mrt_md_extend.restart.coor.old are not on
> the same filesystem. I have no idea how this could be the case.
>
> You can test the same operation in the shell via:
> ln 2mrt_md_extend.restart.coor 2mrt_md_extend.restart.coor.old
>
> (This creates a hard link, not a symbolic link as in "ln -s".)
>
> Are you using regular NFS or the new pNFS "Parallel NFS"? (Do you have
> multiple file servers for this filesystem, or just multiple clients?)
>

Hmm... We are using regular NFS with multiple clients...

>
> -Jim
>
>
> On Thu, 28 Oct 2010, Kwee Hong wrote:
>
> *Hi *all,
>>
>>
>> I had my simulation run on a 14 nodes cluster and I got this error msg:
>>
>>
>> *WRITING COORDINATES TO DCD FILE AT STEP 1605500
>> WRITING COORDINATES TO RESTART FILE AT STEP 1605500
>> ERROR: Error on renaming file 2mrt_md_extend.restart.coor to
>> 2mrt_md_extend.restart.coor.old: Invalid cross-device link
>> FATAL ERROR: Unable to open binary file 2mrt_md_extend.restart.coor: File
>> exists
>> ------------- Processor 0 Exiting: Called CmiAbort ------------
>> Reason: FATAL ERROR: Unable to open binary file
>> 2mrt_md_extend.restart.coor:
>> File exists*
>>
>> And after posting the error at the mailing list and we got it solved as it
>> is due to file's permission. After some time, another similar error occur
>> with an exra notes:
>> *
>> *
>> *ERROR: Error on renaming file ZN_wb_md.restart.coor to
>> ZN_wb_md.restart.coor.old: Invalid cross-device link*
>> *FATAL ERROR: Unable to open binary file ZN_wb_md.restart.coor: File
>> exists*
>> *------------- Processor 0 Exiting: Called CmiAbort ------------*
>> *Reason: FATAL ERROR: Unable to open binary file ZN_wb_md.restart.coor:
>> File
>> exists*
>> *
>> *
>> *[0] Stack Traceback:*
>> * [0:0] CmiAbort+0x5c [0xb4521c]*
>> * [0:1] _Z8NAMD_errPKc+0x9d [0x520c99]*
>> * [0:2] _ZN6Output17write_binary_fileEPciP6Vector+0x17e [0x98619e]*
>> * [0:3] _ZN6Output26output_restart_coordinatesEP6Vectorii+0x1b5
>> [0x986003]
>> *
>> * [0:4] _ZN6Output10coordinateEiiP6VectorP11FloatVectorR7Lattice+0x12b
>> [0x985c57]*
>> * [0:5]
>>
>> _ZN24CkIndex_CollectionMaster39_call_receivePositions_CollectVectorMsgEPvP16CollectionMaster+0x18f
>> [0x533603]*
>> * [0:6] CkDeliverMessageFree+0x21 [0xa863df]*
>> *Charmrun: error on request socket--*
>> *Socket closed before recv.*
>>
>> This round I doubt the problem got to do with the file's permission. We
>> are
>> using nfs parallel file system on the cluster. We export the nfs
>> using (rw,sync,no_subtree_check,no_root_squash) options.
>>
>>
>> Anyway to tackle this?
>>
>> Thanks
>>
>> Regards,
>> Joyce
>>
>>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:54:42 CST