Re: NAMD crashes when writing restart file and dcdfreq is 1

From: Jim Phillips (jim_at_ks.uiuc.edu)
Date: Thu May 26 2011 - 09:12:02 CDT

Hi again,

Copying the list in case anyone else has ideas.

I removed that error message after I noticed it. I just wanted to be sure
you were running the right binary. What this tells me is that when the
crash happens the remove() call isn't returning an error but rename() is
failing because the destination still exists.

Thanks for the links. I don't need transactions or atomic operations, I
just need the .old file to be gone from the perspective of the same
process when remove() returns so that rename() doesn't fail.

Since you're only seeing this on one machine I'm going go call this a
Windows 7 RAID issue.

-Jim

On Thu, 26 May 2011, Ajasja LjubetiÄ~M wrote:

> Sorry, you were correct, I did get these errors (I just didn't look at
> the beginning of the log file).
>
> Perhaps the error could be renamed to a warning or info or something like
> that. An error message might leave new users wondering what they are doing
> wrong:)
>
> I don't know that much about windows file system my self. I do know that win
> 7 has support for atomic file operations (
> http://en.wikipedia.org/wiki/Transactional_NTFS) and that atomic file
> operations are difficult to achieve on older platforms (
> http://stackoverflow.com/questions/167414/is-an-atomic-file-rename-with-overwrite-possible-on-windows
> ).
>
> Thank you for your help and best regards,
> Ajasja
>
> ERROR: Error on removing file out/run/ampCRASH.restart.xsc.old: No such file
> or directory
> WRITING COORDINATES TO RESTART FILE AT STEP 1
> FINISHED WRITING RESTART COORDINATES
> The last position output (seq=1) takes 0.001 seconds, 15.090 MB of memory in
> use
> WRITING VELOCITIES TO RESTART FILE AT STEP 1
> FINISHED WRITING RESTART VELOCITIES
> The last velocity output (seq=1) takes 0.001 seconds, 15.133 MB of memory in
> use
> WRITING COORDINATES TO RESTART FILE AT STEP 2
> ERROR: Error on removing file out/run/ampCRASH.restart.coor.old: No such
> file or directory
> FINISHED WRITING RESTART COORDINATES
> The last position output (seq=2) takes 0.001 seconds, 15.160 MB of memory in
> use
> WRITING VELOCITIES TO RESTART FILE AT STEP 2
> ERROR: Error on removing file out/run/ampCRASH.restart.vel.old: No such file
> or directory
> FINISHED WRITING RESTART VELOCITIES
> The last velocity output (seq=2) takes 0.002 seconds, 15.160 MB of memory in
> use
> WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 3
>
> On Wed, May 25, 2011 at 15:06, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>
>>
>> Immediately before the rename is a call to remove the backup file (on
>> Windows only, other OS don't need it). The new binary I sent you actually
>> checks the return code of this rename call and prints an error message if it
>> fails. That binary (it's since fixed) will also print the error if the
>> backup file doesn't exist, so if you removed it you would see an error
>> message on the first or second output step. If you can remove all of the
>> .old files and you don't see a "No such file or directory" error on the
>> first restart output then you're running a different binary.
>>
>> There is a single thread that does all of the file I/O and the calls are
>> all synchronous. All I can think is that there is some lag in the
>> filesystem that is visible to a single process, but I don't know enough
>> about Windows filesystems to know if that's the right explanation.
>>
>> -Jim
>>
>>
>>
>> On Wed, 25 May 2011, Ajasja LjubetiÄ~M wrote:
>>
>> No, the error is always the same
>>>
>>> WRITING VELOCITIES TO RESTART FILE AT STEP 2243
>>> ERROR: Error on renaming file out/run/ampCRASH.restart.vel to
>>> out/run/ampCRASH.restart.vel.old: File exists
>>> FATAL ERROR: Unable to open binary file out/run/ampCRASH.restart.vel: File
>>> exists
>>>
>>> Could it be, that NAMD (or the system) has not finished writing the
>>> restart
>>> file from the previous step, while the current step already wants to
>>> rename
>>> the restart file? But it can't get access, since it has not yet
>>> been completely written to disk.
>>>
>>> Best regards,
>>> Ajasja
>>> On Tue, May 24, 2011 at 22:30, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>
>>>
>>>> If you remove out/run/ampCRASH.restart.coor.old before you start the run,
>>>> do you at least get a non-fatal error message about "No such file or
>>>> directory" on the first restart?
>>>>
>>>>
>>>> -Jim
>>>>
>>>>
>>>> On Tue, 24 May 2011, Ajasja LjubetiÄ~M wrote:
>>>>
>>>> Update:
>>>>
>>>>>
>>>>> On my co-workers computer the simulation crashed as well it just took a
>>>>> bit
>>>>> longer:
>>>>>
>>>>> WRITING COORDINATES TO RESTART FILE AT STEP 64147
>>>>> ERROR: Error on renaming file out/run/ampCRASH.restart.coor to
>>>>> out/run/ampCRASH.restart.coor.old: File exists
>>>>> FATAL ERROR: Unable to open binary file out/run/ampCRASH.restart.coor:
>>>>> File
>>>>> exists
>>>>>
>>>>>
>>>>> On Tue, May 24, 2011 at 21:20, Ajasja LjubetiĨ <
>>>>> ajasja.ljubetic_at_gmail.com>wrote:
>>>>>
>>>>> Hmm, the error message is still the same
>>>>>
>>>>>>
>>>>>> Info: NAMD 2.8b3 for Win32-multicore
>>>>>> Info: Built Tue May 24 12:50:32 CDT 2011 by jcphill on honor
>>>>>>
>>>>>>
>>>>>> WRITING COORDINATES TO RESTART FILE AT STEP 1141
>>>>>> ERROR: Error on renaming file out/run/ampCRASH.restart.coor to
>>>>>> out/run/ampCRASH.restart.coor.old: File exists
>>>>>> FATAL ERROR: Unable to open binary file out/run/ampCRASH.restart.coor:
>>>>>> File
>>>>>> exists
>>>>>>
>>>>>> or
>>>>>>
>>>>>> WRITING VELOCITIES TO RESTART FILE AT STEP 4993
>>>>>> ERROR: Error on renaming file out/run/ampCRASH.restart.vel to
>>>>>> out/run/ampCRASH.restart.vel.old: File exists
>>>>>> FATAL ERROR: Unable to open binary file out/run/ampCRASH.restart.vel:
>>>>>> File
>>>>>> exists
>>>>>>
>>>>>> or
>>>>>>
>>>>>> WRITING COORDINATES TO RESTART FILE AT STEP 519
>>>>>> ERROR: Error on renaming file out/run/ampCRASH.restart.coor to
>>>>>> out/run/ampCRASH.restart.coor.old: File exists
>>>>>> FATAL ERROR: Unable to open binary file out/run/ampCRASH.restart.coor:
>>>>>> File
>>>>>> exists
>>>>>>
>>>>>> These are my hardware specs:
>>>>>> http://speccy.piriform.com/results/1xffAT0XakvsBisYiH5QC2n
>>>>>> (I'm running win7 64).
>>>>>> I can't reproduce the problem on our cluster nodes:
>>>>>> http://speccy.piriform.com/results/cBn73pnmNQnJbQywmIxm3Hu
>>>>>> (running winxp 32)
>>>>>> It also does not appear on my coworkers computer
>>>>>> http://speccy.piriform.com/results/qIlCdgNOzN4OJ6je1LGd3bH
>>>>>> (running win7 64 bit)
>>>>>>
>>>>>> My computer is the only one of these three that has RAID 1. Otherwise
>>>>>> I'm
>>>>>> at a loss as to what could be causing the crash (but it's an
>>>>>> unimportant
>>>>>> and
>>>>>> obscure bug).
>>>>>>
>>>>>> Best regards,
>>>>>> Ajasja
>>>>>>
>>>>>> On Tue, May 24, 2011 at 20:39, Jim Phillips <jim_at_ks.uiuc.edu> wrote:
>>>>>>
>>>>>>
>>>>>> I can't reproduce this locally, but there was an unchecked remove call
>>>>>>> on
>>>>>>> Windows that was hiding the real error. Please try this binary and
>>>>>>> you
>>>>>>> should see a potentially useful "Error on removing file" message:
>>>>>>>
>>>>>>> -Jim
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 24 May 2011, Ajasja LjubetiƄ~M wrote:
>>>>>>>
>>>>>>> Thank you, this fixes the problem!
>>>>>>>
>>>>>>>
>>>>>>>> There is another issue I noticed while playing around with the test
>>>>>>>> case.
>>>>>>>> It's probably not very relevant for "real life" simulations, but
>>>>>>>> still:
>>>>>>>>
>>>>>>>> If I set the restart frequency below 100 steps (lets say to 1) I get
>>>>>>>> the
>>>>>>>> following errors
>>>>>>>>
>>>>>>>> (restartfreq 1)
>>>>>>>> WRITING VELOCITIES TO RESTART FILE AT STEP 443
>>>>>>>> ERROR: Error on renaming file out/run/ampCRASH.restart.coor to
>>>>>>>> out/run/ampCRASH.restart.coor.old: File exists
>>>>>>>> FATAL ERROR: Unable to open binary file
>>>>>>>> out/run/ampCRASH.restart.coor:
>>>>>>>> File
>>>>>>>> exists
>>>>>>>>
>>>>>>>> OR
>>>>>>>>
>>>>>>>> (restartfreq 10)
>>>>>>>> WRITING VELOCITIES TO RESTART FILE AT STEP 18430
>>>>>>>> ERROR: Error on renaming file out/run/ampCRASH.restart.vel to
>>>>>>>> out/run/ampCRASH.restart.vel.old: File exists
>>>>>>>> FATAL ERROR: Unable to open binary file out/run/ampCRASH.restart.vel:
>>>>>>>> File
>>>>>>>> exists
>>>>>>>>
>>>>>>>>
>>>>>>>> The step at which this happens is random, but it crashes sooner, if
>>>>>>>> restartfreq is smaller.
>>>>>>>> I tested this with 2.7 and with the 2.8b3 you have sent me. Both
>>>>>>>> crash
>>>>>>>> in
>>>>>>>> the same way.
>>>>>>>>
>>>>>>>> Thank you & best regards,
>>>>>>>> Ajasja
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:57:11 CST