Re: Fwd: Re: [Checkpoint] BLCR and NAMD

From: Joseph Farran (jfarran_at_uci.edu)
Date: Fri Oct 25 2013 - 12:33:29 CDT

Hi.

Any feedback from the NAMD development group on this?

Joseph

On 10/23/2013 11:46 AM, Joseph Farran wrote:
> Greetings.
>
> We have been running NAMD successfully for many moons on our campus cluster.
>
> We recently added checkpoint facility BLCR ( Berkeley Lab Checkpoint/Restart ).
>
> I know that NAMD has it's own restart files, but for our user base, using BLCR with NAMD
> would make it a lot easier.
>
> NAMD appears to BLCR checkpoint just fine, but fails on restart. Checking with the BLCR
> support group, they suspect that it may be a "bug" with a NAMD file descriptor leak (see
> email below).
>
> The error we get on NAMD startup with BLCR is:
>
> - Failed to open file '/proc/58743/task/58743/stat'
> - cr_restore_all_files [6446]: Unable to restore fd 3 (type=1,err=-2)
> - cr_rstrt_child [6446]: Unable to restore files! (err=-2)
> Restart failed: No such file or directory
>
>
> Anyone in the NAMD support staff able to verify if this is a bug and if it can be fixed?
>
> Thank you,
> Joseph A. Farran
> University of California, Irvine
> Office of Information Technology
> 209 Multipurpose Science & Technology
> Irvine, CA 92697-2225
>
>
>
> -------- Original Message --------
> Subject: Re: [Checkpoint] BLCR and NAMD
> Date: Sun, 13 Oct 2013 15:20:34 -0700
> From: Paul Hargrove <phhargrove_at_lbl.gov>
> To: Joseph Farran <jfarran_at_uci.edu>
> CC: checkpoint <checkpoint_at_lbl.gov>
>
>
>
> Joseph,
>
> I am fairly certain this *is* a BLCR limitation, because to the best of my recollection we don't do anything exceptional for the case that an application has a file open under /proc.
>
> In principle, it might be a "bug" in NAMD if this file is not open intentionally (a "file descriptor leak"). However, the inability to restore this open descriptor is still an unexpected/unintended limitation in BLCR. Since NAMD is a very real application, having it as motivating case for fixing this limitation would be valuable.
>
> -Paul
>
>
> On Sun, Oct 13, 2013 at 3:09 PM, Joseph Farran <jfarran_at_uci.edu <mailto:jfarran_at_uci.edu>> wrote:
>
> Thanks again Paul.
>
> Let me check with NAMD folks before I open a bug report as it's probably not BLCR.
>
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:49 CST