Fwd: Re: [Checkpoint] BLCR and NAMD

From: Joseph Farran (jfarran_at_uci.edu)
Date: Wed Oct 23 2013 - 13:46:00 CDT

Greetings.

We have been running NAMD successfully for many moons on our campus cluster.

We recently added checkpoint facility BLCR ( Berkeley Lab Checkpoint/Restart ).

I know that NAMD has it's own restart files, but for our user base, using BLCR with NAMD
would make it a lot easier.

NAMD appears to BLCR checkpoint just fine, but fails on restart. Checking with the BLCR
support group, they suspect that it may be a "bug" with a NAMD file descriptor leak (see
email below).

The error we get on NAMD startup with BLCR is:

- Failed to open file '/proc/58743/task/58743/stat'
- cr_restore_all_files [6446]: Unable to restore fd 3 (type=1,err=-2)
- cr_rstrt_child [6446]: Unable to restore files! (err=-2)
Restart failed: No such file or directory

Anyone in the NAMD support staff able to verify if this is a bug and if it can be fixed?

Thank you,
Joseph A. Farran
University of California, Irvine
Office of Information Technology
209 Multipurpose Science & Technology
Irvine, CA 92697-2225

-------- Original Message --------
Subject: Re: [Checkpoint] BLCR and NAMD
Date: Sun, 13 Oct 2013 15:20:34 -0700
From: Paul Hargrove <phhargrove_at_lbl.gov>
To: Joseph Farran <jfarran_at_uci.edu>
CC: checkpoint <checkpoint_at_lbl.gov>

Joseph,

I am fairly certain this *is* a BLCR limitation, because to the best of my recollection we don't do anything exceptional for the case that an application has a file open under /proc.

In principle, it might be a "bug" in NAMD if this file is not open intentionally (a "file descriptor leak"). However, the inability to restore this open descriptor is still an unexpected/unintended limitation in BLCR. Since NAMD is a very real application, having it as motivating case for fixing this limitation would be valuable.

-Paul

On Sun, Oct 13, 2013 at 3:09 PM, Joseph Farran <jfarran_at_uci.edu <mailto:jfarran_at_uci.edu>> wrote:

    Thanks again Paul.

    Let me check with NAMD folks before I open a bug report as it's probably not BLCR.

-- 
Paul H. Hargrove PHHargrove_at_lbl.gov <mailto:PHHargrove_at_lbl.gov>
Future Technologies Group
Computer and Data Sciences Department     Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:52 CST