Re: BLCR and NAMD

From: Joseph Farran (jfarran_at_uci.edu)
Date: Thu Nov 07 2013 - 22:56:44 CST

Hi Phil.

Thank you for the details. Paul Hargrove was able to provide us with a patch to BLCR by which cr_restart will warn but keep running on missing files in /proc.

We have been using the patched BLCR version on our cluster with great success and NAMD is able to be check-pointed and resumed without issues.

Many thanks to Paul Hargrove as NAMD is extensively used on our campus cluster. If you can fix this on next release of NAMD that will be great as others may enjoy BLCR check-pointing with NAMD as well.

Cheers,
Joseph

On 11/7/2013 8:47 PM, Phil Miller wrote:
> Hi Joseph and Paul,
>
> I'm one of the Charm++ developers, and noticed your conversation in the NAMD mailing list archive about checkpointing NAMD with BLCR by happenstance.
>
> The open file descriptor under /proc is not a leak, in the sense that the Charm++ runtime system code that opens this file maintains a reference to it and will refer to it later if certain functions
> are called. However, it's probably not something that we should really be keeping open.
>
> We're using it to identify which core a running task is currently mapped to. In situations in which that's useful information to have, the threads should probably be getting pinned to particular
> cores at startup anyway (+setcpuaffinity and +pemap flags), and thus the results can be cached and the file closed. Without thread affinity, the OS can be moving threads arbitrarily anyway, so any
> use of that information is apt to be stale.
>
> I'll open up an issue in the Charm++ bug tracker about caching this information and closing the associated file descriptor.
>
> My sugestion for how BLCR ought to handle this is to look for file descriptors under /proc matching the checkpointing process's PID/TID, and somehow marking them to be replaced with the restarting
> process PID/TID. Perhaps just store the original PID/TID with the checkpoint (if you don't already), so that the restart procedure can compare against it at the point of need.
>
> Phil

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:58 CST