BLCR and NAMD

From: Phil Miller (mille121_at_illinois.edu)
Date: Thu Nov 07 2013 - 22:47:14 CST

Hi Joseph and Paul,

I'm one of the Charm++ developers, and noticed your conversation in the
NAMD mailing list archive about checkpointing NAMD with BLCR by
happenstance.

The open file descriptor under /proc is not a leak, in the sense that the
Charm++ runtime system code that opens this file maintains a reference to
it and will refer to it later if certain functions are called. However,
it's probably not something that we should really be keeping open.

We're using it to identify which core a running task is currently mapped
to. In situations in which that's useful information to have, the threads
should probably be getting pinned to particular cores at startup anyway
(+setcpuaffinity and +pemap flags), and thus the results can be cached and
the file closed. Without thread affinity, the OS can be moving threads
arbitrarily anyway, so any use of that information is apt to be stale.

I'll open up an issue in the Charm++ bug tracker about caching this
information and closing the associated file descriptor.

My sugestion for how BLCR ought to handle this is to look for file
descriptors under /proc matching the checkpointing process's PID/TID, and
somehow marking them to be replaced with the restarting process PID/TID.
Perhaps just store the original PID/TID with the checkpoint (if you don't
already), so that the restart procedure can compare against it at the point
of need.

Phil

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:53 CST