From: Paul Hargrove (phhargrove_at_lbl.gov)
Date: Fri Nov 08 2013 - 01:54:48 CST
On Thu, Nov 7, 2013 at 8:47 PM, Phil Miller <mille121_at_illinois.edu> wrote:
> My sugestion for how BLCR ought to handle this is to look for file
> descriptors under /proc matching the checkpointing process's PID/TID, and
> somehow marking them to be replaced with the restarting process PID/TID.
> Perhaps just store the original PID/TID with the checkpoint (if you don't
> already), so that the restart procedure can compare against it at the point
> of need.
Phil,
BLCR actually restores the PGID/PID/TID so the values after restart are the
SAME as the ones at checkpoint time.
However, the way BLCR currently orders its restart operations files are
reopened before PID/TID restoration is done. At the time files are being
reopened the restarting process temporarily has whatever PID/TID fork (or
clone) just happened to allocate to it. Thus /pid/1234 (for example) will
exist at the end of the restart, but not at the point in time that BLCR
attempts to reopen /proc/1234/task/1234/stat. There are technical reasons
I won't go into here why PID/TID restoration needs to be done "very late"
in the restart process which make a complete swapping the order of file and
PID/TID restoration impractical.
However, I think BLCR could treat the case of open /proc/<pid> files and
directories distinct from others and defer their reopen until even later
than the restoration of PID/TID. Unlike the case of files (for which we
may need to checkpoint the CONTENTS) the state that needs to be "stashed"
to delay a /proc/<pid> file/dir is very small (path, mode and offset).
However, that is still a non-trivial change and not one I can make with a
simple patch.
Your suggestion might be applied at the time files are opened in BLCR now,
without reordering any operations. However, I looked into that and there
is at lease one non-obvious problem because that "rename" will be recorded
in the kernel as the path to the open file. If another checkpoint is taken
after a restart the WRONG filename is recorded and the correspondence is
lost. Even w/o a subsequent checkpoint the wrong path would show in "ls -l
/proc/<pid>/fd".
-Paul
-- Paul H. Hargrove PHHargrove_at_lbl.gov Future Technologies Group Computer and Data Sciences Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:23:58 CST