Re: how to properly end NAMD replica job on slurm batch system

From: Vermaas, Josh (vermaasj_at_msu.edu)
Date: Wed Mar 24 2021 - 09:40:29 CDT

Hi Rene,

Is this 2.13 or 2.14? I seem to recall that 2.13 (or maybe it was 2.12?) *didn’t* kill the other replicas when one replica received a termination signal, and so you might legitimately be running into an issue where there are zombie namd processes roaming around on slurm.

I typically do not do anything special to clean up after a job crashes, since it is supposed to take itself down cleanly.

-Josh

From: <owner-namd-l_at_ks.uiuc.edu> on behalf of René Hafner TUK <hamburge_at_physik.uni-kl.de>
Reply-To: "namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, René Hafner TUK <hamburge_at_physik.uni-kl.de>
Date: Wednesday, March 24, 2021 at 9:22 AM
To: "namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>
Subject: namd-l: how to properly end NAMD replica job on slurm batch system


Dear NAMD Maintainers,



I work on cluster with SLURM batch system.

 I am currently testing replica simulations and

        experience the issue that when the replica simulation ends with an error or I cancel the job via scancel (since I am only testing...)

    the node gets "closed" with the error that "kill task failed". (it then takes intervention by cluster admins to reopen/reboot the node but thats local policy I guess)



Have you ever experienced this?

Is there a way to savely end the replica runs even when an error occurs?

Do I have to collect processIDs to kill the replica runs myself before the submission script (containing the call to charmrun namd2... ) ends ?



Kind regards

René



--

--

Dipl.-Phys. René Hafner

TU Kaiserslautern

Germany

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST