From: René Hafner TUK (hamburge_at_physik.uni-kl.de)
Date: Wed Mar 24 2021 - 12:17:49 CDT
I was able to "end" the job properly by killing the zombies via
pgrep "namd2" | xargs kill
-- René On 3/24/2021 3:57 PM, René Hafner TUK wrote: > > Hi Josh, > > I use NAMD 2.14. > > Though when using 2 replicas forcing a crash they both had an > error/end message in the logfile > > while for 4 I *had **at least one* replica logfile that has no > error/end message written at the end. > > Therefore I guess there is one zombie still left. > > I wanted to try this now with "top -b > file.txt" in my submission > script after the line "charmrun namd2..." but need to wait until a > proper node becomes available again. > > Kind regards > > René > > On 3/24/2021 3:40 PM, Vermaas, Josh wrote: >> >> Hi Rene, >> >> Is this 2.13 or 2.14? I seem to recall that 2.13 (or maybe it was >> 2.12?) **didn’t** kill the other replicas when one replica received a >> termination signal, and so you might legitimately be running into an >> issue where there are zombie namd processes roaming around on slurm. >> >> I typically do not do anything special to clean up after a job >> crashes, since it is supposed to take itself down cleanly. >> >> -Josh >> >> *From: *<owner-namd-l_at_ks.uiuc.edu> on behalf of René Hafner TUK >> <hamburge_at_physik.uni-kl.de> >> *Reply-To: *"namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, René Hafner >> TUK <hamburge_at_physik.uni-kl.de> >> *Date: *Wednesday, March 24, 2021 at 9:22 AM >> *To: *"namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu> >> *Subject: *namd-l: how to properly end NAMD replica job on slurm >> batch system >> >> Dear NAMD Maintainers, >> >> I work on cluster with SLURM batch system. >> >> I am currently testing replica simulations and >> >> experience the issue that when the replica simulation ends >> with an error or I cancel the job via scancel (since I am only >> testing...) >> >> the node gets "closed" with the error that "*kill task failed*". >> (it then takes intervention by cluster admins to reopen/reboot the >> node but thats local policy I guess) >> >> Have you ever experienced this? >> >> Is there a way to savely end the replica runs even when an error occurs? >> >> Do I have to collect processIDs to kill the replica runs myself >> before the submission script (containing the call to charmrun >> namd2... ) ends ? >> >> Kind regards >> René >> -- >> -- >> Dipl.-Phys. René Hafner >> TU Kaiserslautern >> Germany > -- > -- > Dipl.-Phys. René Hafner > TU Kaiserslautern > Germany -- -- Dipl.-Phys. René Hafner TU Kaiserslautern Germany
This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST