Re: how to properly end NAMD replica job on slurm batch system

From: René Hafner TUK (hamburge_at_physik.uni-kl.de)
Date: Wed Mar 24 2021 - 12:17:49 CDT

I was able to "end" the job properly by killing the zombies via

pgrep "namd2" | xargs kill

--
René
On 3/24/2021 3:57 PM, René Hafner TUK wrote:
>
> Hi Josh,
>
>     I use NAMD 2.14.
>
>     Though when using 2 replicas forcing a crash they both had an 
> error/end message in the logfile
>
>     while for 4 I *had **at least one* replica logfile that has no 
> error/end message written at the end.
>
>     Therefore I guess there is one zombie still left.
>
>     I wanted to try this now with "top -b > file.txt" in my submission 
> script after  the line "charmrun namd2..." but need to wait until a 
> proper node becomes available again.
>
> Kind regards
>
> René
>
> On 3/24/2021 3:40 PM, Vermaas, Josh wrote:
>>
>> Hi Rene,
>>
>> Is this 2.13 or 2.14? I seem to recall that 2.13 (or maybe it was 
>> 2.12?) **didn’t** kill the other replicas when one replica received a 
>> termination signal, and so you might legitimately be running into an 
>> issue where there are zombie namd processes roaming around on slurm.
>>
>> I typically do not do anything special to clean up after a job 
>> crashes, since it is supposed to take itself down cleanly.
>>
>> -Josh
>>
>> *From: *<owner-namd-l_at_ks.uiuc.edu> on behalf of René Hafner TUK 
>> <hamburge_at_physik.uni-kl.de>
>> *Reply-To: *"namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>, René Hafner 
>> TUK <hamburge_at_physik.uni-kl.de>
>> *Date: *Wednesday, March 24, 2021 at 9:22 AM
>> *To: *"namd-l_at_ks.uiuc.edu" <namd-l_at_ks.uiuc.edu>
>> *Subject: *namd-l: how to properly end NAMD replica job on slurm 
>> batch system
>>
>> Dear NAMD Maintainers,
>>
>> I work on cluster with SLURM batch system.
>>
>>  I am currently testing replica simulations and
>>
>>         experience the issue that when the replica simulation ends 
>> with an error or I cancel the job via scancel (since I am only 
>> testing...)
>>
>>     the node gets "closed" with the error that "*kill task failed*". 
>> (it then takes intervention by cluster admins to reopen/reboot the 
>> node but thats local policy I guess)
>>
>> Have you ever experienced this?
>>
>> Is there a way to savely end the replica runs even when an error occurs?
>>
>> Do I have to collect processIDs to kill the replica runs myself 
>> before the submission script (containing the call to charmrun 
>> namd2... ) ends ?
>>
>> Kind regards
>> René
>> -- 
>> --
>> Dipl.-Phys. René Hafner
>> TU Kaiserslautern
>> Germany
> -- 
> --
> Dipl.-Phys. René Hafner
> TU Kaiserslautern
> Germany
-- 
--
Dipl.-Phys. René Hafner
TU Kaiserslautern
Germany

This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:11 CST