From: Gerald Keller (gerald.keller_at_uni-wuerzburg.de)
Date: Thu Aug 13 2020 - 02:51:37 CDT
I hope I am not totally off topic here.
I am running several replicas of RAMD simulations on a HPC. For this I group 10 replicas into one block. In the submission script the replicas start one after the other, set up in a for loop.
Sometimes when the wall time limit is reached the block is canceled and for example, there are 3 replicas left. 1 replica already started but canceled due to the wall time. So far I submitted the left replicas again and the canceled replica was overwritten completely.
What I would like to do now is to resume the block from the last step of the canceled block.
I read something about job arrays and job dependencies but I am not sure, if this would work with NAMD.
My other idea is to resume the canceled replica and then start the 2 replicas left. The latter step is clear, but I am not really sure how to resume the canceled one.
For this I would resume from the restart files, look in the log file for the last STEP and set the obtained value as firsttimestep in the configuration file. I am not sure if the existing trajectory can be appended or I have to set another dcd outputname and merge the files later. In local tests the RAMD logging for the resumed simulations begin with STEP 0. So I have to post modiy the logged steps.
Are there any suggestions how to manage this efficiently? In my opinion resubmitting the job would be the easiest way if possible. As a next step I would like to run relatively long plain MD simulations, for this resubmission would be really nice.
This archive was generated by hypermail 2.1.6 : Fri Dec 31 2021 - 23:17:09 CST