Re: Running Parallel Jobs Simultaneously with MPIRUN

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Wed Jul 18 2012 - 11:33:46 CDT

On Wed, Jul 18, 2012 at 5:36 PM, Rajan Vatassery <rajan_at_umn.edu> wrote:
> Dear List,
> I have several simulations that require mpirun which I need to run at
> the same time, using the same PBS script. As an example, I have to run
> two jobs in parallel with each other (to save time), that will provide
> some output which a third simulation will use as input:
>
> simulation 1 --
> \
> -----> simulation 3
> /
> simulation 2 --
>
> Up to this point, however, I'm not able to get two simulations to run at
> the same time from the same PBS script. There is no communication
> between simulation 1 and 2, and simulation 3 requires some data from
> both of 1 and 2.
> When I try to run something like this:
>
> mpirun -np 8 namd2 myjob1.conf > myjob1.log &
> mpirun -np 8 namd2 myjob2.conf > myjob2.log &

this cannot work. first of all, if you background both
calculations without adding a "wait" command, the
script will just progress and finish immediately and
thus the simulation will be killed. or worse.

> the jobs do not produce any NAMD-related output, and instead have this
> line in the log files:
>
> 8 total processes killed (some possibly by mpirun during cleanup)
>
> The error file has this entry for each process I'm trying to run:
>
> <start error file>
> mpirun: killing job...
>
> --------------------------------------------------------------------------
> mpirun noticed that process rank 0 with PID 13028 on node cl1n106 exited
> on signal 0 (Unknown signal 0).
> --------------------------------------------------------------------------
> mpirun: clean termination accomplished
>
> <end error file>
>
> If I submit the two jobs individually to the cluster, they will run
> without problems.
> I figured that this might be due to mpirun killing the processes that
> it thinks are orphaned or zombie processes, so I tried to add a "nohup"
> before the command. This allowed the two processes to run at the same

no. this is nonsense. you would defeat the restrictions
of the batch system resource management, and any
reasonably skilled sysadmin will be able to squash that.

> time, but they were only using one processor out of the 8 I had
> allocated (as evidenced by the extremely slow computation).
> I am using the following libraries: intel/12.1, ompi/intel,
> intel/11.1.072, namd/2.7-ompi. I noticed there is a conflict between
> intel/12.1 and intel/11.1.072 but presumably those conflicts should also
> exist when I submit the two jobs individually without incident.
> I have already asked the system admins on my cluster (MSI), but I
> believe that this is a NAMD-related issue. Any help is appreciated.

no it isn't a NAMD issue.

this can be easily managed using PBS/Torque job dependencies.
have a close look at the "qsub" manpage. look for the documentation
of the -W flag. there should be a section about "depend=dependency_list".
this is the feature you need. you just submit job 1 and job 2 and take
note of the batch job ids. then you submit job 3, but in addition use
-W depend=afterok:<jobid1>,afterok:<jobid2>

that will make sure that your job 3 will only launch after job1
and job2 have successfully completed. bingo!

cheers,
     axel.

> Thanks,
>
> Rajan
>
>
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
International Centre for Theoretical Physics, Trieste. Italy.

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:48 CST