Re: Running Parallel Jobs Simultaneously with MPIRUN

From: Rajan Vatassery (rajan_at_umn.edu)
Date: Wed Jul 18 2012 - 14:28:07 CDT

This is not the question I asked. Please read more carefully. Twice in
the first paragraph I mentioned that I need to use the same PBS script
to submit the first two jobs. I am aware that I will need to use the
wait command to allow the first two jobs to complete. That isn't the
problem. The problem is that I cannot send even one job to background
without getting a failed calculation.

thanks,

Rajan

On Wed, 2012-07-18 at 18:33 +0200, Axel Kohlmeyer wrote:
> On Wed, Jul 18, 2012 at 5:36 PM, Rajan Vatassery <rajan_at_umn.edu> wrote:
> > Dear List,
> > I have several simulations that require mpirun which I need to run at
> > the same time, using the same PBS script. As an example, I have to run
> > two jobs in parallel with each other (to save time), that will provide
> > some output which a third simulation will use as input:
> >
> > simulation 1 --
> > \
> > -----> simulation 3
> > /
> > simulation 2 --
> >
> > Up to this point, however, I'm not able to get two simulations to run at
> > the same time from the same PBS script. There is no communication
> > between simulation 1 and 2, and simulation 3 requires some data from
> > both of 1 and 2.
> > When I try to run something like this:
> >
> > mpirun -np 8 namd2 myjob1.conf > myjob1.log &
> > mpirun -np 8 namd2 myjob2.conf > myjob2.log &
>
> this cannot work. first of all, if you background both
> calculations without adding a "wait" command, the
> script will just progress and finish immediately and
> thus the simulation will be killed. or worse.
>
> > the jobs do not produce any NAMD-related output, and instead have this
> > line in the log files:
> >
> > 8 total processes killed (some possibly by mpirun during cleanup)
> >
> > The error file has this entry for each process I'm trying to run:
> >
> > <start error file>
> > mpirun: killing job...
> >
> > --------------------------------------------------------------------------
> > mpirun noticed that process rank 0 with PID 13028 on node cl1n106 exited
> > on signal 0 (Unknown signal 0).
> > --------------------------------------------------------------------------
> > mpirun: clean termination accomplished
> >
> > <end error file>
> >
> > If I submit the two jobs individually to the cluster, they will run
> > without problems.
> > I figured that this might be due to mpirun killing the processes that
> > it thinks are orphaned or zombie processes, so I tried to add a "nohup"
> > before the command. This allowed the two processes to run at the same
>
> no. this is nonsense. you would defeat the restrictions
> of the batch system resource management, and any
> reasonably skilled sysadmin will be able to squash that.
>
> > time, but they were only using one processor out of the 8 I had
> > allocated (as evidenced by the extremely slow computation).
> > I am using the following libraries: intel/12.1, ompi/intel,
> > intel/11.1.072, namd/2.7-ompi. I noticed there is a conflict between
> > intel/12.1 and intel/11.1.072 but presumably those conflicts should also
> > exist when I submit the two jobs individually without incident.
> > I have already asked the system admins on my cluster (MSI), but I
> > believe that this is a NAMD-related issue. Any help is appreciated.
>
> no it isn't a NAMD issue.
>
> this can be easily managed using PBS/Torque job dependencies.
> have a close look at the "qsub" manpage. look for the documentation
> of the -W flag. there should be a section about "depend=dependency_list".
> this is the feature you need. you just submit job 1 and job 2 and take
> note of the batch job ids. then you submit job 3, but in addition use
> -W depend=afterok:<jobid1>,afterok:<jobid2>
>
> that will make sure that your job 3 will only launch after job1
> and job2 have successfully completed. bingo!
>
> cheers,
> axel.
>
>
>
>
>
> > Thanks,
> >
> > Rajan
> >
> >
> >
>
>
>

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:17 CST