Re: Running Parallel Jobs Simultaneously with MPIRUN

From: Aron Broom (broomsday_at_gmail.com)
Date: Wed Jul 18 2012 - 15:30:56 CDT

Since what Axel suggested would seem to be a good solution for the problem
in general, perhaps you should explain why it is that you need both jobs
submitted from the same script, as this seems to be the crux of the
problem. You said that there is no communication between Sims 1 and 2, so
I don't see any reason for them needing to be run at the EXACTLY the same
time.

~Aron

On Wed, Jul 18, 2012 at 3:28 PM, Rajan Vatassery <rajan_at_umn.edu> wrote:

> This is not the question I asked. Please read more carefully. Twice in
> the first paragraph I mentioned that I need to use the same PBS script
> to submit the first two jobs. I am aware that I will need to use the
> wait command to allow the first two jobs to complete. That isn't the
> problem. The problem is that I cannot send even one job to background
> without getting a failed calculation.
>
> thanks,
>
> Rajan
>
> On Wed, 2012-07-18 at 18:33 +0200, Axel Kohlmeyer wrote:
> > On Wed, Jul 18, 2012 at 5:36 PM, Rajan Vatassery <rajan_at_umn.edu> wrote:
> > > Dear List,
> > > I have several simulations that require mpirun which I need to
> run at
> > > the same time, using the same PBS script. As an example, I have to run
> > > two jobs in parallel with each other (to save time), that will provide
> > > some output which a third simulation will use as input:
> > >
> > > simulation 1 --
> > > \
> > > -----> simulation 3
> > > /
> > > simulation 2 --
> > >
> > > Up to this point, however, I'm not able to get two simulations to run
> at
> > > the same time from the same PBS script. There is no communication
> > > between simulation 1 and 2, and simulation 3 requires some data from
> > > both of 1 and 2.
> > > When I try to run something like this:
> > >
> > > mpirun -np 8 namd2 myjob1.conf > myjob1.log &
> > > mpirun -np 8 namd2 myjob2.conf > myjob2.log &
> >
> > this cannot work. first of all, if you background both
> > calculations without adding a "wait" command, the
> > script will just progress and finish immediately and
> > thus the simulation will be killed. or worse.
> >
> > > the jobs do not produce any NAMD-related output, and instead have this
> > > line in the log files:
> > >
> > > 8 total processes killed (some possibly by mpirun during cleanup)
> > >
> > > The error file has this entry for each process I'm trying to run:
> > >
> > > <start error file>
> > > mpirun: killing job...
> > >
> > >
> --------------------------------------------------------------------------
> > > mpirun noticed that process rank 0 with PID 13028 on node cl1n106
> exited
> > > on signal 0 (Unknown signal 0).
> > >
> --------------------------------------------------------------------------
> > > mpirun: clean termination accomplished
> > >
> > > <end error file>
> > >
> > > If I submit the two jobs individually to the cluster, they will run
> > > without problems.
> > > I figured that this might be due to mpirun killing the
> processes that
> > > it thinks are orphaned or zombie processes, so I tried to add a "nohup"
> > > before the command. This allowed the two processes to run at the same
> >
> > no. this is nonsense. you would defeat the restrictions
> > of the batch system resource management, and any
> > reasonably skilled sysadmin will be able to squash that.
> >
> > > time, but they were only using one processor out of the 8 I had
> > > allocated (as evidenced by the extremely slow computation).
> > > I am using the following libraries: intel/12.1, ompi/intel,
> > > intel/11.1.072, namd/2.7-ompi. I noticed there is a conflict between
> > > intel/12.1 and intel/11.1.072 but presumably those conflicts should
> also
> > > exist when I submit the two jobs individually without incident.
> > > I have already asked the system admins on my cluster (MSI),
> but I
> > > believe that this is a NAMD-related issue. Any help is appreciated.
> >
> > no it isn't a NAMD issue.
> >
> > this can be easily managed using PBS/Torque job dependencies.
> > have a close look at the "qsub" manpage. look for the documentation
> > of the -W flag. there should be a section about "depend=dependency_list".
> > this is the feature you need. you just submit job 1 and job 2 and take
> > note of the batch job ids. then you submit job 3, but in addition use
> > -W depend=afterok:<jobid1>,afterok:<jobid2>
> >
> > that will make sure that your job 3 will only launch after job1
> > and job2 have successfully completed. bingo!
> >
> > cheers,
> > axel.
> >
> >
> >
> >
> >
> > > Thanks,
> > >
> > > Rajan
> > >
> > >
> > >
> >
> >
> >
>
>

-- 
Aron Broom M.Sc
PhD Student
Department of Chemistry
University of Waterloo

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:17 CST