Re: Running Parallel Jobs Simultaneously with MPIRUN

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Wed Jul 18 2012 - 15:47:40 CDT

On Wed, Jul 18, 2012 at 9:28 PM, Rajan Vatassery <rajan_at_umn.edu> wrote:
> This is not the question I asked. Please read more carefully. Twice in
> the first paragraph I mentioned that I need to use the same PBS script
> to submit the first two jobs. I am aware that I will need to use the
> wait command to allow the first two jobs to complete. That isn't the
> problem. The problem is that I cannot send even one job to background
> without getting a failed calculation.

..and i said in my first statement. "this cannot work".

that is it. full stop. it cannot work. forget it.
do what i suggested as alternative.

axel.

>
> thanks,
>
> Rajan
>
> On Wed, 2012-07-18 at 18:33 +0200, Axel Kohlmeyer wrote:
>> On Wed, Jul 18, 2012 at 5:36 PM, Rajan Vatassery <rajan_at_umn.edu> wrote:
>> > Dear List,
>> > I have several simulations that require mpirun which I need to run at
>> > the same time, using the same PBS script. As an example, I have to run
>> > two jobs in parallel with each other (to save time), that will provide
>> > some output which a third simulation will use as input:
>> >
>> > simulation 1 --
>> > \
>> > -----> simulation 3
>> > /
>> > simulation 2 --
>> >
>> > Up to this point, however, I'm not able to get two simulations to run at
>> > the same time from the same PBS script. There is no communication
>> > between simulation 1 and 2, and simulation 3 requires some data from
>> > both of 1 and 2.
>> > When I try to run something like this:
>> >
>> > mpirun -np 8 namd2 myjob1.conf > myjob1.log &
>> > mpirun -np 8 namd2 myjob2.conf > myjob2.log &
>>
>> this cannot work. first of all, if you background both
>> calculations without adding a "wait" command, the
>> script will just progress and finish immediately and
>> thus the simulation will be killed. or worse.
>>
>> > the jobs do not produce any NAMD-related output, and instead have this
>> > line in the log files:
>> >
>> > 8 total processes killed (some possibly by mpirun during cleanup)
>> >
>> > The error file has this entry for each process I'm trying to run:
>> >
>> > <start error file>
>> > mpirun: killing job...
>> >
>> > --------------------------------------------------------------------------
>> > mpirun noticed that process rank 0 with PID 13028 on node cl1n106 exited
>> > on signal 0 (Unknown signal 0).
>> > --------------------------------------------------------------------------
>> > mpirun: clean termination accomplished
>> >
>> > <end error file>
>> >
>> > If I submit the two jobs individually to the cluster, they will run
>> > without problems.
>> > I figured that this might be due to mpirun killing the processes that
>> > it thinks are orphaned or zombie processes, so I tried to add a "nohup"
>> > before the command. This allowed the two processes to run at the same
>>
>> no. this is nonsense. you would defeat the restrictions
>> of the batch system resource management, and any
>> reasonably skilled sysadmin will be able to squash that.
>>
>> > time, but they were only using one processor out of the 8 I had
>> > allocated (as evidenced by the extremely slow computation).
>> > I am using the following libraries: intel/12.1, ompi/intel,
>> > intel/11.1.072, namd/2.7-ompi. I noticed there is a conflict between
>> > intel/12.1 and intel/11.1.072 but presumably those conflicts should also
>> > exist when I submit the two jobs individually without incident.
>> > I have already asked the system admins on my cluster (MSI), but I
>> > believe that this is a NAMD-related issue. Any help is appreciated.
>>
>> no it isn't a NAMD issue.
>>
>> this can be easily managed using PBS/Torque job dependencies.
>> have a close look at the "qsub" manpage. look for the documentation
>> of the -W flag. there should be a section about "depend=dependency_list".
>> this is the feature you need. you just submit job 1 and job 2 and take
>> note of the batch job ids. then you submit job 3, but in addition use
>> -W depend=afterok:<jobid1>,afterok:<jobid2>
>>
>> that will make sure that your job 3 will only launch after job1
>> and job2 have successfully completed. bingo!
>>
>> cheers,
>> axel.
>>
>>
>>
>>
>>
>> > Thanks,
>> >
>> > Rajan
>> >
>> >
>> >
>>
>>
>>
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
International Centre for Theoretical Physics, Trieste. Italy.

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:21:48 CST