Running Parallel Jobs Simultaneously with MPIRUN

From: Rajan Vatassery (rajan_at_umn.edu)
Date: Wed Jul 18 2012 - 10:36:04 CDT

Dear List,
        I have several simulations that require mpirun which I need to run at
the same time, using the same PBS script. As an example, I have to run
two jobs in parallel with each other (to save time), that will provide
some output which a third simulation will use as input:

simulation 1 --
               \
                -----> simulation 3
               /
simulation 2 --

Up to this point, however, I'm not able to get two simulations to run at
the same time from the same PBS script. There is no communication
between simulation 1 and 2, and simulation 3 requires some data from
both of 1 and 2.
        When I try to run something like this:

mpirun -np 8 namd2 myjob1.conf > myjob1.log &
mpirun -np 8 namd2 myjob2.conf > myjob2.log &

the jobs do not produce any NAMD-related output, and instead have this
line in the log files:

8 total processes killed (some possibly by mpirun during cleanup)

The error file has this entry for each process I'm trying to run:

<start error file>
mpirun: killing job...

--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 13028 on node cl1n106 exited
on signal 0 (Unknown signal 0).
--------------------------------------------------------------------------
mpirun: clean termination accomplished

<end error file>

If I submit the two jobs individually to the cluster, they will run
without problems.
        I figured that this might be due to mpirun killing the processes that
it thinks are orphaned or zombie processes, so I tried to add a "nohup"
before the command. This allowed the two processes to run at the same
time, but they were only using one processor out of the 8 I had
allocated (as evidenced by the extremely slow computation).
        I am using the following libraries: intel/12.1, ompi/intel,
intel/11.1.072, namd/2.7-ompi. I noticed there is a conflict between
intel/12.1 and intel/11.1.072 but presumably those conflicts should also
exist when I submit the two jobs individually without incident.
        I have already asked the system admins on my cluster (MSI), but I
believe that this is a NAMD-related issue. Any help is appreciated.

Thanks,

Rajan

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:16 CST