Multi-node job errors

From: Anup Prasad (anup.prasad_at_monash.edu)
Date: Sat Jul 27 2019 - 02:34:00 CDT

Hi

I am running a parallel NAMD simulation on a CRAY XC HPC facility in my
university in a queue where jobs are restricted to a single node. However,
I would like to run the simulation on more number of processors. For this,
I was trying to submit test simulations in the "devel queue" to run on 1
and 2 nodes, respectively. I am able to run the single node simulation
without any issues. However, the simulation does not run when submitted to
2 nodes.

The following shell script was used for submission to a single node. This
results in a successful job run.

**************************************************************************************
                          shell script for single node
**************************************************************************************

## Queue it will run in
#PBS -N trial
#PBS -q small
#PBS -l select=1:ncpus=40:vntype=cray_compute
#PBS -l walltime=96:00:00
#PBS -l place=pack
#PBS -j oe

module load namd/2.12/intel-18.0.1

cd $PBS_O_WORKDIR

aprun -n 40 -N 40 /home/apps/namd/2.12/intel/18.0.1/CRAY-XC-intel/namd2
prod1.conf > prod1.log

======================================================================================

I face problems when I try to scale the simulation to 2/4 nodes on the
"regular queue". The single node shell script was modified to submit a two
node simulation on the "devel queue".

***************************************************************************************
                            shell script for two nodes
***************************************************************************************
## Queue it will run in
#PBS -N trial
#PBS -q devel
#PBS -l select=2:ncpus=40:vntype=cray_compute
#PBS -l walltime=00:30:00
#PBS -l place=pack
#PBS -j oe

module load namd/2.12/intel-18.0.1
module swap PrgEnv-cray PrgEnv-intel
module load rca
module load craype-hugepages8M
setenv HUGETLB_DEFAULT_PAGE_SIZE 8M
setenv HUGETLB_MORECORE no

cd $PBS_O_WORKDIR

aprun -n 8 -N 4 -d 10 /home/apps/namd/2.12/intel/18.0.1/CRAY-XC-intel/namd2
+ppn 9 +pemap 1-9,11-19,21-29,31-39 +commap 0,10,20,30 prod1.conf >
prod1.log

=======================================================================================

This results in the following error message,

***************************************************************************************
                               error output for two nodes
***************************************************************************************
Transient MPP reservation error on create.
=======================================================================================

I am unable to figure out why the job does not run beyond a single node.
Please help with this problem.

Thank you in advance

Kind regards
Anup

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:20:53 CST