intermittent non-execution of NAMD on tesla

From: P.-L. Chau (pc104_at_pasteur.fr)
Date: Mon Nov 21 2011 - 04:09:04 CST

I am having intermittent problems running NAMD 2.8 on tesla GPU nodes, and
I would like to ask for some advice.

On my local supercomputer, NAMD 2.8 has been loaded as a module onto the
tesla nodes, and I use the following script which submits the job:

#PBS -q tesla
#PBS -l gres=SHAREDGPU,nodes=7:ppn=8,mem=192000mb,walltime=24:00:00
#PBS -m ae
application="namd2.cuda"
options="nachr_popcwi_ae029.conf > nachr_popcwi_ae029.log 2> error.log"
workdir="/scratch/pc104/alt_equi"

. /etc/profile.d/modules.sh
module purge
module load nehalem/fftw/intel/3.2.2
module load default-impi
module load nehalem/namd/impi/2.8

export OMP_NUM_THREADS=1
np=$(cat "$PBS_NODEFILE" | wc -l)
ppn=$(uniq -c "$PBS_NODEFILE" | head --lines=1 | sed -e 's/^ *\([0-9]\+\) .*$/\\1/g')
CMD="mpirun -tune -ppn $ppn -np $np $application $options"

cd $workdir
echo -e "Changed directory to `pwd`.\n"

JOBID=`echo $PBS_JOBID | sed -e 's/\..*$//'`

echo -e "JobID: $JOBID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"

numprocs=0
numnodes=0
if [ -r "$PBS_NODEFILE" ]; then
         #! Create a machine file as for InfiniPath MPI
         cat $PBS_NODEFILE | uniq > machine.file.$JOBID
         numprocs=$[`cat $PBS_NODEFILE | wc -l`]
         numnodes=$[`cat machine.file.$JOBID | wc -l`]
         echo -e "\nNodes allocated:\n================"
         echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'`
fi

[ $numnodes -eq 0 -o $numprocs -eq 0 ] && { echo "numnodes=$numnodes, numprocs\
=$numprocs, exiting." ; exit 7 ; }
ppn=$[ $numprocs / $numnodes ]
[ $ppn -eq 0 ] && { echo "ppn=$ppn, exiting." ; exit 7 ; }

echo -e "\nnumprocs=$numprocs, numnodes=$numnodes, ppn=$ppn"

echo -e "\nExecuting command:\n==================\n$CMD\n"

eval $CMD

But this works only interminttently. When the NAMD/tesla combination
works, then in the log file I get an output which goes like this:

WARNING: there are no tuned data files appropriate for your configuration:
device = shm-dapl, np = 48, ppn = 8
Charm++> Running on MPI version: 2.1 multi-thread support: MPI_THREAD_SINGLE (max supported: MPI_THREAD_SINGLE)
Warning> Randomization of stack pointer is turned on in kernel, thread migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable it, or try run with '+isomalloc_sync'.
Charm++> Running on 6 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.002 seconds.
Info: NAMD 2.8 for Linux-x86_64-MPI-CUDA
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60303 for mpi-linux-x86_64-ifort-mpicxx
Info: Built Wed Aug 3 15:10:20 BST 2011 by sjr20 on pinta02
Info: 1 NAMD 2.8 Linux-x86_64-MPI-CUDA 48 tesla20 pc104
Info: Running on 48 processors, 48 nodes, 6 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.00507498 s
Pe 38 sharing CUDA device 2 first 34 next 34
Pe 32 sharing CUDA device 0 first 32 next 36
[...]

followed by a whole load of output about binding to CUDA devices on
different processors. However, for some reason, this does not always work.
I then get just these lines:

WARNING: there are no tuned data files appropriate for your configuration: device = shm-dapl, np = 56, ppn = 8
Charm++> Running on MPI version: 2.1 multi-thread support: MPI_THREAD_SINGLE (max supported: MPI_THREAD_SINGLE)

In the submission script, I have specifically asked for any errors to be
written to error.log but that file is empty. The supercomputer job log
does not show anything error, either. I have also tried adding
'+isomalloc_sync' but that only caused NAMD to crash.

Could I ask if anybody has met with these problems before? How should I
first detect the errors, and then correct them? Thank you very much
indeed!

P-L Chau

email: pc104_at_pasteur.fr
Bioinformatique Structurale
CNRS URA 2185
Institut Pasteur
75724 Paris
France

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:57:55 CST