Problems running FEP on Lenovo NextScale KNL

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Wed Jul 26 2017 - 09:23:19 CDT

Hello:
I am asking for advice on running a FEP protein-ligand (Bound) simulation.
It runs correctly on my Linux-Intel box with namd12 multicore, while it
halts with namd12 knl on a CINECA-codesigned Lenovo NextScale cluster
with Intel® Xeon Phi™ product family “Knights Landing” alongside with
Intel® Xeon® processor E5-2600 v4 product family.

I tried on a single node by selecting 64 CPUs and 256 MPI processes, or
only 126 MPI processes. In both cases, while the .err file is silent, namd
log shows, after updating NAMD interface and re-initializing colvars, the
error:

=======================================================
 colvars: The final output state file will be "frwd-01_0.colvars.state".

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 58639 RUNNING AT r065c01s03-hfi.marconi.cineca.it
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 58639 RUNNING AT r065c01s03-hfi.marconi.cineca.it
= EXIT CODE: 11
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
====================================================================
For comparison, on my desktop, at that stage, it continue normally:
colvars: The final output state file will be "frwd-01_0.colvars.state".
FEP: RESETTING FOR NEW FEP WINDOW LAMBDA SET TO 0 LAMBDA2 0.02
FEP: WINDOW TO HAVE 100000 STEPS OF EQUILIBRATION PRIOR TO FEP DATA
COLLECTION.
FEP: USING CONSTANT TEMPERATURE OF 300 K FOR FEP CALCULATION
PRESSURE: 0 70.2699 -221.652 -54.6848 -221.652 -146.982 179.527 -54.6848
179.527 216.259
GPRESSURE: 0 92.593 -114.553 110.669 -161.111 -69.3013 92.2703 26.1698
176.706 99.3091
ETITLE: TS BOND ANGLE DIHED
IMPRP ELECT VDW BOUNDARY
MISC KINETIC TOTAL TEMP
POTENTIAL TOTAL3 TEMPAVG PRESSURE
GPRESSURE VOLUME PRESSAVG GPRESSAVG
FEPTITLE: TS BOND2 ELECT2 VDW2

ENERGY: 0 4963.7649 7814.6132 8443.0271
479.5443 -251991.4214
################
The batch job was configures as follows for 126:

#!/bin/bash
#PBS -l select=1:ncpus=64:mpiprocs=126:mem=86GB:mcdram=cache:numa=quadrant
#PBS -l walltime=00:10:00
#PBS -o frwd-01.out
#PBS -e frwd-01.err
#PBS -A my account

# go to submission directory
cd $PBS_O_WORKDIR

# load namd
module load profile/knl
module load autoload namd/2.12_knl
module help namd/2.12_knl

#launch NAMD over 4*64=256 cores

mpirun -perhost 1 -n 1 namd2 +ppn 126 frwd-01.namd +pemap 4-66+68 + commap
67 > frwd-01.namd.log

########################
or for 256:
#PBS -l select=1:ncpus=64:mpiprocs=256:mem=86GB:mcdram=cache:numa=quadrant

mpirun -perhost 1 -n 1 namd2 +ppn 256 frwd-01.namd +pemap 0-63+64+128+192 >
frwd-01.namd.log

###############

Assuming that KNL is no hindrance to FEP, i hope to get a hint to transmit
to operators at the cluster.

Thanks

francesco pietra

This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:21:29 CST