Re: Problems running FEP on Lenovo NextScale KNL

From: Brian Radak (bradak_at_anl.gov)
Date: Wed Jul 26 2017 - 14:45:22 CDT

Hi Francesco,

We have regularly been performing FEP calculations on the KNL machine
here at the lab, although it is a Cray machine and thus we do not use MPI.

Do regular MD jobs complete as expected? The error that I encountered
was variously a segmentation fault and/or uninitialized memory such that
the nonbonded energies were essentially random.

Brian

On 07/26/2017 10:27 AM, Francesco Pietra wrote:
> Hi Brian:
> No, compilation by people at CINECA (the Italian Computer Center)
> don't know on which Intel version. The log ends with
>
>
> Intel(R) MPI Library troubleshooting guide:
> https://software.intel.com/node/561764
>
> people at CINECA advised me, a couple of weeks ago, that nam12 knl
> performance was poor. In fact, my job like the one I showed above run
> (with a different command line) more slowly than on my desktop.
>
> Now they said to have compiled for multinode but asked me to try the
> performance on a single node with the commands that I have shown.
>
> I'll pass them your message (assuming then namd12-knl allows FEP
> calculations).
>
> Thanks
> francesco
>
> On Wed, Jul 26, 2017 at 5:04 PM, Brian Radak <bradak_at_anl.gov
> <mailto:bradak_at_anl.gov>> wrote:
>
> I assume you did not compile your own NAMD on KNL? We've been
> having trouble with version 17 of the Intel compiler suite and
> been falling back to version 16.
>
> Brian
>
>
> On 07/26/2017 09:23 AM, Francesco Pietra wrote:
>> Hello:
>> I am asking for advice on running a FEP protein-ligand (Bound)
>> simulation. It runs correctly on my Linux-Intel box with namd12
>> multicore, while it halts with namd12 knl on a CINECA-codesigned
>> Lenovo NextScale cluster with Intel® Xeon Phi™ product family
>> “Knights Landing” alongside with Intel® Xeon® processor E5-2600
>> v4 product family.
>>
>> I tried on a single node by selecting 64 CPUs and 256 MPI
>> processes, or only 126 MPI processes. In both cases, while the
>> .err file is silent, namd log shows, after updating NAMD
>> interface and re-initializing colvars, the error:
>>
>> =======================================================
>> colvars: The final output state file will be
>> "frwd-01_0.colvars.state".
>>
>>
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = PID 58639 RUNNING AT r065c01s03-hfi.marconi.cineca.it
>> <http://r065c01s03-hfi.marconi.cineca.it>
>> = EXIT CODE: 11
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = PID 58639 RUNNING AT r065c01s03-hfi.marconi.cineca.it
>> <http://r065c01s03-hfi.marconi.cineca.it>
>> = EXIT CODE: 11
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ====================================================================
>> For comparison, on my desktop, at that stage, it continue normally:
>> colvars: The final output state file will be
>> "frwd-01_0.colvars.state".
>> FEP: RESETTING FOR NEW FEP WINDOW LAMBDA SET TO 0 LAMBDA2 0.02
>> FEP: WINDOW TO HAVE 100000 STEPS OF EQUILIBRATION PRIOR TO FEP
>> DATA COLLECTION.
>> FEP: USING CONSTANT TEMPERATURE OF 300 K FOR FEP CALCULATION
>> PRESSURE: 0 70.2699 -221.652 -54.6848 -221.652 -146.982 179.527
>> -54.6848 179.527 216.259
>> GPRESSURE: 0 92.593 -114.553 110.669 -161.111 -69.3013 92.2703
>> 26.1698 176.706 99.3091
>> ETITLE: TS BOND ANGLE DIHED
>> IMPRP ELECT VDW BOUNDARY MISC
>> KINETIC TOTAL TEMP POTENTIAL TOTAL3
>> TEMPAVG PRESSURE GPRESSURE VOLUME
>> PRESSAVG GPRESSAVG
>> FEPTITLE: TS BOND2 ELECT2 VDW2
>>
>> ENERGY: 0 4963.7649 7814.6132 8443.0271
>> 479.5443 -251991.4214
>> ################
>> The batch job was configures as follows for 126:
>>
>> #!/bin/bash
>> #PBS -l
>> select=1:ncpus=64:mpiprocs=126:mem=86GB:mcdram=cache:numa=quadrant
>> #PBS -l walltime=00:10:00
>> #PBS -o frwd-01.out
>> #PBS -e frwd-01.err
>> #PBS -A my account
>>
>> # go to submission directory
>> cd $PBS_O_WORKDIR
>>
>> # load namd
>> module load profile/knl
>> module load autoload namd/2.12_knl
>> module help namd/2.12_knl
>>
>> #launch NAMD over 4*64=256 cores
>>
>> mpirun -perhost 1 -n 1 namd2 +ppn 126 frwd-01.namd +pemap 4-66+68
>> + commap 67 > frwd-01.namd.log
>>
>> ########################
>> or for 256:
>> #PBS -l
>> select=1:ncpus=64:mpiprocs=256:mem=86GB:mcdram=cache:numa=quadrant
>>
>> mpirun -perhost 1 -n 1 namd2 +ppn 256 frwd-01.namd +pemap
>> 0-63+64+128+192 > frwd-01.namd.log
>>
>> ###############
>>
>> Assuming that KNL is no hindrance to FEP, i hope to get a hint to
>> transmit to operators at the cluster.
>>
>> Thanks
>>
>> francesco pietra
>
> --
> Brian Radak
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>
> 9700 South Cass Avenue, Bldg. 240
> Argonne, IL 60439-4854
> (630) 252-8643 <tel:%28630%29%20252-8643>
> brian.radak_at_anl.gov <mailto:brian.radak_at_anl.gov>
>
>

-- 
Brian Radak
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
9700 South Cass Avenue, Bldg. 240
Argonne, IL 60439-4854
(630) 252-8643
brian.radak_at_anl.gov

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2018 - 23:20:27 CST