Re: Problems running FEP on Lenovo NextScale KNL

From: Francesco Pietra (chiendarret_at_gmail.com)
Date: Thu Jul 27 2017 - 01:27:24 CDT

Hi Brian:

Do regular MD jobs complete as expected? The error that I encountered was
> variously a segmentation fault and/or uninitialized memory such that the
> nonbonded energies were essentially random.
>

I said more in my post than was really warranted. That is, I have only
checked (with my protein-ligand system, I did not try the tutorial files)
that all FEP simulations (and ABF, too) patterned on Roux' et al 2017
tutorial (I used Roux layout also for colvars) run on my 4CPUs desktop (the
other machines here are GPUs, so that unusable for FEP) for some 10,000
steps, then I killed the simulation. The only complete simulations (FEP and
ABF) were for the Unbound, although I have not yet analyzed them. The
system is large, a two chain proteins, and the organic ligand (not a
peptide) is also large with conformationally mobile side chains, although
of good parameterization with dihedral and water interaction fitting) (the
system is stable for over 600ns with classical MD, CHARMM36 FF). Therefore,
I'll answer your question (hopefully) later, when a complete run will be
possible on the cluster. I can add that RATTLE proved only possible with
the Unbound systems (FEP and ABF), otherwise I had to set "rigidbonds
water" and "ts=1.0fs", commenting out "# splitpatch hydrogen". With RATTLE
and ts=2.0fs, the system crashed at step 0, for system unstable, not for
segmentation fault.

Do you know whether these simulations are being carried out somewhere on a
NextScale cluster, or anyway by compiling namd12 mpi? I am extremely
interested in the study I have projected (and got a grant for) while people
at cineca are unharmed by lacking a real background in MD.

Thanks for the exchange of information

francesco

On Wed, Jul 26, 2017 at 9:45 PM, Brian Radak <bradak_at_anl.gov> wrote:

> Hi Francesco,
>
> We have regularly been performing FEP calculations on the KNL machine here
> at the lab, although it is a Cray machine and thus we do not use MPI.
>
> Do regular MD jobs complete as expected? The error that I encountered was
> variously a segmentation fault and/or uninitialized memory such that the
> nonbonded energies were essentially random.
>
> Brian
>
> On 07/26/2017 10:27 AM, Francesco Pietra wrote:
>
> Hi Brian:
> No, compilation by people at CINECA (the Italian Computer Center) don't
> know on which Intel version. The log ends with
>
>
> Intel(R) MPI Library troubleshooting guide:
> https://software.intel.com/node/561764
>
> people at CINECA advised me, a couple of weeks ago, that nam12 knl
> performance was poor. In fact, my job like the one I showed above run (with
> a different command line) more slowly than on my desktop.
>
> Now they said to have compiled for multinode but asked me to try the
> performance on a single node with the commands that I have shown.
>
> I'll pass them your message (assuming then namd12-knl allows FEP
> calculations).
>
> Thanks
> francesco
>
> On Wed, Jul 26, 2017 at 5:04 PM, Brian Radak <bradak_at_anl.gov> wrote:
>
>> I assume you did not compile your own NAMD on KNL? We've been having
>> trouble with version 17 of the Intel compiler suite and been falling back
>> to version 16.
>>
>> Brian
>>
>> On 07/26/2017 09:23 AM, Francesco Pietra wrote:
>>
>> Hello:
>> I am asking for advice on running a FEP protein-ligand (Bound)
>> simulation. It runs correctly on my Linux-Intel box with namd12 multicore,
>> while it halts with namd12 knl on a CINECA-codesigned Lenovo NextScale
>> cluster with Intel® Xeon Phi™ product family “Knights Landing” alongside
>> with Intel® Xeon® processor E5-2600 v4 product family.
>>
>> I tried on a single node by selecting 64 CPUs and 256 MPI processes, or
>> only 126 MPI processes. In both cases, while the .err file is silent, namd
>> log shows, after updating NAMD interface and re-initializing colvars, the
>> error:
>>
>> =======================================================
>> colvars: The final output state file will be "frwd-01_0.colvars.state".
>>
>>
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = PID 58639 RUNNING AT r065c01s03-hfi.marconi.cineca.it
>> = EXIT CODE: 11
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>
>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>> = PID 58639 RUNNING AT r065c01s03-hfi.marconi.cineca.it
>> = EXIT CODE: 11
>> = CLEANING UP REMAINING PROCESSES
>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>> ====================================================================
>> For comparison, on my desktop, at that stage, it continue normally:
>> colvars: The final output state file will be "frwd-01_0.colvars.state".
>> FEP: RESETTING FOR NEW FEP WINDOW LAMBDA SET TO 0 LAMBDA2 0.02
>> FEP: WINDOW TO HAVE 100000 STEPS OF EQUILIBRATION PRIOR TO FEP DATA
>> COLLECTION.
>> FEP: USING CONSTANT TEMPERATURE OF 300 K FOR FEP CALCULATION
>> PRESSURE: 0 70.2699 -221.652 -54.6848 -221.652 -146.982 179.527 -54.6848
>> 179.527 216.259
>> GPRESSURE: 0 92.593 -114.553 110.669 -161.111 -69.3013 92.2703 26.1698
>> 176.706 99.3091
>> ETITLE: TS BOND ANGLE DIHED
>> IMPRP ELECT VDW BOUNDARY
>> MISC KINETIC TOTAL TEMP
>> POTENTIAL TOTAL3 TEMPAVG PRESSURE
>> GPRESSURE VOLUME PRESSAVG GPRESSAVG
>> FEPTITLE: TS BOND2 ELECT2 VDW2
>>
>> ENERGY: 0 4963.7649 7814.6132 8443.0271
>> 479.5443 -251991.4214
>> ################
>> The batch job was configures as follows for 126:
>>
>> #!/bin/bash
>> #PBS -l select=1:ncpus=64:mpiprocs=126:mem=86GB:mcdram=cache:numa=
>> quadrant
>> #PBS -l walltime=00:10:00
>> #PBS -o frwd-01.out
>> #PBS -e frwd-01.err
>> #PBS -A my account
>>
>> # go to submission directory
>> cd $PBS_O_WORKDIR
>>
>> # load namd
>> module load profile/knl
>> module load autoload namd/2.12_knl
>> module help namd/2.12_knl
>>
>> #launch NAMD over 4*64=256 cores
>>
>> mpirun -perhost 1 -n 1 namd2 +ppn 126 frwd-01.namd +pemap 4-66+68 +
>> commap 67 > frwd-01.namd.log
>>
>> ########################
>> or for 256:
>> #PBS -l select=1:ncpus=64:mpiprocs=256:mem=86GB:mcdram=cache:numa=
>> quadrant
>>
>> mpirun -perhost 1 -n 1 namd2 +ppn 256 frwd-01.namd +pemap 0-63+64+128+192
>> > frwd-01.namd.log
>>
>> ###############
>>
>> Assuming that KNL is no hindrance to FEP, i hope to get a hint to
>> transmit to operators at the cluster.
>>
>> Thanks
>>
>> francesco pietra
>>
>>
>> --
>> Brian Radak
>> Postdoctoral Appointee
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>> 9700 South Cass Avenue, Bldg. 240
>> Argonne, IL 60439-4854
>> (630) 252-8643
>> brian.radak_at_anl.gov
>>
>
>
> --
> Brian Radak
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>
> 9700 South Cass Avenue, Bldg. 240
> Argonne, IL 60439-4854
> (630) 252-8643
> brian.radak_at_anl.gov
>

This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:21:29 CST