Re: Problems running FEP on Lenovo NextScale KNL

From: Brian Radak (bradak_at_anl.gov)
Date: Thu Jul 27 2017 - 15:55:21 CDT

I do not know of anyone running on NextScale.

The error messages that you are posting don't look like errors from NAMD
as such. Why is not possible to post regular MD output? If it is not
giving an error message, that is fine, it would be one more thing that
we know. I do not know of a specific reason why FEP or ABF would be any
more incompatible with NextScale than more standard NAMD options.

On 07/27/2017 01:27 AM, Francesco Pietra wrote:
> Hi Brian:
>
> Do regular MD jobs complete as expected? The error that I
> encountered was variously a segmentation fault and/or
> uninitialized memory such that the nonbonded energies were
> essentially random.
>
>
> I said more in my post than was really warranted. That is, I have only
> checked (with my protein-ligand system, I did not try the tutorial
> files) that all FEP simulations (and ABF, too) patterned on Roux' et
> al 2017 tutorial (I used Roux layout also for colvars) run on my 4CPUs
> desktop (the other machines here are GPUs, so that unusable for FEP)
> for some 10,000 steps, then I killed the simulation. The only complete
> simulations (FEP and ABF) were for the Unbound, although I have not
> yet analyzed them. The system is large, a two chain proteins, and the
> organic ligand (not a peptide) is also large with conformationally
> mobile side chains, although of good parameterization with dihedral
> and water interaction fitting) (the system is stable for over 600ns
> with classical MD, CHARMM36 FF). Therefore, I'll answer your question
> (hopefully) later, when a complete run will be possible on the
> cluster. I can add that RATTLE proved only possible with the Unbound
> systems (FEP and ABF), otherwise I had to set "rigidbonds water" and
> "ts=1.0fs", commenting out "# splitpatch hydrogen". With RATTLE and
> ts=2.0fs, the system crashed at step 0, for system unstable, not for
> segmentation fault.
>
> Do you know whether these simulations are being carried out somewhere
> on a NextScale cluster, or anyway by compiling namd12 mpi? I am
> extremely interested in the study I have projected (and got a grant
> for) while people at cineca are unharmed by lacking a real background
> in MD.
>
> Thanks for the exchange of information
>
> francesco
>
> On Wed, Jul 26, 2017 at 9:45 PM, Brian Radak <bradak_at_anl.gov
> <mailto:bradak_at_anl.gov>> wrote:
>
> Hi Francesco,
>
> We have regularly been performing FEP calculations on the KNL
> machine here at the lab, although it is a Cray machine and thus we
> do not use MPI.
>
> Do regular MD jobs complete as expected? The error that I
> encountered was variously a segmentation fault and/or
> uninitialized memory such that the nonbonded energies were
> essentially random.
>
> Brian
>
>
> On 07/26/2017 10:27 AM, Francesco Pietra wrote:
>> Hi Brian:
>> No, compilation by people at CINECA (the Italian Computer Center)
>> don't know on which Intel version. The log ends with
>>
>>
>> Intel(R) MPI Library troubleshooting guide:
>> https://software.intel.com/node/561764
>> <https://software.intel.com/node/561764>
>>
>> people at CINECA advised me, a couple of weeks ago, that nam12
>> knl performance was poor. In fact, my job like the one I showed
>> above run (with a different command line) more slowly than on my
>> desktop.
>>
>> Now they said to have compiled for multinode but asked me to try
>> the performance on a single node with the commands that I have shown.
>>
>> I'll pass them your message (assuming then namd12-knl allows FEP
>> calculations).
>>
>> Thanks
>> francesco
>>
>> On Wed, Jul 26, 2017 at 5:04 PM, Brian Radak <bradak_at_anl.gov
>> <mailto:bradak_at_anl.gov>> wrote:
>>
>> I assume you did not compile your own NAMD on KNL? We've been
>> having trouble with version 17 of the Intel compiler suite
>> and been falling back to version 16.
>>
>> Brian
>>
>>
>> On 07/26/2017 09:23 AM, Francesco Pietra wrote:
>>> Hello:
>>> I am asking for advice on running a FEP protein-ligand
>>> (Bound) simulation. It runs correctly on my Linux-Intel box
>>> with namd12 multicore, while it halts with namd12 knl on a
>>> CINECA-codesigned Lenovo NextScale cluster with Intel® Xeon
>>> Phi™ product family “Knights Landing” alongside with Intel®
>>> Xeon® processor E5-2600 v4 product family.
>>>
>>> I tried on a single node by selecting 64 CPUs and 256 MPI
>>> processes, or only 126 MPI processes. In both cases, while
>>> the .err file is silent, namd log shows, after updating NAMD
>>> interface and re-initializing colvars, the error:
>>>
>>> =======================================================
>>> colvars: The final output state file will be
>>> "frwd-01_0.colvars.state".
>>>
>>>
>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> = PID 58639 RUNNING AT r065c01s03-hfi.marconi.cineca.it
>>> <http://r065c01s03-hfi.marconi.cineca.it>
>>> = EXIT CODE: 11
>>> = CLEANING UP REMAINING PROCESSES
>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>>
>>> = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
>>> = PID 58639 RUNNING AT r065c01s03-hfi.marconi.cineca.it
>>> <http://r065c01s03-hfi.marconi.cineca.it>
>>> = EXIT CODE: 11
>>> = CLEANING UP REMAINING PROCESSES
>>> = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
>>> ====================================================================
>>> For comparison, on my desktop, at that stage, it continue
>>> normally:
>>> colvars: The final output state file will be
>>> "frwd-01_0.colvars.state".
>>> FEP: RESETTING FOR NEW FEP WINDOW LAMBDA SET TO 0 LAMBDA2 0.02
>>> FEP: WINDOW TO HAVE 100000 STEPS OF EQUILIBRATION PRIOR TO
>>> FEP DATA COLLECTION.
>>> FEP: USING CONSTANT TEMPERATURE OF 300 K FOR FEP CALCULATION
>>> PRESSURE: 0 70.2699 -221.652 -54.6848 -221.652 -146.982
>>> 179.527 -54.6848 179.527 216.259
>>> GPRESSURE: 0 92.593 -114.553 110.669 -161.111 -69.3013
>>> 92.2703 26.1698 176.706 99.3091
>>> ETITLE: TS BOND ANGLE DIHED IMPRP
>>> ELECT VDW BOUNDARY MISC KINETIC
>>> TOTAL TEMP POTENTIAL TOTAL3
>>> TEMPAVG PRESSURE GPRESSURE VOLUME
>>> PRESSAVG GPRESSAVG
>>> FEPTITLE: TS BOND2 ELECT2 VDW2
>>>
>>> ENERGY: 0 4963.7649 7814.6132 8443.0271
>>> 479.5443 -251991.4214
>>> ################
>>> The batch job was configures as follows for 126:
>>>
>>> #!/bin/bash
>>> #PBS -l
>>> select=1:ncpus=64:mpiprocs=126:mem=86GB:mcdram=cache:numa=quadrant
>>> #PBS -l walltime=00:10:00
>>> #PBS -o frwd-01.out
>>> #PBS -e frwd-01.err
>>> #PBS -A my account
>>>
>>> # go to submission directory
>>> cd $PBS_O_WORKDIR
>>>
>>> # load namd
>>> module load profile/knl
>>> module load autoload namd/2.12_knl
>>> module help namd/2.12_knl
>>>
>>> #launch NAMD over 4*64=256 cores
>>>
>>> mpirun -perhost 1 -n 1 namd2 +ppn 126 frwd-01.namd +pemap
>>> 4-66+68 + commap 67 > frwd-01.namd.log
>>>
>>> ########################
>>> or for 256:
>>> #PBS -l
>>> select=1:ncpus=64:mpiprocs=256:mem=86GB:mcdram=cache:numa=quadrant
>>>
>>> mpirun -perhost 1 -n 1 namd2 +ppn 256 frwd-01.namd +pemap
>>> 0-63+64+128+192 > frwd-01.namd.log
>>>
>>> ###############
>>>
>>> Assuming that KNL is no hindrance to FEP, i hope to get a
>>> hint to transmit to operators at the cluster.
>>>
>>> Thanks
>>>
>>> francesco pietra
>>
>> --
>> Brian Radak
>> Postdoctoral Appointee
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>> 9700 South Cass Avenue, Bldg. 240
>> Argonne, IL 60439-4854
>> (630) 252-8643 <tel:%28630%29%20252-8643>
>> brian.radak_at_anl.gov <mailto:brian.radak_at_anl.gov>
>>
>>
>
> --
> Brian Radak
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory
>
> 9700 South Cass Avenue, Bldg. 240
> Argonne, IL 60439-4854
> (630) 252-8643 <tel:%28630%29%20252-8643>
> brian.radak_at_anl.gov <mailto:brian.radak_at_anl.gov>
>
>

-- 
Brian Radak
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory
9700 South Cass Avenue, Bldg. 240
Argonne, IL 60439-4854
(630) 252-8643
brian.radak_at_anl.gov

This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:21:30 CST