Re: Performance on GPU

From: Anup Prasad (anup.prasad_at_monash.edu)
Date: Tue Nov 05 2019 - 23:40:27 CST

Next message: Zhang Yan: "Homologous structure as stating flexible fitting model"
Previous message: Zhang Yan: "flexible fitting for helical structure"
In reply to: Vermaas, Joshua: "RE: Performance on GPU"
Next in thread: Vermaas, Joshua: "RE: Performance on GPU"
Reply: Vermaas, Joshua: "RE: Performance on GPU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

*The starting output in the log:-*

Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 1 threads
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID:
v6.7.1-0-gbdf6a1b-namd-charm-6.7.1-build-2016-Nov-07-136676
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (36-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Info: Built with CUDA version 8000
Did not find +devices i,j,k,... argument, using all
Pe 0 physical rank 0 binding to CUDA device 0 on physical node 0: 'Tesla
P100-PCIE-16GB' Mem: 16276MB Rev: 6.0
Info: NAMD 2.12 for CRAY-XC-multicore-CUDA
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60701 for multicore-linux64-gcc
Info: Built Wed Aug 8 15:27:39 IST 2018 by crayadm on clogin72
Info: Running on 1 processors, 1 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.523414 s
CkLoopLib is used in SMP with a simple dynamic scheduling (converse-level
notification) but not using node-level queue
Info: 110.766 MB of memory in use based on /proc/self/stat
Info: Configuration file is apoa1_npt_cuda.namd
Info: Working in the current directory
/home/PolymerSimulationLab/souravray/systems/test/gpu_1/apoa1/gpu
TCL: Suspending until startup complete.
Warning: ALWAYS USE NON-ZERO MARGIN WITH CONSTANT PRESSURE!
Warning: CHANGING MARGIN FROM 0 to 0.48
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 2
Info: NUMBER OF STEPS 10000
Info: STEPS PER CYCLE 20
Info: PERIODIC CELL BASIS 1 108.861 0 0
Info: PERIODIC CELL BASIS 2 0 108.861 0
Info: PERIODIC CELL BASIS 3 0 0 77.758
Info: PERIODIC CELL CENTER 0 0 0
Info: LOAD BALANCER None
Info: MIN ATOMS PER PATCH 40
Info: INITIAL TEMPERATURE 298
Info: CENTER OF MASS MOVING INITIALLY? NO
Info: DIELECTRIC 1
Info: EXCLUDE SCALED ONE-FOUR
Info: 1-4 ELECTROSTATICS SCALED BY 1
Info: MODIFIED 1-4 VDW PARAMETERS WILL BE USED
Info: NO DCD TRAJECTORY OUTPUT
Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT
Info: NO VELOCITY DCD OUTPUT
Info: NO FORCE DCD OUTPUT
Info: OUTPUT FILENAME apoa1-output
Info: BINARY OUTPUT FILES WILL BE USED
Info: NO RESTART FILE
Info: SWITCHING ACTIVE
Info: SWITCHING ON 10
Info: SWITCHING OFF 12
Info: PAIRLIST DISTANCE 13.5
Info: PAIRLIST SHRINK RATE 0.01
Info: PAIRLIST GROW RATE 0.01
Info: PAIRLIST TRIGGER 0.3
Info: PAIRLISTS PER CYCLE 2
Info: PAIRLISTS ENABLED
Info: MARGIN 0.48
Info: HYDROGEN GROUP CUTOFF 2.5
Info: PATCH DIMENSION 16.48
Info: ENERGY OUTPUT STEPS 500
Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
Info: TIMING OUTPUT STEPS 500
Info: LANGEVIN DYNAMICS ACTIVE
Info: LANGEVIN TEMPERATURE 298
Info: LANGEVIN USING BBK INTEGRATOR
Info: LANGEVIN DAMPING COEFFICIENT IS 5 INVERSE PS
Info: LANGEVIN DYNAMICS NOT APPLIED TO HYDROGENS
Info: LANGEVIN PISTON PRESSURE CONTROL ACTIVE
Info: TARGET PRESSURE IS 1.01325 BAR
Info: OSCILLATION PERIOD IS 100 FS
Info: DECAY TIME IS 50 FS
Info: PISTON TEMPERATURE IS 298 K
Info: PRESSURE CONTROL IS GROUP-BASED
Info: INITIAL STRAIN RATE IS 0 0 0
Info: CELL FLUCTUATION IS ISOTROPIC
Info: PARTICLE MESH EWALD (PME) ACTIVE
Info: PME TOLERANCE 1e-06
Info: PME EWALD COEFFICIENT 0.257952
Info: PME INTERPOLATION ORDER 4
Info: PME GRID DIMENSIONS 108 108 80
Info: PME MAXIMUM GRID SPACING 1.5
Info: Attempting to read FFTW data from
FFTW_NAMD_2.12_CRAY-XC-multicore-CUDA.txt
Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
Info: Writing FFTW data to FFTW_NAMD_2.12_CRAY-XC-multicore-CUDA.txt
Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 2
Info: USING VERLET I (r-RESPA) MTS SCHEME.
Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
Info: RIGID BONDS TO HYDROGEN : ALL
Info: ERROR TOLERANCE : 1e-08
Info: MAX ITERATIONS : 100
Info: RIGID WATER USING SETTLE ALGORITHM
Info: RANDOM NUMBER SEED 74269
Info: USE HYDROGEN BONDS? NO
Info: COORDINATE PDB apoa1.pdb
Info: STRUCTURE FILE apoa1.psf
Info: PARAMETER file: XPLOR format! (default)
Info: PARAMETERS par_all22_prot_lipid.xplor
Info: PARAMETERS par_all22_popc.xplor
Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
Info: SUMMARY OF PARAMETERS:
Info: 177 BONDS
Info: 435 ANGLES
Info: 446 DIHEDRAL
Info: 45 IMPROPER
Info: 0 CROSSTERM
Info: 83 VDW
Info: 6 VDW_PAIRS
Info: 0 NBTHOLE_PAIRS
Info: TIME FOR READING PSF FILE: 0.870474
Info: Reading pdb file apoa1.pdb
Info: TIME FOR READING PDB FILE: 0.209462
Info:
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 92224 ATOMS
Info: 70660 BONDS
Info: 74136 ANGLES
Info: 74130 DIHEDRALS
Info: 1402 IMPROPERS
Info: 0 CROSSTERMS
Info: 0 EXCLUSIONS
Info: 1568 DIHEDRALS WITH MULTIPLE PERIODICITY (BASED ON PSF FILE)
Info: 80690 RIGID BONDS
Info: 195982 DEGREES OF FREEDOM
Info: 32992 HYDROGEN GROUPS
Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
Info: 32992 MIGRATION GROUPS
Info: 4 ATOMS IN LARGEST MIGRATION GROUP
Info: TOTAL MASS = 553785 amu
Info: TOTAL CHARGE = -14 e
Info: MASS DENSITY = 0.997951 g/cm^3
Info: ATOM DENSITY = 0.100081 atoms/A^3
Info: *****************************
Info:
Info: Entering startup at 23.7553 s, 139.582 MB of memory in use
Info: Startup phase 0 took 5.50747e-05 s, 139.629 MB of memory in use
Info: ADDED 218698 IMPLICIT EXCLUSIONS
Info: Startup phase 1 took 0.0340278 s, 161.539 MB of memory in use
Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
Info: NONBONDED TABLE SIZE: 769 POINTS
Info: INCONSISTENCY IN FAST TABLE ENERGY VS FORCE: 0.000325096 AT 11.9556
Info: INCONSISTENCY IN SCOR TABLE ENERGY VS FORCE: 0.000324844 AT 11.9556
Info: INCONSISTENCY IN VDWA TABLE ENERGY VS FORCE: 0.0040507 AT 0.251946
Info: INCONSISTENCY IN VDWB TABLE ENERGY VS FORCE: 0.00150189 AT 0.251946
Info: Startup phase 2 took 0.000448227 s, 161.648 MB of memory in use
Info: Startup phase 3 took 4.57764e-05 s, 161.711 MB of memory in use
Info: Startup phase 4 took 9.41753e-05 s, 161.711 MB of memory in use
Info: Startup phase 5 took 4.19617e-05 s, 161.711 MB of memory in use
Info: PATCH GRID IS 6 (PERIODIC) BY 6 (PERIODIC) BY 4 (PERIODIC)
Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
Info: REMOVING COM VELOCITY 0.00117565 0.0288209 0.0202255
Info: LARGEST PATCH (56) HAS 718 ATOMS
Info: TORUS A SIZE 1 USING 0
Info: TORUS B SIZE 1 USING 0
Info: TORUS C SIZE 1 USING 0
Info: TORUS MINIMAL MESH SIZE IS 1 BY 1 BY 1
Info: Placed 100% of base nodes on same physical node as patch
Info: Startup phase 6 took 0.0221109 s, 177.691 MB of memory in use
Info: PME using 1 x 1 x 1 pencil grid for FFT and reciprocal sum.
Info: Startup phase 7 took 0.000143051 s, 177.98 MB of memory in use
Info: Startup phase 8 took 0.00597692 s, 179.91 MB of memory in use
Info: Startup phase 9 took 0.356868 s, 361.457 MB of memory in use
Info: CREATING 3031 COMPUTE OBJECTS
Info: Updated CUDA force table with 4096 elements.
Info: Updated CUDA LJ table with 83 x 83 elements.
Info: Found 318 unique exclusion lists needing 1060 bytes
Info: Startup phase 10 took 0.023706 s, 368.844 MB of memory in use
Info: Startup phase 11 took 5.6982e-05 s, 368.906 MB of memory in use
Info: Startup phase 12 took 0.000685215 s, 369.418 MB of memory in use
Info: Finished startup at 24.1996 s, 369.539 MB of memory in use

On Tue, 5 Nov 2019 at 23:03, Vermaas, Joshua <Joshua.Vermaas_at_nrel.gov>
wrote:

> What is the output in the log? Usually when I see weird performance, its
> because NAMD didn't detect the hardware like you expected it to. The
> startup information in the top of the log will report how many processors
> are being used, and how the GPUs are being assigned.
>
> -Josh
>
>
>
> On 2019-11-05 06:36:29-07:00 owner-namd-l_at_ks.uiuc.edu wrote:
>
> Dear NAMD community,
> I am using NAMD platform for my MD simulations. I want to use the GPU
> nodes on the HPC facility here (CRAY XE) to run my simulations, for which I
> am trying to run the "apoa1" benchmark. I compared the simulation output
> performance on my HPC facility with given NAMD benchmark results, but got
> very poor performance. Based on the NAMD benchmarks for apoa1 I should be
> getting a performance of nearly 30 ns/day on the hardware we have here.
> However, I am able to get only around 3 ns/day for the same system. I am
> using the NAMD config files provided in the benchmark link below.
> NAMD benchmark link- https://www.ks.uiuc.edu/Research/namd/benchmarks/
> These are the specifications for the GPU nodes at my institute,
> *HPC specifications*
>
> *Operating System -- Cray Linux Environment Version - 6.x*
>
> *Cray Programming Environment (CPE) -- Unlimited*
>
> *Intel Parallel Studio XE -- 5 Seats*
>
> *PGI Accelerator -- 2 Seats*
>
> *Workload Manager -- PBS Pro*
> *Compute Node - CPU+GPU Node*
>
> *Processor -- 1X BDW 2.1 GHz 18C*
>
> *Accelerator -- 1X P100 16 GiB*
>
> *Memory Per Node -- 64 GB DDR4-2400 with Chipkill technology*
>
>
> *This is the shell script I use to submit jobs,*
>
>
> *##############################################################################*
>
> * submitting shell script*
>
>
> *##############################################################################*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *## Queue it will run in #PBS -N gpu #PBS -q gpuq #PBS -l
> select=1:ncpus=18:accelerator=True:vntype=cray_compute #PBS -l
> walltime=00:30:00 #PBS -l place=pack #PBS -j oe module load
> craype-broadwell module load craype-accel-nvidia60 module load
> namd/2.12/gpu-8.0
> EXEC=/home/apps/namd/2.12/gpu/8.0/CRAY-XC.cuda.arch.multicore/namd2 cd
> $PBS_O_WORKDIR time aprun -n 1 -N 1 -d 18
> /home/apps/namd/2.12/gpu/8.0/CRAY-XC.cuda.arch.multicore/namd2 +idlepoll
> apoa1_npt_cuda.namd > prod_gpu.log*
>
>
> *#################################################################################*
>
>
> *Please help with suggestions. *
>
> *Kind regards*
>
> *Anup Kumar Prasad*
>
> *Ph.D scholar, IITB-Monash Research Academy*
>
> *Indian Institute of Technology Bombay, INDIA*
>
>

Next message: Zhang Yan: "Homologous structure as stating flexible fitting model"
Previous message: Zhang Yan: "flexible fitting for helical structure"
In reply to: Vermaas, Joshua: "RE: Performance on GPU"
Next in thread: Vermaas, Joshua: "RE: Performance on GPU"
Reply: Vermaas, Joshua: "RE: Performance on GPU"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2020 - 23:17:12 CST