RE: Performance on GPU

From: Vermaas, Joshua (Joshua.Vermaas_at_nrel.gov)
Date: Wed Nov 06 2019 - 09:00:41 CST

OK! This is a multicore build, so your runscript should look something like this:

time aprun -n 1 -N 1 -d 18 /home/apps/namd/2.12/gpu/8.0/CRAY-XC.cuda.arch.multicore/namd2 +p18 +idlepoll apoa1_npt_cuda.namd > prod_gpu.log

The parallelism in your binary isn't coming from mpi or multinode parallelism, but instead it was compiled so that NAMD handles multiple threads across multiple processors on the same node internally. You can see this from:

Info: Running on 1 processors, 1 nodes, 1 physical nodes.

While you have 18 processors on the node, NAMD is being asked to only use 1, which means your performance won't be great.

-Josh

On 2019-11-05 22:40:44-07:00 Anup Prasad wrote:

The starting output in the log:-
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 1 threads
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID: v6.7.1-0-gbdf6a1b-namd-charm-6.7.1-build-2016-Nov-07-136676
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (36-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Info: Built with CUDA version 8000
Did not find +devices i,j,k,... argument, using all
Pe 0 physical rank 0 binding to CUDA device 0 on physical node 0: 'Tesla P100-PCIE-16GB' Mem: 16276MB Rev: 6.0
Info: NAMD 2.12 for CRAY-XC-multicore-CUDA
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60701 for multicore-linux64-gcc
Info: Built Wed Aug 8 15:27:39 IST 2018 by crayadm on clogin72
Info: Running on 1 processors, 1 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.523414 s
CkLoopLib is used in SMP with a simple dynamic scheduling (converse-level notification) but not using node-level queue
Info: 110.766 MB of memory in use based on /proc/self/stat
Info: Configuration file is apoa1_npt_cuda.namd
Info: Working in the current directory /home/PolymerSimulationLab/souravray/systems/test/gpu_1/apoa1/gpu
TCL: Suspending until startup complete.
Warning: ALWAYS USE NON-ZERO MARGIN WITH CONSTANT PRESSURE!
Warning: CHANGING MARGIN FROM 0 to 0.48
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 2
Info: NUMBER OF STEPS 10000
Info: STEPS PER CYCLE 20
Info: PERIODIC CELL BASIS 1 108.861 0 0
Info: PERIODIC CELL BASIS 2 0 108.861 0
Info: PERIODIC CELL BASIS 3 0 0 77.758
Info: PERIODIC CELL CENTER 0 0 0
Info: LOAD BALANCER None
Info: MIN ATOMS PER PATCH 40
Info: INITIAL TEMPERATURE 298
Info: CENTER OF MASS MOVING INITIALLY? NO
Info: DIELECTRIC 1
Info: EXCLUDE SCALED ONE-FOUR
Info: 1-4 ELECTROSTATICS SCALED BY 1
Info: MODIFIED 1-4 VDW PARAMETERS WILL BE USED
Info: NO DCD TRAJECTORY OUTPUT
Info: NO EXTENDED SYSTEM TRAJECTORY OUTPUT
Info: NO VELOCITY DCD OUTPUT
Info: NO FORCE DCD OUTPUT
Info: OUTPUT FILENAME apoa1-output
Info: BINARY OUTPUT FILES WILL BE USED
Info: NO RESTART FILE
Info: SWITCHING ACTIVE
Info: SWITCHING ON 10
Info: SWITCHING OFF 12
Info: PAIRLIST DISTANCE 13.5
Info: PAIRLIST SHRINK RATE 0.01
Info: PAIRLIST GROW RATE 0.01
Info: PAIRLIST TRIGGER 0.3
Info: PAIRLISTS PER CYCLE 2
Info: PAIRLISTS ENABLED
Info: MARGIN 0.48
Info: HYDROGEN GROUP CUTOFF 2.5
Info: PATCH DIMENSION 16.48
Info: ENERGY OUTPUT STEPS 500
Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
Info: TIMING OUTPUT STEPS 500
Info: LANGEVIN DYNAMICS ACTIVE
Info: LANGEVIN TEMPERATURE 298
Info: LANGEVIN USING BBK INTEGRATOR
Info: LANGEVIN DAMPING COEFFICIENT IS 5 INVERSE PS
Info: LANGEVIN DYNAMICS NOT APPLIED TO HYDROGENS
Info: LANGEVIN PISTON PRESSURE CONTROL ACTIVE
Info: TARGET PRESSURE IS 1.01325 BAR
Info: OSCILLATION PERIOD IS 100 FS
Info: DECAY TIME IS 50 FS
Info: PISTON TEMPERATURE IS 298 K
Info: PRESSURE CONTROL IS GROUP-BASED
Info: INITIAL STRAIN RATE IS 0 0 0
Info: CELL FLUCTUATION IS ISOTROPIC
Info: PARTICLE MESH EWALD (PME) ACTIVE
Info: PME TOLERANCE 1e-06
Info: PME EWALD COEFFICIENT 0.257952
Info: PME INTERPOLATION ORDER 4
Info: PME GRID DIMENSIONS 108 108 80
Info: PME MAXIMUM GRID SPACING 1.5
Info: Attempting to read FFTW data from FFTW_NAMD_2.12_CRAY-XC-multicore-CUDA.txt
Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
Info: Writing FFTW data to FFTW_NAMD_2.12_CRAY-XC-multicore-CUDA.txt
Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 2
Info: USING VERLET I (r-RESPA) MTS SCHEME.
Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
Info: RIGID BONDS TO HYDROGEN : ALL
Info: ERROR TOLERANCE : 1e-08
Info: MAX ITERATIONS : 100
Info: RIGID WATER USING SETTLE ALGORITHM
Info: RANDOM NUMBER SEED 74269
Info: USE HYDROGEN BONDS? NO
Info: COORDINATE PDB apoa1.pdb
Info: STRUCTURE FILE apoa1.psf
Info: PARAMETER file: XPLOR format! (default)
Info: PARAMETERS par_all22_prot_lipid.xplor
Info: PARAMETERS par_all22_popc.xplor
Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
Info: SUMMARY OF PARAMETERS:
Info: 177 BONDS
Info: 435 ANGLES
Info: 446 DIHEDRAL
Info: 45 IMPROPER
Info: 0 CROSSTERM
Info: 83 VDW
Info: 6 VDW_PAIRS
Info: 0 NBTHOLE_PAIRS
Info: TIME FOR READING PSF FILE: 0.870474
Info: Reading pdb file apoa1.pdb
Info: TIME FOR READING PDB FILE: 0.209462
Info:
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 92224 ATOMS
Info: 70660 BONDS
Info: 74136 ANGLES
Info: 74130 DIHEDRALS
Info: 1402 IMPROPERS
Info: 0 CROSSTERMS
Info: 0 EXCLUSIONS
Info: 1568 DIHEDRALS WITH MULTIPLE PERIODICITY (BASED ON PSF FILE)
Info: 80690 RIGID BONDS
Info: 195982 DEGREES OF FREEDOM
Info: 32992 HYDROGEN GROUPS
Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
Info: 32992 MIGRATION GROUPS
Info: 4 ATOMS IN LARGEST MIGRATION GROUP
Info: TOTAL MASS = 553785 amu
Info: TOTAL CHARGE = -14 e
Info: MASS DENSITY = 0.997951 g/cm^3
Info: ATOM DENSITY = 0.100081 atoms/A^3
Info: *****************************
Info:
Info: Entering startup at 23.7553 s, 139.582 MB of memory in use
Info: Startup phase 0 took 5.50747e-05 s, 139.629 MB of memory in use
Info: ADDED 218698 IMPLICIT EXCLUSIONS
Info: Startup phase 1 took 0.0340278 s, 161.539 MB of memory in use
Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
Info: NONBONDED TABLE SIZE: 769 POINTS
Info: INCONSISTENCY IN FAST TABLE ENERGY VS FORCE: 0.000325096 AT 11.9556
Info: INCONSISTENCY IN SCOR TABLE ENERGY VS FORCE: 0.000324844 AT 11.9556
Info: INCONSISTENCY IN VDWA TABLE ENERGY VS FORCE: 0.0040507 AT 0.251946
Info: INCONSISTENCY IN VDWB TABLE ENERGY VS FORCE: 0.00150189 AT 0.251946
Info: Startup phase 2 took 0.000448227 s, 161.648 MB of memory in use
Info: Startup phase 3 took 4.57764e-05 s, 161.711 MB of memory in use
Info: Startup phase 4 took 9.41753e-05 s, 161.711 MB of memory in use
Info: Startup phase 5 took 4.19617e-05 s, 161.711 MB of memory in use
Info: PATCH GRID IS 6 (PERIODIC) BY 6 (PERIODIC) BY 4 (PERIODIC)
Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
Info: REMOVING COM VELOCITY 0.00117565 0.0288209 0.0202255
Info: LARGEST PATCH (56) HAS 718 ATOMS
Info: TORUS A SIZE 1 USING 0
Info: TORUS B SIZE 1 USING 0
Info: TORUS C SIZE 1 USING 0
Info: TORUS MINIMAL MESH SIZE IS 1 BY 1 BY 1
Info: Placed 100% of base nodes on same physical node as patch
Info: Startup phase 6 took 0.0221109 s, 177.691 MB of memory in use
Info: PME using 1 x 1 x 1 pencil grid for FFT and reciprocal sum.
Info: Startup phase 7 took 0.000143051 s, 177.98 MB of memory in use
Info: Startup phase 8 took 0.00597692 s, 179.91 MB of memory in use
Info: Startup phase 9 took 0.356868 s, 361.457 MB of memory in use
Info: CREATING 3031 COMPUTE OBJECTS
Info: Updated CUDA force table with 4096 elements.
Info: Updated CUDA LJ table with 83 x 83 elements.
Info: Found 318 unique exclusion lists needing 1060 bytes
Info: Startup phase 10 took 0.023706 s, 368.844 MB of memory in use
Info: Startup phase 11 took 5.6982e-05 s, 368.906 MB of memory in use
Info: Startup phase 12 took 0.000685215 s, 369.418 MB of memory in use
Info: Finished startup at 24.1996 s, 369.539 MB of memory in use

On Tue, 5 Nov 2019 at 23:03, Vermaas, Joshua <Joshua.Vermaas_at_nrel.gov<mailto:Joshua.Vermaas_at_nrel.gov>> wrote:
What is the output in the log? Usually when I see weird performance, its because NAMD didn't detect the hardware like you expected it to. The startup information in the top of the log will report how many processors are being used, and how the GPUs are being assigned.

-Josh

On 2019-11-05 06:36:29-07:00 owner-namd-l_at_ks.uiuc.edu<mailto:owner-namd-l_at_ks.uiuc.edu> wrote:

Dear NAMD community,
I am using NAMD platform for my MD simulations. I want to use the GPU nodes on the HPC facility here (CRAY XE) to run my simulations, for which I am trying to run the "apoa1" benchmark. I compared the simulation output performance on my HPC facility with given NAMD benchmark results, but got very poor performance. Based on the NAMD benchmarks for apoa1 I should be getting a performance of nearly 30 ns/day on the hardware we have here. However, I am able to get only around 3 ns/day for the same system. I am using the NAMD config files provided in the benchmark link below.
NAMD benchmark link- https://www.ks.uiuc.edu/Research/namd/benchmarks/
These are the specifications for the GPU nodes at my institute,
HPC specifications

Operating System -- Cray Linux Environment Version - 6.x

Cray Programming Environment (CPE) -- Unlimited

Intel Parallel Studio XE -- 5 Seats

PGI Accelerator -- 2 Seats

Workload Manager -- PBS Pro

Compute Node - CPU+GPU Node

Processor -- 1X BDW 2.1 GHz 18C

Accelerator -- 1X P100 16 GiB

Memory Per Node -- 64 GB DDR4-2400 with Chipkill technology

This is the shell script I use to submit jobs,

##############################################################################

                                      submitting shell script

##############################################################################

## Queue it will run in
#PBS -N gpu
#PBS -q gpuq
#PBS -l select=1:ncpus=18:accelerator=True:vntype=cray_compute
#PBS -l walltime=00:30:00
#PBS -l place=pack
#PBS -j oe

module load craype-broadwell
module load craype-accel-nvidia60
module load namd/2.12/gpu-8.0
EXEC=/home/apps/namd/2.12/gpu/8.0/CRAY-XC.cuda.arch.multicore/namd2

cd $PBS_O_WORKDIR

time aprun -n 1 -N 1 -d 18 /home/apps/namd/2.12/gpu/8.0/CRAY-XC.cuda.arch.multicore/namd2 +idlepoll apoa1_npt_cuda.namd > prod_gpu.log

#################################################################################

Please help with suggestions.

Kind regards

Anup Kumar Prasad

Ph.D scholar, IITB-Monash Research Academy

Indian Institute of Technology Bombay, INDIA

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2019 - 23:21:00 CST