Re: namd-cuda-intel vs. namd-intel

From: Roman Petrenko (rpetrenko_at_gmail.com)
Date: Tue May 12 2009 - 10:19:55 CDT

Axel, thanks for reply. The problem is that the standard test (apoa1)
runs slower on gpu than on cpu. I guess we don't have powerful enough
gpu for this system size (92,000 atoms). On the other hand,
cuda-accelerated ion placement from VMD (when running volmap
coulombpotential from vmd website, 700,000 atoms) resulted in 30 times
speedup on our system. Then, 6,000 atom system we are interested in
resulted in a mere 4x acceleration for namd-gpu. Why?
Note, removing SMD, PBC or Langevin temperature control doesn't affect
greatly the calculation speed. Also, i should note that on cpu there
are 810 compute objects, whereas on gpu this number is 140.

Here is the namd-log file when cuda is used
---------------------------------------------------------------------------------------------
Running command: /usr/local/namd/2.7b1/cuda/intel/bin/namd2 +idlepoll
dir/protein.namd

Charm++: standalone mode (not using charmrun)
Charm warning> Randomization of stack pointer is turned on in Kernel,
run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable
it. Thread migration may not work!
Did not find +devices i,j,k,... argument, defaulting to (pe + 1) % deviceCount
Pe 0 binding to CUDA device 0 on hostname: 'GeForce 9800 GT' Mem:
1023MB Rev: 1.1
Charm++> cpu topology info is being gathered.
Charm++> 1 unique compute nodes detected.
Info: NAMD 2.7b1 for Linux-x86_64
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: and send feedback or bug reports to namd_at_ks.uiuc.edu
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60102 for multicore-linux64-ifort-icc
Info: Built Tue May 5 23:55:12 EDT 2009 by admin on hostname
Info: 1 NAMD 2.7b1 Linux-x86_64 1 hostname user
Info: Running on 1 processors.
Info: Charm++/Converse parallel runtime startup completed at 0.860323 s
Info: 1.79207 MB of memory in use based on mallinfo
Info: Changed directory to /dir
Info: Configuration file is protein.namd
TCL: Suspending until startup complete.
Warning: The following variables were set in the
Warning: configuation file but were not needed
Warning: langevinTemp
Warning: langevinDamping
Warning: langevinHydrogen
Warning: fixedAtomsFile
Warning: fixedAtomsCol
Warning: SMDVel
Warning: SMDDir
Warning: SMDk
Warning: SMDFile
Warning: SMDOutputFreq
Info: EXTENDED SYSTEM FILE protein.restart.xsc
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 2
Info: NUMBER OF STEPS 0
Info: STEPS PER CYCLE 10
Info: PERIODIC CELL BASIS 1 23.199 0 0
Info: PERIODIC CELL BASIS 2 0 23.019 0
Info: PERIODIC CELL BASIS 3 0 0 116.949
Info: PERIODIC CELL CENTER 0.432363 -0.430208 49.7623
Info: LOAD BALANCE STRATEGY New Load Balancers -- ASB
Info: LDB PERIOD 2000 steps
Info: FIRST LDB TIMESTEP 50
Info: LAST LDB TIMESTEP -1
Info: LDB BACKGROUND SCALING 1
Info: HOM BACKGROUND SCALING 1
Info: MAX SELF PARTITIONS 1
Info: MAX PAIR PARTITIONS 1
Info: SELF PARTITION ATOMS 154
Info: SELF2 PARTITION ATOMS 154
Info: PAIR PARTITION ATOMS 318
Info: PAIR2 PARTITION ATOMS 637
Info: MIN ATOMS PER PATCH 100
Info: INITIAL TEMPERATURE 355
Info: CENTER OF MASS MOVING INITIALLY? NO
Info: DIELECTRIC 1
Info: EXCLUDE SCALED ONE-FOUR
Info: 1-4 SCALE FACTOR 1
Info: DCD FILENAME protein-cuda.dcd
Info: DCD FREQUENCY 5000
Info: DCD FIRST STEP 5000
Info: DCD FILE WILL CONTAIN UNIT CELL DATA
Info: XST FILENAME protein-cuda.xst
Info: XST FREQUENCY 5000
Info: NO VELOCITY DCD OUTPUT
Info: OUTPUT FILENAME protein-cuda
Info: RESTART FILENAME protein-cuda.restart
Info: RESTART FREQUENCY 5000
Info: BINARY RESTART FILES WILL BE USED
Info: SWITCHING ACTIVE
Info: SWITCHING ON 10
Info: SWITCHING OFF 12
Info: PAIRLIST DISTANCE 13.5
Info: PAIRLIST SHRINK RATE 0.01
Info: PAIRLIST GROW RATE 0.01
Info: PAIRLIST TRIGGER 0.3
Info: PAIRLISTS PER CYCLE 2
Info: PAIRLISTS ENABLED
Info: MARGIN 0
Info: HYDROGEN GROUP CUTOFF 2.5
Info: PATCH DIMENSION 16
Info: ENERGY OUTPUT STEPS 5000
Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
Info: TIMING OUTPUT STEPS 50000
Info: PRESSURE OUTPUT STEPS 5000
Info: USING VERLET I (r-RESPA) MTS SCHEME.
Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
Info: RIGID BONDS TO HYDROGEN : ALL
Info: ERROR TOLERANCE : 1e-08
Info: MAX ITERATIONS : 100
Info: RIGID WATER USING SETTLE ALGORITHM
Info: RANDOM NUMBER SEED 1242108026
Info: USE HYDROGEN BONDS? NO
Info: COORDINATE PDB protein.pdb
Info: STRUCTURE FILE protein.psf
Info: PARAMETER file: CHARMM format!
Info: PARAMETERS par_all27_prot_lipid.inp
Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
Info: BINARY COORDINATES protein.restart.coor
Info: SUMMARY OF PARAMETERS:
Info: 180 BONDS
Info: 447 ANGLES
Info: 566 DIHEDRAL
Info: 46 IMPROPER
Info: 6 CROSSTERM
Info: 119 VDW
Info: 0 VDW_PAIRS
Info: TIME FOR READING PSF FILE: 0.036222
Info: TIME FOR READING PDB FILE: 0.0176351
Info:
Info: Reading from binary file protein.restart.coor
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 5840 ATOMS
Info: 3999 BONDS
Info: 2379 ANGLES
Info: 761 DIHEDRALS
Info: 62 IMPROPERS
Info: 28 CROSSTERMS
Info: 0 EXCLUSIONS
Info: 5668 RIGID BONDS
Info: 11849 DEGREES OF FREEDOM
Info: 2015 HYDROGEN GROUPS
Info: TOTAL MASS = 35649.7 amu
Info: TOTAL CHARGE = 1.30385e-07 e
Info: *****************************
Info:
Info: Entering startup at 0.93382 s, 3.4761 MB of memory in use
Info: Startup phase 0 took 0.0008111 s, 3.47623 MB of memory in use
Info: Startup phase 1 took 0.00569701 s, 4.43723 MB of memory in use
Info: Startup phase 2 took 0.0011189 s, 4.48509 MB of memory in use
Info: PATCH GRID IS 1 (PERIODIC) BY 1 (PERIODIC) BY 7 (PERIODIC)
Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
Info: REMOVING COM VELOCITY 0.029356 0.0596738 -0.0838868
Info: LARGEST PATCH (1) HAS 860 ATOMS
Info: CREATING 139 COMPUTE OBJECTS
Info: Startup phase 3 took 0.0047071 s, 5.13454 MB of memory in use
Info: Startup phase 4 took 0.000961065 s, 5.13443 MB of memory in use
Info: Startup phase 5 took 0.000222921 s, 5.13432 MB of memory in use
LDB: Measuring processor speeds ... Done.
Info: Startup phase 6 took 0.000232935 s, 5.13583 MB of memory in use
Info: CREATING 139 COMPUTE OBJECTS
Info: useSync: 1 useProxySync: 0
Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
Info: NONBONDED TABLE SIZE: 769 POINTS
Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 1.69407e-21 AT 11.9974
Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.13046e-16 AT 11.9974
CUDA force table updated on pe 0
create ComputeNonbondedCUDA
Pe 0 found 71 unique exclusion lists needing 172 bytes
Info: Startup phase 7 took 0.0047431 s, 6.20007 MB of memory in use
Info: Startup phase 8 took 0.000862837 s, 7.20914 MB of memory in use
Info: Finished startup at 0.953177 s, 7.20914 MB of memory in use

TCL: Running for 10000 steps
Pe 0 has 7 local and 0 remote patches and 189 local and 0 remote computes.
allocating 3 MB of memory on GPU
CUDA EVENT TIMING: 0 0.247392 0.004416 0.004512 29.207968 0.124352 29.588640
CUDA TIMING: 34.158945 ms/step on node 0
ETITLE: TS BOND ANGLE DIHED
IMPRP ELECT VDW BOUNDARY MISC
       KINETIC TOTAL TEMP POTENTIAL
  TOTAL3 TEMPAVG PRESSURE GPRESSURE
VOLUME PRESSAVG GPRESSAVG

ENERGY: 0 72.9822 159.2849 47.3271
16.7981 -18890.0424 1543.1801 0.0000
0.0000 4128.2526 -12922.2173 350.6509
-17050.4699 -12910.9769 350.6509 -595.5245
-18197.5199 62452.8455 -595.5245 -18197.5199

OPENING EXTENDED SYSTEM TRAJECTORY FILE
LDB: ============= START OF LOAD BALANCING ============== 2.82501
LDB: ============== END OF LOAD BALANCING =============== 2.82514

Info: Initial time: 1 CPUs 0.0344173 s/step 0.199174 days/ns 9.11273 MB memory
LDB: ============= START OF LOAD BALANCING ============== 4.55477
LDB: ============== END OF LOAD BALANCING =============== 4.55488

Info: Initial time: 1 CPUs 0.0345514 s/step 0.19995 days/ns 9.11322 MB memory
LDB: ============= START OF LOAD BALANCING ============== 6.32215
LDB: ============== END OF LOAD BALANCING =============== 6.32223

Info: Initial time: 1 CPUs 0.0352922 s/step 0.204237 days/ns 9.17262 MB memory
LDB: ============= START OF LOAD BALANCING ============== 8.0031
LDB: ============== END OF LOAD BALANCING =============== 8.00317

Info: Benchmark time: 1 CPUs 0.0335934 s/step 0.194406 days/ns 9.17316 MB memory
Info: Benchmark time: 1 CPUs 0.0337577 s/step 0.195357 days/ns 9.17384 MB memory
Info: Benchmark time: 1 CPUs 0.0337366 s/step 0.195235 days/ns 9.17459 MB memory
..
..
..
====================================================

WallClock: 339.800354 CPUTime: 339.645233 Memory: 9.246353 MB
Program finished.

---------------------------------------------------------------------------------------------

On Tue, May 12, 2009 at 7:24 AM, Axel Kohlmeyer
<akohlmey_at_cmm.chem.upenn.edu> wrote:
> On Tue, 2009-05-12 at 01:32 -0400, Roman Petrenko wrote:
>> Dear developers,
>> we compared simulations of intel compiled namd2.7b1 programs with cuda
>> disabled and enabled options. NVT simulation of 30-residue peptide in
>> water box with PME and SMD was used. The observed speedup of namd with
>> GPU is just 4 times. Is it due incompleteness of cuda-namd project or
>> we did something wrong?
>
> roman,
>
> the one thing that you did wrong for certain is not to provide any
> information about what hardware you are running on, what
> compilers/flags/libs you are using and most importantly access to your
> input, so that somebody can validate it. in general, it would be
> preferred to use one of the example inputs provided on the namd
> homepage, which people may already have some reference data for.
>
> there are CUDA capable GPUs out there, e.g. GeForce 8400 GS, that
> have very little speedup to offer compared to a GeForce GTX 285
> or a Tesla C1060.
>
> cheers,
>   axel.
>
> --
> =======================================================================
> Axel Kohlmeyer   akohlmey_at_cmm.chem.upenn.edu   http://www.cmm.upenn.edu
>   Center for Molecular Modeling   --   University of Pennsylvania
> Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
> tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
> =======================================================================
> If you make something idiot-proof, the universe creates a better idiot.
>
>

-- 
Roman Petrenko
Physics Department
University of Cincinnati

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:47 CST