Re: namd-cuda-intel vs. namd-intel

From: Axel Kohlmeyer (akohlmey_at_cmm.chem.upenn.edu)
Date: Tue May 12 2009 - 11:11:19 CDT

Next message: yun luo: "Re: dummy atom mass changed"
Previous message: Roman Petrenko: "Re: namd-cuda-intel vs. namd-intel"
In reply to: Roman Petrenko: "Re: namd-cuda-intel vs. namd-intel"
Next in thread: Roman Petrenko: "Re: namd-cuda-intel vs. namd-intel"
Reply: Roman Petrenko: "Re: namd-cuda-intel vs. namd-intel"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

On Tue, 2009-05-12 at 11:19 -0400, Roman Petrenko wrote:

roman,

i'm about to compile a version of NAMD for CUDA myself today
and will thus hopefully have some more detailed comments later.

> Axel, thanks for reply. The problem is that the standard test (apoa1)
> runs slower on gpu than on cpu. I guess we don't have powerful enough
> gpu for this system size (92,000 atoms). On the other hand,

that depends on what settings you actually compare.

> cuda-accelerated ion placement from VMD (when running volmap
> coulombpotential from vmd website, 700,000 atoms) resulted in 30 times

this is an ideal scenario kind of problem for a GPU. you cannot
expect a similar speedup from NAMD. please consider amdahl's law
and the fact that only parts of the NAMD code can be efficiently
offloaded to a GPU.

> speedup on our system. Then, 6,000 atom system we are interested in
> resulted in a mere 4x acceleration for namd-gpu. Why?

systems with small atom counts are a particular problem to get
a good speedup from a GPU, as the time to download the compute
kernel(s) and the associated data to the GPU will become significant
relative to time spent on computing the non-bonded loops. for the
GPU-MD code that i do have experiences with - HOOMD - you need
at least 10,000 atoms to see a massive speedup. for smaller systems
the GPU acceleration can drop rapidly and in some cases even turn
into a deceleration.

> Note, removing SMD, PBC or Langevin temperature control doesn't affect
> greatly the calculation speed. Also, i should note that on cpu there
> are 810 compute objects, whereas on gpu this number is 140.

see above. i'd expect that NAMD generates larger chunks of work for
the non-bonded interactions to make efficient use of the GPU.

cheers,
axel.

>
> Here is the namd-log file when cuda is used
> ---------------------------------------------------------------------------------------------
> Running command: /usr/local/namd/2.7b1/cuda/intel/bin/namd2 +idlepoll
> dir/protein.namd
>
> Charm++: standalone mode (not using charmrun)
> Charm warning> Randomization of stack pointer is turned on in Kernel,
> run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable
> it. Thread migration may not work!
> Did not find +devices i,j,k,... argument, defaulting to (pe + 1) % deviceCount
> Pe 0 binding to CUDA device 0 on hostname: 'GeForce 9800 GT' Mem:
> 1023MB Rev: 1.1
> Charm++> cpu topology info is being gathered.
> Charm++> 1 unique compute nodes detected.
> Info: NAMD 2.7b1 for Linux-x86_64
> Info:
> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
> Info: and send feedback or bug reports to namd_at_ks.uiuc.edu
> Info:
> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
> Info: in all publications reporting results obtained with NAMD.
> Info:
> Info: Based on Charm++/Converse 60102 for multicore-linux64-ifort-icc
> Info: Built Tue May 5 23:55:12 EDT 2009 by admin on hostname
> Info: 1 NAMD 2.7b1 Linux-x86_64 1 hostname user
> Info: Running on 1 processors.
> Info: Charm++/Converse parallel runtime startup completed at 0.860323 s
> Info: 1.79207 MB of memory in use based on mallinfo
> Info: Changed directory to /dir
> Info: Configuration file is protein.namd
> TCL: Suspending until startup complete.
> Warning: The following variables were set in the
> Warning: configuation file but were not needed
> Warning: langevinTemp
> Warning: langevinDamping
> Warning: langevinHydrogen
> Warning: fixedAtomsFile
> Warning: fixedAtomsCol
> Warning: SMDVel
> Warning: SMDDir
> Warning: SMDk
> Warning: SMDFile
> Warning: SMDOutputFreq
> Info: EXTENDED SYSTEM FILE protein.restart.xsc
> Info: SIMULATION PARAMETERS:
> Info: TIMESTEP 2
> Info: NUMBER OF STEPS 0
> Info: STEPS PER CYCLE 10
> Info: PERIODIC CELL BASIS 1 23.199 0 0
> Info: PERIODIC CELL BASIS 2 0 23.019 0
> Info: PERIODIC CELL BASIS 3 0 0 116.949
> Info: PERIODIC CELL CENTER 0.432363 -0.430208 49.7623
> Info: LOAD BALANCE STRATEGY New Load Balancers -- ASB
> Info: LDB PERIOD 2000 steps
> Info: FIRST LDB TIMESTEP 50
> Info: LAST LDB TIMESTEP -1
> Info: LDB BACKGROUND SCALING 1
> Info: HOM BACKGROUND SCALING 1
> Info: MAX SELF PARTITIONS 1
> Info: MAX PAIR PARTITIONS 1
> Info: SELF PARTITION ATOMS 154
> Info: SELF2 PARTITION ATOMS 154
> Info: PAIR PARTITION ATOMS 318
> Info: PAIR2 PARTITION ATOMS 637
> Info: MIN ATOMS PER PATCH 100
> Info: INITIAL TEMPERATURE 355
> Info: CENTER OF MASS MOVING INITIALLY? NO
> Info: DIELECTRIC 1
> Info: EXCLUDE SCALED ONE-FOUR
> Info: 1-4 SCALE FACTOR 1
> Info: DCD FILENAME protein-cuda.dcd
> Info: DCD FREQUENCY 5000
> Info: DCD FIRST STEP 5000
> Info: DCD FILE WILL CONTAIN UNIT CELL DATA
> Info: XST FILENAME protein-cuda.xst
> Info: XST FREQUENCY 5000
> Info: NO VELOCITY DCD OUTPUT
> Info: OUTPUT FILENAME protein-cuda
> Info: RESTART FILENAME protein-cuda.restart
> Info: RESTART FREQUENCY 5000
> Info: BINARY RESTART FILES WILL BE USED
> Info: SWITCHING ACTIVE
> Info: SWITCHING ON 10
> Info: SWITCHING OFF 12
> Info: PAIRLIST DISTANCE 13.5
> Info: PAIRLIST SHRINK RATE 0.01
> Info: PAIRLIST GROW RATE 0.01
> Info: PAIRLIST TRIGGER 0.3
> Info: PAIRLISTS PER CYCLE 2
> Info: PAIRLISTS ENABLED
> Info: MARGIN 0
> Info: HYDROGEN GROUP CUTOFF 2.5
> Info: PATCH DIMENSION 16
> Info: ENERGY OUTPUT STEPS 5000
> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
> Info: TIMING OUTPUT STEPS 50000
> Info: PRESSURE OUTPUT STEPS 5000
> Info: USING VERLET I (r-RESPA) MTS SCHEME.
> Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
> Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
> Info: RIGID BONDS TO HYDROGEN : ALL
> Info: ERROR TOLERANCE : 1e-08
> Info: MAX ITERATIONS : 100
> Info: RIGID WATER USING SETTLE ALGORITHM
> Info: RANDOM NUMBER SEED 1242108026
> Info: USE HYDROGEN BONDS? NO
> Info: COORDINATE PDB protein.pdb
> Info: STRUCTURE FILE protein.psf
> Info: PARAMETER file: CHARMM format!
> Info: PARAMETERS par_all27_prot_lipid.inp
> Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
> Info: BINARY COORDINATES protein.restart.coor
> Info: SUMMARY OF PARAMETERS:
> Info: 180 BONDS
> Info: 447 ANGLES
> Info: 566 DIHEDRAL
> Info: 46 IMPROPER
> Info: 6 CROSSTERM
> Info: 119 VDW
> Info: 0 VDW_PAIRS
> Info: TIME FOR READING PSF FILE: 0.036222
> Info: TIME FOR READING PDB FILE: 0.0176351
> Info:
> Info: Reading from binary file protein.restart.coor
> Info: ****************************
> Info: STRUCTURE SUMMARY:
> Info: 5840 ATOMS
> Info: 3999 BONDS
> Info: 2379 ANGLES
> Info: 761 DIHEDRALS
> Info: 62 IMPROPERS
> Info: 28 CROSSTERMS
> Info: 0 EXCLUSIONS
> Info: 5668 RIGID BONDS
> Info: 11849 DEGREES OF FREEDOM
> Info: 2015 HYDROGEN GROUPS
> Info: TOTAL MASS = 35649.7 amu
> Info: TOTAL CHARGE = 1.30385e-07 e
> Info: *****************************
> Info:
> Info: Entering startup at 0.93382 s, 3.4761 MB of memory in use
> Info: Startup phase 0 took 0.0008111 s, 3.47623 MB of memory in use
> Info: Startup phase 1 took 0.00569701 s, 4.43723 MB of memory in use
> Info: Startup phase 2 took 0.0011189 s, 4.48509 MB of memory in use
> Info: PATCH GRID IS 1 (PERIODIC) BY 1 (PERIODIC) BY 7 (PERIODIC)
> Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
> Info: REMOVING COM VELOCITY 0.029356 0.0596738 -0.0838868
> Info: LARGEST PATCH (1) HAS 860 ATOMS
> Info: CREATING 139 COMPUTE OBJECTS
> Info: Startup phase 3 took 0.0047071 s, 5.13454 MB of memory in use
> Info: Startup phase 4 took 0.000961065 s, 5.13443 MB of memory in use
> Info: Startup phase 5 took 0.000222921 s, 5.13432 MB of memory in use
> LDB: Measuring processor speeds ... Done.
> Info: Startup phase 6 took 0.000232935 s, 5.13583 MB of memory in use
> Info: CREATING 139 COMPUTE OBJECTS
> Info: useSync: 1 useProxySync: 0
> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
> Info: NONBONDED TABLE SIZE: 769 POINTS
> Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 1.69407e-21 AT 11.9974
> Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.13046e-16 AT 11.9974
> CUDA force table updated on pe 0
> create ComputeNonbondedCUDA
> Pe 0 found 71 unique exclusion lists needing 172 bytes
> Info: Startup phase 7 took 0.0047431 s, 6.20007 MB of memory in use
> Info: Startup phase 8 took 0.000862837 s, 7.20914 MB of memory in use
> Info: Finished startup at 0.953177 s, 7.20914 MB of memory in use
>
> TCL: Running for 10000 steps
> Pe 0 has 7 local and 0 remote patches and 189 local and 0 remote computes.
> allocating 3 MB of memory on GPU
> CUDA EVENT TIMING: 0 0.247392 0.004416 0.004512 29.207968 0.124352 29.588640
> CUDA TIMING: 34.158945 ms/step on node 0
> ETITLE: TS BOND ANGLE DIHED
> IMPRP ELECT VDW BOUNDARY MISC
> KINETIC TOTAL TEMP POTENTIAL
> TOTAL3 TEMPAVG PRESSURE GPRESSURE
> VOLUME PRESSAVG GPRESSAVG
>
> ENERGY: 0 72.9822 159.2849 47.3271
> 16.7981 -18890.0424 1543.1801 0.0000
> 0.0000 4128.2526 -12922.2173 350.6509
> -17050.4699 -12910.9769 350.6509 -595.5245
> -18197.5199 62452.8455 -595.5245 -18197.5199
>
> OPENING EXTENDED SYSTEM TRAJECTORY FILE
> LDB: ============= START OF LOAD BALANCING ============== 2.82501
> LDB: ============== END OF LOAD BALANCING =============== 2.82514
>
> Info: Initial time: 1 CPUs 0.0344173 s/step 0.199174 days/ns 9.11273 MB memory
> LDB: ============= START OF LOAD BALANCING ============== 4.55477
> LDB: ============== END OF LOAD BALANCING =============== 4.55488
>
> Info: Initial time: 1 CPUs 0.0345514 s/step 0.19995 days/ns 9.11322 MB memory
> LDB: ============= START OF LOAD BALANCING ============== 6.32215
> LDB: ============== END OF LOAD BALANCING =============== 6.32223
>
> Info: Initial time: 1 CPUs 0.0352922 s/step 0.204237 days/ns 9.17262 MB memory
> LDB: ============= START OF LOAD BALANCING ============== 8.0031
> LDB: ============== END OF LOAD BALANCING =============== 8.00317
>
> Info: Benchmark time: 1 CPUs 0.0335934 s/step 0.194406 days/ns 9.17316 MB memory
> Info: Benchmark time: 1 CPUs 0.0337577 s/step 0.195357 days/ns 9.17384 MB memory
> Info: Benchmark time: 1 CPUs 0.0337366 s/step 0.195235 days/ns 9.17459 MB memory
> ...
> ...
> ...
> ====================================================
>
> WallClock: 339.800354 CPUTime: 339.645233 Memory: 9.246353 MB
> Program finished.
>
> ---------------------------------------------------------------------------------------------
>
>
> On Tue, May 12, 2009 at 7:24 AM, Axel Kohlmeyer
> <akohlmey_at_cmm.chem.upenn.edu> wrote:
> > On Tue, 2009-05-12 at 01:32 -0400, Roman Petrenko wrote:
> >> Dear developers,
> >> we compared simulations of intel compiled namd2.7b1 programs with cuda
> >> disabled and enabled options. NVT simulation of 30-residue peptide in
> >> water box with PME and SMD was used. The observed speedup of namd with
> >> GPU is just 4 times. Is it due incompleteness of cuda-namd project or
> >> we did something wrong?
> >
> > roman,
> >
> > the one thing that you did wrong for certain is not to provide any
> > information about what hardware you are running on, what
> > compilers/flags/libs you are using and most importantly access to your
> > input, so that somebody can validate it. in general, it would be
> > preferred to use one of the example inputs provided on the namd
> > homepage, which people may already have some reference data for.
> >
> > there are CUDA capable GPUs out there, e.g. GeForce 8400 GS, that
> > have very little speedup to offer compared to a GeForce GTX 285
> > or a Tesla C1060.
> >
> > cheers,
> > axel.
> >
> > --
> > =======================================================================
> > Axel Kohlmeyer akohlmey_at_cmm.chem.upenn.edu http://www.cmm.upenn.edu
> > Center for Molecular Modeling -- University of Pennsylvania
> > Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
> > tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
> > =======================================================================
> > If you make something idiot-proof, the universe creates a better idiot.
> >
> >
>
>

-- 
=======================================================================
Axel Kohlmeyer   akohlmey_at_cmm.chem.upenn.edu   http://www.cmm.upenn.edu
   Center for Molecular Modeling   --   University of Pennsylvania
Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
tel: 1-215-898-1582,  fax: 1-215-573-6233,  office-tel: 1-215-898-5425
=======================================================================
If you make something idiot-proof, the universe creates a better idiot.

Next message: yun luo: "Re: dummy atom mass changed"
Previous message: Roman Petrenko: "Re: namd-cuda-intel vs. namd-intel"
In reply to: Roman Petrenko: "Re: namd-cuda-intel vs. namd-intel"
Next in thread: Roman Petrenko: "Re: namd-cuda-intel vs. namd-intel"
Reply: Roman Petrenko: "Re: namd-cuda-intel vs. namd-intel"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:47 CST