From: Roman Petrenko (rpetrenko_at_gmail.com)
Date: Wed May 13 2009 - 11:58:24 CDT
Axel,
thanks for confirming 4x gpu speedup for a small system (6k atoms). It
is a solid starting point for us :) I ran a different system with 18k
atoms and speedup is 5.2x which shows the right trend.
Now more important question is why we don't have _any_ speedup for the
classical apoa1 benchmark case, which was downloaded from namd website
and no tinkering with config file was made.
>ps.: roman, FYI, i'm getting similar speedups with a GTX 260
>card compared to what you reported, but this is with a very
>unusual input which seems to keep a lot of work on the CPU
>and thus restricts the speedup.
BTW, our gard is:
Pe 0 binding to CUDA device 0 on hostname: 'GeForce 9800 GT' Mem:
1023MB Rev: 1.1
On Tue, May 12, 2009 at 12:11 PM, Axel Kohlmeyer
<akohlmey_at_cmm.chem.upenn.edu> wrote:
> On Tue, 2009-05-12 at 11:19 -0400, Roman Petrenko wrote:
>
> roman,
>
> i'm about to compile a version of NAMD for CUDA myself today
> and will thus hopefully have some more detailed comments later.
>
>> Axel, thanks for reply. The problem is that the standard test (apoa1)
>> runs slower on gpu than on cpu. I guess we don't have powerful enough
>> gpu for this system size (92,000 atoms). On the other hand,
>
> that depends on what settings you actually compare.
>
>> cuda-accelerated ion placement from VMD (when running volmap
>> coulombpotential from vmd website, 700,000 atoms) resulted in 30 times
>
> this is an ideal scenario kind of problem for a GPU. you cannot
> expect a similar speedup from NAMD. please consider amdahl's law
> and the fact that only parts of the NAMD code can be efficiently
> offloaded to a GPU.
>
>> speedup on our system. Then, 6,000 atom system we are interested in
>> resulted in a mere 4x acceleration for namd-gpu. Why?
>
> systems with small atom counts are a particular problem to get
> a good speedup from a GPU, as the time to download the compute
> kernel(s) and the associated data to the GPU will become significant
> relative to time spent on computing the non-bonded loops. for the
> GPU-MD code that i do have experiences with - HOOMD - you need
> at least 10,000 atoms to see a massive speedup. for smaller systems
> the GPU acceleration can drop rapidly and in some cases even turn
> into a deceleration.
>
>> Note, removing SMD, PBC or Langevin temperature control doesn't affect
>> greatly the calculation speed. Also, i should note that on cpu there
>> are 810 compute objects, whereas on gpu this number is 140.
>
> see above. i'd expect that NAMD generates larger chunks of work for
> the non-bonded interactions to make efficient use of the GPU.
>
> cheers,
> axel.
>
>>
>> Here is the namd-log file when cuda is used
>> ---------------------------------------------------------------------------------------------
>> Running command: /usr/local/namd/2.7b1/cuda/intel/bin/namd2 +idlepoll
>> dir/protein.namd
>>
>> Charm++: standalone mode (not using charmrun)
>> Charm warning> Randomization of stack pointer is turned on in Kernel,
>> run 'echo 0 > /proc/sys/kernel/randomize_va_space' as root to disable
>> it. Thread migration may not work!
>> Did not find +devices i,j,k,... argument, defaulting to (pe + 1) % deviceCount
>> Pe 0 binding to CUDA device 0 on hostname: 'GeForce 9800 GT' Mem:
>> 1023MB Rev: 1.1
>> Charm++> cpu topology info is being gathered.
>> Charm++> 1 unique compute nodes detected.
>> Info: NAMD 2.7b1 for Linux-x86_64
>> Info:
>> Info: Please visit http://www.ks.uiuc.edu/Research/namd/
>> Info: and send feedback or bug reports to namd_at_ks.uiuc.edu
>> Info:
>> Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
>> Info: in all publications reporting results obtained with NAMD.
>> Info:
>> Info: Based on Charm++/Converse 60102 for multicore-linux64-ifort-icc
>> Info: Built Tue May 5 23:55:12 EDT 2009 by admin on hostname
>> Info: 1 NAMD 2.7b1 Linux-x86_64 1 hostname user
>> Info: Running on 1 processors.
>> Info: Charm++/Converse parallel runtime startup completed at 0.860323 s
>> Info: 1.79207 MB of memory in use based on mallinfo
>> Info: Changed directory to /dir
>> Info: Configuration file is protein.namd
>> TCL: Suspending until startup complete.
>> Warning: The following variables were set in the
>> Warning: configuation file but were not needed
>> Warning: langevinTemp
>> Warning: langevinDamping
>> Warning: langevinHydrogen
>> Warning: fixedAtomsFile
>> Warning: fixedAtomsCol
>> Warning: SMDVel
>> Warning: SMDDir
>> Warning: SMDk
>> Warning: SMDFile
>> Warning: SMDOutputFreq
>> Info: EXTENDED SYSTEM FILE protein.restart.xsc
>> Info: SIMULATION PARAMETERS:
>> Info: TIMESTEP 2
>> Info: NUMBER OF STEPS 0
>> Info: STEPS PER CYCLE 10
>> Info: PERIODIC CELL BASIS 1 23.199 0 0
>> Info: PERIODIC CELL BASIS 2 0 23.019 0
>> Info: PERIODIC CELL BASIS 3 0 0 116.949
>> Info: PERIODIC CELL CENTER 0.432363 -0.430208 49.7623
>> Info: LOAD BALANCE STRATEGY New Load Balancers -- ASB
>> Info: LDB PERIOD 2000 steps
>> Info: FIRST LDB TIMESTEP 50
>> Info: LAST LDB TIMESTEP -1
>> Info: LDB BACKGROUND SCALING 1
>> Info: HOM BACKGROUND SCALING 1
>> Info: MAX SELF PARTITIONS 1
>> Info: MAX PAIR PARTITIONS 1
>> Info: SELF PARTITION ATOMS 154
>> Info: SELF2 PARTITION ATOMS 154
>> Info: PAIR PARTITION ATOMS 318
>> Info: PAIR2 PARTITION ATOMS 637
>> Info: MIN ATOMS PER PATCH 100
>> Info: INITIAL TEMPERATURE 355
>> Info: CENTER OF MASS MOVING INITIALLY? NO
>> Info: DIELECTRIC 1
>> Info: EXCLUDE SCALED ONE-FOUR
>> Info: 1-4 SCALE FACTOR 1
>> Info: DCD FILENAME protein-cuda.dcd
>> Info: DCD FREQUENCY 5000
>> Info: DCD FIRST STEP 5000
>> Info: DCD FILE WILL CONTAIN UNIT CELL DATA
>> Info: XST FILENAME protein-cuda.xst
>> Info: XST FREQUENCY 5000
>> Info: NO VELOCITY DCD OUTPUT
>> Info: OUTPUT FILENAME protein-cuda
>> Info: RESTART FILENAME protein-cuda.restart
>> Info: RESTART FREQUENCY 5000
>> Info: BINARY RESTART FILES WILL BE USED
>> Info: SWITCHING ACTIVE
>> Info: SWITCHING ON 10
>> Info: SWITCHING OFF 12
>> Info: PAIRLIST DISTANCE 13.5
>> Info: PAIRLIST SHRINK RATE 0.01
>> Info: PAIRLIST GROW RATE 0.01
>> Info: PAIRLIST TRIGGER 0.3
>> Info: PAIRLISTS PER CYCLE 2
>> Info: PAIRLISTS ENABLED
>> Info: MARGIN 0
>> Info: HYDROGEN GROUP CUTOFF 2.5
>> Info: PATCH DIMENSION 16
>> Info: ENERGY OUTPUT STEPS 5000
>> Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
>> Info: TIMING OUTPUT STEPS 50000
>> Info: PRESSURE OUTPUT STEPS 5000
>> Info: USING VERLET I (r-RESPA) MTS SCHEME.
>> Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
>> Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
>> Info: RIGID BONDS TO HYDROGEN : ALL
>> Info: ERROR TOLERANCE : 1e-08
>> Info: MAX ITERATIONS : 100
>> Info: RIGID WATER USING SETTLE ALGORITHM
>> Info: RANDOM NUMBER SEED 1242108026
>> Info: USE HYDROGEN BONDS? NO
>> Info: COORDINATE PDB protein.pdb
>> Info: STRUCTURE FILE protein.psf
>> Info: PARAMETER file: CHARMM format!
>> Info: PARAMETERS par_all27_prot_lipid.inp
>> Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
>> Info: BINARY COORDINATES protein.restart.coor
>> Info: SUMMARY OF PARAMETERS:
>> Info: 180 BONDS
>> Info: 447 ANGLES
>> Info: 566 DIHEDRAL
>> Info: 46 IMPROPER
>> Info: 6 CROSSTERM
>> Info: 119 VDW
>> Info: 0 VDW_PAIRS
>> Info: TIME FOR READING PSF FILE: 0.036222
>> Info: TIME FOR READING PDB FILE: 0.0176351
>> Info:
>> Info: Reading from binary file protein.restart.coor
>> Info: ****************************
>> Info: STRUCTURE SUMMARY:
>> Info: 5840 ATOMS
>> Info: 3999 BONDS
>> Info: 2379 ANGLES
>> Info: 761 DIHEDRALS
>> Info: 62 IMPROPERS
>> Info: 28 CROSSTERMS
>> Info: 0 EXCLUSIONS
>> Info: 5668 RIGID BONDS
>> Info: 11849 DEGREES OF FREEDOM
>> Info: 2015 HYDROGEN GROUPS
>> Info: TOTAL MASS = 35649.7 amu
>> Info: TOTAL CHARGE = 1.30385e-07 e
>> Info: *****************************
>> Info:
>> Info: Entering startup at 0.93382 s, 3.4761 MB of memory in use
>> Info: Startup phase 0 took 0.0008111 s, 3.47623 MB of memory in use
>> Info: Startup phase 1 took 0.00569701 s, 4.43723 MB of memory in use
>> Info: Startup phase 2 took 0.0011189 s, 4.48509 MB of memory in use
>> Info: PATCH GRID IS 1 (PERIODIC) BY 1 (PERIODIC) BY 7 (PERIODIC)
>> Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
>> Info: REMOVING COM VELOCITY 0.029356 0.0596738 -0.0838868
>> Info: LARGEST PATCH (1) HAS 860 ATOMS
>> Info: CREATING 139 COMPUTE OBJECTS
>> Info: Startup phase 3 took 0.0047071 s, 5.13454 MB of memory in use
>> Info: Startup phase 4 took 0.000961065 s, 5.13443 MB of memory in use
>> Info: Startup phase 5 took 0.000222921 s, 5.13432 MB of memory in use
>> LDB: Measuring processor speeds ... Done.
>> Info: Startup phase 6 took 0.000232935 s, 5.13583 MB of memory in use
>> Info: CREATING 139 COMPUTE OBJECTS
>> Info: useSync: 1 useProxySync: 0
>> Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
>> Info: NONBONDED TABLE SIZE: 769 POINTS
>> Info: ABSOLUTE IMPRECISION IN FAST TABLE ENERGY: 1.69407e-21 AT 11.9974
>> Info: RELATIVE IMPRECISION IN FAST TABLE ENERGY: 1.13046e-16 AT 11.9974
>> CUDA force table updated on pe 0
>> create ComputeNonbondedCUDA
>> Pe 0 found 71 unique exclusion lists needing 172 bytes
>> Info: Startup phase 7 took 0.0047431 s, 6.20007 MB of memory in use
>> Info: Startup phase 8 took 0.000862837 s, 7.20914 MB of memory in use
>> Info: Finished startup at 0.953177 s, 7.20914 MB of memory in use
>>
>> TCL: Running for 10000 steps
>> Pe 0 has 7 local and 0 remote patches and 189 local and 0 remote computes.
>> allocating 3 MB of memory on GPU
>> CUDA EVENT TIMING: 0 0.247392 0.004416 0.004512 29.207968 0.124352 29.588640
>> CUDA TIMING: 34.158945 ms/step on node 0
>> ETITLE: TS BOND ANGLE DIHED
>> IMPRP ELECT VDW BOUNDARY MISC
>> KINETIC TOTAL TEMP POTENTIAL
>> TOTAL3 TEMPAVG PRESSURE GPRESSURE
>> VOLUME PRESSAVG GPRESSAVG
>>
>> ENERGY: 0 72.9822 159.2849 47.3271
>> 16.7981 -18890.0424 1543.1801 0.0000
>> 0.0000 4128.2526 -12922.2173 350.6509
>> -17050.4699 -12910.9769 350.6509 -595.5245
>> -18197.5199 62452.8455 -595.5245 -18197.5199
>>
>> OPENING EXTENDED SYSTEM TRAJECTORY FILE
>> LDB: ============= START OF LOAD BALANCING ============== 2.82501
>> LDB: ============== END OF LOAD BALANCING =============== 2.82514
>>
>> Info: Initial time: 1 CPUs 0.0344173 s/step 0.199174 days/ns 9.11273 MB memory
>> LDB: ============= START OF LOAD BALANCING ============== 4.55477
>> LDB: ============== END OF LOAD BALANCING =============== 4.55488
>>
>> Info: Initial time: 1 CPUs 0.0345514 s/step 0.19995 days/ns 9.11322 MB memory
>> LDB: ============= START OF LOAD BALANCING ============== 6.32215
>> LDB: ============== END OF LOAD BALANCING =============== 6.32223
>>
>> Info: Initial time: 1 CPUs 0.0352922 s/step 0.204237 days/ns 9.17262 MB memory
>> LDB: ============= START OF LOAD BALANCING ============== 8.0031
>> LDB: ============== END OF LOAD BALANCING =============== 8.00317
>>
>> Info: Benchmark time: 1 CPUs 0.0335934 s/step 0.194406 days/ns 9.17316 MB memory
>> Info: Benchmark time: 1 CPUs 0.0337577 s/step 0.195357 days/ns 9.17384 MB memory
>> Info: Benchmark time: 1 CPUs 0.0337366 s/step 0.195235 days/ns 9.17459 MB memory
>> ...
>> ...
>> ...
>> ====================================================
>>
>> WallClock: 339.800354 CPUTime: 339.645233 Memory: 9.246353 MB
>> Program finished.
>>
>> ---------------------------------------------------------------------------------------------
>>
>>
>> On Tue, May 12, 2009 at 7:24 AM, Axel Kohlmeyer
>> <akohlmey_at_cmm.chem.upenn.edu> wrote:
>> > On Tue, 2009-05-12 at 01:32 -0400, Roman Petrenko wrote:
>> >> Dear developers,
>> >> we compared simulations of intel compiled namd2.7b1 programs with cuda
>> >> disabled and enabled options. NVT simulation of 30-residue peptide in
>> >> water box with PME and SMD was used. The observed speedup of namd with
>> >> GPU is just 4 times. Is it due incompleteness of cuda-namd project or
>> >> we did something wrong?
>> >
>> > roman,
>> >
>> > the one thing that you did wrong for certain is not to provide any
>> > information about what hardware you are running on, what
>> > compilers/flags/libs you are using and most importantly access to your
>> > input, so that somebody can validate it. in general, it would be
>> > preferred to use one of the example inputs provided on the namd
>> > homepage, which people may already have some reference data for.
>> >
>> > there are CUDA capable GPUs out there, e.g. GeForce 8400 GS, that
>> > have very little speedup to offer compared to a GeForce GTX 285
>> > or a Tesla C1060.
>> >
>> > cheers,
>> > axel.
>> >
>> > --
>> > =======================================================================
>> > Axel Kohlmeyer akohlmey_at_cmm.chem.upenn.edu http://www.cmm.upenn.edu
>> > Center for Molecular Modeling -- University of Pennsylvania
>> > Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
>> > tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
>> > =======================================================================
>> > If you make something idiot-proof, the universe creates a better idiot.
>> >
>> >
>>
>>
>
>
> --
> =======================================================================
> Axel Kohlmeyer akohlmey_at_cmm.chem.upenn.edu http://www.cmm.upenn.edu
> Center for Molecular Modeling -- University of Pennsylvania
> Department of Chemistry, 231 S.34th Street, Philadelphia, PA 19104-6323
> tel: 1-215-898-1582, fax: 1-215-573-6233, office-tel: 1-215-898-5425
> =======================================================================
> If you make something idiot-proof, the universe creates a better idiot.
>
>
-- Roman Petrenko Physics Department University of Cincinnati
This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:52:47 CST