From: Dr. Eddie (eackad_at_gmail.com)
Date: Wed Oct 11 2017 - 13:04:29 CDT
Hi all,
I have a namd run on a quad gpu, dual intel CPU E5-1620 v4 @
3.50GHz 14-core system. My system is about 150k and I am getting about 4
ns/day. That seems slow. I've played with p14 vs p28 (p14 is almost 2x
faster). But now I am at a loss. Is there anything I can try to get more
out of the node?
here is the output (I don't know what else would be helpful)
Running command:
/home/eackad/binaries/NAMD_2.11_Linux-x86_64-multicore-CUDA/namd2 +idlepoll
step5_production.inp ++local +p14
Charm++: standalone mode (not using charmrun)
Charm++> Running in Multicore mode: 14 threads
Charm++> Using recursive bisection (scheme 3) for topology aware partitions
Converse/Charm++ Commit ID:
v6.7.0-0-g46f867c-namd-charm-6.7.0-build-2015-Dec-21-45876
Warning> Randomization of stack pointer is turned on in kernel, thread
migration may not work! Run 'echo 0 > /proc/sys/kernel/randomize_va_space'
as root to disable it, or try run with '+isomalloc_sync'.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (56-way SMP).
Charm++> cpu topology info is gathered in 0.001 seconds.
Info: Built with CUDA version 6000
Did not find +devices i,j,k,... argument, using all
Pe 3 physical rank 3 will use CUDA device of pe 2
Pe 1 physical rank 1 will use CUDA device of pe 2
Pe 7 physical rank 7 will use CUDA device of pe 8
Pe 5 physical rank 5 will use CUDA device of pe 4
Pe 11 physical rank 11 will use CUDA device of pe 12
Pe 10 physical rank 10 will use CUDA device of pe 8
Pe 2 physical rank 2 binding to CUDA device 0 on node1.cl.siue.edu:
'GeForce GTX 1080' Mem: 8113MB Rev: 6.1
Pe 6 physical rank 6 will use CUDA device of pe 4
Pe 0 physical rank 0 will use CUDA device of pe 2
WARNING: ++local is a command line argument beginning with a '+' but was
not parsed by the RTS.
If any of the above arguments were intended for the RTS you may need to
recompile Charm++ with different options.
Pe 13 physical rank 13 will use CUDA device of pe 12
Pe 9 physical rank 9 will use CUDA device of pe 8
Pe 12 physical rank 12 binding to CUDA device 3 on node1.cl.siue.edu:
'GeForce GTX 1080' Mem: 8113MB Rev: 6.1
Pe 8 physical rank 8 binding to CUDA device 2 on node1.cl.siue.edu:
'GeForce GTX 1080' Mem: 8113MB Rev: 6.1
Pe 4 physical rank 4 binding to CUDA device 1 on node1.cl.siue.edu:
'GeForce GTX 1080' Mem: 8113MB Rev: 6.1
Info: NAMD 2.11 for Linux-x86_64-multicore-CUDA
Info:
Info: Please visit http://www.ks.uiuc.edu/Research/namd/
Info: for updates, documentation, and support information.
Info:
Info: Please cite Phillips et al., J. Comp. Chem. 26:1781-1802 (2005)
Info: in all publications reporting results obtained with NAMD.
Info:
Info: Based on Charm++/Converse 60700 for multicore-linux64-iccstatic
Info: Built Mon Dec 21 10:47:12 CST 2015 by jim on despina.ks.uiuc.edu
Info: 1 NAMD 2.11 Linux-x86_64-multicore-CUDA 14 node1.cl.siue.edu
eackad
Info: Running on 14 processors, 1 nodes, 1 physical nodes.
Info: CPU topology information available.
Info: Charm++/Converse parallel runtime startup completed at 0.172279 s
CkLoopLib is used in SMP with a simple dynamic scheduling (converse-level
notification) but not using node-level queue
Info: 13.5703 MB of memory in use based on /proc/self/stat
Info: Configuration file is step5_production.inp
Info: Working in the current directory
/raid6/eddie/namd/output/p450/p450_diltiazem
TCL: Suspending until startup complete.
Info: EXTENDED SYSTEM FILE step4.6_equilibrate0.restart.xsc
Warning: ALWAYS USE NON-ZERO MARGIN WITH CONSTANT PRESSURE!
Warning: CHANGING MARGIN FROM 0 to 0.555
Warning: The parameter fullElectFrequency now defaults to nonbondedFreq (1)
rather than stepsPerCycle.
Info: SIMULATION PARAMETERS:
Info: TIMESTEP 2
Info: NUMBER OF STEPS 0
Info: STEPS PER CYCLE 20
Info: PERIODIC CELL BASIS 1 101.628 0 0
Info: PERIODIC CELL BASIS 2 0 98.7346 0
Info: PERIODIC CELL BASIS 3 0 0 129.107
Info: PERIODIC CELL CENTER 0.548182 0.24011 17.7212
Info: WRAPPING WATERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
Info: WRAPPING ALL CLUSTERS AROUND PERIODIC BOUNDARIES ON OUTPUT.
Info: LOAD BALANCER Centralized
Info: LOAD BALANCING STRATEGY New Load Balancers -- DEFAULT
Info: LDB PERIOD 4000 steps
Info: FIRST LDB TIMESTEP 100
Info: LAST LDB TIMESTEP -1
Info: LDB BACKGROUND SCALING 1
Info: HOM BACKGROUND SCALING 1
Info: PME BACKGROUND SCALING 1
Info: MIN ATOMS PER PATCH 40
Info: VELOCITY FILE step4.6_equilibrate0.restart.vel
Info: CENTER OF MASS MOVING INITIALLY? NO
Info: DIELECTRIC 1
Info: EXCLUDE SCALED ONE-FOUR
Info: 1-4 ELECTROSTATICS SCALED BY 1
Info: MODIFIED 1-4 VDW PARAMETERS WILL BE USED
Info: DCD FILENAME step5_production.dcd
Info: DCD FREQUENCY 10000
Info: DCD FIRST STEP 10000
Info: DCD FILE WILL CONTAIN UNIT CELL DATA
Info: XST FILENAME step5_production.xst
Info: XST FREQUENCY 10000
Info: NO VELOCITY DCD OUTPUT
Info: NO FORCE DCD OUTPUT
Info: OUTPUT FILENAME step5_production
Info: BINARY OUTPUT FILES WILL BE USED
Info: RESTART FILENAME step5_production.restart
Info: RESTART FREQUENCY 10000
Info: BINARY RESTART FILES WILL BE USED
Info: SWITCHING ACTIVE
Info: SWITCHING ON 10
Info: SWITCHING OFF 12
Info: PAIRLIST DISTANCE 16
Info: PAIRLIST SHRINK RATE 0.01
Info: PAIRLIST GROW RATE 0.01
Info: PAIRLIST TRIGGER 0.3
Info: PAIRLISTS PER CYCLE 2
Info: PAIRLISTS ENABLED
Info: MARGIN 0.555
Info: HYDROGEN GROUP CUTOFF 2.5
Info: PATCH DIMENSION 19.055
Info: ENERGY OUTPUT STEPS 10000
Info: CROSSTERM ENERGY INCLUDED IN DIHEDRAL
Info: TIMING OUTPUT STEPS 10000
Info: LANGEVIN DYNAMICS ACTIVE
Info: LANGEVIN TEMPERATURE 310
Info: LANGEVIN USING BBK INTEGRATOR
Info: LANGEVIN DAMPING COEFFICIENT IS 1 INVERSE PS
Info: LANGEVIN DYNAMICS APPLIED TO HYDROGENS
Warning: Option useGroupPressure is being enabled due to pressure control
with rigidBonds.
Info: LANGEVIN PISTON PRESSURE CONTROL ACTIVE
Info: TARGET PRESSURE IS 1.01325 BAR
Info: OSCILLATION PERIOD IS 200 FS
Info: DECAY TIME IS 100 FS
Info: PISTON TEMPERATURE IS 310 K
Info: PRESSURE CONTROL IS GROUP-BASED
Info: INITIAL STRAIN RATE IS -9.24931e-06 -9.24931e-06 -9.24931e-06
Info: CELL FLUCTUATION IS ISOTROPIC
Info: PARTICLE MESH EWALD (PME) ACTIVE
Info: PME TOLERANCE 1e-06
Info: PME EWALD COEFFICIENT 0.257952
Info: PME INTERPOLATION ORDER 6
Info: PME GRID DIMENSIONS 128 128 128
Info: PME MAXIMUM GRID SPACING 1.5
Info: Attempting to read FFTW data from
FFTW_NAMD_2.11_Linux-x86_64-multicore-CUDA.txt
Info: Optimizing 6 FFT steps. 1... 2... 3... 4... 5... 6... Done.
Info: Writing FFTW data to FFTW_NAMD_2.11_Linux-x86_64-multicore-CUDA.txt
Info: FULL ELECTROSTATIC EVALUATION FREQUENCY 1
Info: USING VERLET I (r-RESPA) MTS SCHEME.
Info: C1 SPLITTING OF LONG RANGE ELECTROSTATICS
Info: PLACING ATOMS IN PATCHES BY HYDROGEN GROUPS
Info: RIGID BONDS TO HYDROGEN : WATER
Info: ERROR TOLERANCE : 1e-05
Info: MAX ITERATIONS : 100
Info: RIGID WATER USING SETTLE ALGORITHM
Info: RANDOM NUMBER SEED 1503150210
Info: USE HYDROGEN BONDS? NO
Info: COORDINATE PDB p450_diltiazem_ionized.pdb
Info: STRUCTURE FILE p450_diltiazem_ionized.psf
Info: PARAMETER file: CHARMM format!
Info: PARAMETERS toppar/par_all36_prot.prm
Info: PARAMETERS toppar/par_all36_na.prm
Info: PARAMETERS toppar/par_all36_carb.prm
Info: PARAMETERS toppar/par_all36_lipid.prm
Info: PARAMETERS toppar/par_all36_cgenff.prm
Info: PARAMETERS toppar/toppar_all36_prot_retinol.str
Info: PARAMETERS toppar/toppar_all36_na_rna_modified.str
Info: PARAMETERS toppar/toppar_all36_carb_glycopeptide.str
Info: PARAMETERS toppar/toppar_all36_prot_fluoro_alkanes.str
Info: PARAMETERS toppar/toppar_all36_prot_na_combined.str
Info: PARAMETERS toppar/par_hem_charmmgui.prm
Info: PARAMETERS toppar/LIG.prm
Info: PARAMETERS ../../../input/toppar/toppar_water_ions.str
Info: USING ARITHMETIC MEAN TO COMBINE L-J SIGMA PARAMETERS
Info: BINARY COORDINATES step4.6_equilibrate0.restart.coor
Info: SKIPPING rtf SECTION IN STREAM FILE
Info: SKIPPING rtf SECTION IN STREAM FILE
Info: SKIPPING rtf SECTION IN STREAM FILE
Info: SKIPPING rtf SECTION IN STREAM FILE
Info: SKIPPING rtf SECTION IN STREAM FILE
Warning: DUPLICATE BOND ENTRY FOR CG321-NG2S0
PREVIOUS VALUES k=340 x0=1.451
USING VALUES k=315 x0=1.434
Warning: DUPLICATE ANGLE ENTRY FOR HGA2-CG321-NG2S0
PREVIOUS VALUES k=54 theta0=109.5 k_ub=0 r_ub=0
USING VALUES k=51.5 theta0=109.5 k_ub=0 r_ub=0
Info: SKIPPING rtf SECTION IN STREAM FILE
Info: SUMMARY OF PARAMETERS:
Info: 1053 BONDS
Info: 3254 ANGLES
Info: 8400 DIHEDRAL
Info: 242 IMPROPER
Info: 6 CROSSTERM
Info: 376 VDW
Info: 6 VDW_PAIRS
Info: 0 NBTHOLE_PAIRS
Info: TIME FOR READING PSF FILE: 17.6009
Info: TIME FOR READING PDB FILE: 0.278248
Info:
Info: ****************************
Info: STRUCTURE SUMMARY:
Info: 133527 ATOMS
Info: 104867 BONDS
Info: 120314 ANGLES
Info: 129132 DIHEDRALS
Info: 1952 IMPROPERS
Info: 497 CROSSTERMS
Info: 0 EXCLUSIONS
Info: 84915 RIGID BONDS
Info: 315666 DEGREES OF FREEDOM
Info: 48151 HYDROGEN GROUPS
Info: 4 ATOMS IN LARGEST HYDROGEN GROUP
Info: 48151 MIGRATION GROUPS
Info: 4 ATOMS IN LARGEST MIGRATION GROUP
Info: TOTAL MASS = 800762 amu
Info: TOTAL CHARGE = 2.20051e-05 e
Info: MASS DENSITY = 1.02643 g/cm^3
Info: ATOM DENSITY = 0.103071 atoms/A^3
Info: *****************************
Info: Reading from binary file step4.6_equilibrate0.restart.coor
Info:
Info: Entering startup at 18.7278 s, 74.1602 MB of memory in use
Info: Startup phase 0 took 0.000119925 s, 74.1758 MB of memory in use
Info: ADDED 353884 IMPLICIT EXCLUSIONS
Info: Startup phase 1 took 0.0726631 s, 106.734 MB of memory in use
Info: NONBONDED TABLE R-SQUARED SPACING: 0.0625
Info: NONBONDED TABLE SIZE: 769 POINTS
Info: INCONSISTENCY IN FAST TABLE ENERGY VS FORCE: 0.000325096 AT 11.9556
Info: INCONSISTENCY IN SCOR TABLE ENERGY VS FORCE: 0.000324844 AT 11.9556
Info: ABSOLUTE IMPRECISION IN VDWA TABLE ENERGY: 4.59334e-32 AT 11.9974
Info: RELATIVE IMPRECISION IN VDWA TABLE ENERGY: 7.4108e-17 AT 11.9974
Info: INCONSISTENCY IN VDWA TABLE ENERGY VS FORCE: 0.0040507 AT 0.251946
Info: ABSOLUTE IMPRECISION IN VDWB TABLE ENERGY: 1.53481e-26 AT 11.9974
Info: RELATIVE IMPRECISION IN VDWB TABLE ENERGY: 7.96691e-18 AT 11.9974
Info: INCONSISTENCY IN VDWB TABLE ENERGY VS FORCE: 0.00150189 AT 0.251946
Info: Updated CUDA LJ table with 376 x 376 elements.
Info: Updated CUDA force table with 4096 elements.
Info: Updated CUDA LJ table with 376 x 376 elements.
Info: Updated CUDA force table with 4096 elements.
Info: Updated CUDA LJ table with 376 x 376 elements.
Info: Updated CUDA force table with 4096 elements.
Info: Updated CUDA LJ table with 376 x 376 elements.
Info: Updated CUDA force table with 4096 elements.
Info: Startup phase 2 took 1.79073 s, 470.668 MB of memory in use
Info: Startup phase 3 took 0.000116825 s, 470.707 MB of memory in use
Info: Startup phase 4 took 0.00123 s, 478.883 MB of memory in use
Info: Startup phase 5 took 8.51154e-05 s, 478.906 MB of memory in use
Info: PATCH GRID IS 5 (PERIODIC) BY 5 (PERIODIC) BY 6 (PERIODIC)
Info: PATCH GRID IS 1-AWAY BY 1-AWAY BY 1-AWAY
Info: Reading from binary file step4.6_equilibrate0.restart.vel
Info: REMOVING COM VELOCITY 0.00499493 0.0101335 -0.00197174
Info: LARGEST PATCH (53) HAS 1002 ATOMS
Info: TORUS A SIZE 14 USING 0
Info: TORUS B SIZE 1 USING 0
Info: TORUS C SIZE 1 USING 0
Info: TORUS MINIMAL MESH SIZE IS 1 BY 1 BY 1
Info: Placed 100% of base nodes on same physical node as patch
Info: Startup phase 6 took 0.0593419 s, 503.312 MB of memory in use
Info: PME using 13 and 13 processors for FFT and reciprocal sum.
Info: PME GRID LOCATIONS: 1 2 3 4 5 6 7 8 9 10 ...
Info: PME TRANS LOCATIONS: 1 2 3 4 5 6 7 8 9 10 ...
Info: PME USING 1 GRID NODES AND 1 TRANS NODES
Info: Startup phase 7 took 0.00213218 s, 510.254 MB of memory in use
Info: Startup phase 8 took 0.000555992 s, 510.254 MB of memory in use
LDB: Central LB being created...
Info: Startup phase 9 took 0.000712872 s, 510.266 MB of memory in use
Info: CREATING 3248 COMPUTE OBJECTS
CUDA device 1 stream priority range 0 -1
CUDA device 2 stream priority range 0 -1
CUDA device 3 stream priority range 0 -1
CUDA device 0 stream priority range 0 -1
Info: Found 495 unique exclusion lists needing 2212 bytes
Info: Found 495 unique exclusion lists needing 2212 bytes
Pe 8 hosts 0 local and 11 remote patches for pe 8
Pe 9 hosts 0 local and 10 remote patches for pe 8
Pe 5 hosts 0 local and 5 remote patches for pe 8
Pe 13 hosts 0 local and 11 remote patches for pe 8
Pe 10 hosts 0 local and 11 remote patches for pe 8
Pe 6 hosts 0 local and 2 remote patches for pe 8
Pe 11 hosts 0 local and 11 remote patches for pe 8
Pe 7 hosts 0 local and 11 remote patches for pe 8
Pe 6 hosts 0 local and 2 remote patches for pe 2
Pe 2 hosts 0 local and 10 remote patches for pe 2
Pe 7 hosts 0 local and 9 remote patches for pe 2
Pe 0 hosts 0 local and 11 remote patches for pe 2
Pe 5 hosts 0 local and 11 remote patches for pe 2
Pe 8 hosts 0 local and 4 remote patches for pe 2
Pe 1 hosts 0 local and 11 remote patches for pe 2
Pe 3 hosts 0 local and 11 remote patches for pe 2
Info: Found 495 unique exclusion lists needing 2212 bytes
Pe 8 hosts 0 local and 1 remote patches for pe 4
Pe 11 hosts 0 local and 2 remote patches for pe 4
Pe 4 hosts 0 local and 11 remote patches for pe 4
Pe 5 hosts 0 local and 11 remote patches for pe 4
Pe 3 hosts 0 local and 7 remote patches for pe 4
Pe 7 hosts 0 local and 9 remote patches for pe 4
Pe 10 hosts 0 local and 11 remote patches for pe 4
Pe 6 hosts 0 local and 10 remote patches for pe 4
Pe 9 hosts 0 local and 8 remote patches for pe 4
Pe 4 hosts 0 local and 11 remote patches for pe 2
Info: Found 495 unique exclusion lists needing 2212 bytes
Pe 2 hosts 0 local and 8 remote patches for pe 12
Pe 12 hosts 0 local and 10 remote patches for pe 12
Pe 13 hosts 0 local and 11 remote patches for pe 12
Pe 0 hosts 0 local and 11 remote patches for pe 12
Pe 10 hosts 0 local and 4 remote patches for pe 12
Pe 1 hosts 0 local and 11 remote patches for pe 12
Pe 11 hosts 0 local and 11 remote patches for pe 12
Pe 12 hosts 0 local and 10 remote patches for pe 8
Info: useSync: 0 useProxySync: 0
Info: Startup phase 10 took 0.085218 s, 520.344 MB of memory in use
Info: Startup phase 11 took 9.60827e-05 s, 520.344 MB of memory in use
Info: Startup phase 12 took 0.000111103 s, 520.344 MB of memory in use
Info: Finished startup at 20.7409 s, 520.348 MB of memory in use
TCL: Running for 100000000 steps
Pe 2 has 0 local and 80 remote patches and 0 local and 602 remote computes.
Pe 4 has 0 local and 70 remote patches and 0 local and 448 remote computes.
Pe 8 has 0 local and 82 remote patches and 0 local and 602 remote computes.
Pe 12 has 0 local and 66 remote patches and 0 local and 448 remote computes.
ETITLE: TS BOND ANGLE DIHED IMPRP
ELECT VDW BOUNDARY MISC
KINETIC TOTAL TEMP POTENTIAL TOTAL3
TEMPAVG PRESSURE GPRESSURE VOLUME
PRESSAVG GPRESSAVG
ENERGY: 0 18592.0692 30821.3813 21449.0928 479.6594
-367064.4538 12629.7559 0.0000 0.0000
97229.4313 -185863.0639 309.9990 -283092.4952 -184059.8378
309.9990 -106.8030 -15.9586 1295490.0449
-106.8030 -15.9586
OPENING EXTENDED SYSTEM TRAJECTORY FILE
LDB: ============= START OF LOAD BALANCING ============== 22.3971
LDB: ============== END OF LOAD BALANCING =============== 22.4037
..
Info: Benchmark time: 14 CPUs 0.0722329 s/step 0.418014 days/ns 818.262 MB
memory
Info: Benchmark time: 14 CPUs 0.0599945 s/step 0.347191 days/ns 823.234 MB
memory
Info: Benchmark time: 14 CPUs 0.0458627 s/step 0.265409 days/ns 825.172 MB
memory
Info: Benchmark time: 14 CPUs 0.045706 s/step 0.264502 days/ns 825.816 MB
memory
Info: Benchmark time: 14 CPUs 0.0622721 s/step 0.360371 days/ns 827.113 MB
memory
Info: Benchmark time: 14 CPUs 0.0695487 s/step 0.402481 days/ns 828.879 MB
memory
Thanks ahead of time for any ideas!!
-- Eddie
This archive was generated by hypermail 2.1.6 : Sun Dec 31 2017 - 23:21:42 CST