NAMD_2.9_Linux-x86_64-ibverbs RUNS on QueenBee RUN TIME VARIATION

From: Thomas C. Bishop (bishop_at_latech.edu)
Date: Tue Nov 19 2013 - 08:06:31 CST

Dear NAMD,

We are using charmrun with NAMD_2.9_Linux-x86_64-ibverbs on QueenBee.loni.org.
And obtain WILDLY different run times (differences by over a factor of 100%) for (essentially) the same
1ns simulation task (about 375,000 atoms and 256 processors (32 nodes ) of dual quad Core Xeon 64-bit @ 2.33 GHz 10Gbps IB network, PME with gridspacing 1.
standard MD runs)

Our usual experience with similar runs is less than 10% variation and worst case scenario of 30% variation not a factor of 2X!
Suggestions for tracking down this problem?
This is our command line w/in the qsub

$CHARM ++remote-shell /usr/bin/ssh ++timeout 120  ++nodelist nodes ++p $NP  $NAMD  $CONF

(yes we've tried reducing the number processes per node  form a max of 8 down to as few as 6)


The "benchmark" times all look very promising ... about 6hrs for a 1ns run.
But in practice we can OFTEN run out of time even when we request a 12hr wall limit.
Inspection of the "time remaining" shows that tfor typical runs this is a monotonically decreasing function.
For the aberrant runs the values are all over the place.

Thanks for any ideas.
Tom


DETAILS




head -1500 */dyn*log | grep -i benchmark

**********  sin2/dyn24.log **************
Info: Benchmark time: 256 CPUs 0.0428429 s/step 0.247933 days/ns 480.676 MB memory
Info: Benchmark time: 256 CPUs 0.043162 s/step 0.24978 days/ns 485.109 MB memory
Info: Benchmark time: 256 CPUs 0.0434176 s/step 0.25126 days/ns 485.109 MB memory
**********  sin8/dyn21.log **************
Info: Benchmark time: 256 CPUs 0.0410384 s/step 0.237491 days/ns 477.648 MB memory
Info: Benchmark time: 256 CPUs 0.0411387 s/step 0.238071 days/ns 482.082 MB memory
Info: Benchmark time: 256 CPUs 0.0416531 s/step 0.241048 days/ns 483.562 MB memory
**********  sin4/dyn21.log **************
Info: Benchmark time: 256 CPUs 0.0410146 s/step 0.237353 days/ns 478.449 MB memory
Info: Benchmark time: 256 CPUs 0.0410538 s/step 0.23758 days/ns 482.883 MB memory
Info: Benchmark time: 256 CPUs 0.0412462 s/step 0.238693 days/ns 482.883 MB memory
**********  sin7/dyn21.log **************
Info: Benchmark time: 256 CPUs 0.0414661 s/step 0.239966 days/ns 478.672 MB memory
Info: Benchmark time: 256 CPUs 0.0410347 s/step 0.237469 days/ns 483.102 MB memory
Info: Benchmark time: 256 CPUs 0.041631 s/step 0.24092 days/ns 483.102 MB memory
[vsp004@qb3 sin-mutants]$


grep remaining */dyn*log | tail -1

******** sin8/dyn21.log ***************
TIMING: 310000  CPU: 36741.1, 0.102898/step  Wall: 36755, 0.102917/step, 5.43174 hours remaining, 501.289062 MB of memory in use.
******** sin2/dyn24.log ***************
TIMING: 125000  CPU: 5430.1, 0.04337/step  Wall: 5437.7, 0.0434238/step, 4.52331 hours remaining, 502.835938 MB of memory in use.
******** sin4/dyn21.log ***************
TIMING: 95000  CPU: 36786.4, 0.131336/step  Wall: 36819, 0.131405/step, 14.7831 hours remaining, 500.609375 MB of memory in use.
******** sin7/dyn21.log ***************
TIMING: 330000  CPU: 19387.2, 0.098274/step  Wall: 19402.4, 0.0983259/step, 4.64317 hours remaining, 500.820312 MB of memory in use.
TIMING: 335000  CPU: 19894.9, 0.101536/step  Wall: 19910.6, 0.101622/step, 4.65766 hours remaining, 500.820312 MB of memory in use.
[vsp004@qb3 sin-mutants]$


qstat -u vsp004

                                                                   Req'd  Req'd   Elap
Job ID               Username Queue    Jobname    SessID NDS   TSK Memory Time  S Time
-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----
724226.qb2           vsp004   checkpt  sin2-24       850    32   1    --  12:00 R 01:28
724227.qb2           vsp004   checkpt  sin2-25       --     32   1    --  12:00 H   --
724406.qb2           vsp004   checkpt  sin4-21     31776    32   1    --  12:00 R 10:43
724407.qb2           vsp004   checkpt  sin4-22       --     32   1    --  12:00 H   --
724408.qb2           vsp004   checkpt  sin4-23       --     32   1    --  12:00 H   --
724409.qb2           vsp004   checkpt  sin4-24       --     32   1    --  12:00 H   --
724410.qb2           vsp004   checkpt  sin4-25       --     32   1    --  12:00 H   --
724414.qb2           vsp004   checkpt  sin7-21      2647    32   1    --  12:00 R 10:26
724415.qb2           vsp004   checkpt  sin7-22       --     32   1    --  12:00 H   --
724416.qb2           vsp004   checkpt  sin7-23       --     32   1    --  12:00 H   --
724417.qb2           vsp004   checkpt  sin7-24       --     32   1    --  12:00 H   --
724418.qb2           vsp004   checkpt  sin7-25       --     32   1    --  12:00 H   --
724429.qb2           vsp004   checkpt  sin8-21      6745    32   1    --  12:00 R 10:13
724430.qb2           vsp004   checkpt  sin8-22       --     32   1    --  12:00 H   --
724431.qb2           vsp004   checkpt  sin8-23       --     32   1    --  12:00 H   --
724432.qb2           vsp004   checkpt  sin8-24       --     32   1    --  12:00 H   --
724433.qb2           vsp004   checkpt  sin8-25       --     32   1    --  12:00 H   --
[vsp004@qb3 sin-mutants]$


AND just for fun

grep -i remaining sin8/dyn21.log

TIMING: 5000  CPU: 492.29, 0.098414/step  Wall: 492.534, 0.0984621/step, 13.5385 hours remaining, 501.289062 MB of memory in use.
TIMING: 10000  CPU: 1644.41, 0.230424/step  Wall: 1645.22, 0.230536/step, 31.3785 hours remaining, 501.289062 MB of memory in use.
TIMING: 15000  CPU: 2372.88, 0.145694/step  Wall: 2374, 0.145756/step, 19.6366 hours remaining, 501.289062 MB of memory in use.
TIMING: 20000  CPU: 3508.73, 0.22717/step  Wall: 3510.33, 0.227266/step, 30.3022 hours remaining, 501.289062 MB of memory in use.
TIMING: 25000  CPU: 3716.95, 0.041644/step  Wall: 3718.66, 0.0416657/step, 5.49756 hours remaining, 501.289062 MB of memory in use.
TIMING: 30000  CPU: 3925.29, 0.041668/step  Wall: 3927.11, 0.0416911/step, 5.44301 hours remaining, 501.289062 MB of memory in use.
TIMING: 35000  CPU: 5212.46, 0.257434/step  Wall: 5214.78, 0.257534/step, 33.2648 hours remaining, 501.289062 MB of memory in use.
TIMING: 40000  CPU: 5772.96, 0.1121/step  Wall: 5775.69, 0.112182/step, 14.3344 hours remaining, 501.289062 MB of memory in use.
TIMING: 45000  CPU: 5980.95, 0.041598/step  Wall: 5983.81, 0.0416241/step, 5.26082 hours remaining, 501.289062 MB of memory in use.
TIMING: 50000  CPU: 6188.35, 0.04148/step  Wall: 6191.32, 0.0415023/step, 5.18779 hours remaining, 501.289062 MB of memory in use.
TIMING: 55000  CPU: 6802.46, 0.122822/step  Wall: 6805.89, 0.122913/step, 15.1935 hours remaining, 501.289062 MB of memory in use.
TIMING: 60000  CPU: 7010.68, 0.041644/step  Wall: 7014.29, 0.0416791/step, 5.09412 hours remaining, 501.289062 MB of memory in use.
TIMING: 65000  CPU: 7613.9, 0.120644/step  Wall: 7617.78, 0.120699/step, 14.5845 hours remaining, 501.289062 MB of memory in use.
TIMING: 70000  CPU: 7821.14, 0.041448/step  Wall: 7825.13, 0.0414693/step, 4.95328 hours remaining, 501.289062 MB of memory in use.
TIMING: 75000  CPU: 8028.73, 0.041518/step  Wall: 8033, 0.041574/step, 4.90804 hours remaining, 501.289062 MB of memory in use.
TIMING: 80000  CPU: 8236.61, 0.041576/step  Wall: 8241, 0.0416/step, 4.85333 hours remaining, 501.289062 MB of memory in use.
TIMING: 85000  CPU: 8727.78, 0.098234/step  Wall: 8732.45, 0.0982896/step, 11.3306 hours remaining, 501.289062 MB of memory in use.
TIMING: 90000  CPU: 8935.14, 0.041472/step  Wall: 8939.94, 0.0414992/step, 4.7263 hours remaining, 501.289062 MB of memory in use.
TIMING: 95000  CPU: 9142.92, 0.041556/step  Wall: 9147.88, 0.0415874/step, 4.67858 hours remaining, 501.289062 MB of memory in use.
TIMING: 100000  CPU: 9893.16, 0.150048/step  Wall: 9898.41, 0.150107/step, 16.6786 hours remaining, 501.289062 MB of memory in use.
TIMING: 105000  CPU: 10454.8, 0.112336/step  Wall: 10460.4, 0.112393/step, 12.332 hours remaining, 501.289062 MB of memory in use.
TIMING: 110000  CPU: 10941.9, 0.097406/step  Wall: 10947.5, 0.0974262/step, 10.5545 hours remaining, 501.289062 MB of memory in use.
TIMING: 115000  CPU: 11995.7, 0.210764/step  Wall: 12001.5, 0.210796/step, 22.5434 hours remaining, 501.289062 MB of memory in use.
TIMING: 120000  CPU: 12205.9, 0.042032/step  Wall: 12211.8, 0.0420594/step, 4.4396 hours remaining, 501.289062 MB of memory in use.
TIMING: 125000  CPU: 12837.5, 0.126326/step  Wall: 12843.7, 0.126376/step, 13.1641 hours remaining, 501.289062 MB of memory in use.
TIMING: 130000  CPU: 13505.4, 0.133582/step  Wall: 13511.7, 0.133601/step, 13.7312 hours remaining, 501.289062 MB of memory in use.
TIMING: 135000  CPU: 14772.3, 0.25339/step  Wall: 14779.4, 0.253544/step, 25.7065 hours remaining, 501.289062 MB of memory in use.
TIMING: 140000  CPU: 14979.7, 0.041466/step  Wall: 14986.8, 0.0414803/step, 4.14803 hours remaining, 501.289062 MB of memory in use.
TIMING: 145000  CPU: 15633.6, 0.130788/step  Wall: 15640.9, 0.130822/step, 12.9005 hours remaining, 501.289062 MB of memory in use.
TIMING: 150000  CPU: 16240.2, 0.121322/step  Wall: 16247.6, 0.121338/step, 11.7967 hours remaining, 501.289062 MB of memory in use.
TIMING: 155000  CPU: 16448.1, 0.04158/step  Wall: 16455.6, 0.0415979/step, 3.98647 hours remaining, 501.289062 MB of memory in use.
TIMING: 160000  CPU: 16654.9, 0.041352/step  Wall: 16662.4, 0.0413741/step, 3.90755 hours remaining, 501.289062 MB of memory in use.
TIMING: 165000  CPU: 17394, 0.147828/step  Wall: 17401.7, 0.147854/step, 13.7586 hours remaining, 501.289062 MB of memory in use.
TIMING: 170000  CPU: 20006.8, 0.522554/step  Wall: 20014.8, 0.522608/step, 47.9057 hours remaining, 501.289062 MB of memory in use.
TIMING: 175000  CPU: 20589.6, 0.116566/step  Wall: 20597.7, 0.116592/step, 10.5257 hours remaining, 501.289062 MB of memory in use.
TIMING: 180000  CPU: 20797.3, 0.041532/step  Wall: 20805.5, 0.0415489/step, 3.69323 hours remaining, 501.289062 MB of memory in use.
TIMING: 185000  CPU: 21956.6, 0.231872/step  Wall: 21966.2, 0.232153/step, 20.3134 hours remaining, 501.289062 MB of memory in use.
TIMING: 190000  CPU: 22486.7, 0.106008/step  Wall: 22496.6, 0.106079/step, 9.13461 hours remaining, 501.289062 MB of memory in use.
TIMING: 195000  CPU: 23028.1, 0.108288/step  Wall: 23038.3, 0.108333/step, 9.17818 hours remaining, 501.289062 MB of memory in use.
TIMING: 200000  CPU: 23584.1, 0.111194/step  Wall: 23594.9, 0.111324/step, 9.277 hours remaining, 501.289062 MB of memory in use.
TIMING: 205000  CPU: 24109.3, 0.105036/step  Wall: 24120.3, 0.105086/step, 8.61123 hours remaining, 501.289062 MB of memory in use.
TIMING: 210000  CPU: 24643, 0.106736/step  Wall: 24654.2, 0.106778/step, 8.60155 hours remaining, 501.289062 MB of memory in use.
TIMING: 215000  CPU: 25751.9, 0.221798/step  Wall: 25763.6, 0.221885/step, 17.5659 hours remaining, 501.289062 MB of memory in use.
TIMING: 220000  CPU: 25959.1, 0.04143/step  Wall: 25970.9, 0.0414426/step, 3.22331 hours remaining, 501.289062 MB of memory in use.
TIMING: 225000  CPU: 26166.5, 0.041482/step  Wall: 26178.4, 0.0415041/step, 3.17045 hours remaining, 501.289062 MB of memory in use.
TIMING: 230000  CPU: 27392.6, 0.245224/step  Wall: 27404.6, 0.245245/step, 18.3934 hours remaining, 501.289062 MB of memory in use.
TIMING: 235000  CPU: 27599.9, 0.041458/step  Wall: 27612, 0.0414758/step, 3.05308 hours remaining, 501.289062 MB of memory in use.
TIMING: 240000  CPU: 28188.2, 0.117664/step  Wall: 28200.4, 0.117678/step, 8.49896 hours remaining, 501.289062 MB of memory in use.
TIMING: 245000  CPU: 28397.4, 0.041842/step  Wall: 28409.8, 0.0418826/step, 2.96668 hours remaining, 501.289062 MB of memory in use.
TIMING: 250000  CPU: 28899.7, 0.100446/step  Wall: 28912, 0.100453/step, 6.9759 hours remaining, 501.289062 MB of memory in use.
TIMING: 255000  CPU: 29992.9, 0.218636/step  Wall: 30005.3, 0.218657/step, 14.8809 hours remaining, 501.289062 MB of memory in use.
TIMING: 260000  CPU: 30200.3, 0.041494/step  Wall: 30213, 0.0415275/step, 2.7685 hours remaining, 501.289062 MB of memory in use.
TIMING: 265000  CPU: 30844, 0.128744/step  Wall: 30856.8, 0.128767/step, 8.40565 hours remaining, 501.289062 MB of memory in use.
TIMING: 270000  CPU: 31051.3, 0.04145/step  Wall: 31064.1, 0.0414648/step, 2.64914 hours remaining, 501.289062 MB of memory in use.
TIMING: 275000  CPU: 31686.8, 0.127102/step  Wall: 31699.8, 0.127124/step, 7.94523 hours remaining, 501.289062 MB of memory in use.
TIMING: 280000  CPU: 32931.1, 0.248864/step  Wall: 32944.3, 0.248904/step, 15.2108 hours remaining, 501.289062 MB of memory in use.
TIMING: 285000  CPU: 33991, 0.211968/step  Wall: 34004.3, 0.212001/step, 12.6612 hours remaining, 501.289062 MB of memory in use.
TIMING: 290000  CPU: 35116.5, 0.225108/step  Wall: 35129.9, 0.225125/step, 13.1323 hours remaining, 501.289062 MB of memory in use.
TIMING: 295000  CPU: 35326.2, 0.04194/step  Wall: 35339.7, 0.0419623/step, 2.38952 hours remaining, 501.289062 MB of memory in use.
TIMING: 300000  CPU: 36018.9, 0.138536/step  Wall: 36032.6, 0.138572/step, 7.69843 hours remaining, 501.289062 MB of memory in use.
TIMING: 305000  CPU: 36226.6, 0.041552/step  Wall: 36240.4, 0.0415704/step, 2.25173 hours remaining, 501.289062 MB of memory in use.
TIMING: 310000  CPU: 36741.1, 0.102898/step  Wall: 36755, 0.102917/step, 5.43174 hours remaining, 501.289062 MB of memory in use.
TIMING: 315000  CPU: 36948.9, 0.041558/step  Wall: 36962.9, 0.0415722/step, 2.13635 hours remaining, 501.289062 MB of memory in use.
TIMING: 320000  CPU: 37156.3, 0.041482/step  Wall: 37170.4, 0.041511/step, 2.07555 hours remaining, 501.289062 MB of memory in use.
TIMING: 325000  CPU: 37363.3, 0.0414/step  Wall: 37377.6, 0.0414302/step, 2.01397 hours remaining, 501.289062 MB of memory in use.
TIMING: 330000  CPU: 37570.6, 0.041464/step  Wall: 37585, 0.0414844/step, 1.95898 hours remaining, 501.289062 MB of memory in use.
[vsp004@qb3 sin-mutants]$


-- 
*******************************
   Thomas C. Bishop
    Tel: 318-257-5209
    Fax: 318-257-3823
   www.latech.edu/~bishop
******************************** 

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:21:55 CST