NAMD_2.9_Linux-x86_64-ibverbs RUNS on QueenBee RUN TIME VARIATION
Dear NAMD,
We are using charmrun with NAMD_2.9_Linux-x86_64-ibverbs on
QueenBee.loni.org.
And obtain WILDLY different run times (differences by over a factor
of 100%) for (essentially) the same
1ns simulation task (about 375,000 atoms and 256 processors (32
nodes ) of dual quad Core Xeon 64-bit @ 2.33 GHz 10Gbps IB network,
PME with gridspacing 1.
standard MD runs)
Our usual experience with similar runs is less than 10% variation
and worst case scenario of 30% variation not a factor of 2X!
Suggestions for tracking down this problem?
This is our command line w/in the qsub
$CHARM ++remote-shell /usr/bin/ssh ++timeout 120 ++nodelist nodes
++p $NP $NAMD $CONF
(yes we've tried reducing the number processes per node form a max
of 8 down to as few as 6)
The "benchmark" times all look very promising ... about 6hrs for a
1ns run.
But in practice we can OFTEN run out of time even when we request a
12hr wall limit.
Inspection of the "time remaining" shows that tfor typical runs this
is a monotonically decreasing function.
For the aberrant runs the values are all over the place.
Thanks for any ideas.
Tom
DETAILS
head -1500 */dyn*log | grep -i benchmark
********** sin2/dyn24.log **************
Info: Benchmark time: 256 CPUs 0.0428429 s/step 0.247933 days/ns
480.676 MB memory
Info: Benchmark time: 256 CPUs 0.043162 s/step 0.24978 days/ns
485.109 MB memory
Info: Benchmark time: 256 CPUs 0.0434176 s/step 0.25126 days/ns
485.109 MB memory
********** sin8/dyn21.log **************
Info: Benchmark time: 256 CPUs 0.0410384 s/step 0.237491 days/ns
477.648 MB memory
Info: Benchmark time: 256 CPUs 0.0411387 s/step 0.238071 days/ns
482.082 MB memory
Info: Benchmark time: 256 CPUs 0.0416531 s/step 0.241048 days/ns
483.562 MB memory
********** sin4/dyn21.log **************
Info: Benchmark time: 256 CPUs 0.0410146 s/step 0.237353 days/ns
478.449 MB memory
Info: Benchmark time: 256 CPUs 0.0410538 s/step 0.23758 days/ns
482.883 MB memory
Info: Benchmark time: 256 CPUs 0.0412462 s/step 0.238693 days/ns
482.883 MB memory
********** sin7/dyn21.log **************
Info: Benchmark time: 256 CPUs 0.0414661 s/step 0.239966 days/ns
478.672 MB memory
Info: Benchmark time: 256 CPUs 0.0410347 s/step 0.237469 days/ns
483.102 MB memory
Info: Benchmark time: 256 CPUs 0.041631 s/step 0.24092 days/ns
483.102 MB memory
[vsp004@qb3 sin-mutants]$
grep remaining */dyn*log | tail -1
******** sin8/dyn21.log ***************
TIMING: 310000 CPU: 36741.1, 0.102898/step Wall: 36755,
0.102917/step, 5.43174 hours remaining, 501.289062 MB of memory in
use.
******** sin2/dyn24.log ***************
TIMING: 125000 CPU: 5430.1, 0.04337/step Wall: 5437.7,
0.0434238/step, 4.52331 hours remaining, 502.835938 MB of memory in
use.
******** sin4/dyn21.log ***************
TIMING: 95000 CPU: 36786.4, 0.131336/step Wall: 36819,
0.131405/step, 14.7831 hours remaining, 500.609375 MB of memory in
use.
******** sin7/dyn21.log ***************
TIMING: 330000 CPU: 19387.2, 0.098274/step Wall: 19402.4,
0.0983259/step, 4.64317 hours remaining, 500.820312 MB of memory in
use.
TIMING: 335000 CPU: 19894.9, 0.101536/step Wall: 19910.6,
0.101622/step, 4.65766 hours remaining, 500.820312 MB of memory in
use.
[vsp004@qb3 sin-mutants]$
qstat -u vsp004
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK
Memory Time S Time
-------------------- -------- -------- ---------- ------ ----- ---
------ ----- - -----
724226.qb2 vsp004 checkpt sin2-24 850 32
1 -- 12:00 R 01:28
724227.qb2 vsp004 checkpt sin2-25 -- 32
1 -- 12:00 H --
724406.qb2 vsp004 checkpt sin4-21 31776 32
1 -- 12:00 R 10:43
724407.qb2 vsp004 checkpt sin4-22 -- 32
1 -- 12:00 H --
724408.qb2 vsp004 checkpt sin4-23 -- 32
1 -- 12:00 H --
724409.qb2 vsp004 checkpt sin4-24 -- 32
1 -- 12:00 H --
724410.qb2 vsp004 checkpt sin4-25 -- 32
1 -- 12:00 H --
724414.qb2 vsp004 checkpt sin7-21 2647 32
1 -- 12:00 R 10:26
724415.qb2 vsp004 checkpt sin7-22 -- 32
1 -- 12:00 H --
724416.qb2 vsp004 checkpt sin7-23 -- 32
1 -- 12:00 H --
724417.qb2 vsp004 checkpt sin7-24 -- 32
1 -- 12:00 H --
724418.qb2 vsp004 checkpt sin7-25 -- 32
1 -- 12:00 H --
724429.qb2 vsp004 checkpt sin8-21 6745 32
1 -- 12:00 R 10:13
724430.qb2 vsp004 checkpt sin8-22 -- 32
1 -- 12:00 H --
724431.qb2 vsp004 checkpt sin8-23 -- 32
1 -- 12:00 H --
724432.qb2 vsp004 checkpt sin8-24 -- 32
1 -- 12:00 H --
724433.qb2 vsp004 checkpt sin8-25 -- 32
1 -- 12:00 H --
[vsp004@qb3 sin-mutants]$
AND just for fun
grep -i remaining sin8/dyn21.log
TIMING: 5000 CPU: 492.29, 0.098414/step Wall: 492.534,
0.0984621/step, 13.5385 hours remaining, 501.289062 MB of memory in
use.
TIMING: 10000 CPU: 1644.41, 0.230424/step Wall: 1645.22,
0.230536/step, 31.3785 hours remaining, 501.289062 MB of memory in
use.
TIMING: 15000 CPU: 2372.88, 0.145694/step Wall: 2374,
0.145756/step, 19.6366 hours remaining, 501.289062 MB of memory in
use.
TIMING: 20000 CPU: 3508.73, 0.22717/step Wall: 3510.33,
0.227266/step, 30.3022 hours remaining, 501.289062 MB of memory in
use.
TIMING: 25000 CPU: 3716.95, 0.041644/step Wall: 3718.66,
0.0416657/step, 5.49756 hours remaining, 501.289062 MB of memory in
use.
TIMING: 30000 CPU: 3925.29, 0.041668/step Wall: 3927.11,
0.0416911/step, 5.44301 hours remaining, 501.289062 MB of memory in
use.
TIMING: 35000 CPU: 5212.46, 0.257434/step Wall: 5214.78,
0.257534/step, 33.2648 hours remaining, 501.289062 MB of memory in
use.
TIMING: 40000 CPU: 5772.96, 0.1121/step Wall: 5775.69,
0.112182/step, 14.3344 hours remaining, 501.289062 MB of memory in
use.
TIMING: 45000 CPU: 5980.95, 0.041598/step Wall: 5983.81,
0.0416241/step, 5.26082 hours remaining, 501.289062 MB of memory in
use.
TIMING: 50000 CPU: 6188.35, 0.04148/step Wall: 6191.32,
0.0415023/step, 5.18779 hours remaining, 501.289062 MB of memory in
use.
TIMING: 55000 CPU: 6802.46, 0.122822/step Wall: 6805.89,
0.122913/step, 15.1935 hours remaining, 501.289062 MB of memory in
use.
TIMING: 60000 CPU: 7010.68, 0.041644/step Wall: 7014.29,
0.0416791/step, 5.09412 hours remaining, 501.289062 MB of memory in
use.
TIMING: 65000 CPU: 7613.9, 0.120644/step Wall: 7617.78,
0.120699/step, 14.5845 hours remaining, 501.289062 MB of memory in
use.
TIMING: 70000 CPU: 7821.14, 0.041448/step Wall: 7825.13,
0.0414693/step, 4.95328 hours remaining, 501.289062 MB of memory in
use.
TIMING: 75000 CPU: 8028.73, 0.041518/step Wall: 8033,
0.041574/step, 4.90804 hours remaining, 501.289062 MB of memory in
use.
TIMING: 80000 CPU: 8236.61, 0.041576/step Wall: 8241, 0.0416/step,
4.85333 hours remaining, 501.289062 MB of memory in use.
TIMING: 85000 CPU: 8727.78, 0.098234/step Wall: 8732.45,
0.0982896/step, 11.3306 hours remaining, 501.289062 MB of memory in
use.
TIMING: 90000 CPU: 8935.14, 0.041472/step Wall: 8939.94,
0.0414992/step, 4.7263 hours remaining, 501.289062 MB of memory in
use.
TIMING: 95000 CPU: 9142.92, 0.041556/step Wall: 9147.88,
0.0415874/step, 4.67858 hours remaining, 501.289062 MB of memory in
use.
TIMING: 100000 CPU: 9893.16, 0.150048/step Wall: 9898.41,
0.150107/step, 16.6786 hours remaining, 501.289062 MB of memory in
use.
TIMING: 105000 CPU: 10454.8, 0.112336/step Wall: 10460.4,
0.112393/step, 12.332 hours remaining, 501.289062 MB of memory in
use.
TIMING: 110000 CPU: 10941.9, 0.097406/step Wall: 10947.5,
0.0974262/step, 10.5545 hours remaining, 501.289062 MB of memory in
use.
TIMING: 115000 CPU: 11995.7, 0.210764/step Wall: 12001.5,
0.210796/step, 22.5434 hours remaining, 501.289062 MB of memory in
use.
TIMING: 120000 CPU: 12205.9, 0.042032/step Wall: 12211.8,
0.0420594/step, 4.4396 hours remaining, 501.289062 MB of memory in
use.
TIMING: 125000 CPU: 12837.5, 0.126326/step Wall: 12843.7,
0.126376/step, 13.1641 hours remaining, 501.289062 MB of memory in
use.
TIMING: 130000 CPU: 13505.4, 0.133582/step Wall: 13511.7,
0.133601/step, 13.7312 hours remaining, 501.289062 MB of memory in
use.
TIMING: 135000 CPU: 14772.3, 0.25339/step Wall: 14779.4,
0.253544/step, 25.7065 hours remaining, 501.289062 MB of memory in
use.
TIMING: 140000 CPU: 14979.7, 0.041466/step Wall: 14986.8,
0.0414803/step, 4.14803 hours remaining, 501.289062 MB of memory in
use.
TIMING: 145000 CPU: 15633.6, 0.130788/step Wall: 15640.9,
0.130822/step, 12.9005 hours remaining, 501.289062 MB of memory in
use.
TIMING: 150000 CPU: 16240.2, 0.121322/step Wall: 16247.6,
0.121338/step, 11.7967 hours remaining, 501.289062 MB of memory in
use.
TIMING: 155000 CPU: 16448.1, 0.04158/step Wall: 16455.6,
0.0415979/step, 3.98647 hours remaining, 501.289062 MB of memory in
use.
TIMING: 160000 CPU: 16654.9, 0.041352/step Wall: 16662.4,
0.0413741/step, 3.90755 hours remaining, 501.289062 MB of memory in
use.
TIMING: 165000 CPU: 17394, 0.147828/step Wall: 17401.7,
0.147854/step, 13.7586 hours remaining, 501.289062 MB of memory in
use.
TIMING: 170000 CPU: 20006.8, 0.522554/step Wall: 20014.8,
0.522608/step, 47.9057 hours remaining, 501.289062 MB of memory in
use.
TIMING: 175000 CPU: 20589.6, 0.116566/step Wall: 20597.7,
0.116592/step, 10.5257 hours remaining, 501.289062 MB of memory in
use.
TIMING: 180000 CPU: 20797.3, 0.041532/step Wall: 20805.5,
0.0415489/step, 3.69323 hours remaining, 501.289062 MB of memory in
use.
TIMING: 185000 CPU: 21956.6, 0.231872/step Wall: 21966.2,
0.232153/step, 20.3134 hours remaining, 501.289062 MB of memory in
use.
TIMING: 190000 CPU: 22486.7, 0.106008/step Wall: 22496.6,
0.106079/step, 9.13461 hours remaining, 501.289062 MB of memory in
use.
TIMING: 195000 CPU: 23028.1, 0.108288/step Wall: 23038.3,
0.108333/step, 9.17818 hours remaining, 501.289062 MB of memory in
use.
TIMING: 200000 CPU: 23584.1, 0.111194/step Wall: 23594.9,
0.111324/step, 9.277 hours remaining, 501.289062 MB of memory in
use.
TIMING: 205000 CPU: 24109.3, 0.105036/step Wall: 24120.3,
0.105086/step, 8.61123 hours remaining, 501.289062 MB of memory in
use.
TIMING: 210000 CPU: 24643, 0.106736/step Wall: 24654.2,
0.106778/step, 8.60155 hours remaining, 501.289062 MB of memory in
use.
TIMING: 215000 CPU: 25751.9, 0.221798/step Wall: 25763.6,
0.221885/step, 17.5659 hours remaining, 501.289062 MB of memory in
use.
TIMING: 220000 CPU: 25959.1, 0.04143/step Wall: 25970.9,
0.0414426/step, 3.22331 hours remaining, 501.289062 MB of memory in
use.
TIMING: 225000 CPU: 26166.5, 0.041482/step Wall: 26178.4,
0.0415041/step, 3.17045 hours remaining, 501.289062 MB of memory in
use.
TIMING: 230000 CPU: 27392.6, 0.245224/step Wall: 27404.6,
0.245245/step, 18.3934 hours remaining, 501.289062 MB of memory in
use.
TIMING: 235000 CPU: 27599.9, 0.041458/step Wall: 27612,
0.0414758/step, 3.05308 hours remaining, 501.289062 MB of memory in
use.
TIMING: 240000 CPU: 28188.2, 0.117664/step Wall: 28200.4,
0.117678/step, 8.49896 hours remaining, 501.289062 MB of memory in
use.
TIMING: 245000 CPU: 28397.4, 0.041842/step Wall: 28409.8,
0.0418826/step, 2.96668 hours remaining, 501.289062 MB of memory in
use.
TIMING: 250000 CPU: 28899.7, 0.100446/step Wall: 28912,
0.100453/step, 6.9759 hours remaining, 501.289062 MB of memory in
use.
TIMING: 255000 CPU: 29992.9, 0.218636/step Wall: 30005.3,
0.218657/step, 14.8809 hours remaining, 501.289062 MB of memory in
use.
TIMING: 260000 CPU: 30200.3, 0.041494/step Wall: 30213,
0.0415275/step, 2.7685 hours remaining, 501.289062 MB of memory in
use.
TIMING: 265000 CPU: 30844, 0.128744/step Wall: 30856.8,
0.128767/step, 8.40565 hours remaining, 501.289062 MB of memory in
use.
TIMING: 270000 CPU: 31051.3, 0.04145/step Wall: 31064.1,
0.0414648/step, 2.64914 hours remaining, 501.289062 MB of memory in
use.
TIMING: 275000 CPU: 31686.8, 0.127102/step Wall: 31699.8,
0.127124/step, 7.94523 hours remaining, 501.289062 MB of memory in
use.
TIMING: 280000 CPU: 32931.1, 0.248864/step Wall: 32944.3,
0.248904/step, 15.2108 hours remaining, 501.289062 MB of memory in
use.
TIMING: 285000 CPU: 33991, 0.211968/step Wall: 34004.3,
0.212001/step, 12.6612 hours remaining, 501.289062 MB of memory in
use.
TIMING: 290000 CPU: 35116.5, 0.225108/step Wall: 35129.9,
0.225125/step, 13.1323 hours remaining, 501.289062 MB of memory in
use.
TIMING: 295000 CPU: 35326.2, 0.04194/step Wall: 35339.7,
0.0419623/step, 2.38952 hours remaining, 501.289062 MB of memory in
use.
TIMING: 300000 CPU: 36018.9, 0.138536/step Wall: 36032.6,
0.138572/step, 7.69843 hours remaining, 501.289062 MB of memory in
use.
TIMING: 305000 CPU: 36226.6, 0.041552/step Wall: 36240.4,
0.0415704/step, 2.25173 hours remaining, 501.289062 MB of memory in
use.
TIMING: 310000 CPU: 36741.1, 0.102898/step Wall: 36755,
0.102917/step, 5.43174 hours remaining, 501.289062 MB of memory in
use.
TIMING: 315000 CPU: 36948.9, 0.041558/step Wall: 36962.9,
0.0415722/step, 2.13635 hours remaining, 501.289062 MB of memory in
use.
TIMING: 320000 CPU: 37156.3, 0.041482/step Wall: 37170.4,
0.041511/step, 2.07555 hours remaining, 501.289062 MB of memory in
use.
TIMING: 325000 CPU: 37363.3, 0.0414/step Wall: 37377.6,
0.0414302/step, 2.01397 hours remaining, 501.289062 MB of memory in
use.
TIMING: 330000 CPU: 37570.6, 0.041464/step Wall: 37585,
0.0414844/step, 1.95898 hours remaining, 501.289062 MB of memory in
use.
[vsp004@qb3 sin-mutants]$
--
*******************************
Thomas C. Bishop
Tel: 318-257-5209
Fax: 318-257-3823
www.latech.edu/~bishop
********************************
This archive was generated by hypermail 2.1.6
: Wed Dec 31 2014 - 23:21:55 CST