AW: NAMD_2.9_Linux-x86_64-ibverbs RUNS on QueenBee RUN TIME VARIATION

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Tue Nov 19 2013 - 09:07:42 CST

Hi Thomas,

 

I think that you are using nodes with Hyperthreading enabled. This means
that there are virtual cores and they share the same physical cores. A
operating system let the processes jump over the cores all the time usually.
Therefore there are points in time where your processes use logical cores
that belong to the same physical core, means running multiple processes on
only some real physical core which is of course bad. Therefore if you can,
disable HT (BIOS) or use f.i. taskset to allow only real cores. To find out
which cores are real, check

 

cat /proc/cpuinfo | grep -E "processor|core id|physical id"

 

The output shows which logical "processor" uses which physical "core id".
Then only use the logical processors having a unique core id. The physical
id will show you the processor socket if you have multiple.

 

Afterwards you can use something like that to run:

 

charmrun [.] taskset -c 0,1,2,3,4 namd2 [..]

 

You can use "top" and press "1" to show the utilization of each individual
core on the compute nodes to check if it is working like expected. This
should give you the expected performance.

 

Let us know.

 

Norman Geist.

 

Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im Auftrag
von Thomas C. Bishop
Gesendet: Dienstag, 19. November 2013 15:07
An: namd-l_at_ks.uiuc.edu
Cc: Pereddy Venkat
Betreff: namd-l: NAMD_2.9_Linux-x86_64-ibverbs RUNS on QueenBee RUN TIME
VARIATION

 

Dear NAMD,

We are using charmrun with NAMD_2.9_Linux-x86_64-ibverbs on
QueenBee.loni.org.
And obtain WILDLY different run times (differences by over a factor of 100%)
for (essentially) the same
1ns simulation task (about 375,000 atoms and 256 processors (32 nodes ) of
dual quad Core Xeon 64-bit @ 2.33 GHz 10Gbps IB network, PME with
gridspacing 1.
standard MD runs)

Our usual experience with similar runs is less than 10% variation and worst
case scenario of 30% variation not a factor of 2X!
Suggestions for tracking down this problem?
This is our command line w/in the qsub

$CHARM ++remote-shell /usr/bin/ssh ++timeout 120 ++nodelist nodes ++p $NP
$NAMD $CONF

(yes we've tried reducing the number processes per node form a max of 8
down to as few as 6)

The "benchmark" times all look very promising ... about 6hrs for a 1ns run.
But in practice we can OFTEN run out of time even when we request a 12hr
wall limit.
Inspection of the "time remaining" shows that tfor typical runs this is a
monotonically decreasing function.
For the aberrant runs the values are all over the place.

Thanks for any ideas.
Tom

DETAILS

head -1500 */dyn*log | grep -i benchmark

********** sin2/dyn24.log **************
Info: Benchmark time: 256 CPUs 0.0428429 s/step 0.247933 days/ns 480.676 MB
memory
Info: Benchmark time: 256 CPUs 0.043162 s/step 0.24978 days/ns 485.109 MB
memory
Info: Benchmark time: 256 CPUs 0.0434176 s/step 0.25126 days/ns 485.109 MB
memory
********** sin8/dyn21.log **************
Info: Benchmark time: 256 CPUs 0.0410384 s/step 0.237491 days/ns 477.648 MB
memory
Info: Benchmark time: 256 CPUs 0.0411387 s/step 0.238071 days/ns 482.082 MB
memory
Info: Benchmark time: 256 CPUs 0.0416531 s/step 0.241048 days/ns 483.562 MB
memory
********** sin4/dyn21.log **************
Info: Benchmark time: 256 CPUs 0.0410146 s/step 0.237353 days/ns 478.449 MB
memory
Info: Benchmark time: 256 CPUs 0.0410538 s/step 0.23758 days/ns 482.883 MB
memory
Info: Benchmark time: 256 CPUs 0.0412462 s/step 0.238693 days/ns 482.883 MB
memory
********** sin7/dyn21.log **************
Info: Benchmark time: 256 CPUs 0.0414661 s/step 0.239966 days/ns 478.672 MB
memory
Info: Benchmark time: 256 CPUs 0.0410347 s/step 0.237469 days/ns 483.102 MB
memory
Info: Benchmark time: 256 CPUs 0.041631 s/step 0.24092 days/ns 483.102 MB
memory
[vsp004_at_qb3 sin-mutants]$

grep remaining */dyn*log | tail -1

******** sin8/dyn21.log ***************
TIMING: 310000 CPU: 36741.1, 0.102898/step Wall: 36755, 0.102917/step,
5.43174 hours remaining, 501.289062 MB of memory in use.
******** sin2/dyn24.log ***************
TIMING: 125000 CPU: 5430.1, 0.04337/step Wall: 5437.7, 0.0434238/step,
4.52331 hours remaining, 502.835938 MB of memory in use.
******** sin4/dyn21.log ***************
TIMING: 95000 CPU: 36786.4, 0.131336/step Wall: 36819, 0.131405/step,
14.7831 hours remaining, 500.609375 MB of memory in use.
******** sin7/dyn21.log ***************
TIMING: 330000 CPU: 19387.2, 0.098274/step Wall: 19402.4, 0.0983259/step,
4.64317 hours remaining, 500.820312 MB of memory in use.
TIMING: 335000 CPU: 19894.9, 0.101536/step Wall: 19910.6, 0.101622/step,
4.65766 hours remaining, 500.820312 MB of memory in use.
[vsp004_at_qb3 sin-mutants]$

qstat -u vsp004

                                                                   Req'd
Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory
Time S Time
-------------------- -------- -------- ---------- ------ ----- --- ------
----- - -----
724226.qb2 vsp004 checkpt sin2-24 850 32 1 --
12:00 R 01:28
724227.qb2 vsp004 checkpt sin2-25 -- 32 1 --
12:00 H --
724406.qb2 vsp004 checkpt sin4-21 31776 32 1 --
12:00 R 10:43
724407.qb2 vsp004 checkpt sin4-22 -- 32 1 --
12:00 H --
724408.qb2 vsp004 checkpt sin4-23 -- 32 1 --
12:00 H --
724409.qb2 vsp004 checkpt sin4-24 -- 32 1 --
12:00 H --
724410.qb2 vsp004 checkpt sin4-25 -- 32 1 --
12:00 H --
724414.qb2 vsp004 checkpt sin7-21 2647 32 1 --
12:00 R 10:26
724415.qb2 vsp004 checkpt sin7-22 -- 32 1 --
12:00 H --
724416.qb2 vsp004 checkpt sin7-23 -- 32 1 --
12:00 H --
724417.qb2 vsp004 checkpt sin7-24 -- 32 1 --
12:00 H --
724418.qb2 vsp004 checkpt sin7-25 -- 32 1 --
12:00 H --
724429.qb2 vsp004 checkpt sin8-21 6745 32 1 --
12:00 R 10:13
724430.qb2 vsp004 checkpt sin8-22 -- 32 1 --
12:00 H --
724431.qb2 vsp004 checkpt sin8-23 -- 32 1 --
12:00 H --
724432.qb2 vsp004 checkpt sin8-24 -- 32 1 --
12:00 H --
724433.qb2 vsp004 checkpt sin8-25 -- 32 1 --
12:00 H --
[vsp004_at_qb3 sin-mutants]$

AND just for fun

grep -i remaining sin8/dyn21.log

TIMING: 5000 CPU: 492.29, 0.098414/step Wall: 492.534, 0.0984621/step,
13.5385 hours remaining, 501.289062 MB of memory in use.
TIMING: 10000 CPU: 1644.41, 0.230424/step Wall: 1645.22, 0.230536/step,
31.3785 hours remaining, 501.289062 MB of memory in use.
TIMING: 15000 CPU: 2372.88, 0.145694/step Wall: 2374, 0.145756/step,
19.6366 hours remaining, 501.289062 MB of memory in use.
TIMING: 20000 CPU: 3508.73, 0.22717/step Wall: 3510.33, 0.227266/step,
30.3022 hours remaining, 501.289062 MB of memory in use.
TIMING: 25000 CPU: 3716.95, 0.041644/step Wall: 3718.66, 0.0416657/step,
5.49756 hours remaining, 501.289062 MB of memory in use.
TIMING: 30000 CPU: 3925.29, 0.041668/step Wall: 3927.11, 0.0416911/step,
5.44301 hours remaining, 501.289062 MB of memory in use.
TIMING: 35000 CPU: 5212.46, 0.257434/step Wall: 5214.78, 0.257534/step,
33.2648 hours remaining, 501.289062 MB of memory in use.
TIMING: 40000 CPU: 5772.96, 0.1121/step Wall: 5775.69, 0.112182/step,
14.3344 hours remaining, 501.289062 MB of memory in use.
TIMING: 45000 CPU: 5980.95, 0.041598/step Wall: 5983.81, 0.0416241/step,
5.26082 hours remaining, 501.289062 MB of memory in use.
TIMING: 50000 CPU: 6188.35, 0.04148/step Wall: 6191.32, 0.0415023/step,
5.18779 hours remaining, 501.289062 MB of memory in use.
TIMING: 55000 CPU: 6802.46, 0.122822/step Wall: 6805.89, 0.122913/step,
15.1935 hours remaining, 501.289062 MB of memory in use.
TIMING: 60000 CPU: 7010.68, 0.041644/step Wall: 7014.29, 0.0416791/step,
5.09412 hours remaining, 501.289062 MB of memory in use.
TIMING: 65000 CPU: 7613.9, 0.120644/step Wall: 7617.78, 0.120699/step,
14.5845 hours remaining, 501.289062 MB of memory in use.
TIMING: 70000 CPU: 7821.14, 0.041448/step Wall: 7825.13, 0.0414693/step,
4.95328 hours remaining, 501.289062 MB of memory in use.
TIMING: 75000 CPU: 8028.73, 0.041518/step Wall: 8033, 0.041574/step,
4.90804 hours remaining, 501.289062 MB of memory in use.
TIMING: 80000 CPU: 8236.61, 0.041576/step Wall: 8241, 0.0416/step, 4.85333
hours remaining, 501.289062 MB of memory in use.
TIMING: 85000 CPU: 8727.78, 0.098234/step Wall: 8732.45, 0.0982896/step,
11.3306 hours remaining, 501.289062 MB of memory in use.
TIMING: 90000 CPU: 8935.14, 0.041472/step Wall: 8939.94, 0.0414992/step,
4.7263 hours remaining, 501.289062 MB of memory in use.
TIMING: 95000 CPU: 9142.92, 0.041556/step Wall: 9147.88, 0.0415874/step,
4.67858 hours remaining, 501.289062 MB of memory in use.
TIMING: 100000 CPU: 9893.16, 0.150048/step Wall: 9898.41, 0.150107/step,
16.6786 hours remaining, 501.289062 MB of memory in use.
TIMING: 105000 CPU: 10454.8, 0.112336/step Wall: 10460.4, 0.112393/step,
12.332 hours remaining, 501.289062 MB of memory in use.
TIMING: 110000 CPU: 10941.9, 0.097406/step Wall: 10947.5, 0.0974262/step,
10.5545 hours remaining, 501.289062 MB of memory in use.
TIMING: 115000 CPU: 11995.7, 0.210764/step Wall: 12001.5, 0.210796/step,
22.5434 hours remaining, 501.289062 MB of memory in use.
TIMING: 120000 CPU: 12205.9, 0.042032/step Wall: 12211.8, 0.0420594/step,
4.4396 hours remaining, 501.289062 MB of memory in use.
TIMING: 125000 CPU: 12837.5, 0.126326/step Wall: 12843.7, 0.126376/step,
13.1641 hours remaining, 501.289062 MB of memory in use.
TIMING: 130000 CPU: 13505.4, 0.133582/step Wall: 13511.7, 0.133601/step,
13.7312 hours remaining, 501.289062 MB of memory in use.
TIMING: 135000 CPU: 14772.3, 0.25339/step Wall: 14779.4, 0.253544/step,
25.7065 hours remaining, 501.289062 MB of memory in use.
TIMING: 140000 CPU: 14979.7, 0.041466/step Wall: 14986.8, 0.0414803/step,
4.14803 hours remaining, 501.289062 MB of memory in use.
TIMING: 145000 CPU: 15633.6, 0.130788/step Wall: 15640.9, 0.130822/step,
12.9005 hours remaining, 501.289062 MB of memory in use.
TIMING: 150000 CPU: 16240.2, 0.121322/step Wall: 16247.6, 0.121338/step,
11.7967 hours remaining, 501.289062 MB of memory in use.
TIMING: 155000 CPU: 16448.1, 0.04158/step Wall: 16455.6, 0.0415979/step,
3.98647 hours remaining, 501.289062 MB of memory in use.
TIMING: 160000 CPU: 16654.9, 0.041352/step Wall: 16662.4, 0.0413741/step,
3.90755 hours remaining, 501.289062 MB of memory in use.
TIMING: 165000 CPU: 17394, 0.147828/step Wall: 17401.7, 0.147854/step,
13.7586 hours remaining, 501.289062 MB of memory in use.
TIMING: 170000 CPU: 20006.8, 0.522554/step Wall: 20014.8, 0.522608/step,
47.9057 hours remaining, 501.289062 MB of memory in use.
TIMING: 175000 CPU: 20589.6, 0.116566/step Wall: 20597.7, 0.116592/step,
10.5257 hours remaining, 501.289062 MB of memory in use.
TIMING: 180000 CPU: 20797.3, 0.041532/step Wall: 20805.5, 0.0415489/step,
3.69323 hours remaining, 501.289062 MB of memory in use.
TIMING: 185000 CPU: 21956.6, 0.231872/step Wall: 21966.2, 0.232153/step,
20.3134 hours remaining, 501.289062 MB of memory in use.
TIMING: 190000 CPU: 22486.7, 0.106008/step Wall: 22496.6, 0.106079/step,
9.13461 hours remaining, 501.289062 MB of memory in use.
TIMING: 195000 CPU: 23028.1, 0.108288/step Wall: 23038.3, 0.108333/step,
9.17818 hours remaining, 501.289062 MB of memory in use.
TIMING: 200000 CPU: 23584.1, 0.111194/step Wall: 23594.9, 0.111324/step,
9.277 hours remaining, 501.289062 MB of memory in use.
TIMING: 205000 CPU: 24109.3, 0.105036/step Wall: 24120.3, 0.105086/step,
8.61123 hours remaining, 501.289062 MB of memory in use.
TIMING: 210000 CPU: 24643, 0.106736/step Wall: 24654.2, 0.106778/step,
8.60155 hours remaining, 501.289062 MB of memory in use.
TIMING: 215000 CPU: 25751.9, 0.221798/step Wall: 25763.6, 0.221885/step,
17.5659 hours remaining, 501.289062 MB of memory in use.
TIMING: 220000 CPU: 25959.1, 0.04143/step Wall: 25970.9, 0.0414426/step,
3.22331 hours remaining, 501.289062 MB of memory in use.
TIMING: 225000 CPU: 26166.5, 0.041482/step Wall: 26178.4, 0.0415041/step,
3.17045 hours remaining, 501.289062 MB of memory in use.
TIMING: 230000 CPU: 27392.6, 0.245224/step Wall: 27404.6, 0.245245/step,
18.3934 hours remaining, 501.289062 MB of memory in use.
TIMING: 235000 CPU: 27599.9, 0.041458/step Wall: 27612, 0.0414758/step,
3.05308 hours remaining, 501.289062 MB of memory in use.
TIMING: 240000 CPU: 28188.2, 0.117664/step Wall: 28200.4, 0.117678/step,
8.49896 hours remaining, 501.289062 MB of memory in use.
TIMING: 245000 CPU: 28397.4, 0.041842/step Wall: 28409.8, 0.0418826/step,
2.96668 hours remaining, 501.289062 MB of memory in use.
TIMING: 250000 CPU: 28899.7, 0.100446/step Wall: 28912, 0.100453/step,
6.9759 hours remaining, 501.289062 MB of memory in use.
TIMING: 255000 CPU: 29992.9, 0.218636/step Wall: 30005.3, 0.218657/step,
14.8809 hours remaining, 501.289062 MB of memory in use.
TIMING: 260000 CPU: 30200.3, 0.041494/step Wall: 30213, 0.0415275/step,
2.7685 hours remaining, 501.289062 MB of memory in use.
TIMING: 265000 CPU: 30844, 0.128744/step Wall: 30856.8, 0.128767/step,
8.40565 hours remaining, 501.289062 MB of memory in use.
TIMING: 270000 CPU: 31051.3, 0.04145/step Wall: 31064.1, 0.0414648/step,
2.64914 hours remaining, 501.289062 MB of memory in use.
TIMING: 275000 CPU: 31686.8, 0.127102/step Wall: 31699.8, 0.127124/step,
7.94523 hours remaining, 501.289062 MB of memory in use.
TIMING: 280000 CPU: 32931.1, 0.248864/step Wall: 32944.3, 0.248904/step,
15.2108 hours remaining, 501.289062 MB of memory in use.
TIMING: 285000 CPU: 33991, 0.211968/step Wall: 34004.3, 0.212001/step,
12.6612 hours remaining, 501.289062 MB of memory in use.
TIMING: 290000 CPU: 35116.5, 0.225108/step Wall: 35129.9, 0.225125/step,
13.1323 hours remaining, 501.289062 MB of memory in use.
TIMING: 295000 CPU: 35326.2, 0.04194/step Wall: 35339.7, 0.0419623/step,
2.38952 hours remaining, 501.289062 MB of memory in use.
TIMING: 300000 CPU: 36018.9, 0.138536/step Wall: 36032.6, 0.138572/step,
7.69843 hours remaining, 501.289062 MB of memory in use.
TIMING: 305000 CPU: 36226.6, 0.041552/step Wall: 36240.4, 0.0415704/step,
2.25173 hours remaining, 501.289062 MB of memory in use.
TIMING: 310000 CPU: 36741.1, 0.102898/step Wall: 36755, 0.102917/step,
5.43174 hours remaining, 501.289062 MB of memory in use.
TIMING: 315000 CPU: 36948.9, 0.041558/step Wall: 36962.9, 0.0415722/step,
2.13635 hours remaining, 501.289062 MB of memory in use.
TIMING: 320000 CPU: 37156.3, 0.041482/step Wall: 37170.4, 0.041511/step,
2.07555 hours remaining, 501.289062 MB of memory in use.
TIMING: 325000 CPU: 37363.3, 0.0414/step Wall: 37377.6, 0.0414302/step,
2.01397 hours remaining, 501.289062 MB of memory in use.
TIMING: 330000 CPU: 37570.6, 0.041464/step Wall: 37585, 0.0414844/step,
1.95898 hours remaining, 501.289062 MB of memory in use.
[vsp004_at_qb3 sin-mutants]$

-- 
*******************************
   Thomas C. Bishop
    Tel: 318-257-5209
    Fax: 318-257-3823
   www.latech.edu/~bishop
******************************** 

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:24:00 CST