RE: optimising namd ibverb runs

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Mon May 18 2015 - 08:16:11 CDT

Hi,

 

You may want to disable Hyper-Threading. Hyper-Threading (or logical cores)
don't bring any use in HPC and as the OS doesn't care which core is real and
which not, it can't distribute the processes very good. As a alternative you
can use e.g. taskset to select the non-shared cores.

 

Example: (depending on core layout; benchmark which is fastest)

 

charmrun +p 192 +nodelist yournodelistfile taskset -c 0-7,16-23,32-39,48-55
namd2 your.in

 

or

 

charmrun +p 192 +nodelist yournodelistfile taskset -c
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,
54,56,58,60,62 namd2 your.in

 

Personally I think its easier to just disable HT in the BIOS.

Also have a look at Amdahls law. You will unlikely get full linear scaling,
depending on system size. A penalty of about 20% is normal for a reasnable
amount of cores for a given system size.

 

Norman Geist.

 

From: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] On Behalf
Of Thanassis Silis
Sent: Monday, May 18, 2015 1:48 PM
To: namd-l_at_ks.uiuc.edu
Subject: namd-l: optimising namd ibverb runs

 

Hello everyone,
I am running some relatively small andm simulations in a system of 6 blade
processing servers. Each has the following specs

POWEREDGE M910 BLADE SERVER
(4x) INTEL XEON E7-4830 PROCESSOR (2.13GHZ)
256GB MEMORY FOR 2/4CPU
900GB, SAS 6GBPS, 2.5-IN, 10K RPM HARD DISK
MELLANOX CONNECT X3 QDR 40GBPS INFINIBAND

each of the 4 processors has 8 cores and due to hyper-threading 16 threads
are available.
Indeed, cat /proc/cpuinfo returns 64 cpus on each system.

I have created a nodelist file using the infiband interface ip address - I
am also using the ibverbs namd executable. I have run several test
simulations to figure out which setting minimizes processing time. Overall
it seems that for 64 cpus/system * 6 systems = 384 cpus , I get to minimize
the processing time by using "+p128 +setcpuaffinity"

This seems odd as it is 1/3 of the available cpus. It's not half - which
would seem sensible (if one of each core's threads works, it utilizes the
full resources of the other thread of the core and this maximizes
performance).

One of the things I tried was to let the system decide which cpu's to use,
with
charmrn namd2 ++nodelist nodelist +setcpuaffinity `numactl --show | awk
'/^physcpubind/ {printf "+p%d +pemap %d",(NF-1),$2;
for(i=3;i<=NF;++i){printf ",%d",$i}}'` sim.conf > sim.log

and also to manually assign worker threads and comminucation threads. I may
(or may not!) have managed that with the command
charmrun namd2 ++nodelist nodelist +setcpuaffinity +p64 +pemap 0-63:16.15
+commap 15-63:16
In this above command, I am not sure how should I "see" the 64 * 6 cpus. as
6 same systems ? (so add +p64), or aggregate them to 384 cpus (so add +p384
above). I did try +p384 but it seems to be even slower - way too many
threads have been spawned.

So I am fuzzy. Why do I get minimized process time when 1/3 of the 384 cpus
are used and no manual settings are in place? Are charmrun and namd2 clever
enough at this version (2.10) that they assign worker and comm threads
automagically?

Is there some other parameter that you suggest I should append, because at
the very least using 1/3 of the cpus seems Very odd.

Thank you for your time and input.

This archive was generated by hypermail 2.1.6 : Tue Dec 27 2016 - 23:21:08 CST