AW: 50% system CPU usage when parallel running NAMD on Rocks cluster

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Wed Dec 18 2013 - 02:06:09 CST

Hi again,

I will see though the settings you posted when I got some time. A quick look didn't show something obvious. So what I already posted earlier and what axel also suggests :

>but notice, that 16 cores per node is really heavy for 1Gbit/s Ethernet and you might want to consider spending some >money into a HPC network like Infiniband or at least 10Gbit/s Ethernet.

could become the painful truth. You should also really try using an smp binary to reduce the number of network communicators, as axels posted. You might notice slower timings for single node cases, but maybe an improvement for multiple nodes.

What you see regarding system cpu is the default "idlepoll" behavior of the MPI you used formerly (I guess openMPI) that improves latency. You can and should reproduce that behavior by adding +idlepoll to namd2 when using it with charmrun. This usually makes a great difference.

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: Axel Kohlmeyer [mailto:akohlmey_at_gmail.com]
> Gesendet: Mittwoch, 18. Dezember 2013 08:29
> An: 周昵昀
> Cc: Norman Geist; Namd Mailing List
> Betreff: Re: namd-l: 50% system CPU usage when parallel running NAMD on
> Rocks cluster
>
> guys,
>
> i think you are on a wild good chase here. please look at the numbers
> first.
>
> you have a very fast CPU and an interconnect with a rather high
> latency. yet the test system has "only" 100,000 atoms. to assess
> parallel scaling performance, you have to consider two components:
> - the more processor cores you use, the smaller the number of work
> units per processor is
> - the more processor cores you use, the more communication messages
> you need to send
>
> each message will *add* to the overall time based on latency (constant
> amount of time per message) and bandwidth (added amount depends on the
> chunk of data sent). so the more processors you use, the more overhead
> you create. for systems with a very large number of atoms and few
> processor cores this will primarily be due to bandwidth, for a smaller
> number of atoms and more processors this will primarily be due to
> latency.
>
> now a lot of the good performance of NAMD is due to the fact that it
> can "hide" the cost of communication behind doing computation (which
> is different from many other MD codes and mostly due to using the
> charm++ library), but that goes only so far and then it will quickly
> become very bad (unlike other MD codes, that don't suffer as much). so
> for this kind of setup and a rather small system, i would say getting
> decent scaling to two nodes (32-processors) is quite good, but
> expecting this to go much further is neglecting the fundamental
> limitations of the hardware and the parallelization strategy in NAMD.
> you can tweak it, but only up to a point.
>
> what might be worth investigating would be the impact of using an SMP
> executable vs. a regular TCP/UDP/MPI-only binary, but i would not get
> my hopes up too high. with < 3000 atoms per CPU core, you don't have a
> lot of work to hide communication behind. so if you seriously need to
> go below and scale to more nodes, you need to invest a lot of money
> into a low latency communication.
>
> axel.
>
>
> On Tue, Dec 17, 2013 at 3:12 PM, 周昵昀 <malrot13_at_gmail.com> wrote:
> > [root_at_c1 ~]# ifconfig
> > eth1 Link encap:Ethernet HWaddr F8:0F:41:F8:51:B2
> > inet addr:10.1.255.247 Bcast:10.1.255.255
> Mask:255.255.0.0
> > UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
> > RX packets:324575235 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:309132521 errors:0 dropped:0 overruns:0
> carrier:0
> > collisions:0 txqueuelen:1000
> > RX bytes:471233795133 (438.8 GiB) TX bytes:472407792256
> (439.9
> > GiB)
> > Memory:dfd20000-dfd40000
> >
> > lo Link encap:Local Loopback
> > inet addr:127.0.0.1 Mask:255.0.0.0
> > UP LOOPBACK RUNNING MTU:16436 Metric:1
> > RX packets:44327434 errors:0 dropped:0 overruns:0 frame:0
> > TX packets:44327434 errors:0 dropped:0 overruns:0 carrier:0
> > collisions:0 txqueuelen:0
> > RX bytes:242299288622 (225.6 GiB) TX bytes:242299288622
> (225.6
> > GiB)
> >
> > Mtu has been changed for some test, but shows no difference between
> 1500 and
> > 9000. Normally, mtu is 1500.
> >
> > [root_at_c1 ~]# ethtool eth1
> > Settings for eth1:
> > Supported ports: [ TP ]
> > Supported link modes: 10baseT/Half 10baseT/Full
> > 100baseT/Half 100baseT/Full
> > 1000baseT/Full
> > Supports auto-negotiation: Yes
> > Advertised link modes: 10baseT/Half 10baseT/Full
> > 100baseT/Half 100baseT/Full
> > 1000baseT/Full
> > Advertised auto-negotiation: Yes
> > Speed: 1000Mb/s
> > Duplex: Full
> > Port: Twisted Pair
> > PHYAD: 1
> > Transceiver: internal
> > Auto-negotiation: on
> > Supports Wake-on: pumbg
> > Wake-on: g
> > Current message level: 0x00000003 (3)
> > Link detected: yes
> >
> > [root_at_c1 ~]# ethtool -k eth1
> > Offload parameters for eth1:
> > Cannot get device udp large send offload settings: Operation not
> supported
> > rx-checksumming: on
> > tx-checksumming: on
> > scatter-gather: on
> > tcp segmentation offload: on
> > udp fragmentation offload: off
> > generic segmentation offload: off
> > generic-receive-offload: on
> >
> > [root_at_c1 ~]# ethtool -c eth1
> > Coalesce parameters for eth1:
> > Adaptive RX: off TX: off
> > stats-block-usecs: 0
> > sample-interval: 0
> > pkt-rate-low: 0
> > pkt-rate-high: 0
> >
> > rx-usecs: 3
> > rx-frames: 0
> > rx-usecs-irq: 0
> > rx-frames-irq: 0
> >
> > tx-usecs: 3
> > tx-frames: 0
> > tx-usecs-irq: 0
> > tx-frames-irq: 0
> >
> > rx-usecs-low: 0
> > rx-frame-low: 0
> > tx-usecs-low: 0
> > tx-frame-low: 0
> >
> > rx-usecs-high: 0
> > rx-frame-high: 0
> > tx-usecs-high: 0
> > tx-frame-high: 0
> >
> >
> > [root_at_c1 ~]# sysctl -a | grep tcp
> > sunrpc.tcp_slot_table_entries = 16
> > net.ipv4.netfilter.ip_conntrack_tcp_max_retrans = 3
> > net.ipv4.netfilter.ip_conntrack_tcp_be_liberal = 0
> > net.ipv4.netfilter.ip_conntrack_tcp_loose = 1
> > net.ipv4.netfilter.ip_conntrack_tcp_timeout_max_retrans = 300
> > net.ipv4.netfilter.ip_conntrack_tcp_timeout_close = 10
> > net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
> > net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack = 30
> > net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
> > net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120
> > net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 432000
> > net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv = 60
> > net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent = 120
> > net.ipv4.tcp_slow_start_after_idle = 1
> > net.ipv4.tcp_dma_copybreak = 4096
> > net.ipv4.tcp_workaround_signed_windows = 0
> > net.ipv4.tcp_base_mss = 512
> > net.ipv4.tcp_mtu_probing = 0
> > net.ipv4.tcp_abc = 0
> > net.ipv4.tcp_congestion_control = highspeed
> > net.ipv4.tcp_tso_win_divisor = 3
> > net.ipv4.tcp_moderate_rcvbuf = 1
> > net.ipv4.tcp_no_metrics_save = 0
> > net.ipv4.tcp_low_latency = 0
> > net.ipv4.tcp_frto = 0
> > net.ipv4.tcp_tw_reuse = 0
> > net.ipv4.tcp_adv_win_scale = 2
> > net.ipv4.tcp_app_win = 31
> > net.ipv4.tcp_rmem = 4096 87380 4194304
> > net.ipv4.tcp_wmem = 4096 16384 4194304
> > net.ipv4.tcp_mem = 196608 262144 393216
> > net.ipv4.tcp_dsack = 1
> > net.ipv4.tcp_ecn = 0
> > net.ipv4.tcp_reordering = 3
> > net.ipv4.tcp_fack = 1
> > net.ipv4.tcp_orphan_retries = 0
> > net.ipv4.tcp_max_syn_backlog = 1024
> > net.ipv4.tcp_rfc1337 = 0
> > net.ipv4.tcp_stdurg = 0
> > net.ipv4.tcp_abort_on_overflow = 0
> > net.ipv4.tcp_tw_recycle = 0
> > net.ipv4.tcp_syncookies = 1
> > net.ipv4.tcp_fin_timeout = 60
> > net.ipv4.tcp_retries2 = 15
> > net.ipv4.tcp_retries1 = 3
> > net.ipv4.tcp_keepalive_intvl = 75
> > net.ipv4.tcp_keepalive_probes = 9
> > net.ipv4.tcp_keepalive_time = 7200
> > net.ipv4.tcp_max_tw_buckets = 180000
> > net.ipv4.tcp_max_orphans = 65536
> > net.ipv4.tcp_synack_retries = 5
> > net.ipv4.tcp_syn_retries = 5
> > net.ipv4.tcp_retrans_collapse = 1
> > net.ipv4.tcp_sack = 1
> > net.ipv4.tcp_window_scaling = 1
> > net.ipv4.tcp_timestamps = 1
> > fs.nfs.nfs_callback_tcpport = 0
> > fs.nfs.nlm_tcpport = 0
> >
> > Tcp congestion control algorithm has been changed through following
> command:
> > "echo highspeed > /proc/sys/net/ipv4/tcp_congestion_control"
> > But it has no obvious improvement.
> >
> > MPI version is:
> > $ mpicxx -v
> > Using built-in specs.
> > Target: x86_64-redhat-linux
> > Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
> > --infodir=/usr/share/info --enable-shared --enable-threads=posix
> > --enable-checking=release --with-system-zlib --enable-__cxa_atexit
> > --disable-libunwind-exceptions --enable-libgcj-multifile
> > --enable-languages=c,c++,objc,obj-c++,java,fortran,ada --enable-java-
> awt=gtk
> > --disable-dssi --disable-plugin
> > --with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-
> cpu=generic
> > --host=x86_64-redhat-linux
> > Thread model: posix
> > gcc version 4.1.2 20080704 (Red Hat 4.1.2-52)
> >
> > [root_at_c1 bin]# mpirun -V
> > mpirun (Open MPI) 1.4.3
> >
> > I used mpirun to run NAMD through SGE. Both NAMD and charm++ was
> compiled on
> > frontnode and run on compute node. I think all the environment is the
> same
> > due to the feature of Rocks Cluster.
> >
> > Yesterday, I compiled UDP version of charm++ and NAMD with gcc
> instead of
> > mpicxx. So, I tried charmrun to run NAMD and got some "interesting"
> > benchmark data:
> >
> > 91% user CPU, 4%system CPU, 5%idle CPU
> >
> > Info: Benchmark time: 16 CPUs 0.093186 s/step 0.539271 days/ns
> 65.9622 MB
> > memory
> >
> > Info: Benchmark time: 16 CPUs 0.0918341 s/step 0.531448 days/ns
> 66.1155 MB
> > memory
> >
> > Info: Benchmark time: 16 CPUs 0.0898816 s/step 0.520148 days/ns
> 66.023 MB
> > memory
> >
> >
> > total network speed=200Mb/s 40% user CPU, 5%system CPU, 54%idle CPU
> >
> > Info: Benchmark time: 32 CPUs 0.124091 s/step 0.71812 days/ns 57.4477
> MB
> > memory
> >
> > Info: Benchmark time: 32 CPUs 0.123746 s/step 0.716121 days/ns
> 57.4098 MB
> > memory
> >
> > Info: Benchmark time: 32 CPUs 0.125931 s/step 0.728767 days/ns
> 57.6321 MB
> > memory
> >
> >
> > total network speed=270mb/s 28%user CPU, 5%system CPU, 66%idle CPU
> >
> > Info: Benchmark time: 48 CPUs 0.133027 s/step 0.769833 days/ns
> 55.1507 MB
> > memory
> >
> > Info: Benchmark time: 48 CPUs 0.135996 s/step 0.787013 days/ns
> 55.2202 MB
> > memory
> >
> > Info: Benchmark time: 48 CPUs 0.135308 s/step 0.783031 days/ns
> 55.2494 MB
> > memory
> >
> >
> > total network speed=340Mb/s 24%user CPU, 5%system CPU, 70%idle CPU
> >
> > Info: Benchmark time: 64 CPUs 0.137098 s/step 0.793394 days/ns
> 53.4818 MB
> > memory
> >
> > Info: Benchmark time: 64 CPUs 0.138207 s/step 0.799812 days/ns
> 53.4665 MB
> > memory
> >
> > Info: Benchmark time: 64 CPUs 0.137856 s/step 0.797777 days/ns
> 53.4743 MB
> > memory
> >
> > There was no much system CPU usage anymore, but idle CPU was
> increasing when
> > more cores was used. I guess the "high idle CPU" in UDP version has
> > something to do with the "high system CPU usage" in MPI version.
> >
> >
> > Neil Zhou
> >
> >
> > 2013/12/16 Norman Geist <norman.geist_at_uni-greifswald.de>
> >>
> >> Additionally, what MPI are you using, or do you use charm++?
> >>
> >>
> >>
> >> Norman Geist.
> >>
> >>
>
>
>
> --
> Dr. Axel Kohlmeyer akohlmey_at_gmail.com http://goo.gl/1wk0
> College of Science & Technology, Temple University, Philadelphia PA,
> USA
> International Centre for Theoretical Physics, Trieste. Italy.

---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
http://www.avast.com

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:00 CST