Re: 50% system CPU usage when parallel running NAMD on Rocks cluster

From: ΦάκΗκΐ (malrot13_at_gmail.com)
Date: Tue Dec 17 2013 - 08:12:33 CST

[root_at_c1 ~]# ifconfig
eth1 Link encap:Ethernet HWaddr F8:0F:41:F8:51:B2
          inet addr:10.1.255.247 Bcast:10.1.255.255 Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
          RX packets:324575235 errors:0 dropped:0 overruns:0 frame:0
          TX packets:309132521 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:471233795133 (438.8 GiB) TX bytes:472407792256 (439.9
GiB)
          Memory:dfd20000-dfd40000

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:44327434 errors:0 dropped:0 overruns:0 frame:0
          TX packets:44327434 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:242299288622 (225.6 GiB) TX bytes:242299288622 (225.6
GiB)

Mtu has been changed for some test, but shows no difference between 1500
and 9000. Normally, mtu is 1500.

[root_at_c1 ~]# ethtool eth1
Settings for eth1:
        Supported ports: [ TP ]
        Supported link modes: 10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Supports auto-negotiation: Yes
        Advertised link modes: 10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
        Advertised auto-negotiation: Yes
        Speed: 1000Mb/s
        Duplex: Full
        Port: Twisted Pair
        PHYAD: 1
        Transceiver: internal
        Auto-negotiation: on
        Supports Wake-on: pumbg
        Wake-on: g
        Current message level: 0x00000003 (3)
        Link detected: yes

[root_at_c1 ~]# ethtool -k eth1
Offload parameters for eth1:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: on

[root_at_c1 ~]# ethtool -c eth1
Coalesce parameters for eth1:
Adaptive RX: off TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 3
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 3
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

[root_at_c1 ~]# sysctl -a | grep tcp
sunrpc.tcp_slot_table_entries = 16
net.ipv4.netfilter.ip_conntrack_tcp_max_retrans = 3
net.ipv4.netfilter.ip_conntrack_tcp_be_liberal = 0
net.ipv4.netfilter.ip_conntrack_tcp_loose = 1
net.ipv4.netfilter.ip_conntrack_tcp_timeout_max_retrans = 300
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close = 10
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack = 30
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 432000
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent = 120
net.ipv4.tcp_slow_start_after_idle = 1
net.ipv4.tcp_dma_copybreak = 4096
net.ipv4.tcp_workaround_signed_windows = 0
net.ipv4.tcp_base_mss = 512
net.ipv4.tcp_mtu_probing = 0
net.ipv4.tcp_abc = 0
net.ipv4.tcp_congestion_control = highspeed
net.ipv4.tcp_tso_win_divisor = 3
net.ipv4.tcp_moderate_rcvbuf = 1
net.ipv4.tcp_no_metrics_save = 0
net.ipv4.tcp_low_latency = 0
net.ipv4.tcp_frto = 0
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_adv_win_scale = 2
net.ipv4.tcp_app_win = 31
net.ipv4.tcp_rmem = 4096 87380 4194304
net.ipv4.tcp_wmem = 4096 16384 4194304
net.ipv4.tcp_mem = 196608 262144 393216
net.ipv4.tcp_dsack = 1
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_reordering = 3
net.ipv4.tcp_fack = 1
net.ipv4.tcp_orphan_retries = 0
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_rfc1337 = 0
net.ipv4.tcp_stdurg = 0
net.ipv4.tcp_abort_on_overflow = 0
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_retries2 = 15
net.ipv4.tcp_retries1 = 3
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_max_tw_buckets = 180000
net.ipv4.tcp_max_orphans = 65536
net.ipv4.tcp_synack_retries = 5
net.ipv4.tcp_syn_retries = 5
net.ipv4.tcp_retrans_collapse = 1
net.ipv4.tcp_sack = 1
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_timestamps = 1
fs.nfs.nfs_callback_tcpport = 0
fs.nfs.nlm_tcpport = 0

Tcp congestion control algorithm has been changed through following command:
"echo highspeed > /proc/sys/net/ipv4/tcp_congestion_control"
But it has no obvious improvement.

MPI version is:
$ mpicxx -v
Using built-in specs.
Target: x86_64-redhat-linux
Configured with: ../configure --prefix=/usr --mandir=/usr/share/man
--infodir=/usr/share/info --enable-shared --enable-threads=posix
--enable-checking=release --with-system-zlib --enable-__cxa_atexit
--disable-libunwind-exceptions --enable-libgcj-multifile
--enable-languages=c,c++,objc,obj-c++,java,fortran,ada
--enable-java-awt=gtk --disable-dssi --disable-plugin
--with-java-home=/usr/lib/jvm/java-1.4.2-gcj-1.4.2.0/jre --with-cpu=generic
--host=x86_64-redhat-linux
Thread model: posix
gcc version 4.1.2 20080704 (Red Hat 4.1.2-52)

[root_at_c1 bin]# mpirun -V
mpirun (Open MPI) 1.4.3

I used mpirun to run NAMD through SGE. Both NAMD and charm++ was compiled
on frontnode and run on compute node. I think all the environment is the
same due to the feature of Rocks Cluster.

Yesterday, I compiled UDP version of charm++ and NAMD with gcc instead of
mpicxx. So, I tried charmrun to run NAMD and got some "interesting"
benchmark data:

91% user CPU, 4%system CPU, 5%idle CPU

Info: Benchmark time: 16 CPUs 0.093186 s/step 0.539271 days/ns 65.9622 MB
memory

Info: Benchmark time: 16 CPUs 0.0918341 s/step 0.531448 days/ns 66.1155 MB
memory

Info: Benchmark time: 16 CPUs 0.0898816 s/step 0.520148 days/ns 66.023 MB
memory

total network speed=200Mb/s 40% user CPU, 5%system CPU, 54%idle CPU

Info: Benchmark time: 32 CPUs 0.124091 s/step 0.71812 days/ns 57.4477 MB
memory

Info: Benchmark time: 32 CPUs 0.123746 s/step 0.716121 days/ns 57.4098 MB
memory

Info: Benchmark time: 32 CPUs 0.125931 s/step 0.728767 days/ns 57.6321 MB
memory

total network speed=270mb/s 28%user CPU, 5%system CPU, 66%idle CPU

Info: Benchmark time: 48 CPUs 0.133027 s/step 0.769833 days/ns 55.1507 MB
memory

Info: Benchmark time: 48 CPUs 0.135996 s/step 0.787013 days/ns 55.2202 MB
memory

Info: Benchmark time: 48 CPUs 0.135308 s/step 0.783031 days/ns 55.2494 MB
memory

total network speed=340Mb/s 24%user CPU, 5%system CPU, 70%idle CPU

Info: Benchmark time: 64 CPUs 0.137098 s/step 0.793394 days/ns 53.4818 MB
memory

Info: Benchmark time: 64 CPUs 0.138207 s/step 0.799812 days/ns 53.4665 MB
memory

Info: Benchmark time: 64 CPUs 0.137856 s/step 0.797777 days/ns 53.4743 MB
memory
There was no much system CPU usage anymore, but idle CPU was increasing
when more cores was used. I guess the "high idle CPU" in UDP version has
something to do with the "high system CPU usage" in MPI version.

Neil Zhou

2013/12/16 Norman Geist <norman.geist_at_uni-greifswald.de>

> Additionally, what MPI are you using, or do you use charm++?
>
>
>
> Norman Geist.
>
>
>

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:00 CST