Re: 50% system CPU usage when parallel running NAMD on Rocks cluster

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Wed Dec 18 2013 - 07:04:24 CST

On Wed, Dec 18, 2013 at 1:43 PM, 周昵昀 <malrot13_at_gmail.com> wrote:
> Many thanks, guys. I'll keep trying anything I can.
>
> I compiled SMP version of NAMD. Charm was compiled through the following
> command:
> "./build charm++ mpi-linux-x86_64 mpicxx smp -j16 --with-production"
> NAMD was compiled with SMP version of charm++. I tested it with the same
> system (100,000 atoms). The result ,as predicted, had some improve. :D
> I ran NAMD with the following command:
> "mpirun -np {number of nodes}
> /apps/apps/namd/2.9/Linux-x86_64-g++-SMP/namd2 +ppn {number of cores for
> computing on each node} {configuration file} > {log file}"
> 1 node,1 communication CPU

you may be better off using one MPI task per socket and configure CPU
affinity. this way you would get better cache utilization.

[...]

> This SMP version is based on MPI. Next, I'll try SMP based on charm++.
> BTW, do all the gigabit switches have the same switching-latency? If yes,
> that may be the reason why there is no any performance change after I use a
> new gigabit switch.

the latency you observe primarily originates in TCP/IP networking.
thus even if you would have an infiniband network but use IP-over-IB,
you would incur those high latencies (up to 1000x as large). the CPU
performance in the switch contributes, but only very little. it will
become most visible if you run a lot of communication across all nodes
and this will mostly manifest itself in a loss of bandwidth, not so
much latency.

infiniband communication (or myrinet, quadrics, etc...) has so low
latencies because it bypasses the entire TCP/IP layer. the low-level
encoding is the same. there are some MPI stacks available that allow
TCP bypassing to reduce latency, but those require support from
hardware, switch and the MPI layer. at that point, i would rather
recommend to buy second hand infiniband (SDR or DDR) equipment. while
it is not competitive to current (QDR or FDR) hardware, it still
offers a massive latency reduction over TCP/IP networking. the benefit
of the lower latency of QDR/FDR primarily manifests itself in very
large clusters.

one more recommendation would be to benchmark with both, and even
smaller system (say 10,000 atoms) and a larger system (say 1,000,000)
so you have a better assessment of how many atoms per CPU core your
cluster and network can tolerate effectively.

axel.

>
>
> Neil Zhou
>
> 2013/12/18 Norman Geist <norman.geist_at_uni-greifswald.de>
>>
>> Hi again,
>>
>> I will see though the settings you posted when I got some time. A quick
>> look didn't show something obvious. So what I already posted earlier and
>> what axel also suggests :
>>
>> >but notice, that 16 cores per node is really heavy for 1Gbit/s Ethernet
>> > and you might want to consider spending some >money into a HPC network like
>> > Infiniband or at least 10Gbit/s Ethernet.
>>
>> could become the painful truth. You should also really try using an smp
>> binary to reduce the number of network communicators, as axels posted. You
>> might notice slower timings for single node cases, but maybe an improvement
>> for multiple nodes.
>>
>> What you see regarding system cpu is the default "idlepoll" behavior of
>> the MPI you used formerly (I guess openMPI) that improves latency. You can
>> and should reproduce that behavior by adding +idlepoll to namd2 when using
>> it with charmrun. This usually makes a great difference.
>>
>> Norman Geist.
>>
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
College of Science & Technology, Temple University, Philadelphia PA, USA
International Centre for Theoretical Physics, Trieste. Italy.

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:24:06 CST