Re: AW: AW: NAMD and NUMA

From: Kenno Vanommeslaeghe (kvanomme_at_rx.umaryland.edu)
Date: Fri May 09 2014 - 11:45:09 CDT

I'm not convinced this is true. The shared FPU on an AMD bulldozer module
is 256 bits wide and a single thread can only saturate it through
relatively intensive use of AVX instructions. Given more real-life like
workloads, it acts as two 128-bit FPUs. Last time we benchmarked, we could
actually make NAMD run substantially faster by using all the logical
cores, though the speedup was significantly lower than the one we saw when
comparing the same numbers of cores on a machine with twice as many
modules (frequency scaling might also play a role there). The same could
not be said of our Intel benchmarks, where the speedup from using all the
virtual cores was nearly negligible. For fairness, it should be noted that
Intel *also* has these wide FPUs (and wider in more recent iterations)
that are shared between threads, so we ascribed the difference to more
aggressive frequency scaling from Intel's part.

So far, I've purely been talking parallel scaling. Generally spoken,
Intel's single core performance was better, but it scaled worse (even
before tapping the virtual cores, which is part of the reason why we
suspect the frequency scaling). Then again that's just *our* benchmarks;
in the end, the morale of the story is to take *nothing* for granted until
you've benchmarked your specific workload on your specific machine.

On 05/09/2014 03:11 AM, Norman Geist wrote:
> Ok, did you btw notice that "Bulldozer" has the same issue as HT? There are
> only 32 FPUs which can do the "most" work for NAMD, so running 64 processes
> is definetly slower that using the "right" 32 cores selected via taskset for
> instance. You might want to try out. Its either "charmrun/mpirun [...]
> taskset -c 0,2,4,6,8 [...] ,62 namd2 [...]" or "charmrun/mpirun [...]
> taskset -c 0,1,2,3 [...] ,31 namd2 [...]". One of that options is slow, one
> fast, faster than using all 64 cores.
>
>
> Norman Geist.
>
>> -----Ursprüngliche Nachricht-----
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag von Joseph Farran
>> Gesendet: Donnerstag, 8. Mai 2014 18:29
>> An: Norman Geist
>> Cc: Namd Mailing List
>> Betreff: Re: AW: namd-l: NAMD and NUMA
>>
>> Hi Norman.
>>
>> The node the job jumped to is identical to the original one. These
>> are
>> 64-core nodes with 512GB of memory and exact same CPU.
>>
>> Yes the jobs remains slow after it jumps regardless of how long it runs
>> after it jumped nodes.
>>
>> These are AMD Bulldozer nodes and so no HyperThreading is involved.
>>
>> Running htop on both nodes, new and old nodes shows CPU usage the same
>> -
>> all 64-cores in use.
>>
>> Best,
>> Joseph
>>
>>
>>
>>
>> On 05/07/2014 11:53 PM, Norman Geist wrote:
>>> Hi joseph,
>>>
>>> Are you sure that the node the job jumped to isn't just slower? Or
>> where
>>> there interfering jobs on that node maybe?
>>>
>>> Otherwise, regarding NUMA, the OS should learn which data to store in
>> the
>>> caches and the performance should therefore raise again after some
>> time. I
>>> can't imagine that memory allocation influences the performance that
>> much as
>>> NAMD isn't that memory bound. Does the bad performance remain as long
>> as the
>>> simulation continue?
>>>
>>> Also, if your nodes have HyperThreading enabled, you might want to
>> check if
>>> your job is actually using "real" cores, so doesn't share physical
>> cores.
>>> (this would usually show up with largely fluctuating step times while
>>> processes jump over cores)
>>>
>>> Norman Geist.
>>>
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>>>> Auftrag von Joseph Farran
>>>> Gesendet: Mittwoch, 7. Mai 2014 20:55
>>>> An: namd-l_at_ks.uiuc.edu
>>>> Betreff: namd-l: NAMD and NUMA
>>>>
>>>> Hi All / NAMD support.
>>>>
>>>> We are running NAMD 2.9 on CentoOS 6.5 with Berkeley checkpoint and
>>>> jobs
>>>> checkpoint and start-up just fine, however, when the job re-starts
>> on
>>>> another node, the time to finish increases 2x to 3x:
>>>>
>>>> TIMING: 16000 CPU: 668.71, 0.0411388/step Wall: 668.71,
>>>> 0.0411388/step, 5.53088 hours remaining, 4338.894531 MB of memory in
>>>> use.
>>>> TIMING: 17000 CPU: 710.398, 0.0416875/step Wall: 710.398,
>>>> 0.0416875/step, 5.59307 hours remaining, 4338.894531 MB of memory in
>>>> use.
>>>>
>>>> <job jumped nodes>
>>>>
>>>> TIMING: 18000 CPU: 817.05, 0.106652/step Wall: 817.05,
>> 0.106652/step,
>>>> 14.2795 hours remaining, 4338.894531 MB of memory in use.
>>>> TIMING: 19000 CPU: 943.168, 0.126118/step Wall: 943.168,
>>>> 0.126118/step, 16.8507 hours remaining, 4338.894531 MB of memory in
>>>> use.
>>>>
>>>> The issue seems to be with memory allocation. When the job re-
>> starts
>>>> on a different but similar node, memory allocation is lost.
>>>>
>>>> Anyone knows how to save the current memory allocation and be able
>> to
>>>> restore it with Linux numactl?
>>>>
>>>> Thanks,
>>>> Joseph
>>>
>>>
>>> ---
>>> Diese E-Mail ist frei von Viren und Malware, denn der avast!
>> Antivirus Schutz ist aktiv.
>>> http://www.avast.com
>>>
>>>
>
>
> ---
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
> http://www.avast.com
>
>

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:24 CST