Re: AW: NAMD and NUMA

From: Joseph Farran (jfarran_at_uci.edu)
Date: Thu May 08 2014 - 11:28:46 CDT

Next message: Joseph Farran: "Re: AW: NAMD and NUMA"
Previous message: Norman Geist: "AW: TclForces wrapmode FATAL ERROR: Setting parameter wrapmode from script failed"
In reply to: Norman Geist: "AW: NAMD and NUMA"
Next in thread: Norman Geist: "AW: AW: NAMD and NUMA"
Reply: Norman Geist: "AW: AW: NAMD and NUMA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Hi Norman.

The node the job jumped to is identical to the original one. These are
64-core nodes with 512GB of memory and exact same CPU.

Yes the jobs remains slow after it jumps regardless of how long it runs
after it jumped nodes.

These are AMD Bulldozer nodes and so no HyperThreading is involved.

Running htop on both nodes, new and old nodes shows CPU usage the same -
all 64-cores in use.

Best,
Joseph

On 05/07/2014 11:53 PM, Norman Geist wrote:
> Hi joseph,
>
> Are you sure that the node the job jumped to isn't just slower? Or where
> there interfering jobs on that node maybe?
>
> Otherwise, regarding NUMA, the OS should learn which data to store in the
> caches and the performance should therefore raise again after some time. I
> can't imagine that memory allocation influences the performance that much as
> NAMD isn't that memory bound. Does the bad performance remain as long as the
> simulation continue?
>
> Also, if your nodes have HyperThreading enabled, you might want to check if
> your job is actually using "real" cores, so doesn't share physical cores.
> (this would usually show up with largely fluctuating step times while
> processes jump over cores)
>
> Norman Geist.
>
>> -----Ursprüngliche Nachricht-----
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag von Joseph Farran
>> Gesendet: Mittwoch, 7. Mai 2014 20:55
>> An: namd-l_at_ks.uiuc.edu
>> Betreff: namd-l: NAMD and NUMA
>>
>> Hi All / NAMD support.
>>
>> We are running NAMD 2.9 on CentoOS 6.5 with Berkeley checkpoint and
>> jobs
>> checkpoint and start-up just fine, however, when the job re-starts on
>> another node, the time to finish increases 2x to 3x:
>>
>> TIMING: 16000 CPU: 668.71, 0.0411388/step Wall: 668.71,
>> 0.0411388/step, 5.53088 hours remaining, 4338.894531 MB of memory in
>> use.
>> TIMING: 17000 CPU: 710.398, 0.0416875/step Wall: 710.398,
>> 0.0416875/step, 5.59307 hours remaining, 4338.894531 MB of memory in
>> use.
>>
>> <job jumped nodes>
>>
>> TIMING: 18000 CPU: 817.05, 0.106652/step Wall: 817.05, 0.106652/step,
>> 14.2795 hours remaining, 4338.894531 MB of memory in use.
>> TIMING: 19000 CPU: 943.168, 0.126118/step Wall: 943.168,
>> 0.126118/step, 16.8507 hours remaining, 4338.894531 MB of memory in
>> use.
>>
>> The issue seems to be with memory allocation. When the job re-starts
>> on a different but similar node, memory allocation is lost.
>>
>> Anyone knows how to save the current memory allocation and be able to
>> restore it with Linux numactl?
>>
>> Thanks,
>> Joseph
>
>
> ---
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
> http://www.avast.com
>
>

Next message: Joseph Farran: "Re: AW: NAMD and NUMA"
Previous message: Norman Geist: "AW: TclForces wrapmode FATAL ERROR: Setting parameter wrapmode from script failed"
In reply to: Norman Geist: "AW: NAMD and NUMA"
Next in thread: Norman Geist: "AW: AW: NAMD and NUMA"
Reply: Norman Geist: "AW: AW: NAMD and NUMA"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.6 : Wed Dec 31 2014 - 23:22:23 CST