Re: AW: NAMD and NUMA

From: Joseph Farran (jfarran_at_uci.edu)
Date: Thu May 08 2014 - 11:33:25 CDT

Paul @ BLCR believes this is a memory management issue:

The thread is here:

https://hpcrdm.lbl.gov/pipermail/checkpoint/2014-April/000972.html

Joseph

On 05/07/2014 11:53 PM, Norman Geist wrote:
> Hi joseph,
>
> Are you sure that the node the job jumped to isn't just slower? Or where
> there interfering jobs on that node maybe?
>
> Otherwise, regarding NUMA, the OS should learn which data to store in the
> caches and the performance should therefore raise again after some time. I
> can't imagine that memory allocation influences the performance that much as
> NAMD isn't that memory bound. Does the bad performance remain as long as the
> simulation continue?
>
> Also, if your nodes have HyperThreading enabled, you might want to check if
> your job is actually using "real" cores, so doesn't share physical cores.
> (this would usually show up with largely fluctuating step times while
> processes jump over cores)
>
> Norman Geist.
>
>> -----Ursprüngliche Nachricht-----
>> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
>> Auftrag von Joseph Farran
>> Gesendet: Mittwoch, 7. Mai 2014 20:55
>> An: namd-l_at_ks.uiuc.edu
>> Betreff: namd-l: NAMD and NUMA
>>
>> Hi All / NAMD support.
>>
>> We are running NAMD 2.9 on CentoOS 6.5 with Berkeley checkpoint and
>> jobs
>> checkpoint and start-up just fine, however, when the job re-starts on
>> another node, the time to finish increases 2x to 3x:
>>
>> TIMING: 16000 CPU: 668.71, 0.0411388/step Wall: 668.71,
>> 0.0411388/step, 5.53088 hours remaining, 4338.894531 MB of memory in
>> use.
>> TIMING: 17000 CPU: 710.398, 0.0416875/step Wall: 710.398,
>> 0.0416875/step, 5.59307 hours remaining, 4338.894531 MB of memory in
>> use.
>>
>> <job jumped nodes>
>>
>> TIMING: 18000 CPU: 817.05, 0.106652/step Wall: 817.05, 0.106652/step,
>> 14.2795 hours remaining, 4338.894531 MB of memory in use.
>> TIMING: 19000 CPU: 943.168, 0.126118/step Wall: 943.168,
>> 0.126118/step, 16.8507 hours remaining, 4338.894531 MB of memory in
>> use.
>>
>> The issue seems to be with memory allocation. When the job re-starts
>> on a different but similar node, memory allocation is lost.
>>
>> Anyone knows how to save the current memory allocation and be able to
>> restore it with Linux numactl?
>>
>> Thanks,
>> Joseph
>
>
> ---
> Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
> http://www.avast.com
>
>

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:20:46 CST