From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Thu May 08 2014 - 01:53:12 CDT
Hi joseph,
Are you sure that the node the job jumped to isn't just slower? Or where
there interfering jobs on that node maybe?
Otherwise, regarding NUMA, the OS should learn which data to store in the
caches and the performance should therefore raise again after some time. I
can't imagine that memory allocation influences the performance that much as
NAMD isn't that memory bound. Does the bad performance remain as long as the
simulation continue?
Also, if your nodes have HyperThreading enabled, you might want to check if
your job is actually using "real" cores, so doesn't share physical cores.
(this would usually show up with largely fluctuating step times while
processes jump over cores)
Norman Geist.
> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Joseph Farran
> Gesendet: Mittwoch, 7. Mai 2014 20:55
> An: namd-l_at_ks.uiuc.edu
> Betreff: namd-l: NAMD and NUMA
> 
> Hi All / NAMD support.
> 
> We are running NAMD 2.9 on CentoOS 6.5 with Berkeley checkpoint and
> jobs
> checkpoint and start-up just fine, however, when the job re-starts on
> another node, the time to finish increases 2x to 3x:
> 
> TIMING: 16000  CPU: 668.71, 0.0411388/step  Wall: 668.71,
> 0.0411388/step, 5.53088 hours remaining, 4338.894531 MB of memory in
> use.
> TIMING: 17000  CPU: 710.398, 0.0416875/step  Wall: 710.398,
> 0.0416875/step, 5.59307 hours remaining, 4338.894531 MB of memory in
> use.
> 
> <job jumped nodes>
> 
> TIMING: 18000  CPU: 817.05, 0.106652/step  Wall: 817.05, 0.106652/step,
> 14.2795 hours remaining, 4338.894531 MB of memory in use.
> TIMING: 19000  CPU: 943.168, 0.126118/step  Wall: 943.168,
> 0.126118/step, 16.8507 hours remaining, 4338.894531 MB of memory in
> use.
> 
> The issue seems to be with memory allocation.   When the job re-starts
> on a different but similar node, memory allocation is lost.
> 
> Anyone knows how to save the current memory allocation and be able to
> restore it with Linux numactl?
> 
> Thanks,
> Joseph
--- Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv. http://www.avast.com
This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:20:46 CST