AW: AW: NAMD and NUMA

From: Norman Geist (norman.geist_at_uni-greifswald.de)
Date: Fri May 09 2014 - 02:11:19 CDT

Ok, did you btw notice that "Bulldozer" has the same issue as HT? There are
only 32 FPUs which can do the "most" work for NAMD, so running 64 processes
is definetly slower that using the "right" 32 cores selected via taskset for
instance. You might want to try out. Its either "charmrun/mpirun [...]
taskset -c 0,2,4,6,8 [...] ,62 namd2 [...]" or "charmrun/mpirun [...]
taskset -c 0,1,2,3 [...] ,31 namd2 [...]". One of that options is slow, one
fast, faster than using all 64 cores.

Norman Geist.

> -----Ursprüngliche Nachricht-----
> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> Auftrag von Joseph Farran
> Gesendet: Donnerstag, 8. Mai 2014 18:29
> An: Norman Geist
> Cc: Namd Mailing List
> Betreff: Re: AW: namd-l: NAMD and NUMA
>
> Hi Norman.
>
> The node the job jumped to is identical to the original one. These
> are
> 64-core nodes with 512GB of memory and exact same CPU.
>
> Yes the jobs remains slow after it jumps regardless of how long it runs
> after it jumped nodes.
>
> These are AMD Bulldozer nodes and so no HyperThreading is involved.
>
> Running htop on both nodes, new and old nodes shows CPU usage the same
> -
> all 64-cores in use.
>
> Best,
> Joseph
>
>
>
>
> On 05/07/2014 11:53 PM, Norman Geist wrote:
> > Hi joseph,
> >
> > Are you sure that the node the job jumped to isn't just slower? Or
> where
> > there interfering jobs on that node maybe?
> >
> > Otherwise, regarding NUMA, the OS should learn which data to store in
> the
> > caches and the performance should therefore raise again after some
> time. I
> > can't imagine that memory allocation influences the performance that
> much as
> > NAMD isn't that memory bound. Does the bad performance remain as long
> as the
> > simulation continue?
> >
> > Also, if your nodes have HyperThreading enabled, you might want to
> check if
> > your job is actually using "real" cores, so doesn't share physical
> cores.
> > (this would usually show up with largely fluctuating step times while
> > processes jump over cores)
> >
> > Norman Geist.
> >
> >> -----Ursprüngliche Nachricht-----
> >> Von: owner-namd-l_at_ks.uiuc.edu [mailto:owner-namd-l_at_ks.uiuc.edu] Im
> >> Auftrag von Joseph Farran
> >> Gesendet: Mittwoch, 7. Mai 2014 20:55
> >> An: namd-l_at_ks.uiuc.edu
> >> Betreff: namd-l: NAMD and NUMA
> >>
> >> Hi All / NAMD support.
> >>
> >> We are running NAMD 2.9 on CentoOS 6.5 with Berkeley checkpoint and
> >> jobs
> >> checkpoint and start-up just fine, however, when the job re-starts
> on
> >> another node, the time to finish increases 2x to 3x:
> >>
> >> TIMING: 16000 CPU: 668.71, 0.0411388/step Wall: 668.71,
> >> 0.0411388/step, 5.53088 hours remaining, 4338.894531 MB of memory in
> >> use.
> >> TIMING: 17000 CPU: 710.398, 0.0416875/step Wall: 710.398,
> >> 0.0416875/step, 5.59307 hours remaining, 4338.894531 MB of memory in
> >> use.
> >>
> >> <job jumped nodes>
> >>
> >> TIMING: 18000 CPU: 817.05, 0.106652/step Wall: 817.05,
> 0.106652/step,
> >> 14.2795 hours remaining, 4338.894531 MB of memory in use.
> >> TIMING: 19000 CPU: 943.168, 0.126118/step Wall: 943.168,
> >> 0.126118/step, 16.8507 hours remaining, 4338.894531 MB of memory in
> >> use.
> >>
> >> The issue seems to be with memory allocation. When the job re-
> starts
> >> on a different but similar node, memory allocation is lost.
> >>
> >> Anyone knows how to save the current memory allocation and be able
> to
> >> restore it with Linux numactl?
> >>
> >> Thanks,
> >> Joseph
> >
> >
> > ---
> > Diese E-Mail ist frei von Viren und Malware, denn der avast!
> Antivirus Schutz ist aktiv.
> > http://www.avast.com
> >
> >

---
Diese E-Mail ist frei von Viren und Malware, denn der avast! Antivirus Schutz ist aktiv.
http://www.avast.com

This archive was generated by hypermail 2.1.6 : Thu Dec 31 2015 - 23:20:47 CST