Re: timing variation during 256core run on abe.ncsa.uiuc.edu

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Tue Feb 22 2011 - 11:06:26 CST

a few additional comments.

On Tue, Feb 22, 2011 at 11:14 AM, Bishop, Thomas C <bishop_at_tulane.edu> wrote:
> Thanks. Your right I skipped some fundamentals.
> It's not just CPUs time/load...but this is all the batch really manages.
>
> After running benchmarks from 32 up to 1024CPU. I found that things became
> unstable at 512  and really fell apart at 1024 (this was true for a number
> of TG machines).

with the infiniband communication based machines,
there is also the issue of whether your job is entirely
contained in one switch module (fast and consistent)
or spans multiple of them (a little bit higher latency
and more susceptible to be impacted by other jobs).

another item to consider is whether you turn on
shared request queues in the infinband layer and
whether you use all cores or leave some out. on the
pre-nehalem processors, it is often much faster to
not use all cores and thus reduce the memory
bandwidth requirements of the job and boost
the communication bandwidth in return.

i found this to be the most extreme with DFT based
codes like CPMD or CP2k but also on LAMMPS
where it would be better to use only half the cores,
but with that effectively double the L2 cache per
MPI task.

in the LAMMPS case, this can be compensated
through using hybrid OpenMP/MPI code for the
non-bonded interactions. for NAMD, you get that
"for free" through the charm++ infrastructure,
but it may still be faster to run only 7 or 6 cores
per 8 core node.

> I figured 256 was a fair compromise w/ 206Katoms/256core.
> This is a bit higher than a figure of 200atoms/core I got from someplace.
> True the CPU/network can handle 200atoms/core but the entire SYSTEM becomes
> tasked at the point
> so it depends what else is happening on the system.
>
> If I really want this speed I should reconsider writing energies every step!

yep. amdahl's law is a bitch. ;-)

cheers,
    axel.

> True 0.015s/step is very fast.
>
> As a grad. I was happy  w/ 12s/step, 12CPUs,60Katoms and 100ps total
> trajectory.
> (note today I'm ~1000x faster w/ 256CPUs, 200K atoms, but expect 100ns
> trajectory, also 1000x longer)
>
> The more things change the more they stay the same.
>
> TOm
>
>
>
>
> -----Original Message-----
> From: owner-namd-l_at_ks.uiuc.edu on behalf of Axel Kohlmeyer
> Sent: Mon 2/21/2011 3:07 PM
> To: Bishop, Thomas C
> Cc: namd-l_at_ks.uiuc.edu
> Subject: Re: namd-l: timing variation during 256core run on
> abe.ncsa.uiuc.edu
>
> On Mon, Feb 21, 2011 at 2:52 PM, Thomas C. Bishop <bishop_at_tulane.edu> wrote:
>> dear namd,
>>
>> I'm running a 256CPU NAMD  2.7  Linux-x86_64-ibverbs
>> simulation  on abe at ncsa. The simulation contains 206031atoms.
>>
>> I've run many many simulations with the same namd configuration and
>> consistently
>> get benchmarks of
>> Info: Benchmark time: 256 CPUs 0.0149822 s/step 0.0867028 days/ns 337.961
>> MB memory
>> Info: Benchmark time: 256 CPUs 0.0155425 s/step 0.0899449 days/ns 349.344
>> MB memory
>> Info: Benchmark time: 256 CPUs 0.0148334 s/step 0.0858417 days/ns 351.711
>> MB memory
>>
>> However every now and then namd slows to a crawl (factor 30 change speed)
>> during run time (see times below)
>> The simulations themselves are not crashing (i.e. alll energies and the
>> trajectory itself look good).
>> Interestingly the simulation speed recovers and is able to return to
>> benchmark again.
>>
>> Is this symptomatic of a hardware, I/O problem , scheduling/load conflict,
>> that I should  bring up w/ sys-admins on ABE?
>
> in the case of abe, the most likely candidate is i/o contention on the
> infiniband
> network and lustre i/o. you have to share the with lustre and file access on
> lustre can experience delays at times. particularly when a bunch of quantum
> chemistry jobs get started that have suck in large integral files (or
> write them out).
>
> furthermore, you are pushing your system very hard at only 0.015 s/step.
> OS jitter may have an impact on that,too.
>
>> I'd chalk this up to system load but this shouldn't happen in a batch env,
>> or am I missing somethere here?
>
> it happens _particularly_ in a batch environment, whenever people have to
> share resources. i see similar fluctuations on kraken, too, especially when
> close to scale out. the only way to get reproducible benchmarks is to
> get everybody off the machine, reboot all nodes before each run, and
> then prime the caches before running for production.
>
> cheers,
>     axel.
>
>>
>> The complete 35M namd output file is available at
>> http://dna.ccs.tulane.edu/~bishop/dyn11.out
>>
>> Thanks for any info.
>> Tom
>>
>> TIMING: 5000  CPU: 355.724, 0.071117/step  Wall: 357.868, 0.071539/step,
>> 9.83661 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 10000  CPU: 428.965, 0.0146482/step  Wall: 432.858,
>> 0.0149981/step, 2.0414 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 15000  CPU: 502.386, 0.0146842/step  Wall: 507.764,
>> 0.0149811/step, 2.01829 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 20000  CPU: 917.611, 0.083045/step  Wall: 925.339, 0.083515/step,
>> 11.1353 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 25000  CPU: 994.495, 0.0153769/step  Wall: 1006.21,
>> 0.0161739/step, 2.13405 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 30000  CPU: 1067.95, 0.0146908/step  Wall: 1083.06,
>> 0.0153707/step, 2.00673 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 35000  CPU: 1676.58, 0.121726/step  Wall: 1692.43, 0.121873/step,
>> 15.7419 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 40000  CPU: 2224.61, 0.109606/step  Wall: 2242.1, 0.109935/step,
>> 14.0472 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 45000  CPU: 2752.26, 0.105531/step  Wall: 2772.49, 0.106078/step,
>> 13.4071 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 50000  CPU: 3329.5, 0.115446/step  Wall: 3351.42, 0.115786/step,
>> 14.4733 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 55000  CPU: 4428.86, 0.219874/step  Wall: 4452.69, 0.220253/step,
>> 27.2258 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 60000  CPU: 5495.78, 0.213383/step  Wall: 5521.81, 0.213824/step,
>> 26.134 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 65000  CPU: 7152.37, 0.331318/step  Wall: 7180.03, 0.331644/step,
>> 40.0736 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 70000  CPU: 9351.38, 0.439802/step  Wall: 9380.11, 0.440017/step,
>> 52.5575 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 75000  CPU: 10993.4, 0.328407/step  Wall: 11024.3, 0.328832/step,
>> 38.8205 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 80000  CPU: 11066.3, 0.0145752/step  Wall: 11098.6,
>> 0.0148747/step, 1.73539 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 85000  CPU: 13187.3, 0.424192/step  Wall: 13222.4, 0.424758/step,
>> 48.9651 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 90000  CPU: 14291.9, 0.220935/step  Wall: 14329.3, 0.221371/step,
>> 25.2117 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 95000  CPU: 15932.4, 0.328092/step  Wall: 15971.4, 0.328431/step,
>> 36.9484 hours remaining, 351.710938 MB of memory in use.
>> TIMING: 100000  CPU: 16479.4, 0.109409/step  Wall: 16519.7, 0.109659/step,
>> 12.1843 hours remaining, 358.207031 MB of memory in use.
>>
>>
>>
>> *******************************
>>   Thomas C. Bishop
>>    Tel: 504-862-3370
>>    Fax: 504-862-8392
>> http://dna.ccs.tulane.edu
>> ********************************
>>
>>
>
>
>
> --
> Dr. Axel Kohlmeyer
> akohlmey_at_gmail.com  http://goo.gl/1wk0
>
> Institute for Computational Molecular Science
> Temple University, Philadelphia PA, USA.
>
>
>

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Wed Feb 29 2012 - 15:56:40 CST