Re: timing variation during 256core run on abe.ncsa.uiuc.edu

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Mon Feb 21 2011 - 15:07:50 CST

On Mon, Feb 21, 2011 at 2:52 PM, Thomas C. Bishop <bishop_at_tulane.edu> wrote:
> dear namd,
>
> I'm running a 256CPU NAMD  2.7  Linux-x86_64-ibverbs
> simulation  on abe at ncsa. The simulation contains 206031atoms.
>
> I've run many many simulations with the same namd configuration and consistently
> get benchmarks of
> Info: Benchmark time: 256 CPUs 0.0149822 s/step 0.0867028 days/ns 337.961 MB memory
> Info: Benchmark time: 256 CPUs 0.0155425 s/step 0.0899449 days/ns 349.344 MB memory
> Info: Benchmark time: 256 CPUs 0.0148334 s/step 0.0858417 days/ns 351.711 MB memory
>
> However every now and then namd slows to a crawl (factor 30 change speed) during run time (see times below)
> The simulations themselves are not crashing (i.e. alll energies and the trajectory itself look good).
> Interestingly the simulation speed recovers and is able to return to benchmark again.
>
> Is this symptomatic of a hardware, I/O problem , scheduling/load conflict, that I should  bring up w/ sys-admins on ABE?

in the case of abe, the most likely candidate is i/o contention on the
infiniband
network and lustre i/o. you have to share the with lustre and file access on
lustre can experience delays at times. particularly when a bunch of quantum
chemistry jobs get started that have suck in large integral files (or
write them out).

furthermore, you are pushing your system very hard at only 0.015 s/step.
OS jitter may have an impact on that,too.

> I'd chalk this up to system load but this shouldn't happen in a batch env, or am I missing somethere here?

it happens _particularly_ in a batch environment, whenever people have to
share resources. i see similar fluctuations on kraken, too, especially when
close to scale out. the only way to get reproducible benchmarks is to
get everybody off the machine, reboot all nodes before each run, and
then prime the caches before running for production.

cheers,
    axel.

>
> The complete 35M namd output file is available at
> http://dna.ccs.tulane.edu/~bishop/dyn11.out
>
> Thanks for any info.
> Tom
>
> TIMING: 5000  CPU: 355.724, 0.071117/step  Wall: 357.868, 0.071539/step, 9.83661 hours remaining, 351.710938 MB of memory in use.
> TIMING: 10000  CPU: 428.965, 0.0146482/step  Wall: 432.858, 0.0149981/step, 2.0414 hours remaining, 351.710938 MB of memory in use.
> TIMING: 15000  CPU: 502.386, 0.0146842/step  Wall: 507.764, 0.0149811/step, 2.01829 hours remaining, 351.710938 MB of memory in use.
> TIMING: 20000  CPU: 917.611, 0.083045/step  Wall: 925.339, 0.083515/step, 11.1353 hours remaining, 351.710938 MB of memory in use.
> TIMING: 25000  CPU: 994.495, 0.0153769/step  Wall: 1006.21, 0.0161739/step, 2.13405 hours remaining, 351.710938 MB of memory in use.
> TIMING: 30000  CPU: 1067.95, 0.0146908/step  Wall: 1083.06, 0.0153707/step, 2.00673 hours remaining, 351.710938 MB of memory in use.
> TIMING: 35000  CPU: 1676.58, 0.121726/step  Wall: 1692.43, 0.121873/step, 15.7419 hours remaining, 351.710938 MB of memory in use.
> TIMING: 40000  CPU: 2224.61, 0.109606/step  Wall: 2242.1, 0.109935/step, 14.0472 hours remaining, 351.710938 MB of memory in use.
> TIMING: 45000  CPU: 2752.26, 0.105531/step  Wall: 2772.49, 0.106078/step, 13.4071 hours remaining, 351.710938 MB of memory in use.
> TIMING: 50000  CPU: 3329.5, 0.115446/step  Wall: 3351.42, 0.115786/step, 14.4733 hours remaining, 351.710938 MB of memory in use.
> TIMING: 55000  CPU: 4428.86, 0.219874/step  Wall: 4452.69, 0.220253/step, 27.2258 hours remaining, 351.710938 MB of memory in use.
> TIMING: 60000  CPU: 5495.78, 0.213383/step  Wall: 5521.81, 0.213824/step, 26.134 hours remaining, 351.710938 MB of memory in use.
> TIMING: 65000  CPU: 7152.37, 0.331318/step  Wall: 7180.03, 0.331644/step, 40.0736 hours remaining, 351.710938 MB of memory in use.
> TIMING: 70000  CPU: 9351.38, 0.439802/step  Wall: 9380.11, 0.440017/step, 52.5575 hours remaining, 351.710938 MB of memory in use.
> TIMING: 75000  CPU: 10993.4, 0.328407/step  Wall: 11024.3, 0.328832/step, 38.8205 hours remaining, 351.710938 MB of memory in use.
> TIMING: 80000  CPU: 11066.3, 0.0145752/step  Wall: 11098.6, 0.0148747/step, 1.73539 hours remaining, 351.710938 MB of memory in use.
> TIMING: 85000  CPU: 13187.3, 0.424192/step  Wall: 13222.4, 0.424758/step, 48.9651 hours remaining, 351.710938 MB of memory in use.
> TIMING: 90000  CPU: 14291.9, 0.220935/step  Wall: 14329.3, 0.221371/step, 25.2117 hours remaining, 351.710938 MB of memory in use.
> TIMING: 95000  CPU: 15932.4, 0.328092/step  Wall: 15971.4, 0.328431/step, 36.9484 hours remaining, 351.710938 MB of memory in use.
> TIMING: 100000  CPU: 16479.4, 0.109409/step  Wall: 16519.7, 0.109659/step, 12.1843 hours remaining, 358.207031 MB of memory in use.
>
>
>
> *******************************
>   Thomas C. Bishop
>    Tel: 504-862-3370
>    Fax: 504-862-8392
> http://dna.ccs.tulane.edu
> ********************************
>
>

-- 
Dr. Axel Kohlmeyer
akohlmey_at_gmail.com  http://goo.gl/1wk0
Institute for Computational Molecular Science
Temple University, Philadelphia PA, USA.

This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:19:51 CST