Re: NAMD on Peta-Scale Machines

From: Axel Kohlmeyer (akohlmey_at_gmail.com)
Date: Sun Sep 02 2012 - 07:19:02 CDT

dear keun soo,

On Sat, Sep 1, 2012 at 1:08 AM, Yim, Keun Soo <yim6_at_illinois.edu> wrote:
>
> Hi Guys,
>
> I've studied the behaviors of NAMD, VMD, and other N-body programs under
> transient hardware faults. As such program runs on top of peta-scale
> machines, it's natural to expect many of the parallel threads experience
> transient hardware faults at runtime.
>
> I'd like to share my findings and wonder how do you think about it? Any
> comment, feedback, and question will be really helpful for us to investigate
> this problem further because you are one of the world experts in this
> problem domain.
>
> My data are:
>
> * Many transient hardware faults are undetected by the built-in NAMD error
> detectors. About 18% of transient hardware faults in GPU kernel lead to
> silent data corruptions in the NAMD program output without being detected by
> the built-in detectors and/or system-level error detectors.

can you be a bit more specific about what you did here?
did you simulate hardware breakage or did you just randomly
modify data? what does the 18% number mean? 18% of the
theoretical possible faults or 18% of the actually happening errors?

> * Here, our definition of SDC is any output data value of NAMD is more than
> 1% different from the golden output data. My main curiosity is that what

again, how did you measure this? did you compare a single set of
force calculations? did you compare the results of a trajectory after
a certain number of steps.

> such data error means for you or biophysicists? My experiment used three
> data sets (1ubq, 1e79, and stmv). Another question is whether it would be
> possible for biophysicists to recognize the different between these four VMD
> videos where two are without faults, and the rest two are with faults (all
> execute the same program with the same data)?

well, the answer to what these kinds of errors mean is not an easy one.
first of all, MD is solving a system of coupled linear partial differential
equations, which essentially means, that it is a chaotic system. in other
words, the tiniest changes can result in an (exponential) divergence
between two otherwise identical calculations. however, for the purpose
of analyzing the result, only the statistical relevance of those trajectory
matters, so if the divergence is caused by a truly random change, it
doesn't make a difference. similarly, being able to distinguish the resulting
trajectories doesn't mean anything. now there are several contributing
factors to divergence: using a "thermostat algorithm" (which simulates
coupling the simulation to a large heat bath), using floating point math,
which is not associative and usually for maximum performance not
even the IEEE754 are fully adhered to. the floating point math issue
is worse, if it is non-deterministic how summations are ordered, like
across threads on a GPU, or when using dynamic load balancing
(as NAMD does).

however, there are some types of errors, that are problematic
even without being easily able to tell them. in part, they can be
due to problems in algorithms, bad choice of parameters and
recurring hardware errors that are *not* random, i.e. that result
in creating undesired correlations which may result in creating
statistically significant results that are not genuine.

in conclusion, probably most errors will have no bad effects
on the simulations (some may even be beneficial), but ones
that cause unwanted correlations, that will happen always in
the same way, could be problematic and may not be easily
detected during cursory analysis.

hope this helps,
     axel.

> https://netfiles.uiuc.edu/yim6/www/validation_gpu_demo.html
>
> * After this observation, we have developed a lightweight data error
> detection technique that can detect such errors (reduction of missed faults
> from 18% to 2.5% on average) with a negligible performance overhead for
> GPU-version of NAMD. Yet the technique is automatic that does not need any
> manual engineering effort to use. Would you like to please help me to find
> right persons in the computational biophysics community who'd be interested
> in using our technique for large-scale simulations?
>
> Thank you.
>
> Best,
> Keun Soo
>
> P.S., I'm CCing my thesis advisor, Prof. Ravi Iyer. Pl feel free to have
> his email in your response.
>

-- 
Dr. Axel Kohlmeyer  akohlmey_at_gmail.com  http://goo.gl/1wk0
International Centre for Theoretical Physics, Trieste. Italy.

This archive was generated by hypermail 2.1.6 : Tue Dec 31 2013 - 23:22:29 CST