From: Iyer, Ravishankar K (rkiyer_at_illinois.edu)
Date: Sun Sep 02 2012 - 11:07:32 CDT
Thanks for your explanation. We have been having similar debates here as
to the significance of these errors. Kuen Soo base results also show that
a significant percent of the injected errors have no impact. The question
we have been debating is what percentage of the remaining errors (approx
50 percent) really matter. I had asked Kuen Soo to study this closely
working with NAMD group here to understand better this question
Your insight is very useful. Kuen Soo needs to understand this better.
Best
Ravi
Ravishankar K. Iyer
George and Ann Fisher Distinguished Professor
Department of Electrical and Computer Engineering
Department of Computer Science and
The Coordinated Science Laboratory
University of Illinois at Urbana-Champaign
Urbana IL 61801
On 9/2/12 7:19 AM, "Axel Kohlmeyer" <akohlmey_at_gmail.com> wrote:
>dear keun soo,
>
>On Sat, Sep 1, 2012 at 1:08 AM, Yim, Keun Soo <yim6_at_illinois.edu> wrote:
>>
>> Hi Guys,
>>
>> I've studied the behaviors of NAMD, VMD, and other N-body programs
>>under
>> transient hardware faults. As such program runs on top of peta-scale
>> machines, it's natural to expect many of the parallel threads experience
>> transient hardware faults at runtime.
>>
>> I'd like to share my findings and wonder how do you think about it? Any
>> comment, feedback, and question will be really helpful for us to
>>investigate
>> this problem further because you are one of the world experts in this
>> problem domain.
>>
>> My data are:
>>
>> * Many transient hardware faults are undetected by the built-in NAMD
>>error
>> detectors. About 18% of transient hardware faults in GPU kernel lead to
>> silent data corruptions in the NAMD program output without being
>>detected by
>> the built-in detectors and/or system-level error detectors.
>
>can you be a bit more specific about what you did here?
>did you simulate hardware breakage or did you just randomly
>modify data? what does the 18% number mean? 18% of the
>theoretical possible faults or 18% of the actually happening errors?
>
>> * Here, our definition of SDC is any output data value of NAMD is more
>>than
>> 1% different from the golden output data. My main curiosity is that what
>
>again, how did you measure this? did you compare a single set of
>force calculations? did you compare the results of a trajectory after
>a certain number of steps.
>
>> such data error means for you or biophysicists? My experiment used three
>> data sets (1ubq, 1e79, and stmv). Another question is whether it would
>>be
>> possible for biophysicists to recognize the different between these
>>four VMD
>> videos where two are without faults, and the rest two are with faults
>>(all
>> execute the same program with the same data)?
>
>well, the answer to what these kinds of errors mean is not an easy one.
>first of all, MD is solving a system of coupled linear partial
>differential
>equations, which essentially means, that it is a chaotic system. in other
>words, the tiniest changes can result in an (exponential) divergence
>between two otherwise identical calculations. however, for the purpose
>of analyzing the result, only the statistical relevance of those
>trajectory
>matters, so if the divergence is caused by a truly random change, it
>doesn't make a difference. similarly, being able to distinguish the
>resulting
>trajectories doesn't mean anything. now there are several contributing
>factors to divergence: using a "thermostat algorithm" (which simulates
>coupling the simulation to a large heat bath), using floating point math,
>which is not associative and usually for maximum performance not
>even the IEEE754 are fully adhered to. the floating point math issue
>is worse, if it is non-deterministic how summations are ordered, like
>across threads on a GPU, or when using dynamic load balancing
>(as NAMD does).
>
>however, there are some types of errors, that are problematic
>even without being easily able to tell them. in part, they can be
>due to problems in algorithms, bad choice of parameters and
>recurring hardware errors that are *not* random, i.e. that result
>in creating undesired correlations which may result in creating
>statistically significant results that are not genuine.
>
>in conclusion, probably most errors will have no bad effects
>on the simulations (some may even be beneficial), but ones
>that cause unwanted correlations, that will happen always in
>the same way, could be problematic and may not be easily
>detected during cursory analysis.
>
>hope this helps,
> axel.
>
>> https://netfiles.uiuc.edu/yim6/www/validation_gpu_demo.html
>>
>> * After this observation, we have developed a lightweight data error
>> detection technique that can detect such errors (reduction of missed
>>faults
>> from 18% to 2.5% on average) with a negligible performance overhead for
>> GPU-version of NAMD. Yet the technique is automatic that does not need
>>any
>> manual engineering effort to use. Would you like to please help me to
>>find
>> right persons in the computational biophysics community who'd be
>>interested
>> in using our technique for large-scale simulations?
>>
>> Thank you.
>>
>> Best,
>> Keun Soo
>>
>> P.S., I'm CCing my thesis advisor, Prof. Ravi Iyer. Pl feel free to
>>have
>> his email in your response.
>>
>
>
>
>--
>Dr. Axel Kohlmeyer akohlmey_at_gmail.com http://goo.gl/1wk0
>International Centre for Theoretical Physics, Trieste. Italy.
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:01 CST