From: Yim, Keun Soo (yim6_at_illinois.edu)
Date: Fri Aug 31 2012 - 18:08:13 CDT
Hi Guys,
I've studied the behaviors of NAMD, VMD, and other N-body programs under transient hardware faults. As such program runs on top of peta-scale machines, it's natural to expect many of the parallel threads experience transient hardware faults at runtime.
I'd like to share my findings and wonder how do you think about it? Any comment, feedback, and question will be really helpful for us to investigate this problem further because you are one of the world experts in this problem domain.
My data are:
* Many transient hardware faults are undetected by the built-in NAMD error detectors. About 18% of transient hardware faults in GPU kernel lead to silent data corruptions in the NAMD program output without being detected by the built-in detectors and/or system-level error detectors.
* Here, our definition of SDC is any output data value of NAMD is more than 1% different from the golden output data. My main curiosity is that what such data error means for you or biophysicists? My experiment used three data sets (1ubq, 1e79, and stmv). Another question is whether it would be possible for biophysicists to recognize the different between these four VMD videos where two are without faults, and the rest two are with faults (all execute the same program with the same data)? https://netfiles.uiuc.edu/yim6/www/validation_gpu_demo.html
* After this observation, we have developed a lightweight data error detection technique that can detect such errors (reduction of missed faults from 18% to 2.5% on average) with a negligible performance overhead for GPU-version of NAMD. Yet the technique is automatic that does not need any manual engineering effort to use. Would you like to please help me to find right persons in the computational biophysics community who'd be interested in using our technique for large-scale simulations?
Thank you.
Best,
Keun Soo
P.S., I'm CCing my thesis advisor, Prof. Ravi Iyer. Pl feel free to have his email in your response.
This archive was generated by hypermail 2.1.6 : Mon Dec 31 2012 - 23:22:00 CST