• Publications
  • Influence
An algorithmic approach to error localization and partial recomputation for low-overhead fault tolerance
The increasing size and complexity of massively parallel systems (e.g. HPC systems) is making it increasingly likely that individual circuits will produce erroneous results. For this reason, novelExpand
  • 40
  • 3
Stochastic computing: Embracing errors in architecture and design of processors and applications
As device sizes shrink, device-level manufacturing challenges have led to increased variability in physical circuit characteristics. Exponentially increasing circuit density has not only broughtExpand
  • 48
  • 2
Algorithmic approaches to low overhead fault detection for sparse linear algebra
The increasing size and complexity of High-Performance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a lowExpand
  • 86
  • 1
A numerical optimization-based methodology for application robustification: Transforming applications for error tolerance
There have been several attempts at correcting process variation induced errors by identifying and masking these errors at the circuit and architecture level [10, 27]. These approaches take upExpand
  • 53
  • 1
Towards analyzing and improving robustness of software applications to intermittent and permanent faults in hardware
Although a significant fraction of emerging failure and wearout mechanisms result in intermittent or permanent faults in hardware, their impact (as distinct from transient faults) on softwareExpand
  • 5
On software design for stochastic processors
Much recent research [8, 6, 7] suggests significant power and energy benefits of relaxing correctness constraints in future processors. Such processors with relaxed constraints have often beenExpand
  • 16
Towards scalable reliability frameworks for error prone CMPs
As technology scales and the energy of computation continually approaches thermal equilibrium [1,2], parameter variations and noise levels will lead to larger error rates at various levels of theExpand
  • 9
Fluid NMR-Performing Power/Reliability Tradeoffs for Applications with Error Tolerance
N-modular redundancy (NMR) [1] has long been the most prevalent fault-tolerance technique. However, traditional NMR is agnostic of application characteristics (especially, an application’s errorExpand
  • 9
Algorithmic Techniques for Fault Detection for Sparse Linear Algebra
The growing complexity and variability of future computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energyExpand
  • 1
Hardware/System Support for Four Economic Models for Many Core Computing
This paper argues for a new set o f economic models for many-core computing. Current economic models for processors require a customer to estimate her average-case or worst-case computationalExpand