Corpus ID: 11795237

Redundancy and Replication Fault Tolerance on the System Level

  title={Redundancy and Replication Fault Tolerance on the System Level},
  author={Kurt Kanzenbach Friedrich-Alexander},
In the past decade the clock rates and densities of processors, enabled by new and smaller manufacture processes, increased. Due to this development current processors become less and less reliable and more vulnerable to physical effects such as the cosmic radiation, which can result in transient failures. One common approach to address hardware failures like these is to use redundant hardware. However, this strategy is expensive and not suitable in all cases. One alternative is to use… Expand

Figures and Tables from this paper


Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance
This paper proposes a software-based multi-core alternative for transient fault tolerance using process-level redundancy (PLR), which creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution. Expand
Configurable Transient Fault Detection via Dynamic Binary Translation
Smaller feature sizes, lower voltage levels, and reduced noise margins have helped improve the performance and lower the power consumption of modern microprocessors. These same advances have madeExpand
SWIFT: software implemented fault tolerance
A novel, software-only, transient-fault-detection technique, called SWIFT, which efficiently manages redundancy by reclaiming unused instruction-level resources present during the execution of most programs and provides a high level of protection and performance with an enhanced control-flow checking mechanism. Expand
Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper)
Remus provides an extremely high degree of fault tolerance, to the point that a running system can transparently continue execution on an alternate physical host in the face of failure with only seconds of downtime, while completely preserving host state such as active network connections. Expand
Instruction-Level Fault Tolerance Configurability
The paper shows how some existing FT techniques can be adapted to support instruction-level FT configurability, how a programmer can specify the desired FT level of the instructions, and how the compiler can manage it automatically. Expand
Error detection by duplicated instructions in super-scalar processors
EDDI can provide over 98% fault-coverage without any extra hardware for error detection, which is especially useful when designers cannot change the hardware, but they need dependability in the computer system. Expand
Hardware error detection using AN-Codes
This thesis provides techniques for detecting hardware errors that disturb the execution of a program and presents the first ones that present the encoding of a complete RISC instruction set including boolean and bitwise logical operations, casts, unaligned loads and stores, shifts and arithmetic operations. Expand
Hypervisor-based fault tolerance
Protocols to implement a fault-tolerant computing system that augment the hypervisor of a virtual-machine manager and coordinate a primary virtual machine with its backup are described. Expand
Software Fault Tolerance Techniques and Implementation
Software Fault Tolerance Techniques and Implementation examines key programming techniques such as assertions, checkpointing, and atomic actions, and provides design tips and models to assist in theExpand
Automatic Instruction-Level Software-Only Recovery
Three automatic, instruction-level, software-only recovery techniques representing different trade-offs between reliability and performance are described. Expand