• Corpus ID: 11795237

Redundancy and Replication Fault Tolerance on the System Level

  title={Redundancy and Replication Fault Tolerance on the System Level},
  author={Kurt Kanzenbach Friedrich-Alexander},
In the past decade the clock rates and densities of processors, enabled by new and smaller manufacture processes, increased. Due to this development current processors become less and less reliable and more vulnerable to physical effects such as the cosmic radiation, which can result in transient failures. One common approach to address hardware failures like these is to use redundant hardware. However, this strategy is expensive and not suitable in all cases. One alternative is to use… 

Figures and Tables from this paper



Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance

This paper proposes a software-based multi-core alternative for transient fault tolerance using process-level redundancy (PLR), which creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution.

SWIFT: software implemented fault tolerance

A novel, software-only, transient-fault-detection technique, called SWIFT, which efficiently manages redundancy by reclaiming unused instruction-level resources present during the execution of most programs and provides a high level of protection and performance with an enhanced control-flow checking mechanism.

Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper)

Remus provides an extremely high degree of fault tolerance, to the point that a running system can transparently continue execution on an alternate physical host in the face of failure with only seconds of downtime, while completely preserving host state such as active network connections.

Error detection by duplicated instructions in super-scalar processors

EDDI can provide over 98% fault-coverage without any extra hardware for error detection, which is especially useful when designers cannot change the hardware, but they need dependability in the computer system.

Hardware error detection using AN-Codes

This thesis provides techniques for detecting hardware errors that disturb the execution of a program and presents the first ones that present the encoding of a complete RISC instruction set including boolean and bitwise logical operations, casts, unaligned loads and stores, shifts and arithmetic operations.

Hypervisor-based fault tolerance

Protocols to implement a fault-tolerant computing system that augment the hypervisor of a virtual-machine manager and coordinate a primary virtual machine with its backup are described.

Software Fault Tolerance Techniques and Implementation

Software Fault Tolerance Techniques and Implementation examines key programming techniques such as assertions, checkpointing, and atomic actions, and provides design tips and models to assist in the

Automatic Instruction-Level Software-Only Recovery

Three automatic, instruction-level, software-only recovery techniques representing different trade-offs between reliability and performance are described.

Fault-tolerant real-time systems - the problem of replica determinism

  • S. Poledna
  • Computer Science
    The Kluwer international series in engineering and computer science
  • 1996
The requirements of automotive electronics are a topic in the remainder of this work for discussion and are used as a benchmark to evaluate solutions to the problem of replica determinism.

Architecture Design for Soft Errors