• Corpus ID: 11795237

Redundancy and Replication Fault Tolerance on the System Level

@inproceedings{FriedrichAlexander2014RedundancyAR,
  title={Redundancy and Replication Fault Tolerance on the System Level},
  author={Kurt Kanzenbach Friedrich-Alexander},
  year={2014}
}
In the past decade the clock rates and densities of processors, enabled by new and smaller manufacture processes, increased. Due to this development current processors become less and less reliable and more vulnerable to physical effects such as the cosmic radiation, which can result in transient failures. One common approach to address hardware failures like these is to use redundant hardware. However, this strategy is expensive and not suitable in all cases. One alternative is to use… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 16 REFERENCES

Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance

This paper proposes a software-based multi-core alternative for transient fault tolerance using process-level redundancy (PLR), which creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution.

Configurable Transient Fault Detection via Dynamic Binary Translation

Spot is presented, a software-only fault-detection technique which uses dynamic binary translation to provide softwaremodulated fault tolerance with fine-grained control of redundancy and can vary the level of protection independently for each register and region of code to provide users with more, and often superior, faultdetection options.

SWIFT: software implemented fault tolerance

A novel, software-only, transient-fault-detection technique, called SWIFT, which efficiently manages redundancy by reclaiming unused instruction-level resources present during the execution of most programs and provides a high level of protection and performance with an enhanced control-flow checking mechanism.

Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper)

Remus provides an extremely high degree of fault tolerance, to the point that a running system can transparently continue execution on an alternate physical host in the face of failure with only seconds of downtime, while completely preserving host state such as active network connections.

Error detection by duplicated instructions in super-scalar processors

EDDI can provide over 98% fault-coverage without any extra hardware for error detection, which is especially useful when designers cannot change the hardware, but they need dependability in the computer system.

Hardware error detection using AN-Codes

This thesis provides techniques for detecting hardware errors that disturb the execution of a program and presents the first ones that present the encoding of a complete RISC instruction set including boolean and bitwise logical operations, casts, unaligned loads and stores, shifts and arithmetic operations.

Hypervisor-based fault tolerance

Protocols to implement a fault-tolerant computing system that augment the hypervisor of a virtual-machine manager and coordinate a primary virtual machine with its backup are described.

Software Fault Tolerance Techniques and Implementation

Software Fault Tolerance Techniques and Implementation examines key programming techniques such as assertions, checkpointing, and atomic actions, and provides design tips and models to assist in the

Automatic Instruction-Level Software-Only Recovery

Three automatic, instruction-level, software-only recovery techniques representing different trade-offs between reliability and performance are described.

Pin: building customized program analysis tools with dynamic instrumentation

The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.