Redundancy and Replication Fault Tolerance on the System Level
@inproceedings{FriedrichAlexander2014RedundancyAR, title={Redundancy and Replication Fault Tolerance on the System Level}, author={Kurt Kanzenbach Friedrich-Alexander}, year={2014} }
In the past decade the clock rates and densities of processors, enabled by new and smaller manufacture processes, increased. Due to this development current processors become less and less reliable and more vulnerable to physical effects such as the cosmic radiation, which can result in transient failures. One common approach to address hardware failures like these is to use redundant hardware. However, this strategy is expensive and not suitable in all cases. One alternative is to use…
References
SHOWING 1-10 OF 16 REFERENCES
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance
- Computer Science37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
- 2007
This paper proposes a software-based multi-core alternative for transient fault tolerance using process-level redundancy (PLR), which creates a set of redundant processes per application process and systematically compares the processes to guarantee correct execution.
SWIFT: software implemented fault tolerance
- Computer Science, EngineeringInternational Symposium on Code Generation and Optimization
- 2005
A novel, software-only, transient-fault-detection technique, called SWIFT, which efficiently manages redundancy by reclaiming unused instruction-level resources present during the execution of most programs and provides a high level of protection and performance with an enhanced control-flow checking mechanism.
Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper)
- Computer ScienceNSDI
- 2008
Remus provides an extremely high degree of fault tolerance, to the point that a running system can transparently continue execution on an alternate physical host in the face of failure with only seconds of downtime, while completely preserving host state such as active network connections.
Error detection by duplicated instructions in super-scalar processors
- Computer ScienceIEEE Trans. Reliab.
- 2002
EDDI can provide over 98% fault-coverage without any extra hardware for error detection, which is especially useful when designers cannot change the hardware, but they need dependability in the computer system.
Hardware error detection using AN-Codes
- Computer Science
- 2010
This thesis provides techniques for detecting hardware errors that disturb the execution of a program and presents the first ones that present the encoding of a complete RISC instruction set including boolean and bitwise logical operations, casts, unaligned loads and stores, shifts and arithmetic operations.
Hypervisor-based fault tolerance
- Computer ScienceTOCS
- 1996
Protocols to implement a fault-tolerant computing system that augment the hypervisor of a virtual-machine manager and coordinate a primary virtual machine with its backup are described.
Software Fault Tolerance Techniques and Implementation
- Computer Science
- 2001
Software Fault Tolerance Techniques and Implementation examines key programming techniques such as assertions, checkpointing, and atomic actions, and provides design tips and models to assist in the…
Automatic Instruction-Level Software-Only Recovery
- Computer ScienceIEEE Micro
- 2007
Three automatic, instruction-level, software-only recovery techniques representing different trade-offs between reliability and performance are described.
Fault-tolerant real-time systems - the problem of replica determinism
- Computer ScienceThe Kluwer international series in engineering and computer science
- 1996
The requirements of automotive electronics are a topic in the remainder of this work for discussion and are used as a benchmark to evaluate solutions to the problem of replica determinism.