Evaluating the impact of Undetected Disk Errors in RAID systems

@article{Rozier2009EvaluatingTI,
  title={Evaluating the impact of Undetected Disk Errors in RAID systems},
  author={Eric Rozier and Wendy Belluomini and Veera Deenadhayalan and James Lee Hafner and K. K. Rao and Pin Zhou},
  journal={2009 IEEE/IFIP International Conference on Dependable Systems \& Networks},
  year={2009},
  pages={83-92}
}
  • Eric Rozier, W. Belluomini, Pin Zhou
  • Published 29 September 2009
  • Computer Science
  • 2009 IEEE/IFIP International Conference on Dependable Systems & Networks
Despite the reliability of modern disks, recent studies have made it clear that a new class of faults, UndetectedDisk Errors (UDEs) also known as silent data corruption events, become a real challenge as storage capacity scales. While RAID systems have proven effective in protecting data from traditional disk failures, silent data corruption events remain a significant problem unaddressed by RAID. We present a fault model for UDEs, and a hybrid framework for simulating UDEs in large-scale… 

Figures and Tables from this paper

Toward I/O-efficient protection against silent data corruptions in RAID arrays
  • Mingqiang Li, P. Lee
  • Computer Science
    2014 30th Symposium on Mass Storage Systems and Technologies (MSST)
  • 2014
TLDR
A systematic study on I/O-efficient integrity protection against silent data corruptions in RAID arrays is presented, and two integrity protection schemes are constructed that provide complementary performance advantages for storage workloads with different user write sizes.
Cooperative Data Protection
TLDR
An analytical framework to evaluate reliability is developed and a straight-forward End-to-End ZFS (E 2ZFS) with the same protection scheme for all components is implemented, which is able to achieve better overall performance than E 2 ZFS, while still offering Zettabyte Reliability.
Towards Securing Data Transfers Against Silent Data Corruption
TLDR
This paper investigates the robustness of existing end-to-end integrity verification approaches against silent data corruption and proposes a Robust Integrity Verification Algorithm (RIVA) to enhance data integrity and implemented dynamic transfer and checksum parallelism to overcome performance bottlenecks.
Modeling SSD RAID reliability under general settings
TLDR
A new continuous time Markov chain (CTMC) model is proposed to characterize the reliability dynamics of SSD RAID over time under two general settings: (1) fault tolerance against a general number of device failures and (2) non-uniform workload.
Modeling Impact of Human Errors on the Data Unavailability and Data Loss of Storage Systems
TLDR
The results show that employing automatic fail-over policy (using hot spare disks) can reduce the drastic impacts of human errors by two orders of magnitude, and the conventional beliefs about the dependability of different Redundant array of independent disks (RAID) mechanisms should be revised.
RAID-CUBE : The Modern Datacenter Case for RAID Jayanta Basak
TLDR
This paper introduces a new high availability storage configuration, which it is called RAID-CUBE, and shows that it is more resilient to data loss as the datacenter scales in capacity than existing RAID dual parity and triple parity schemes.
Modeling the Fault Tolerance Consequences of Deduplication
TLDR
It is suggested that data deduplication introduces inter-file relationships that may have a negative impact on the fault tolerance of such systems by creating dependencies that can increase the severity of data loss events.
RIVA: Robust Integrity Verification Algorithm for High-Speed File Transfers
TLDR
Robust Integrity Verification Algorithm (RIVA) is proposed to strengthen the integrity of file transfers by forcing checksum computation tasks to read files directly from disk by invalidating memory mappings of file pages after their transfer.
Evaluation and Performance Modeling of a Burst Buffer Solution
TLDR
The results of an evaluation of an emerging technology, DataDirect Networks' (DDN) Infinite Memory Engine (IME), in which parameter range burst buffers are able to counteract the widening performance gap between compute and I/O are investigated.
Reliability challenges for storing exabytes
  • A. Amer, D. Long, T. Schwarz
  • Computer Science
    2014 International Conference on Computing, Networking and Communications (ICNC)
  • 2014
TLDR
It is demonstrated how such systems will suffer substantial annual data loss if only traditional reliability mechanisms are employed, and it is argued that the architecture for exascale storage systems should incorporate novel mechanisms at or below the object level to address this problem.
...
...

References

SHOWING 1-10 OF 29 REFERENCES
Undetected disk errors in RAID arrays
TLDR
The causes of UDEs and their effects on data integrity are discussed, some of the basic techniques that have been applied to address this problem at various software layers in the I/O stack are described and a family of solutions that can be integrated into the RAID subsystem are described.
An analysis of data corruption in the storage stack
TLDR
This article presents the first large-scale study of data corruption, which analyzes corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months.
Parity Lost and Parity Regained
TLDR
This work uses model checking to evaluate whether common protection techniques used in parity-based RAID systems are sufficient in light of the increasingly complex failure modes of modern disk drives and identifies a parity pollution problem that spreads corrupt data across multiple disks, thus leading to data loss or corruption.
IRON file systems
TLDR
It is shown that commodity file system failure policies are often inconsistent, sometimes buggy, and generally inadequate in their ability to recover from partial disk failures, so a new fail-partial failure model for disks is suggested, which incorporates realistic localized faults such as latent sector errors and block corruption.
Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems
TLDR
The results demonstrate that the reliability improvement due to disk scrubbing depends on the scrubbing frequency and the workload of the system, and may not reach the reliability level achieved by a simple IPC-based intra-disk redundancy scheme, which is insensitive to the workload.
Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ?
In large-scale systems where number of components c an approach a million a failure is a significant problem. In this paper aut hors have presented and analyzed failure data from different large
An analysis of latent sector errors in disk drives
TLDR
This is the first study of such large scale the sample size is at least an order of magnitude larger than previously published studies and the first one to focus specifically on latent sector errors and their implications on the design and reliability of storage systems.
A Case for Redundant Arrays of Inexpensive Disks (RAID)
TLDR
Five levels of RAIDs are introduced, giving their relative cost/performance, and a comparison to an IBM 3380 and a Fujitsu Super Eagle is compared.
A case for redundant arrays of inexpensive disks (RAID)
TLDR
Five levels of RAIDs are introduced, giving their relative cost/performance, and a comparison to an IBM 3380 and a Fujitsu Super Eagle is compared.
Markov Chain Models--Rarity And Exponentiality
0. Introduction and Summary.- 1. Discrete Time Markov Chains Reversibility in Time.- 1.00. Introduction.- 1.0. Notation, Transition Laws.- 1.1. Irreducibility, Aperiodicity, Ergodicity Stationary
...
...