• Corpus ID: 5722163

End-to-end Data Integrity for File Systems: A ZFS Case Study

@inproceedings{Zhang2010EndtoendDI,
  title={End-to-end Data Integrity for File Systems: A ZFS Case Study},
  author={Yupu Zhang and Abhishek Rajimwale and Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau},
  booktitle={FAST},
  year={2010}
}
We present a study of the effects of disk and memory corruption on file system data integrity. [] Key Result Our analysis reveals the importance of considering both memory and disk in the construction of truly robust file and storage systems.

Figures and Tables from this paper

High Performance Metadata Integrity Protection in the WAFL Copy-on-Write File System
We introduce a low-cost incremental checksum technique that protects metadata blocks against in-memory scribbles, and a lightweight digest-based transaction auditing mechanism that enforces file
Redundancy Does Not Imply Fault Tolerance
TLDR
It is found that modern distributed systems do not consistently use redundancy to recover from file- system faults: a single file-system fault can cause catastrophic outcomes such as data loss, corruption, and unavailability.
Can Applications Recover from fsync Failures?
TLDR
The findings show that although applications use many failure-handling strategies, none are sufficient: fsync failures can cause catastrophic outcomes such as data loss and corruption.
Integrated end-to-end dependability in the Loris storage stack
TLDR
An integrated approach is presented that combines several techniques to protect the Loris storage stack against these dependability threats, all the way from the disk driver layer to the virtual file system (VFS) layer.
Cooperative Data Protection
TLDR
An analytical framework to evaluate reliability is developed and a straight-forward End-to-End ZFS (E 2ZFS) with the same protection scheme for all components is implemented, which is able to achieve better overall performance than E 2 ZFS, while still offering Zettabyte Reliability.
Consistency without ordering
TLDR
This paper introduces the No-Order File System (NoFS), a simple, lightweight file system that employs a novel technique called backpointer-based consistency to provide crash consistency without ordering writes as they go to disk.
Understanding the Fault Resilience of File System Checkers
File system checkers serve as the last line of defense to recover a corrupted file system back to a consistent state. Therefore, their reliability is critically important. Motivated by real
Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions
TLDR
It is found that modern distributed systems do not consistently use redundancy to recover from file- system faults: a single file-system fault can cause catastrophic outcomes such as data loss, corruption, and unavailability.
NOVA-Fortis: A Fault-Tolerant Non-Volatile Main Memory File System
TLDR
An NVMM-optimized file system called NOVA-Fortis is described that is both fast and resilient in the face of corruption due to media errors and software bugs and outperforms reliable, block-based file systems running on NVMM by 3x on average.
Checksumming RAID
TLDR
A checksumming mechanism is integrated into Linux’s Multi-Device Software RAID layer so that it is able to detect and correct silent data corruptions in storage systems.
...
...

References

SHOWING 1-10 OF 65 REFERENCES
The effects of metadata corruption on nfs
TLDR
This work studies the failure handling and recovery mechanisms of a widely used distributed file system, Linux NFS, and finds that the NFS protocol behaves in unexpected ways in the presence of corruptions.
Unifying File System Protection
TLDR
The protected file system (PFS) is a file system that unifies the meta-data update protection of journaling with strong data integrity and is an end-to-end solution that will work with any block-oriented device, from a disk drive to a monolithic RAID system, without modification.
Analyzing the effects of disk-pointer corruption
TLDR
A new technique called type-aware pointer corruption is developed to systematically explore how a file system reacts to corrupt pointers, and it is found that NTFS and ext3 do not recover from most corruptions, including many scenarios for which they possess sufficient redundant information, leading to further corruption, crashes, and unmountable file systems.
IRON file systems
TLDR
It is shown that commodity file system failure policies are often inconsistent, sometimes buggy, and generally inadequate in their ability to recover from partial disk failures, so a new fail-partial failure model for disks is suggested, which incorporates realistic localized faults such as latent sector errors and block corruption.
An analysis of data corruption in the storage stack
TLDR
This article presents the first large-scale study of data corruption, which analyzes corruption instances recorded in production storage systems containing a total of 1.53 million disk drives, over a period of 41 months.
EXPLODE: a lightweight, general system for finding serious storage system errors
TLDR
ExPLODE is a system that makes it easy to systematically check real storage systems for errors, taking user-written, potentially system-specific checkers and uses them to drive a storage system into tricky corner cases, including crash recovery errors.
Journaling the Linux ext2fs Filesystem
TLDR
A design intended to increase ext2fs’s speed and reliability of crash recovery by adding a transactional journal to the filesystem.
Dependability Analysis of Virtual Memory Systems
TLDR
It is found that failure handling policies in current virtual memory systems are at best simplistic, and often inconsistent or even absent, and possible reasons for poor failure handling are identified, which can help in the design of a failure-aware virtual memory system.
Parity Lost and Parity Regained
TLDR
This work uses model checking to evaluate whether common protection techniques used in parity-based RAID systems are sufficient in light of the increasingly complex failure modes of modern disk drives and identifies a parity pollution problem that spreads corrupt data across multiple disks, thus leading to data loss or corruption.
A white paper on the benefits of chipkill-correct ecc for pc server main memory
TLDR
This paper addresses one area of concern in the RAS arena of PC Servers at the lower end of the server spectrum: error recovery when an entire DRAM chip fails.
...
...