Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to File-System Faults

Abstract

We analyze how modern distributed storage systems behave in the presence of file-system faults such as data corruption and read and write errors. We characterize eight popular distributed storage systems and uncover numerous problems related to file-system fault tolerance. We find that modern distributed systems do not consistently use redundancy to recover from file-system faults: a single file-system fault can cause catastrophic outcomes such as data loss, corruption, and unavailability. We also find that the above outcomes arise due to fundamental problems in file-system fault handling that are common across many systems. Our results have implications for the design of next-generation fault-tolerant distributed and cloud storage systems.

DOI: 10.1145/3125497

15 Figures and Tables

Cite this paper

@article{Ganesan2017RedundancyDN, title={Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to File-System Faults}, author={Aishwarya Ganesan and Ramnatthan Alagappan and Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau}, journal={TOS}, year={2017}, volume={13}, pages={20:1-20:33} }