Building Reliable High-Performance Storage Systems: An Empirical and Analytical Study

  title={Building Reliable High-Performance Storage Systems: An Empirical and Analytical Study},
  author={Zhi Qiao and Song Fu and Hsing-bung Chen and Bradley W. Settlemyer},
  journal={2019 IEEE International Conference on Cluster Computing (CLUSTER)},
  • Zhi Qiao, Song Fu, +1 author B. Settlemyer
  • Published 1 September 2019
  • Computer Science
  • 2019 IEEE International Conference on Cluster Computing (CLUSTER)
Due to the vast storage needs of high performance computing (HPC), the scale and complexity of storage systems in HPC data centers continue growing. Disk failures have become the norm. With the ever-increasing disk capacity, RAID recovery based on disk rebuild becomes more and more expensive, which causes significant performance degradation and even unavailability of storage systems. Declustered redundant array of independent disks shuffle data and parity blocks among all drives in a RAID group… Expand
A Smart Background Scheduler for Storage Systems
  • Maher Kachmar, D. Kaeli
  • Computer Science
  • 2020 28th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)
  • 2020
This work proposes a priority-based background scheduler that learns a repetitive high/low workload pattern and allows storage systems to maintain peak performance and meet service level objectives (SLOs) while supporting a number of data services. Expand


Characterizing and Modeling Reliability of Declustered RAID for HPC Storage Systems
It is found that improved recovery performance leads to higher storage reliability compared with the traditional RAID and the reliability of declustered RAID in terms of the mean-time-to-data-loss (MTTDL) is analyzed. Expand
Evaluation of distributed recovery in large-scale storage systems
  • Qin Xin, E. L. Miller, T. Schwarz
  • Computer Science
  • Proceedings. 13th IEEE International Symposium on High performance Distributed Computing, 2004.
  • 2004
This work presents fast recovery mechanism (FARM), a distributed recovery approach that exploits excess disk capacity and reduces data recovery time and examines essential factors that influence system reliability, performance, and costs by simulating system behavior under disk failures. Expand
Multi-Partition RAID: A New Method for Improving Performance of Disk Arrays under Failure
A new variation of RAID organization, multi-partition RAID (mP-RAID), is proposed to improve storage ef"ciency and reduce performance degradation when disk failures occur. Expand
On the role of burst buffers in leadership-class storage systems
It is shown that burst buffers can accelerate the application perceived throughput to the external storage system and can reduce the amount of external storage bandwidth required to meet a desired application perceived bottleneck goal. Expand
Improving Availability of RAID-Structured Storage Systems by Workload Outsourcing
The lightweight prototype implementation of WorkOut and extensive trace-driven and benchmark-driven experiments demonstrate that, compared with existing approaches, WorkOut effectively improves the performance of the low-priority background tasks, such as RAID reconstruction and RAID resynchronization. Expand
Performance Analysis of Disk Arrays under Failure
A new variation of the RAID organization is proposed that has significant advantages in both reducing the magnitude of the performance degradation when there is a single failure and can also reduce the mean time to system failure. Expand
An Early Functional and Performance Experiment of the MarFS Hybrid Storage EcoSystem
The system architecture of the proposed MarFS near-POISX file system is presented, early functional performance testing cases on MarFS's software components are conducted, and the current deployment status and future development works of the MarFS are addressed. Expand
RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures
A method of using the joint failure probability to quantify and predict how likely a RAID group is to face multiple simultaneous disk failures, which can identify disks that collectively represent a risk of failure even when no individual disk is flagged in isolation is designed. Expand
Characterizing Disk Health Degradation and Proactively Protecting Against Disk Failures for Reliable Storage Systems
This paper aims to uncover the entire process in which disk's health deteriorates and forecast when disk drives will fail in the future, and model the disk degradation processes as functions of SMART attributes, which eliminates the dependency on time and thus I/O workload. Expand
Parity declustering for continuous operation in redundant disk arrays
It is shown that declustered parity penalizes user response time while a disk is being repaired (before and during its recovery) less than comparable non-declustered (RAID 5) organizations without any penalty touser response time in the fault-free state. Expand