Reliability of modular mesh-connected intelligent storage brick systems

Abstract

A key objective of the IBM Intelligent Bricks project is to create a highly reliable system from commodity components. We envision such systems to be architected for a service model called fail-inplace or deferred maintenance. By delaying service actions, possibly for the entire lifetime of the system, management of the system is simplified. This paper examines the hardware reliability and deferred maintenance of intelligent storage brick (ISB) systems assuming a mesh-connected collection of bricks in which each brick includes processing power, memory, networking, and storage. On the basis of Monte Carlo simulations, we quantify the fraction of bricks that become unusable by a distributed data redundancy scheme due to degrading internal bandwidth and loss of external host connectivity. We derive a system hardware reliability expression and predict the length of time ISB systems can operate without replacement of failed bricks. We also show via a Markov analysis the level of fault tolerance that is required by the data redundancy scheme to achieve a goal of less than two data loss events per exabyte-year due to multiple failures.

DOI: 10.1147/rd.502.0199

Extracted Key Phrases

7 Figures and Tables

Cite this paper

@article{Fleiner2006ReliabilityOM, title={Reliability of modular mesh-connected intelligent storage brick systems}, author={Claudio Fleiner and Robert B. Garner and James Lee Hafner and K. K. Rao and Deepak R. Kenchammana-Hosekote and Winfried W. Wilcke and Joseph S. Glider}, journal={IBM Journal of Research and Development}, year={2006}, volume={50}, pages={199-208} }