A Large-Scale Study of Flash Memory Failures in the Field

  title={A Large-Scale Study of Flash Memory Failures in the Field},
  author={Justin Meza and Qiang Wu and Sanjeva Kumar and Onur Mutlu},
  journal={ACM SIGMETRICS Performance Evaluation Review},
  pages={177 - 190}
Servers use flash memory based solid state drives (SSDs) as a high-performance alternative to hard disk drives to store persistent data. Unfortunately, recent increases in flash density have also brought about decreases in chip-level reliability. In a data center environment, flash-based SSD failures can lead to downtime and, in the worst case, data loss. As a result, it is important to understand flash memory reliability characteristics over flash lifetime in a realistic production data center… 
Flash Reliability in Production: The Expected and the Unexpected
A large-scale field study covering many millions of drive days, ten different drive models, different flash technologies, and no evidence that higher-end SLC drives are more reliable than MLC drives within typical drive lifetimes is provided.
Understanding SSD Reliability in Large-Scale Cloud Systems
This paper takes a holistic view to examine the reliability of SSD-based storage systems in Alibaba's datacenters, which covers about half-million SSDs under representative cloud services over three years, and discovers a number of interesting correlations.
Evaluating Reliability of SSD-Based I/O Caches in Enterprise Storage Systems
A physical fault injection and failure detection platform is developed and the impact of workload dependent parameters on the reliability of I/O cache in the presence of two common failure types in data centers, power outage and high temperature faults is investigated.
An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers
It is shown that correlated failures in the same node or rack are common, and the possible impacting factors on those correlated failures are studied, and via trace-driven simulation how various redundancy schemes affect the storage reliability under correlated failures is evaluated.
Reliability of nand-Based SSDs: What Field Studies Tell Us
An overview of what has been learned about flash reliability in production, and where appropriate contrasting it with prior studies performing controlled experiments is provided.
Reliability Characterization of Solid State Drives in a Scalable Production Datacenter
The results show that 1) Media wear affects the reliability of SSDs more than any other factors, and 2) SSDs transit from one health group to another which infers the reliability degradation of those drives.
Exploiting Data Longevity for Enhancing the Lifetime of Flash-based Storage Class Memory
An extensive simulation-based analysis of an SLC flash-based SCM is used and it is demonstrated that D-SLC is able to significantly improve device lifetime with no performance overhead and also very small changes at the FTL software.
Errors in Flash-Memory-Based Solid-State Drives: Analysis, Mitigation, and Recovery
This chapter describes several mitigation and recovery techniques, including cell-tocell interference mitigation; optimal multi-level cell sensing; error correction using state-of-the-art algorithms and methods; and data recovery when error correction fails.
Reliability Issues in Flash-Memory-Based Solid-State Drives: Experimental Analysis, Mitigation, Recovery
This chapter describes several mitigation and recovery techniques, including cell-to-cell interference mitigation; optimal multi-level cell sensing; error correction using state-of-the-art algorithms and methods; and data recovery when error correction fails.
LDM: Log Disk Mirroring with Improved Performance and Reliability for SSD-Based Disk Arrays
This article proposes a Log Disk Mirroring scheme (LDM), a hybrid disk array architecture that consists of several SSDs and two hard disk drives that significantly outperforms the pure SSD-based disk arrays, and outperforms HPDA by a factor of 5.0 on average.


The bleak future of NAND flash memory
It is shown that future gains in density will come at significant drops in performance and reliability, and SSD manufacturers and users will face a tough choice in trading off between cost, performance, capacity and reliability.
Flash correct-and-refresh: Retention-aware error management for increased flash memory lifetime
  • Yu Cai, Gulay Yalcin, K. Mai
  • Computer Science
    2012 IEEE 30th International Conference on Computer Design (ICCD)
  • 2012
New techniques that can tolerate high bit error rates without requiring prohibitively strong ECC are developed, called Flash Correct-and-Refresh (FCR), which provide 46× average lifetime improvement on a variety of workloads at no additional hardware cost.
SDF: software-defined flash for web-scale internet storage systems
Measurements show that SDF can deliver approximately 95% of the raw flash bandwidth and provide 99% ofThe flash capacity for user data, and increases I/O bandwidth by 300\% and reduces per-GB hardware cost by 50% on average compared with the commodity-SSD-based system used at Baidu.
Error patterns in MLC NAND flash memory: Measurement, characterization, and analysis
A framework for fast and accurate characterization of flash memory throughout its lifetime is designed and implemented and distinct error patterns, such as cycle-dependency, location- dependency and value- dependency, for various types of flash operations are demonstrated.
Data retention in MLC NAND flash memory: Characterization, optimization, and recovery
This paper describes how the threshold voltage distribution of flash memory changes with different retention age - the length of time since a flash cell was programmed, and proposes two new techniques, Retention Optimized Reading and Retention Failure Recovery, which can effectively recover data from otherwise uncorrectable flash errors.
Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery
This paper experimentally characterizes read disturb errors on state-of-the-art 2Y-nm (i.e., 20-24 nm) MLC NAND flash memory chips and identifies that lowering pass-through voltage levels reduces the impact of read disturb and extend flash lifetime.
Threshold voltage distribution in MLC NAND flash memory: Characterization, analysis, and modeling
A key result is that the threshold voltage distribution can be modeled, with more than 95% accuracy, as a Gaussian distribution with additive white noise, which shifts to the right and widens as P/E cycles increase.
Neighbor-cell assisted error correction for MLC NAND flash memories
This paper provides a detailed statistical and experimental characterization of threshold voltage distribution of flash memory cells conditional upon the immediate-neighbor cell values, and shows that such conditional distributions can be used to determine a set of read reference voltages that lead to error rates much lower than when a single set of reference voltage values based on the overall distribution are used.
Program interference in MLC NAND flash memory: Characterization, modeling, and mitigation
A new model is developed that predicts the amount of program interference as a function of threshold voltage values and changes in neighboring cells and can reduce the raw flash bit error rate by 64% and thereby improve flash lifetime by 30%.