A Survey of Techniques for Modeling and Improving Reliability of Computing Systems
Soft errors arising from energetic particle strikes pose a significant reliability concern for computing systems, especially for those running in noisy environments. Technology scaling and aggressive leakage control mechanisms make the problem caused by these transient errors even more severe. Therefore, it is very important to employ reliability enhancing mechanisms in processor/memory designs to protect them against soft errors. To do so, we first need to model soft errors, and then study cost/reliability tradeoffs among various reliability enhancing techniques based on the model so that system requirements could be met. Since cache memories take the largest fraction of on-chip real estate today and their share is expected to continue to grow in future designs, they are more vulnerable to soft errors, as compared to many other components of a computing system. In this paper, we first focus on a soft error model for L1 data caches, and then explore different reliability enhancing mechanisms. More specifically, we define a metric called AVFC (Architectural Vulnerability Factor for Caches), which represents the probability with which a fault in the cache can be visible in the final output of the program. Based on this model, we then propose three architectural schemes for improving reliability in the existence of soft errors. Our first scheme prevents an error from propagating to the lower levels in the memory hierarchy by not forwarding the unmodified data words of a dirty cache block to the L2 cache when the dirty block is to be replaced. The second scheme proposed selectively invalidates cache blocks to reduce their vulnerable periods, decreasing their chances of catching any soft errors. Based on the AVFC metric, our experimental results show that these two schemes are very effective in alleviating soft errors in the L1 data cache. Specifically, by using our first scheme, it is possible to improve the AVFC metric by 32% without any performance loss. On the other hand, the second scheme enhances the AVFC metric between 60% and 97%, at the cost of a performance degradation which varies from 0% to 21.3%, depending on how aggressively the cache blocks are invalidated. To reduce the performance overhead caused by cache block invalidation, we also propose a third scheme which tries to bring a fresh copy of the invalidated block into the cache via prefetching. Our experimental results indicate that, this scheme can reduce the performance overheads to less than 1% for all applications in our experimental suite, at the cost of giving up a tolerable portion of the reliability enhancement the second scheme achieves.