Abstract
Increased power densities (and resultant temperatures) and other effects of device scaling are predicted to cause significant lifetime reliability problems in the near future. In this paper, we study two techniques that leverage microarchitectural structural redundancy for lifetime reliability enhancement. First, in structural duplication (SD), redundant microarchitectural structures are added to the processor and designated as spares. Spare structures can be turned on when the original structure fails, increasing the processorýs lifetime. Second, graceful performance degradation (GPD) is a technique which exploits existing microarchitectural redundancy for reliability. Redundant structures that fail are shut down while still maintaining functionality, thereby increasing the processorýs lifetime, but at a lower performance. Our analysis shows that exploiting structural redundancy can provide significant reliability benefits, and we present guidelines for efficient usage of these techniques by identifying situations where each is more beneficial. We show that GPD is the superior technique when only limited performance or cost resources can be sacrificed for reliability. Specifically, on average for our systems and applications,GPD increased processor reliability to 1.42 times the base value for less than a 5% loss in performance. On the other hand, for systems where reliability is more important than performance or cost, SD is more beneficial. SD increases reliability to 3.17 times the base value for 2.25 times the base cost, for our applications. Finally, a combination of the two techniques (SD+GPD) provides the highest reliability benefit.
- {1} Assessing Product Reliability, Chapter 8, NIST/SEMATECH e-Handbook of Statistical Methods. In http://www.itl.nist.gov/div898/handbook/.Google Scholar
- {2} Compaq NonStop Himalaya S-Series Server Description Manual. In Compaq Technical Manual 520331-001, http://www.compaq.com.Google Scholar
- {3} Methods for Calculating Failure Rates in Units of FITs. In JEDEC Publication JESD85, 2001.Google Scholar
- {4} F. Bower et al. Tolerating Hard Faults in Microprocessor Array Structures. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, 2004. Google ScholarDigital Library
- {5} D. Brooks et al. Power-aware Microarchitecture: Design and Modeling Challenges for the next-generation microprocessor. In IEEE Micro, 2000. Google ScholarDigital Library
- {6} D. Brooks et al. Wattch: A Framework for Architectural-Level Power Analysis and Optimizations. In Proc. of the 27th Annual Intl. Symp. on Comp. Arch., 2000. Google ScholarDigital Library
- {7} J. L. Hennessy and D. A. Patterson. Computer Architecture, A Quantitative Approach. Morgan Kaufmann, 2003. Google ScholarDigital Library
- {8} S. Heo et al. Reducing Power Density Through Activity Migration. In Intl. Symp. on Low Power Elec. Design, 2003. Google ScholarDigital Library
- {9} G. Hetheringon et al. Logic BIST for Large Industrial Designs: Real Issues and Case Studies. In Proceedings of the International Test Conference, 1999. Google ScholarDigital Library
- {10} V. Iyengar, L. H. Trevillyan, and P. Bose. Representative Traces for Processor Models with Infinite Cache. In Proc. of the 2nd Intl. Symp. on High-Perf. Comp. Architecture, 1996. Google ScholarDigital Library
- {11} I. Koren et al. Defect Tolerant VLSI Circuits: Techniques and Yield Analysis. In Proceedings of the IEEE, 1998.Google Scholar
- {12} M. Moudgill et al. Environment for PowerPC microarchitectural exploration. In IEEE Micro, 1999. Google ScholarDigital Library
- {13} M. Moudgill et al. Validation of turandot, a fast processor model for microarchitectural exploration. In IEEE Intl Perf., Computing, and Communications Conf., 1999.Google Scholar
- {14} W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical recipes in C (2nd ed.): the art of scientific computing. Cambridge University Press, 1992. Google ScholarDigital Library
- {15} P. Shivakumar et al. Exploiting Microarchitectural Redundancy for Defect Tolerance. In 21st Intl. Conf. on Comp. Design, 2003. Google ScholarDigital Library
- {16} K. Skadron et al. Temperature-Aware Microarchitecture. In Proc. of the 30th Annual Intl. Symp. on Comp. Arch., 2003. Google ScholarDigital Library
- {17} L. Spainhower et al. IBM S/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. In IBM Journal of R&D, September/November 1999. Google ScholarDigital Library
- {18} J. Srinivasan et al. The Case for Lifetime Reliability-Aware Microprocessors. In Proc. of the 31st Annual Intl. Symp. on Comp. Architecture, 2004. Google ScholarDigital Library
- {19} J. Srinivasan et al. The Impact of Technology Scaling on Lifetime Reliability. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, 2004. Google ScholarDigital Library
- {20} J. M. Tendler et al. POWER4 System Microarchitecture. In IBM Journal of Research and Development, 2002. Google ScholarDigital Library
- {21} K. Trivedi. Probability and Statistics with Reliability, Queueing, and Computer Science Applications. Prentice Hall, 1982. Google ScholarDigital Library
- {22} S. Zafar et al. A Model for Negative Bias Temperature Instability (NBTI) in Oxide and High-KpFETs. In 2004 Symposia on VLSI Technology and Circuits, June, 2004.Google Scholar
Index Terms
- Exploiting Structural Duplication for Lifetime Reliability Enhancement
Recommendations
Exploiting Structural Duplication for Lifetime Reliability Enhancement
ISCA '05: Proceedings of the 32nd annual international symposium on Computer ArchitectureIncreased power densities (and resultant temperatures) and other effects of device scaling are predicted to cause significant lifetime reliability problems in the near future. In this paper, we study two techniques that leverage microarchitectural ...
Comments