Abstract
As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and software-only fault-detection mechanisms to identify and mitigate the deleterious effects of transient faults. These two fault-detection systems, however, are extremes in the design space, representing sharp trade-offs between hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection mechanisms as promising alternatives to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. We propose and evaluate CRAFT, a suite of three such hybrid techniques, to illustrate the potential of the hybrid approach. For fair, quantitative comparisons among hardware, software, and hybrid systems, we introduce a new metric, Mean Work To Failure, which is able to compare systems for which machine instructions do not represent a constant unit of work. Additionally, we present a new simulation framework which rapidly assesses reliability and does not depend on manual identification of failure modes. Our evaluation illustrates that CRAFT, and hybrid techniques in general, offer attractive options in the fault-detection design space.
- {1} R. C. Baumann. Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Transactions on Device and Materials Reliability, 1(1):17-22, March 2001.Google ScholarCross Ref
- {2} A. Benso, S. D. Carlo, G. D. Natale, and P. Prinetto. A watchdog processor to detect data and control flow errors. In Proceedings of the 9th IEEE International On-Line Testing Symposium, 2003.Google ScholarCross Ref
- {3} D. C. Bossen. CMOS soft errors and server design. In IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pages 121 07.1-121_07.6, April 2002.Google Scholar
- {4} E. W. Czeck and D. Siewiorek. Effects of transient gate-level faults on program behavior. In Proceedings of the 1990 International Symposium on Fault-Tolerant Computing, pages 236-243, June 1990.Google ScholarCross Ref
- {5} M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th annual international symposium on Computer architecture, pages 98-109. ACM Press, 2003. Google ScholarDigital Library
- {6} R. W. Horst, R. L. Harris, and R. L. Jardine. Multiple instruction issue in the NonStop Cyclone processor. In Proceedings of the 17th International Symposium on Computer Architecture, pages 216-226, May 1990. Google ScholarDigital Library
- {7} S. Kim and A. K. Somani. Soft error sensitivity characterization for microprocessor dependability enhancement strategy. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 416-425, September 2002. Google ScholarDigital Library
- {8} A. Mahmood and E. J. McCluskey. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers, 37(2):160-174, 1988. Google ScholarDigital Library
- {9} S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th annual international symposium on Computer architecture , pages 99-110. IEEE Computer Society, 2002. Google ScholarDigital Library
- {10} S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture , page 29. IEEE Computer Society, 2003. Google ScholarDigital Library
- {11} T. J. O'Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, I. C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. In IBM Journal of Research and Development, pages 41-49, January 1996. Google ScholarDigital Library
- {12} N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking by software signatures. In IEEE Transactions on Reliability, volume 51, pages 111-122, March 2002.Google ScholarCross Ref
- {13} N. Oh, P. P. Shirvani, and E. J. McCluskey. Error detection by duplicated instructions in super-scalar processors. In IEEE Transactions on Reliability, volume 51, pages 63-75, March 2002.Google ScholarCross Ref
- {14} J. Ohlsson and M. Rimen. Implicit signature checking. In International Conference on Fault-Tolerant Computing, June 1995. Google ScholarDigital Library
- {15} D. A. Penry, M. Vachharajani, and D. I. August. Rapid development of flexible validated processor models. Technical Report Liberty-04- 03, Liberty Research Group, Princeton University, November 2004.Google Scholar
- {16} J. Ray, J. C. Hoe, and B. Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, pages 214-224. IEEE Computer Society, 2001. Google ScholarDigital Library
- {17} M. Rebaudengo, M. S. Reorda, M. Violante, and M. Torchiano. A source-to-source compiler for generating dependable software. pages 33-42, 2001.Google Scholar
- {18} S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th annual international symposium on Computer architecture, pages 25-36. ACM Press, 2000. Google ScholarDigital Library
- {19} G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization, March 2005. Google ScholarDigital Library
- {20} P. P. Shirvani, N. Saxena, and E. J. McCluskey. Software-implemented EDAC protection against SEUs. In IEEE Transactions on Reliability, volume 49, pages 273-284, 2000.Google ScholarCross Ref
- {21} T. J. Slegel, R. M. Averill III, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. IBM's S/390 G5 Microprocessor design. In IEEE Micro, volume 19, pages 12-23, March 1999. Google ScholarDigital Library
- {22} M. Vachharajani, N. Vachharajani, and D. I. August. The Liberty Structural Specification Language: A high-level modeling language for component reuse. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI), pages 195-206, June 2004. Google ScholarDigital Library
- {23} M. Vachharajani, N. Vachharajani, D. A. Penry, J. A. Blome, and D. I. August. Microarchitectural exploration with Liberty. In Proceedings of the 35th International Symposium on Microarchitecture (MICRO), pages 271-282, November 2002. Google ScholarDigital Library
- {24} R. Venkatasubramanian, J. P. Hayes, and B. T. Murray. Low-cost on-line fault detection using control flow assertions. In Proceedings of the 9th IEEE International On-Line Testing Symposium, pages 137-143, July 2003.Google ScholarCross Ref
- {25} T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th annual international symposium on Computer architecture, pages 87-98. IEEE Computer Society, 2002. Google ScholarDigital Library
- {26} N. Wang, M. Fertig, and S. J. Patel. Y-branches: When you come to a fork in the road, take it. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 56-67, September 2003. Google ScholarDigital Library
- {27} N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. Characterizing the effects of transient faults on a high-performance processor pipeline. In Proceedings of the 2004 International Conference on Dependendable Systems and Networks, pages 61-72, June 2004. Google ScholarDigital Library
- {28} C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt. Techniques to reduce the soft error rate of a high-performance microprocessor. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), 2004. Google ScholarDigital Library
- {29} Y. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedings of the 1996 IEEE Aerospace Applications Conference, volume 1, pages 293-307, February 1996.Google ScholarCross Ref
Index Terms
- Design and Evaluation of Hybrid Fault-Detection Systems
Recommendations
Design and Evaluation of Hybrid Fault-Detection Systems
ISCA '05: Proceedings of the 32nd annual international symposium on Computer ArchitectureAs chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and software-only fault-detection ...
Dependability evaluation using hybrid fault/error injection
IPDS '95: Proceedings of the International Computer Performance and Dependability Symposium on Computer Performance and Dependability SymposiumAbstract: This paper presents a new hybrid fault/error injection technique which overcomes the limitations of both software-based and hardware-based approaches. The logic for the hardware fault unit injection circuitry is implemented using field ...
Fault Injection and Dependability Evaluation of Fault-Tolerant Systems
The authors describe a dependability evaluation method based on fault injection that establishes the link between the experimental evaluation of the fault tolerance process and the fault occurrence process. The main characteristics of a fault injection ...
Comments