skip to main content
article

Design and Evaluation of Hybrid Fault-Detection Systems

Published:01 May 2005Publication History
Skip Abstract Section

Abstract

As chip densities and clock rates increase, processors are becoming more susceptible to transient faults that can affect program correctness. Up to now, system designers have primarily considered hardware-only and software-only fault-detection mechanisms to identify and mitigate the deleterious effects of transient faults. These two fault-detection systems, however, are extremes in the design space, representing sharp trade-offs between hardware cost, reliability, and performance. In this paper, we identify hybrid hardware/software fault-detection mechanisms as promising alternatives to hardware-only and software-only systems. These hybrid systems offer designers more options to fit their reliability needs within their hardware and performance budgets. We propose and evaluate CRAFT, a suite of three such hybrid techniques, to illustrate the potential of the hybrid approach. For fair, quantitative comparisons among hardware, software, and hybrid systems, we introduce a new metric, Mean Work To Failure, which is able to compare systems for which machine instructions do not represent a constant unit of work. Additionally, we present a new simulation framework which rapidly assesses reliability and does not depend on manual identification of failure modes. Our evaluation illustrates that CRAFT, and hybrid techniques in general, offer attractive options in the fault-detection design space.

References

  1. {1} R. C. Baumann. Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Transactions on Device and Materials Reliability, 1(1):17-22, March 2001.Google ScholarGoogle ScholarCross RefCross Ref
  2. {2} A. Benso, S. D. Carlo, G. D. Natale, and P. Prinetto. A watchdog processor to detect data and control flow errors. In Proceedings of the 9th IEEE International On-Line Testing Symposium, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  3. {3} D. C. Bossen. CMOS soft errors and server design. In IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals, pages 121 07.1-121_07.6, April 2002.Google ScholarGoogle Scholar
  4. {4} E. W. Czeck and D. Siewiorek. Effects of transient gate-level faults on program behavior. In Proceedings of the 1990 International Symposium on Fault-Tolerant Computing, pages 236-243, June 1990.Google ScholarGoogle ScholarCross RefCross Ref
  5. {5} M. Gomaa, C. Scarbrough, T. N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th annual international symposium on Computer architecture, pages 98-109. ACM Press, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. {6} R. W. Horst, R. L. Harris, and R. L. Jardine. Multiple instruction issue in the NonStop Cyclone processor. In Proceedings of the 17th International Symposium on Computer Architecture, pages 216-226, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. {7} S. Kim and A. K. Somani. Soft error sensitivity characterization for microprocessor dependability enhancement strategy. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 416-425, September 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. {8} A. Mahmood and E. J. McCluskey. Concurrent error detection using watchdog processors-a survey. IEEE Transactions on Computers, 37(2):160-174, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. {9} S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th annual international symposium on Computer architecture , pages 99-110. IEEE Computer Society, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. {10} S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture , page 29. IEEE Computer Society, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. {11} T. J. O'Gorman, J. M. Ross, A. H. Taber, J. F. Ziegler, H. P. Muhlfeld, I. C. J. Montrose, H. W. Curtis, and J. L. Walsh. Field testing for cosmic ray soft errors in semiconductor memories. In IBM Journal of Research and Development, pages 41-49, January 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. {12} N. Oh, P. P. Shirvani, and E. J. McCluskey. Control-flow checking by software signatures. In IEEE Transactions on Reliability, volume 51, pages 111-122, March 2002.Google ScholarGoogle ScholarCross RefCross Ref
  13. {13} N. Oh, P. P. Shirvani, and E. J. McCluskey. Error detection by duplicated instructions in super-scalar processors. In IEEE Transactions on Reliability, volume 51, pages 63-75, March 2002.Google ScholarGoogle ScholarCross RefCross Ref
  14. {14} J. Ohlsson and M. Rimen. Implicit signature checking. In International Conference on Fault-Tolerant Computing, June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. {15} D. A. Penry, M. Vachharajani, and D. I. August. Rapid development of flexible validated processor models. Technical Report Liberty-04- 03, Liberty Research Group, Princeton University, November 2004.Google ScholarGoogle Scholar
  16. {16} J. Ray, J. C. Hoe, and B. Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture, pages 214-224. IEEE Computer Society, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. {17} M. Rebaudengo, M. S. Reorda, M. Violante, and M. Torchiano. A source-to-source compiler for generating dependable software. pages 33-42, 2001.Google ScholarGoogle Scholar
  18. {18} S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th annual international symposium on Computer architecture, pages 25-36. ACM Press, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. {19} G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In Proceedings of the 3rd International Symposium on Code Generation and Optimization, March 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. {20} P. P. Shirvani, N. Saxena, and E. J. McCluskey. Software-implemented EDAC protection against SEUs. In IEEE Transactions on Reliability, volume 49, pages 273-284, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  21. {21} T. J. Slegel, R. M. Averill III, M. A. Check, B. C. Giamei, B. W. Krumm, C. A. Krygowski, W. H. Li, J. S. Liptay, J. D. MacDougall, T. J. McPherson, J. A. Navarro, E. M. Schwarz, K. Shum, and C. F. Webb. IBM's S/390 G5 Microprocessor design. In IEEE Micro, volume 19, pages 12-23, March 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. {22} M. Vachharajani, N. Vachharajani, and D. I. August. The Liberty Structural Specification Language: A high-level modeling language for component reuse. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI), pages 195-206, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. {23} M. Vachharajani, N. Vachharajani, D. A. Penry, J. A. Blome, and D. I. August. Microarchitectural exploration with Liberty. In Proceedings of the 35th International Symposium on Microarchitecture (MICRO), pages 271-282, November 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. {24} R. Venkatasubramanian, J. P. Hayes, and B. T. Murray. Low-cost on-line fault detection using control flow assertions. In Proceedings of the 9th IEEE International On-Line Testing Symposium, pages 137-143, July 2003.Google ScholarGoogle ScholarCross RefCross Ref
  25. {25} T. N. Vijaykumar, I. Pomeranz, and K. Cheng. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th annual international symposium on Computer architecture, pages 87-98. IEEE Computer Society, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. {26} N. Wang, M. Fertig, and S. J. Patel. Y-branches: When you come to a fork in the road, take it. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 56-67, September 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. {27} N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. Characterizing the effects of transient faults on a high-performance processor pipeline. In Proceedings of the 2004 International Conference on Dependendable Systems and Networks, pages 61-72, June 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. {28} C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt. Techniques to reduce the soft error rate of a high-performance microprocessor. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. {29} Y. Yeh. Triple-triple redundant 777 primary flight computer. In Proceedings of the 1996 IEEE Aerospace Applications Conference, volume 1, pages 293-307, February 1996.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Design and Evaluation of Hybrid Fault-Detection Systems

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in

                    Full Access

                    • Published in

                      cover image ACM SIGARCH Computer Architecture News
                      ACM SIGARCH Computer Architecture News  Volume 33, Issue 2
                      ISCA 2005
                      May 2005
                      531 pages
                      ISSN:0163-5964
                      DOI:10.1145/1080695
                      Issue’s Table of Contents
                      • cover image ACM Conferences
                        ISCA '05: Proceedings of the 32nd annual international symposium on Computer Architecture
                        June 2005
                        541 pages
                        ISBN:076952270X

                      Copyright © 2005 Authors

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      • Published: 1 May 2005

                      Check for updates

                      Qualifiers

                      • article