skip to main content
10.1145/2063384.2063445acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

System implications of memory reliability in exascale computing

Published:12 November 2011Publication History

ABSTRACT

Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) and checkpointing are two effective approaches to fault tolerance. While there are numerous studies on ECC or checkpointing in isolation, this is the first paper to investigate the combined effect of both on overall system performance and power. Specifically, we study the impact of various ECC schemes (SECDED, BCH, and chipkill) in conjunction with checkpointing on future exascale systems. Our simulation results show that while chipkill is 13% better for computation-intensive applications, BCH has a 28% advantage in system energy-delay product (EDP) for memory-intensive applications. We also propose to use BCH in tagged memory systems with commodity DRAMs where chipkill is impractical. Our proposed architecture achieves 2.3x better system EDP than state-of-the-art tagged memory systems.

References

  1. "McSim: A Manycore Simulation Infrastructure," http://scale.snu.ac.kr/mcsim.Google ScholarGoogle Scholar
  2. J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, "Future Scaling of Processor-Memory Interfaces," in Supercomputing Conference, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. AMD, "BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh Processors, Technical Report," Nov. 2009.Google ScholarGoogle Scholar
  4. D. A. Bader, G. Cong, and J. Feo, "On the architectural requirements for efficient execution of graph algorithms," in ICPP '05: Proceedings of the 2005 International Conference on Parallel Processing, 2005, pp. 547--556. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Bergman, et al., "ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems." 2008, DARPA IPTO sponsored report.Google ScholarGoogle Scholar
  6. J. Berry, B. Hendrickson, S. Kahan, and P. Konecny, "Software and Algorithms for Graph Queries on Multithreaded Architectures," in 2007 IEEE International Parallel and Distributed Processing Symposium, 2007, p. 495.Google ScholarGoogle Scholar
  7. C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The PARSEC Benchmark Suite: Characterization and Architectural Implications," in PACT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Borkar, "The Exascale Challenge," in Asia Academic Forum, Nov 2010.Google ScholarGoogle Scholar
  9. L. Borucki, G. Schindlbeck, and C. Slayman, "Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level," in Proceedings of 46th Annual International Reliability Physics Symposium, 2008.Google ScholarGoogle Scholar
  10. Cray Corporation, "Cray MTA-2 System."Google ScholarGoogle Scholar
  11. J. T. Daly, "A Higher Order Estimate Of The Optimum Checkpoint Interval For Restart Dumps," Future Gener. Comput. Syst., vol. 22, pp. 303--312, February 2006. Google ScholarGoogle ScholarCross RefCross Ref
  12. T. J. Dell, "System RAS Implications of DRAM Soft Errors," IBM Journal of Research and Development, vol. 52, no. 3, pp. 307--314, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Dell, "A White Paper On The Benefits Of Chipkill-Correct ECC for PC Server Main Memory," IBM Microelectronics Division," Technical Report, Nov. 1997.Google ScholarGoogle Scholar
  14. X. Dong, N. P. Jouppi, and Y. Xie, "PCRAMsim: System-Level Performance, Energy, and Area Modeling for Phase-Change RAM," in Proceedings of the 2009 International Conference on Computer-Aided Design, ser. ICCAD '09. New York, NY, USA: ACM, 2009, pp. 269--275. {Online}. Available: http://doi.acm.org/10.1145/1687399.1687449 Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie, "Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Feo, D. Harper, S. Kahan, and P. Konecny, "ELDORADO," in Proceedings of the 2nd conference on Computing frontiers, Ischia, Italy, 2005, pp. 28--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. L. Henning, "Performance Counters and Development of SPEC CPU2006," Computer Architecture News, vol. 35, no. 1, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M.-y. Hsieh, A. Rodrigues, R. Riesen, K. Thompson, and W. Song, "A Framework for Architecture-Level Power, Area, And Thermal Simulation and Its Application to Network-on-Chip Design Exploration," SIGMETRICS Perform. Eval. Rev., vol. 38, pp. 63--68, March 2011. {Online}. Available: http://doi.acm.org/10.1145/1964218.1964229 Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Jacob, S. Ng, and D. Wang, Memory Systems: Cache, DRAM, Disk. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. JEDEC, "http://www.jedec.org/."Google ScholarGoogle Scholar
  21. P. Koka, et al., "Silicon-Photonic Network Architectures For Scalable, Power-Efficient Multi-Chip Systems," ISCA 2010, vol. 38, pp. 117--128, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. Lee, et al., "A 16Gb/s/link, 64GB/s Bidirectional Asymmetric Memory Interface," JSSC, vol. 44, no. 4, 2009.Google ScholarGoogle Scholar
  23. S. Li, J. Ahn, J. B. Brockman, and N. P. Jouppi, "McPAT 1.0: An Integrated Power, Area, and Timing Modeling Framework for Multicore Architectures," HP Labs, Tech. Rep. HPL-2009-206, 2009.Google ScholarGoogle Scholar
  24. S. Li, et al., "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 469--480. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, "CACTI-P: Architecture-Level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques," in ICCAD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Li, et al., "A Heterogeneous Lightweight Multithreaded Architecture," in International Parallel and Distributed Computing Computing Symposium (IPDPS), MTAAP workshop, 2007.Google ScholarGoogle Scholar
  27. S. Li, S. Kuntz, J. Brockman, and P. Kogge, "Lightweight Chip Multi-Threading (LCMT): Maximizing Fine-Grained Parallelism On-Chip," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 7, July, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Li, S. Kuntz, P. Kogge, and J. Brockman, "Memory Model Effects on Application Performance for a Lightweight Multithreaded Architecture," in International Parallel and Distributed Computing Computing Symposium (IPDPS), MTAAP workshop, 2008.Google ScholarGoogle Scholar
  29. X. Li, M. C. Huang, and K. Shen, "A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility," in Proceedings of the 2010 USENIX conference on USENIX annual technical conference, ser. USENIXATC'10, 2010, pp. 6--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Los Alamos National Laboratory, Reliability Data Sets. {Online}. Available: {http://institutes.lanl.gov/data/fdata/}Google ScholarGoogle Scholar
  31. C.-K. Luk, et al., "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," in PLDI, Jun 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Rosenfeld et al, "DRAMSim2," http://www.ece.umd.edu/dramsim/.Google ScholarGoogle Scholar
  33. R. Palmer, aet al., "A 14mW 6.25Gb/s Transceiver in 90nm CMOS for Serial Chip-to-Chip Communications," in ISSCC'07, 2007, pp. 440--614.Google ScholarGoogle Scholar
  34. T. Rao and E. Fujiwara, Error-Control Coding for Computer Systems. Prentice Hall, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. F. Rodrigues, et al., "The Structural Simulation Toolkit," SIGMETRICS Perform. Eval. Rev., vol. 38, pp. 37--42, March 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Samsung Electronics Corporation, "Samsung Electronics Develops World's First Eight-Die Multi-Chip Package for Multimedia Cell Phones," 2005, (Press release from http://www.samsung.com).Google ScholarGoogle Scholar
  37. B. Schroeder and G. A. Gibson, "A Large-scale Study of Failures in High Performance Computing Systems," in Proceedings of DSN, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. Schroeder, E. Pinheiro, and W.-D. Weber, "DRAM Errors in The Wild: A Large-Scale Field Study," Commun. ACM, vol. 54, no. 2, pp. 100--107, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Semiconductor Industries Association, "International Technology Roadmap for Semiconductors./Model for Assessment of CMOS Technologies and Roadmaps (MASTAR) http://www.itrs.net/."Google ScholarGoogle Scholar
  40. H. Simon, "Exascale Challenges for the Computational Science Community," Lawrence Berkeley National Laboratory and UC Berkeley, Tech. Rep., Oct. 2010.Google ScholarGoogle Scholar
  41. C. Slayman, M. Ma, and S. Lindley, "Impact of Error Correction Code and Dynamic Memory Reconfiguration on High-Reliability/Low-Cost Server Memory," in Proceedings of the IEEE Integrated Reliability Workshop, 2006, pp. 190--193.Google ScholarGoogle Scholar
  42. B. J. Smith, "A Pipelined, Shared Resource MIMD Computer," in Proceedings of the International Conference on Parallel Processing, 1978, pp. 6--8.Google ScholarGoogle Scholar
  43. D. Strukov, "The Area And Latency Tradeoffs Of Binary Bit-Parallelbch Decoders For Prospective Nanoelectronicmemories," in Proceedings of 2006 Asilomar Conference on Signals Systems and Computers, Oct. 2006, pp. 1183--1187.Google ScholarGoogle Scholar
  44. Tezzaron Semiconductor, "Soft Errors in Electronic Memory-A White Paper," Tezzaron Semiconductor," Technical Report, 2004.Google ScholarGoogle Scholar
  45. A. N. Udipi et al., "Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores," in Proceedings of ISCA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. C. Wilkerson, et al., "Reducing Cache Power With Low-Cost, Multi-Bit Error-Correcting Codes," in International Symposium on Computer Architecture, 2010, pp. 83--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, "The SPLASH-2 Programs: Characterization and Methodological Considerations," in ISCA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. D. H. Yoon, et al., "FREE-p: Protecting Non-Volatile Memory against both Hard and Soft Errors," in Proc. the Int'l Symp. High-Performance Computer Architecture (HPCA), February 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. W. Young, "A First Order Approximation To The Optimum Checkpoint Interval," Commun. ACM, vol. 17, pp. 530--531, September 1974. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. System implications of memory reliability in exascale computing

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
      November 2011
      866 pages
      ISBN:9781450307710
      DOI:10.1145/2063384

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 November 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SC '11 Paper Acceptance Rate74of352submissions,21%Overall Acceptance Rate1,516of6,373submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader