ABSTRACT
Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) and checkpointing are two effective approaches to fault tolerance. While there are numerous studies on ECC or checkpointing in isolation, this is the first paper to investigate the combined effect of both on overall system performance and power. Specifically, we study the impact of various ECC schemes (SECDED, BCH, and chipkill) in conjunction with checkpointing on future exascale systems. Our simulation results show that while chipkill is 13% better for computation-intensive applications, BCH has a 28% advantage in system energy-delay product (EDP) for memory-intensive applications. We also propose to use BCH in tagged memory systems with commodity DRAMs where chipkill is impractical. Our proposed architecture achieves 2.3x better system EDP than state-of-the-art tagged memory systems.
- "McSim: A Manycore Simulation Infrastructure," http://scale.snu.ac.kr/mcsim.Google Scholar
- J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, "Future Scaling of Processor-Memory Interfaces," in Supercomputing Conference, 2009. Google ScholarDigital Library
- AMD, "BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh Processors, Technical Report," Nov. 2009.Google Scholar
- D. A. Bader, G. Cong, and J. Feo, "On the architectural requirements for efficient execution of graph algorithms," in ICPP '05: Proceedings of the 2005 International Conference on Parallel Processing, 2005, pp. 547--556. Google ScholarDigital Library
- K. Bergman, et al., "ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems." 2008, DARPA IPTO sponsored report.Google Scholar
- J. Berry, B. Hendrickson, S. Kahan, and P. Konecny, "Software and Algorithms for Graph Queries on Multithreaded Architectures," in 2007 IEEE International Parallel and Distributed Processing Symposium, 2007, p. 495.Google Scholar
- C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The PARSEC Benchmark Suite: Characterization and Architectural Implications," in PACT, 2008. Google ScholarDigital Library
- S. Borkar, "The Exascale Challenge," in Asia Academic Forum, Nov 2010.Google Scholar
- L. Borucki, G. Schindlbeck, and C. Slayman, "Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level," in Proceedings of 46th Annual International Reliability Physics Symposium, 2008.Google Scholar
- Cray Corporation, "Cray MTA-2 System."Google Scholar
- J. T. Daly, "A Higher Order Estimate Of The Optimum Checkpoint Interval For Restart Dumps," Future Gener. Comput. Syst., vol. 22, pp. 303--312, February 2006. Google ScholarCross Ref
- T. J. Dell, "System RAS Implications of DRAM Soft Errors," IBM Journal of Research and Development, vol. 52, no. 3, pp. 307--314, 2008. Google ScholarDigital Library
- T. Dell, "A White Paper On The Benefits Of Chipkill-Correct ECC for PC Server Main Memory," IBM Microelectronics Division," Technical Report, Nov. 1997.Google Scholar
- X. Dong, N. P. Jouppi, and Y. Xie, "PCRAMsim: System-Level Performance, Energy, and Area Modeling for Phase-Change RAM," in Proceedings of the 2009 International Conference on Computer-Aided Design, ser. ICCAD '09. New York, NY, USA: ACM, 2009, pp. 269--275. {Online}. Available: http://doi.acm.org/10.1145/1687399.1687449 Google ScholarDigital Library
- X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie, "Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009. Google ScholarDigital Library
- J. Feo, D. Harper, S. Kahan, and P. Konecny, "ELDORADO," in Proceedings of the 2nd conference on Computing frontiers, Ischia, Italy, 2005, pp. 28--34. Google ScholarDigital Library
- J. L. Henning, "Performance Counters and Development of SPEC CPU2006," Computer Architecture News, vol. 35, no. 1, 2007. Google ScholarDigital Library
- M.-y. Hsieh, A. Rodrigues, R. Riesen, K. Thompson, and W. Song, "A Framework for Architecture-Level Power, Area, And Thermal Simulation and Its Application to Network-on-Chip Design Exploration," SIGMETRICS Perform. Eval. Rev., vol. 38, pp. 63--68, March 2011. {Online}. Available: http://doi.acm.org/10.1145/1964218.1964229 Google ScholarDigital Library
- B. Jacob, S. Ng, and D. Wang, Memory Systems: Cache, DRAM, Disk. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007. Google ScholarDigital Library
- JEDEC, "http://www.jedec.org/."Google Scholar
- P. Koka, et al., "Silicon-Photonic Network Architectures For Scalable, Power-Efficient Multi-Chip Systems," ISCA 2010, vol. 38, pp. 117--128, June 2010. Google ScholarDigital Library
- H. Lee, et al., "A 16Gb/s/link, 64GB/s Bidirectional Asymmetric Memory Interface," JSSC, vol. 44, no. 4, 2009.Google Scholar
- S. Li, J. Ahn, J. B. Brockman, and N. P. Jouppi, "McPAT 1.0: An Integrated Power, Area, and Timing Modeling Framework for Multicore Architectures," HP Labs, Tech. Rep. HPL-2009-206, 2009.Google Scholar
- S. Li, et al., "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 469--480. Google ScholarDigital Library
- S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, "CACTI-P: Architecture-Level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques," in ICCAD, 2011. Google ScholarDigital Library
- S. Li, et al., "A Heterogeneous Lightweight Multithreaded Architecture," in International Parallel and Distributed Computing Computing Symposium (IPDPS), MTAAP workshop, 2007.Google Scholar
- S. Li, S. Kuntz, J. Brockman, and P. Kogge, "Lightweight Chip Multi-Threading (LCMT): Maximizing Fine-Grained Parallelism On-Chip," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 7, July, 2011. Google ScholarDigital Library
- S. Li, S. Kuntz, P. Kogge, and J. Brockman, "Memory Model Effects on Application Performance for a Lightweight Multithreaded Architecture," in International Parallel and Distributed Computing Computing Symposium (IPDPS), MTAAP workshop, 2008.Google Scholar
- X. Li, M. C. Huang, and K. Shen, "A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility," in Proceedings of the 2010 USENIX conference on USENIX annual technical conference, ser. USENIXATC'10, 2010, pp. 6--6. Google ScholarDigital Library
- Los Alamos National Laboratory, Reliability Data Sets. {Online}. Available: {http://institutes.lanl.gov/data/fdata/}Google Scholar
- C.-K. Luk, et al., "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," in PLDI, Jun 2005. Google ScholarDigital Library
- P. Rosenfeld et al, "DRAMSim2," http://www.ece.umd.edu/dramsim/.Google Scholar
- R. Palmer, aet al., "A 14mW 6.25Gb/s Transceiver in 90nm CMOS for Serial Chip-to-Chip Communications," in ISSCC'07, 2007, pp. 440--614.Google Scholar
- T. Rao and E. Fujiwara, Error-Control Coding for Computer Systems. Prentice Hall, 1989. Google ScholarDigital Library
- A. F. Rodrigues, et al., "The Structural Simulation Toolkit," SIGMETRICS Perform. Eval. Rev., vol. 38, pp. 37--42, March 2011. Google ScholarDigital Library
- Samsung Electronics Corporation, "Samsung Electronics Develops World's First Eight-Die Multi-Chip Package for Multimedia Cell Phones," 2005, (Press release from http://www.samsung.com).Google Scholar
- B. Schroeder and G. A. Gibson, "A Large-scale Study of Failures in High Performance Computing Systems," in Proceedings of DSN, 2006. Google ScholarDigital Library
- B. Schroeder, E. Pinheiro, and W.-D. Weber, "DRAM Errors in The Wild: A Large-Scale Field Study," Commun. ACM, vol. 54, no. 2, pp. 100--107, 2011. Google ScholarDigital Library
- Semiconductor Industries Association, "International Technology Roadmap for Semiconductors./Model for Assessment of CMOS Technologies and Roadmaps (MASTAR) http://www.itrs.net/."Google Scholar
- H. Simon, "Exascale Challenges for the Computational Science Community," Lawrence Berkeley National Laboratory and UC Berkeley, Tech. Rep., Oct. 2010.Google Scholar
- C. Slayman, M. Ma, and S. Lindley, "Impact of Error Correction Code and Dynamic Memory Reconfiguration on High-Reliability/Low-Cost Server Memory," in Proceedings of the IEEE Integrated Reliability Workshop, 2006, pp. 190--193.Google Scholar
- B. J. Smith, "A Pipelined, Shared Resource MIMD Computer," in Proceedings of the International Conference on Parallel Processing, 1978, pp. 6--8.Google Scholar
- D. Strukov, "The Area And Latency Tradeoffs Of Binary Bit-Parallelbch Decoders For Prospective Nanoelectronicmemories," in Proceedings of 2006 Asilomar Conference on Signals Systems and Computers, Oct. 2006, pp. 1183--1187.Google Scholar
- Tezzaron Semiconductor, "Soft Errors in Electronic Memory-A White Paper," Tezzaron Semiconductor," Technical Report, 2004.Google Scholar
- A. N. Udipi et al., "Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores," in Proceedings of ISCA, 2010. Google ScholarDigital Library
- C. Wilkerson, et al., "Reducing Cache Power With Low-Cost, Multi-Bit Error-Correcting Codes," in International Symposium on Computer Architecture, 2010, pp. 83--93. Google ScholarDigital Library
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, "The SPLASH-2 Programs: Characterization and Methodological Considerations," in ISCA, 1995. Google ScholarDigital Library
- D. H. Yoon, et al., "FREE-p: Protecting Non-Volatile Memory against both Hard and Soft Errors," in Proc. the Int'l Symp. High-Performance Computer Architecture (HPCA), February 2011. Google ScholarDigital Library
- J. W. Young, "A First Order Approximation To The Optimum Checkpoint Interval," Commun. ACM, vol. 17, pp. 530--531, September 1974. Google ScholarDigital Library
Index Terms
- System implications of memory reliability in exascale computing
Recommendations
DRAM errors in the wild: a large-scale field study
SIGMETRICS '09Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in ...
XED: exposing on-die error detection information for strong memory reliability
ISCA'16Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related ...
Checkpointing Exascale Memory Systems with Existing Memory Technologies
MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsBuilding exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming ...
Comments