research-article

System implications of memory reliability in exascale computing

Authors:
Sheng Li

Hewlett-Packard Labs

Hewlett-Packard Labs
View Profile

,
Ke Chen

University of Notre Dame and Hewlett-Packard Labs

University of Notre Dame and Hewlett-Packard Labs
View Profile

,
Ming-Yu Hsieh

Sandia National Labs

Sandia National Labs
View Profile

,
Naveen Muralimanohar

Hewlett-Packard Labs

Hewlett-Packard Labs
View Profile

,
Chad D. Kersey

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Jay B. Brockman

University of Notre Dame

University of Notre Dame
View Profile

,
Arun F. Rodrigues

Sandia National Labs

Sandia National Labs
View Profile

,
Norman P. Jouppi

Hewlett-Packard Labs

Hewlett-Packard Labs
View Profile

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2011Article No.: 46Pages 1–12https://doi.org/10.1145/2063384.2063445

Published:12 November 2011Publication History

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

Resiliency will be one of the toughest challenges in future exascale systems. Memory errors contribute more than 40% of the total hardware-related failures and are projected to increase in future exascale systems. The use of error correction codes (ECC) and checkpointing are two effective approaches to fault tolerance. While there are numerous studies on ECC or checkpointing in isolation, this is the first paper to investigate the combined effect of both on overall system performance and power. Specifically, we study the impact of various ECC schemes (SECDED, BCH, and chipkill) in conjunction with checkpointing on future exascale systems. Our simulation results show that while chipkill is 13% better for computation-intensive applications, BCH has a 28% advantage in system energy-delay product (EDP) for memory-intensive applications. We also propose to use BCH in tagged memory systems with commodity DRAMs where chipkill is impractical. Our proposed architecture achieves 2.3x better system EDP than state-of-the-art tagged memory systems.

References

"McSim: A Manycore Simulation Infrastructure," http://scale.snu.ac.kr/mcsim.Google Scholar
J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, "Future Scaling of Processor-Memory Interfaces," in Supercomputing Conference, 2009. Google ScholarDigital Library
AMD, "BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh Processors, Technical Report," Nov. 2009.Google Scholar
D. A. Bader, G. Cong, and J. Feo, "On the architectural requirements for efficient execution of graph algorithms," in ICPP '05: Proceedings of the 2005 International Conference on Parallel Processing, 2005, pp. 547--556. Google ScholarDigital Library
K. Bergman, et al., "ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems." 2008, DARPA IPTO sponsored report.Google Scholar
J. Berry, B. Hendrickson, S. Kahan, and P. Konecny, "Software and Algorithms for Graph Queries on Multithreaded Architectures," in 2007 IEEE International Parallel and Distributed Processing Symposium, 2007, p. 495.Google Scholar
C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The PARSEC Benchmark Suite: Characterization and Architectural Implications," in PACT, 2008. Google ScholarDigital Library
S. Borkar, "The Exascale Challenge," in Asia Academic Forum, Nov 2010.Google Scholar
L. Borucki, G. Schindlbeck, and C. Slayman, "Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level," in Proceedings of 46th Annual International Reliability Physics Symposium, 2008.Google Scholar
Cray Corporation, "Cray MTA-2 System."Google Scholar
J. T. Daly, "A Higher Order Estimate Of The Optimum Checkpoint Interval For Restart Dumps," Future Gener. Comput. Syst., vol. 22, pp. 303--312, February 2006. Google ScholarCross Ref
T. J. Dell, "System RAS Implications of DRAM Soft Errors," IBM Journal of Research and Development, vol. 52, no. 3, pp. 307--314, 2008. Google ScholarDigital Library
T. Dell, "A White Paper On The Benefits Of Chipkill-Correct ECC for PC Server Main Memory," IBM Microelectronics Division," Technical Report, Nov. 1997.Google Scholar
X. Dong, N. P. Jouppi, and Y. Xie, "PCRAMsim: System-Level Performance, Energy, and Area Modeling for Phase-Change RAM," in Proceedings of the 2009 International Conference on Computer-Aided Design, ser. ICCAD '09. New York, NY, USA: ACM, 2009, pp. 269--275. {Online}. Available: http://doi.acm.org/10.1145/1687399.1687449 Google ScholarDigital Library
X. Dong, N. Muralimanohar, N. Jouppi, R. Kaufmann, and Y. Xie, "Leveraging 3D PCRAM Technologies to Reduce Checkpoint Overhead for Future Exascale Systems," in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2009. Google ScholarDigital Library
J. Feo, D. Harper, S. Kahan, and P. Konecny, "ELDORADO," in Proceedings of the 2nd conference on Computing frontiers, Ischia, Italy, 2005, pp. 28--34. Google ScholarDigital Library
J. L. Henning, "Performance Counters and Development of SPEC CPU2006," Computer Architecture News, vol. 35, no. 1, 2007. Google ScholarDigital Library
M.-y. Hsieh, A. Rodrigues, R. Riesen, K. Thompson, and W. Song, "A Framework for Architecture-Level Power, Area, And Thermal Simulation and Its Application to Network-on-Chip Design Exploration," SIGMETRICS Perform. Eval. Rev., vol. 38, pp. 63--68, March 2011. {Online}. Available: http://doi.acm.org/10.1145/1964218.1964229 Google ScholarDigital Library
B. Jacob, S. Ng, and D. Wang, Memory Systems: Cache, DRAM, Disk. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2007. Google ScholarDigital Library
JEDEC, "http://www.jedec.org/."Google Scholar
P. Koka, et al., "Silicon-Photonic Network Architectures For Scalable, Power-Efficient Multi-Chip Systems," ISCA 2010, vol. 38, pp. 117--128, June 2010. Google ScholarDigital Library
H. Lee, et al., "A 16Gb/s/link, 64GB/s Bidirectional Asymmetric Memory Interface," JSSC, vol. 44, no. 4, 2009.Google Scholar
S. Li, J. Ahn, J. B. Brockman, and N. P. Jouppi, "McPAT 1.0: An Integrated Power, Area, and Timing Modeling Framework for Multicore Architectures," HP Labs, Tech. Rep. HPL-2009-206, 2009.Google Scholar
S. Li, et al., "McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures," in MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009, pp. 469--480. Google ScholarDigital Library
S. Li, K. Chen, J. H. Ahn, J. B. Brockman, and N. P. Jouppi, "CACTI-P: Architecture-Level Modeling for SRAM-based Structures with Advanced Leakage Reduction Techniques," in ICCAD, 2011. Google ScholarDigital Library
S. Li, et al., "A Heterogeneous Lightweight Multithreaded Architecture," in International Parallel and Distributed Computing Computing Symposium (IPDPS), MTAAP workshop, 2007.Google Scholar
S. Li, S. Kuntz, J. Brockman, and P. Kogge, "Lightweight Chip Multi-Threading (LCMT): Maximizing Fine-Grained Parallelism On-Chip," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 7, July, 2011. Google ScholarDigital Library
S. Li, S. Kuntz, P. Kogge, and J. Brockman, "Memory Model Effects on Application Performance for a Lightweight Multithreaded Architecture," in International Parallel and Distributed Computing Computing Symposium (IPDPS), MTAAP workshop, 2008.Google Scholar
X. Li, M. C. Huang, and K. Shen, "A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility," in Proceedings of the 2010 USENIX conference on USENIX annual technical conference, ser. USENIXATC'10, 2010, pp. 6--6. Google ScholarDigital Library
Los Alamos National Laboratory, Reliability Data Sets. {Online}. Available: {http://institutes.lanl.gov/data/fdata/}Google Scholar
C.-K. Luk, et al., "Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation," in PLDI, Jun 2005. Google ScholarDigital Library
P. Rosenfeld et al, "DRAMSim2," http://www.ece.umd.edu/dramsim/.Google Scholar
R. Palmer, aet al., "A 14mW 6.25Gb/s Transceiver in 90nm CMOS for Serial Chip-to-Chip Communications," in ISSCC'07, 2007, pp. 440--614.Google Scholar
T. Rao and E. Fujiwara, Error-Control Coding for Computer Systems. Prentice Hall, 1989. Google ScholarDigital Library
A. F. Rodrigues, et al., "The Structural Simulation Toolkit," SIGMETRICS Perform. Eval. Rev., vol. 38, pp. 37--42, March 2011. Google ScholarDigital Library
Samsung Electronics Corporation, "Samsung Electronics Develops World's First Eight-Die Multi-Chip Package for Multimedia Cell Phones," 2005, (Press release from http://www.samsung.com).Google Scholar
B. Schroeder and G. A. Gibson, "A Large-scale Study of Failures in High Performance Computing Systems," in Proceedings of DSN, 2006. Google ScholarDigital Library
B. Schroeder, E. Pinheiro, and W.-D. Weber, "DRAM Errors in The Wild: A Large-Scale Field Study," Commun. ACM, vol. 54, no. 2, pp. 100--107, 2011. Google ScholarDigital Library
Semiconductor Industries Association, "International Technology Roadmap for Semiconductors./Model for Assessment of CMOS Technologies and Roadmaps (MASTAR) http://www.itrs.net/."Google Scholar
H. Simon, "Exascale Challenges for the Computational Science Community," Lawrence Berkeley National Laboratory and UC Berkeley, Tech. Rep., Oct. 2010.Google Scholar
C. Slayman, M. Ma, and S. Lindley, "Impact of Error Correction Code and Dynamic Memory Reconfiguration on High-Reliability/Low-Cost Server Memory," in Proceedings of the IEEE Integrated Reliability Workshop, 2006, pp. 190--193.Google Scholar
B. J. Smith, "A Pipelined, Shared Resource MIMD Computer," in Proceedings of the International Conference on Parallel Processing, 1978, pp. 6--8.Google Scholar
D. Strukov, "The Area And Latency Tradeoffs Of Binary Bit-Parallelbch Decoders For Prospective Nanoelectronicmemories," in Proceedings of 2006 Asilomar Conference on Signals Systems and Computers, Oct. 2006, pp. 1183--1187.Google Scholar
Tezzaron Semiconductor, "Soft Errors in Electronic Memory-A White Paper," Tezzaron Semiconductor," Technical Report, 2004.Google Scholar
A. N. Udipi et al., "Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores," in Proceedings of ISCA, 2010. Google ScholarDigital Library
C. Wilkerson, et al., "Reducing Cache Power With Low-Cost, Multi-Bit Error-Correcting Codes," in International Symposium on Computer Architecture, 2010, pp. 83--93. Google ScholarDigital Library
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, "The SPLASH-2 Programs: Characterization and Methodological Considerations," in ISCA, 1995. Google ScholarDigital Library
D. H. Yoon, et al., "FREE-p: Protecting Non-Volatile Memory against both Hard and Soft Errors," in Proc. the Int'l Symp. High-Performance Computer Architecture (HPCA), February 2011. Google ScholarDigital Library
J. W. Young, "A First Order Approximation To The Optimum Checkpoint Interval," Commun. ACM, vol. 17, pp. 530--531, September 1974. Google ScholarDigital Library

Index Terms

System implications of memory reliability in exascale computing
1. Computer systems organization

Recommendations

DRAM errors in the wild: a large-scale field study
SIGMETRICS '09

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in ...
Read More
XED: exposing on-die error detection information for strong memory reliability
ISCA'16

Large-granularity memory failures continue to be a critical impediment to system reliability. To make matters worse, as DRAM scales to smaller nodes, the frequency of unreliable bits in DRAM chips continues to increase. To mitigate such scaling-related ...
Read More
Checkpointing Exascale Memory Systems with Existing Memory Technologies
MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis
November 2011
866 pages
ISBN:9781450307710
DOI:10.1145/2063384
Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
BCH
DRAM
ECC
checkpointing
chipkill
exascale computing
memory system
reliability
tagged memory
Qualifiers
- research-article
Conference

Acceptance Rates
SC '11 Paper Acceptance Rate74of352submissions,21%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 477
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

System implications of memory reliability in exascale computing

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

DRAM errors in the wild: a large-scale field study

XED: exposing on-die error detection information for strong memory reliability

Checkpointing Exascale Memory Systems with Existing Memory Technologies

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

System implications of memory reliability in exascale computing

SC '11: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

DRAM errors in the wild: a large-scale field study

XED: exposing on-die error detection information for strong memory reliability

Checkpointing Exascale Memory Systems with Existing Memory Technologies

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media