research-article

Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Authors:
Edmund B. Nightingale

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
John R. Douceur

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Vince Orgovan

Microsoft Corporation, Redmond, WA, USA

Microsoft Corporation, Redmond, WA, USA
View Profile

EuroSys '11: Proceedings of the sixth conference on Computer systemsApril 2011Pages 343–356https://doi.org/10.1145/1966445.1966477

Published:10 April 2011Publication History

EuroSys '11: Proceedings of the sixth conference on Computer systems

Pages 343–356

ABSTRACT

We present the first large-scale analysis of hardware failure rates on a million consumer PCs. We find that many failures are neither transient nor independent. Instead, a large portion of hardware induced failures are recurrent: a machine that crashes from a fault in hardware is up to two orders of magnitude more likely to crash a second time. For example, machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault. Further, machines that crashed once had a probability of 1 in 3.3 of crashing a second time. Our study examines failures due to faults within the CPU, DRAM and disk subsystems. Our analysis spans desktops and laptops, CPU vendor, overclocking, underclocking, generic vs. brand name, and characteristics such as machine speed and calendar age. Among our many results, we find that CPU fault rates are correlated with the number of cycles executed, underclocked machines are significantly more reliable than machines running at their rated speed, and laptops are more reliable than desktops.

References

Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. An Analysis of Latent Sector Errors in Disk Drives. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS'07), San Diego, California, June 2007. Google ScholarDigital Library
Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. An Analysis of Data Corruption in the Storage Stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST '08), San Jose, California, February 2008. Google ScholarDigital Library
Thomas C. Bressoud and Fred B. Schneider. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst., 14 (1): 80--107, 1996. ISSN 0734-2071. Google ScholarDigital Library
Cristian Constantinescu. Impact of deep submicron technology on dependability of vlsi circuits. In International Conference on Dependable Systems and Networks (DSN'02), pages 205--209, 2002. Google ScholarDigital Library
Cristian Constantinescu. Trends and challenges in vlsi circuit reliability. IEEE Micro, 23 (4): 14--19, July-Aug. 2003. ISSN 0272-1732. Google ScholarDigital Library
Jean Dickinson Gibbons and Subhabrata Chakraborti. Nonparametric Statistical Inference, Fourth Edition, Revised and Expanded. CRC Press, 2003.Google Scholar
Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt. Debugging in the (very) large: Ten years of implementation and experience. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP '09), October 2009. Google ScholarDigital Library
Jim Gray. Why do computers stop and what can we do about it. In 6th International Conference on Reliability and Distributed Databases, 1987.Google Scholar
Jim Gray. A census of tandem system availability between 1985 and 1990. Technical Report 90.1, Tandem Computers, January 1990.Google Scholar
Intel. Intel 64 and IA-32 Architectures Software Developer's Manual. Intel.Google Scholar
Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky. Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. In 6th USENIX Conference on File and Storage Technologies (FAST'08), San Jose, CA, February 2008. Google ScholarDigital Library
Asmin Kadav, Matthew J. Renzelmann, and Michael M. Swift. Tolerating hardware device failures in software. In 22nd ACM Symposium on Operating Systems Principles (SOSP 09), pages 59--72, Big Sky, MT, October 2009. Google ScholarDigital Library
M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure data analysis of a lan of windows nt based computers. In 18th IEEE Symposium on Reliable Distributed Systems, pages 178--187, 1999. Google ScholarDigital Library
Micron. Dram soft eror rate calculations. Technical Report TN-04-28, Micron Technologies, 1994.Google Scholar
Micron. Module mean time between failures. Technical Report TN-04-45, Micron Technologies, 1997.Google Scholar
David Oppenheimer, Archana Ganapathi, and David A. Patterson. Why do internet services fail, and what can be done about it? In 4th Usenix Symposium on Internet Technologies and Systems (USITS '03), 2003. Google ScholarDigital Library
David A. Patterson. An introduction to dependability. login:, 27 (4): 61--65, August 2002.Google Scholar
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. Failure trends in a large disk drive population. In FAST '07: Proceedings of the 5th USENIX conference on File and Storage Technologies, San Jose, CA, 2007. Google ScholarDigital Library
Martin C. Rinard, Cristian Cadar, Daniel Dumitran, Daniel M. Roy, Tudor Leu, and Jr. William S. Beebee. Enhancing server availability and security through failure-oblivious computing. In 6th Symposium on Operating Systems Design and Implementation, pages 303--316, San Francisco, CA, December 2004. Google ScholarDigital Library
Bianca Schroeder and Garth A. Gibson. A large-scale study of failures in high-performance computing systems. In Symposium on Dependable Systems and Networks (DSN 2006), 2006. Google ScholarDigital Library
Bianca Schroeder and Garth A. Gibson. Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX conference on File and Storage Technologies (FAST 07), pages 1--16. USENIX Association, 2007. Google ScholarDigital Library
Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. Dram errors in the wild:a large-scale field study. In SIGMETRICS, Seattle, WA, June 2009. Google ScholarDigital Library
Norbet Seifert, David Moyer, Norman Leland, and Ray Hokinson. Historical trend in alpha-particle induced soft error rates of the alphatm microprocessor. In 39th Annual Reliability Physics Symposium, pages 259--265, Orlando, FL, April 2001.Google Scholar
Dennis J. Wilkins. The bathtub curve and product failure behavior. Technical report, Fire-Dynamics.Com, October 2009.Google Scholar
Jun Xu, Z. Kalbarczyk, and R.K. Iyer. Networked windows nt system field failure data analysis. In Pacific Rim International Symposium on Dependable Computing, pages 178--185, 1999. Google ScholarDigital Library
J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, B. Chin, M. Nicewicz, C. A. Russell, W. Y. Wang, L. B. Freeman, P. Hosier, L. E. LaFavea, J. L. Walsh, J. M. Orroa, G. J. Unger, J. M. Rossa, T. J. O'Gormana, B. Messinaa, T. D. Sullivana, A. J. Sykes, H. Yourkea, T. A. Enger, V. Tolat, T. S. Scotta, A. H. Taber, R. J. Sussman, W. A. Klein, and C. W. Wahaus. Ibm experiments in soft fails in computer elextronics. Technical Report 1, IBM, November 1996.Google Scholar
James F. Ziegler, Martin E. Nelson, James Dean Shell, Jerry Patterson, Carl J. Gelderloos, Hans P. Muhfield, and Charles J. Montrose. Cosmic ray soft error rates of 16-mb dram memory chips. IEEE Journal of Solid-state circuits, 33 (2), February 1998.Google ScholarCross Ref

Index Terms

Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
1. General and reference
  1. Cross-computing tools and techniques
    1. Reliability
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software reliability

Recommendations

The Concept of Coverage and Its Effect on the Reliability Model of a Repairable System

Duplication is a technique frequently employed to achieve high reliability for a repairable system. Although the philosophy of duplication is that it takes two faults to place a system out of service, there are generally some critical single faults that ...
Read More
A Reliability Model for Gracefully Degrading and Standby-Sparing Systems

A model for analyzing the reliability of gracefully degrading and standby-sparing computer systems is developed. The basis of the model is the identification of four distinct causes of crashes: time-domain multiple faults, resource exhaustion, space-...
Read More
Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines

The Internet has become essential to all aspects of modern life, and thus the consequences of network disruption have become increasingly severe. It is widely recognised that the Internet is not sufficiently resilient, survivable, and dependable, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '11: Proceedings of the sixth conference on Computer systems
April 2011
370 pages
ISBN:9781450306348
DOI:10.1145/1966445
General Chair:
Christoph Kirsch
University of Salzburg, Austria
,
Program Chair:
Gernot Heiser
University of New South Wales, Australia
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 April 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
fault tolerance
reliability
Qualifiers
- research-article
Conference

Acceptance Rates
EuroSys '11 Paper Acceptance Rate24of161submissions,15%Overall Acceptance Rate241of1,308submissions,18%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 100
  Total Citations
  View Citations
- 500
  Total Downloads
- Downloads (Last 12 months)23
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

EuroSys '11: Proceedings of the sixth conference on Computer systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Concept of Coverage and Its Effect on the Reliability Model of a Repairable System

A Reliability Model for Gracefully Degrading and Standby-Sparing Systems

Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines