skip to main content
10.1145/1966445.1966477acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

Published:10 April 2011Publication History

ABSTRACT

We present the first large-scale analysis of hardware failure rates on a million consumer PCs. We find that many failures are neither transient nor independent. Instead, a large portion of hardware induced failures are recurrent: a machine that crashes from a fault in hardware is up to two orders of magnitude more likely to crash a second time. For example, machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault. Further, machines that crashed once had a probability of 1 in 3.3 of crashing a second time. Our study examines failures due to faults within the CPU, DRAM and disk subsystems. Our analysis spans desktops and laptops, CPU vendor, overclocking, underclocking, generic vs. brand name, and characteristics such as machine speed and calendar age. Among our many results, we find that CPU fault rates are correlated with the number of cycles executed, underclocked machines are significantly more reliable than machines running at their rated speed, and laptops are more reliable than desktops.

References

  1. Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. An Analysis of Latent Sector Errors in Disk Drives. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS'07), San Diego, California, June 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. An Analysis of Data Corruption in the Storage Stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST '08), San Jose, California, February 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Thomas C. Bressoud and Fred B. Schneider. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst., 14 (1): 80--107, 1996. ISSN 0734-2071. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cristian Constantinescu. Impact of deep submicron technology on dependability of vlsi circuits. In International Conference on Dependable Systems and Networks (DSN'02), pages 205--209, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cristian Constantinescu. Trends and challenges in vlsi circuit reliability. IEEE Micro, 23 (4): 14--19, July-Aug. 2003. ISSN 0272-1732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Jean Dickinson Gibbons and Subhabrata Chakraborti. Nonparametric Statistical Inference, Fourth Edition, Revised and Expanded. CRC Press, 2003.Google ScholarGoogle Scholar
  7. Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt. Debugging in the (very) large: Ten years of implementation and experience. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP '09), October 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jim Gray. Why do computers stop and what can we do about it. In 6th International Conference on Reliability and Distributed Databases, 1987.Google ScholarGoogle Scholar
  9. Jim Gray. A census of tandem system availability between 1985 and 1990. Technical Report 90.1, Tandem Computers, January 1990.Google ScholarGoogle Scholar
  10. Intel. Intel 64 and IA-32 Architectures Software Developer's Manual. Intel.Google ScholarGoogle Scholar
  11. Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky. Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. In 6th USENIX Conference on File and Storage Technologies (FAST'08), San Jose, CA, February 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Asmin Kadav, Matthew J. Renzelmann, and Michael M. Swift. Tolerating hardware device failures in software. In 22nd ACM Symposium on Operating Systems Principles (SOSP 09), pages 59--72, Big Sky, MT, October 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure data analysis of a lan of windows nt based computers. In 18th IEEE Symposium on Reliable Distributed Systems, pages 178--187, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Micron. Dram soft eror rate calculations. Technical Report TN-04-28, Micron Technologies, 1994.Google ScholarGoogle Scholar
  15. Micron. Module mean time between failures. Technical Report TN-04-45, Micron Technologies, 1997.Google ScholarGoogle Scholar
  16. David Oppenheimer, Archana Ganapathi, and David A. Patterson. Why do internet services fail, and what can be done about it? In 4th Usenix Symposium on Internet Technologies and Systems (USITS '03), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. David A. Patterson. An introduction to dependability. login:, 27 (4): 61--65, August 2002.Google ScholarGoogle Scholar
  18. Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. Failure trends in a large disk drive population. In FAST '07: Proceedings of the 5th USENIX conference on File and Storage Technologies, San Jose, CA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Martin C. Rinard, Cristian Cadar, Daniel Dumitran, Daniel M. Roy, Tudor Leu, and Jr. William S. Beebee. Enhancing server availability and security through failure-oblivious computing. In 6th Symposium on Operating Systems Design and Implementation, pages 303--316, San Francisco, CA, December 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Bianca Schroeder and Garth A. Gibson. A large-scale study of failures in high-performance computing systems. In Symposium on Dependable Systems and Networks (DSN 2006), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Bianca Schroeder and Garth A. Gibson. Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX conference on File and Storage Technologies (FAST 07), pages 1--16. USENIX Association, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. Dram errors in the wild:a large-scale field study. In SIGMETRICS, Seattle, WA, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Norbet Seifert, David Moyer, Norman Leland, and Ray Hokinson. Historical trend in alpha-particle induced soft error rates of the alphatm microprocessor. In 39th Annual Reliability Physics Symposium, pages 259--265, Orlando, FL, April 2001.Google ScholarGoogle Scholar
  24. Dennis J. Wilkins. The bathtub curve and product failure behavior. Technical report, Fire-Dynamics.Com, October 2009.Google ScholarGoogle Scholar
  25. Jun Xu, Z. Kalbarczyk, and R.K. Iyer. Networked windows nt system field failure data analysis. In Pacific Rim International Symposium on Dependable Computing, pages 178--185, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, B. Chin, M. Nicewicz, C. A. Russell, W. Y. Wang, L. B. Freeman, P. Hosier, L. E. LaFavea, J. L. Walsh, J. M. Orroa, G. J. Unger, J. M. Rossa, T. J. O'Gormana, B. Messinaa, T. D. Sullivana, A. J. Sykes, H. Yourkea, T. A. Enger, V. Tolat, T. S. Scotta, A. H. Taber, R. J. Sussman, W. A. Klein, and C. W. Wahaus. Ibm experiments in soft fails in computer elextronics. Technical Report 1, IBM, November 1996.Google ScholarGoogle Scholar
  27. James F. Ziegler, Martin E. Nelson, James Dean Shell, Jerry Patterson, Carl J. Gelderloos, Hans P. Muhfield, and Charles J. Montrose. Cosmic ray soft error rates of 16-mb dram memory chips. IEEE Journal of Solid-state circuits, 33 (2), February 1998.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        EuroSys '11: Proceedings of the sixth conference on Computer systems
        April 2011
        370 pages
        ISBN:9781450306348
        DOI:10.1145/1966445

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 April 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        EuroSys '11 Paper Acceptance Rate24of161submissions,15%Overall Acceptance Rate241of1,308submissions,18%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader