ABSTRACT
We present the first large-scale analysis of hardware failure rates on a million consumer PCs. We find that many failures are neither transient nor independent. Instead, a large portion of hardware induced failures are recurrent: a machine that crashes from a fault in hardware is up to two orders of magnitude more likely to crash a second time. For example, machines with at least 30 days of accumulated CPU time over an 8 month period had a 1 in 190 chance of crashing due to a CPU subsystem fault. Further, machines that crashed once had a probability of 1 in 3.3 of crashing a second time. Our study examines failures due to faults within the CPU, DRAM and disk subsystems. Our analysis spans desktops and laptops, CPU vendor, overclocking, underclocking, generic vs. brand name, and characteristics such as machine speed and calendar age. Among our many results, we find that CPU fault rates are correlated with the number of cycles executed, underclocked machines are significantly more reliable than machines running at their rated speed, and laptops are more reliable than desktops.
- Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy, and Jiri Schindler. An Analysis of Latent Sector Errors in Disk Drives. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS'07), San Diego, California, June 2007. Google ScholarDigital Library
- Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. An Analysis of Data Corruption in the Storage Stack. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST '08), San Jose, California, February 2008. Google ScholarDigital Library
- Thomas C. Bressoud and Fred B. Schneider. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst., 14 (1): 80--107, 1996. ISSN 0734-2071. Google ScholarDigital Library
- Cristian Constantinescu. Impact of deep submicron technology on dependability of vlsi circuits. In International Conference on Dependable Systems and Networks (DSN'02), pages 205--209, 2002. Google ScholarDigital Library
- Cristian Constantinescu. Trends and challenges in vlsi circuit reliability. IEEE Micro, 23 (4): 14--19, July-Aug. 2003. ISSN 0272-1732. Google ScholarDigital Library
- Jean Dickinson Gibbons and Subhabrata Chakraborti. Nonparametric Statistical Inference, Fourth Edition, Revised and Expanded. CRC Press, 2003.Google Scholar
- Kirk Glerum, Kinshuman Kinshumann, Steve Greenberg, Gabriel Aul, Vince Orgovan, Greg Nichols, David Grant, Gretchen Loihle, and Galen Hunt. Debugging in the (very) large: Ten years of implementation and experience. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP '09), October 2009. Google ScholarDigital Library
- Jim Gray. Why do computers stop and what can we do about it. In 6th International Conference on Reliability and Distributed Databases, 1987.Google Scholar
- Jim Gray. A census of tandem system availability between 1985 and 1990. Technical Report 90.1, Tandem Computers, January 1990.Google Scholar
- Intel. Intel 64 and IA-32 Architectures Software Developer's Manual. Intel.Google Scholar
- Weihang Jiang, Chongfeng Hu, Yuanyuan Zhou, and Arkady Kanevsky. Are disks the dominant contributor for storage failures? A comprehensive study of storage subsystem failure characteristics. In 6th USENIX Conference on File and Storage Technologies (FAST'08), San Jose, CA, February 2008. Google ScholarDigital Library
- Asmin Kadav, Matthew J. Renzelmann, and Michael M. Swift. Tolerating hardware device failures in software. In 22nd ACM Symposium on Operating Systems Principles (SOSP 09), pages 59--72, Big Sky, MT, October 2009. Google ScholarDigital Library
- M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer. Failure data analysis of a lan of windows nt based computers. In 18th IEEE Symposium on Reliable Distributed Systems, pages 178--187, 1999. Google ScholarDigital Library
- Micron. Dram soft eror rate calculations. Technical Report TN-04-28, Micron Technologies, 1994.Google Scholar
- Micron. Module mean time between failures. Technical Report TN-04-45, Micron Technologies, 1997.Google Scholar
- David Oppenheimer, Archana Ganapathi, and David A. Patterson. Why do internet services fail, and what can be done about it? In 4th Usenix Symposium on Internet Technologies and Systems (USITS '03), 2003. Google ScholarDigital Library
- David A. Patterson. An introduction to dependability. login:, 27 (4): 61--65, August 2002.Google Scholar
- Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. Failure trends in a large disk drive population. In FAST '07: Proceedings of the 5th USENIX conference on File and Storage Technologies, San Jose, CA, 2007. Google ScholarDigital Library
- Martin C. Rinard, Cristian Cadar, Daniel Dumitran, Daniel M. Roy, Tudor Leu, and Jr. William S. Beebee. Enhancing server availability and security through failure-oblivious computing. In 6th Symposium on Operating Systems Design and Implementation, pages 303--316, San Francisco, CA, December 2004. Google ScholarDigital Library
- Bianca Schroeder and Garth A. Gibson. A large-scale study of failures in high-performance computing systems. In Symposium on Dependable Systems and Networks (DSN 2006), 2006. Google ScholarDigital Library
- Bianca Schroeder and Garth A. Gibson. Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX conference on File and Storage Technologies (FAST 07), pages 1--16. USENIX Association, 2007. Google ScholarDigital Library
- Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. Dram errors in the wild:a large-scale field study. In SIGMETRICS, Seattle, WA, June 2009. Google ScholarDigital Library
- Norbet Seifert, David Moyer, Norman Leland, and Ray Hokinson. Historical trend in alpha-particle induced soft error rates of the alphatm microprocessor. In 39th Annual Reliability Physics Symposium, pages 259--265, Orlando, FL, April 2001.Google Scholar
- Dennis J. Wilkins. The bathtub curve and product failure behavior. Technical report, Fire-Dynamics.Com, October 2009.Google Scholar
- Jun Xu, Z. Kalbarczyk, and R.K. Iyer. Networked windows nt system field failure data analysis. In Pacific Rim International Symposium on Dependable Computing, pages 178--185, 1999. Google ScholarDigital Library
- J. F. Ziegler, H. W. Curtis, H. P. Muhlfeld, C. J. Montrose, B. Chin, M. Nicewicz, C. A. Russell, W. Y. Wang, L. B. Freeman, P. Hosier, L. E. LaFavea, J. L. Walsh, J. M. Orroa, G. J. Unger, J. M. Rossa, T. J. O'Gormana, B. Messinaa, T. D. Sullivana, A. J. Sykes, H. Yourkea, T. A. Enger, V. Tolat, T. S. Scotta, A. H. Taber, R. J. Sussman, W. A. Klein, and C. W. Wahaus. Ibm experiments in soft fails in computer elextronics. Technical Report 1, IBM, November 1996.Google Scholar
- James F. Ziegler, Martin E. Nelson, James Dean Shell, Jerry Patterson, Carl J. Gelderloos, Hans P. Muhfield, and Charles J. Montrose. Cosmic ray soft error rates of 16-mb dram memory chips. IEEE Journal of Solid-state circuits, 33 (2), February 1998.Google ScholarCross Ref
Index Terms
- Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs
Recommendations
The Concept of Coverage and Its Effect on the Reliability Model of a Repairable System
Duplication is a technique frequently employed to achieve high reliability for a repairable system. Although the philosophy of duplication is that it takes two faults to place a system out of service, there are generally some critical single faults that ...
A Reliability Model for Gracefully Degrading and Standby-Sparing Systems
A model for analyzing the reliability of gracefully degrading and standby-sparing computer systems is developed. The basis of the model is the identification of four distinct causes of crashes: time-domain multiple faults, resource exhaustion, space-...
Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines
The Internet has become essential to all aspects of modern life, and thus the consequences of network disruption have become increasingly severe. It is widely recognised that the Internet is not sufficiently resilient, survivable, and dependable, and ...
Comments