ABSTRACT
In modern commodity operating systems, core functionality is usually designed assuming that the underlying processor hardware always functions correctly. Shrinking hardware feature sizes break this assumption. Existing approaches to cope with these issues either use hardware functionality that is not available in commercial-off-the-shelf (COTS) systems or poses additional requirements on the software development side, making reuse of existing software hard, if not impossible.
In this paper we present Romain, a framework that provides transparent redundant multithreading1 as an operating system service for hardware error detection and recovery. When applied to a standard benchmark suite, Romain requires a maximum runtime overhead of 30% for triple-modular redundancy (while in many cases remaining below 5%). Furthermore, our approach minimizes the complexity added to the operating system for the sake of replication.
- Ansel, J., Arya, K., and Cooperman, G. DMTCP: Transparent checkpointing for cluster computations and the desktop. In 23rd IEEE International Parallel and Distributed Processing Symposium (Rome, Italy, May 2009). Google ScholarDigital Library
- Arlat, J., Fabre, J.-C., Society, I. C., Rodriguez, M., and Salles, F. Dependability of COTS microkernel-based systems. IEEE Transactions on Computers 51 (2002), 138--163. Google ScholarDigital Library
- Aron, M., Deller, L., Elphinstone, K., Jaeger, T., Liedtke, J., and Park, Y. The SawMill framework for virtual memory diversity. In Proceedings of the 8th Asia-Pacific Computer Systems Architecture Conference (Bond University, Gold Coast, QLD, Australia, Jan. 29 - Feb. 2 2001). Google ScholarDigital Library
- Austin, T. DIVA: a reliable substrate for deep submicron microarchitecture design. In Microarchitecture, 1999. MICRO-32. Proceedings. 32nd Annual International Symposium on (1999), pp. 196--207. Google ScholarDigital Library
- Bartlett, J. F. A nonstop kernel. In Proceedings of the Eighth ACM Symposium on Operating Systems Principles (New York, NY, USA, 1981), SOSP '81, ACM, pp. 22--29. Google ScholarDigital Library
- Bernick, D., Bruckert, B., Vigna, P., Garcia, D., Jardine, R., Klecka, J., and Smullen, J. Nonstop: Advanced architecture. In Dependable Systems and Networks, 2005. DSN 2005. Proceedings. International Conference on (june 1 - july 2005), pp. 12--21. Google ScholarDigital Library
- Borkar, S. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25, 6 (Nov. - Dec. 2005), 10--16. Google ScholarDigital Library
- Bressoud, T. C., and Schneider, F. B. Hypervisor-based fault tolerance. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles (New York, NY, USA, 1995), SOSP '95, ACM, pp. 1--11. Google ScholarDigital Library
- Brown, J., and Knight, T. F. A minimal trusted computing base for dynamically ensuring secure information flow. Tech. rep., 2001.Google Scholar
- David, F. M., Chan, E. M., Carlyle, J. C., and Campbell, R. H. CuriOS: Improving Reliability through Operating System Structure. In USENIX Symposium on Operating Systems Design and Implementation (San Diego, CA, December 2008), pp. 59--72. Google ScholarDigital Library
- Fetzer, C., Schiffel, U., and Süsskraut, M. AN-encoding compiler: Building safety-critical systems with commodity hardware. In Proceedings of the 28th International Conference on Computer Safety, Reliability, and Security (Berlin, Heidelberg, 2009), SAFECOMP '09, Springer-Verlag, pp. 283--296. Google ScholarDigital Library
- Gray, J. Why do computers stop and what can be done about it? In Symposium on Reliability in Distributed Software and Database Systems (1986), pp. 3--12.Google Scholar
- Guthaus, M. R., Ringenberg, J. S., Ernst, D., Austin, T. M., Mudge, T., and Brown, R. B. MiBench: A free, commercially representative embedded benchmark suite. In Proceedings of the Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop (Washington, DC, USA, 2001), IEEE Computer Society, pp. 3--14. Google ScholarDigital Library
- Hendricks, J., and van Doorn, L. Secure bootstrap is not enough: shoring up the trusted computing base. In Proceedings of the 11th workshop on ACM SIGOPS European workshop (New York, NY, USA, 2004), EW 11, ACM. Google ScholarDigital Library
- Herder, J. N. Building a dependable operating system: Fault Tolerance in MINIX3. Dissertation, Vrije Universiteit Amsterdam, 2010.Google Scholar
- IBM. PowerPC 750GX Lockstep facility. IBM Application Note, 2008.Google Scholar
- IBM. z/OS - a smarter operating system for smarter computing. http://www-03.ibm.com/systems/z/os/zos/, 2011.Google Scholar
- Kadav, A., Renzelmann, M. J., and Swift, M. M. Tolerating hardware device failures in software. Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (2009), 59. Google ScholarDigital Library
- Klein, G., Elphinstone, K., Heiser, G., Andronick, J., Cock, D., Derrin, P., Elkaduwe, D., Engelhardt, K., Kolanski, R., Norrish, M., Sewell, T., Tuch, H., and Winwood, S. seL4: Formal verification of an OS kernel. In Proc. 22nd ACM Symposium on Operating Systems Principles (SOSP) (Big Sky, MT, USA, Oct. 2009), ACM, pp. 207--220. Google ScholarDigital Library
- Lackorzynski, A., Warg, A., and Peter, M. Generic Virtualization with Virtual Processors. In Proceedings of Twelfth Real-Time Linux Workshop (Nairobi, Kenya, October 2010).Google Scholar
- Li, M.-L., Ramachandran, P., Sahoo, S. K., Adve, S. V., Adve, V. S., and Zhou, Y. Understanding the propagation of hard errors to software and implications for resilient system design. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2008), ASPLOS XIII, ACM, pp. 265--276. Google ScholarDigital Library
- Liu, T., Curtsinger, C., and Berger, E. D. Dthreads: efficient deterministic multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (New York, NY, USA, 2011), SOSP '11, ACM, pp. 327--336. Google ScholarDigital Library
- Meixner, A., and Sorin, D. J. Detouring: Translating software to circumvent hard faults in simple cores. In Proceedings of the International Conference on Dependable Systems and Networks (DSN) (2008), pp. 80--89.Google ScholarCross Ref
- Mukherjee, S. Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Google ScholarDigital Library
- Nassif, S. R. The light at the end of the CMOS tunnel. In Int. Conf. on Application-specific Systems Architectures and Processors (july 2010), pp. 4--9.Google ScholarCross Ref
- Oh, N., Shirvani, P., and McCluskey, E. Control-flow checking by software signatures. IEEE Transactions on Reliability 51, 1 (mar 2002), 111--122.Google ScholarCross Ref
- Oh, N., Shirvani, P. P., and McCluskey, E. J. Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51 (Mar 2002), 63--75.Google ScholarCross Ref
- Olszewski, M., Ansel, J., and Amarasinghe, S. Kendo: efficient deterministic multithreading in software. SIGPLAN Not. 44 (Mar. 2009), 97--108. Google ScholarDigital Library
- Palix, N., Thomas, G., Saha, S., Calvès, C., Lawall, J., and Muller, G. Faults in Linux: Ten years later. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2011), ASPLOS '11, ACM, pp. 305--318. Google ScholarDigital Library
- Patterson, D. A., Gibson, G., and Katz, R. H. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 1988), SIGMOD '88, ACM, pp. 109--116. Google ScholarDigital Library
- Postel, J. Transmission Control Protocol. RFC 793 (Standard), Sept. 1981. Updated by RFCs 1122, 3168, 6093.Google Scholar
- Reick, K., Sanda, P., Swaney, S., Kellington, J., Mack, M., Floyd, M., and Henderson, D. Fault-tolerant design of the IBM Power6 Microprocessor. IEEE Micro 28, 2 (march-april 2008), 30--38. Google ScholarDigital Library
- Reinhardt, S. K., and Mukherjee, S. S. Transient fault detection via simultaneous multithreading. SIGARCH Comput. Archit. News 28 (May 2000), 25--36. Google ScholarDigital Library
- Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., and August, D. I. SWIFT: Software implemented fault tolerance. In Proceedings of the International Symposium on Code Generation and Optimization (2005), IEEE Computer Society, pp. 243--254. Google ScholarDigital Library
- Ryzhyk, L., Chubb, P., Kuz, I., Le Sueur, E., and Heiser, G. Automatic device driver synthesis with Termite. Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles SOSP '09 (2009), 73. Google ScholarDigital Library
- Saggese, G. P., Wang, N. J., Kalbarczyk, Z. T., Patel, S. J., and Iyer, R. K. An experimental study of soft errors in microprocessors. IEEE Micro 25 (November 2005), 30--39. Google ScholarDigital Library
- Schroder, D. K. Negative bias temperature instability: What do we understand? Microelectronics Reliability 47, 6 (2007), 841--852.Google ScholarCross Ref
- Shye, A., Moseley, T., Reddi, V. J., Blomstedt, J., and Connors, D. A. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (Washington, DC, USA, 2007), DSN '07, IEEE Computer Society, pp. 297--306. Google ScholarDigital Library
- Singaravelu, L., Pu, C., Härtig, H., and Helmuth, C. Reducing TCB complexity for security-sensitive applications: three case studies. SIGOPS Oper. Syst. Rev. 40 (April 2006), 161--174. Google ScholarDigital Library
- Steinberg, U., and Kauer, B. NOVA: a microhypervisor-based secure virtualization architecture. In Proceedings of the 5th European conference on Computer systems (New York, NY, USA, 2010), EuroSys '10, ACM, pp. 209--222. Google ScholarDigital Library
- Taber, A., and Normand, E. Single event upset in avionics. IEEE Transactions on Nuclear Science 40, 2 (apr 1993), 120--126.Google ScholarCross Ref
- Thampi, V. udis86 - disassembler library for x86 and x86-64. http://udis86.sourceforge.net/, 2009.Google Scholar
- TU Dresden OS Group. L4/Fiasco.OC microkernel. http://www.tudos.org/fiasco, 2012.Google Scholar
- uNdErX. Micro length-disassembler engine 32. http://vx.netlux.org/vx.php?id=em24, 2004.Google Scholar
- Venkatasubramanian, R., Hayes, J., and Murray, B. Low-cost on-line fault detection using control flow assertions. In On-Line Testing Symposium, 2003. IOLTS 2003. 9th IEEE (july 2003), pp. 137--143.Google ScholarCross Ref
- Vogt, D., Döbel, B., and Lackorzynski, A. Stay strong, stay safe: Enhancing reliability of a secure operating system. In Proceedings of the Workshop on Isolation and Integration for Dependable Systems (IIDS 2010), Paris, France, April 2010 (New York, NY, USA, 2010), ACM.Google Scholar
- Wang, C., Kim, H.-s., Wu, Y., and Ying, V. Compiler-managed software-based redundant multi-threading for transient fault detection. In Proceedings of the International Symposium on Code Generation and Optimization (Washington, DC, USA, 2007), CGO '07, IEEE Computer Society, pp. 244--258. Google ScholarDigital Library
- Wang, N., Fertig, M., and Patel, S. Y-branches: when you come to a fork in the road, take it. In Parallel Architectures and Compilation Techniques, 2003. PACT 2003. Proceedings. 12th International Conference on (sept. - 1 oct. 2003), pp. 56--66. Google ScholarDigital Library
- Zhu, D., Melhem, R., and Mosse, D. The effects of energy management on reliability in real-time embedded systems. In IEEE/ACM International Conference on Computer-Aided design (Washington, DC, USA, 2004), ICCAD '04, IEEE Computer Society, pp. 35--40. Google ScholarDigital Library
- Ziegler, J. F., and Lanford, W. A. Effect of cosmic rays on computer memories. Science 206, 4420 (1979), 776--788.Google ScholarCross Ref
Index Terms
- Operating system support for redundant multithreading
Recommendations
Measures to improve security in a microkernel operating system
InfoSecCD '11: Proceedings of the 2011 Information Security Curriculum Development ConferenceAn operating system forms the foundation for all of the user's computer activities. Therefore, it should be trustworthy and function flawlessly. Unfortunately, today's operating systems, such as Windows and Linux, fail to deliver to this ideal, because ...
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systemsRedundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance ...
Operating System Structures for Multiprocessor Systems on Programmable Chip
RECONFIG '10: Proceedings of the 2010 International Conference on Reconfigurable Computing and FPGAsChips are moving from single-core systems to much more complex, heterogeneous many core systems. While heterogeneous architectures promise high performance, they are also challenging our ability to port our existing operating systems to abstract the ...
Comments