Abstract
Several application domains exist, where the effects of Soft Errors on processor-based systems cannot be faced by acting on the hardware (either by changing the technology, or the components, or the architecture, or whatever else). In these cases, an attractive solution lies in just modifying the software: the ability to detect and possibly correct errors is obtained by introducing redundancy in the code and in the data, without modifying the underlying hardware. This chapter provides an overview of the methods resorting to this technique, outlining their characteristics and summarizing their advantages and limitations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The term alternate reflects sequential execution, which is a feature specific to the recovery block approach.
- 2.
Task duplication [40] was introduced to detect transient faults, based on duplicating the computation of a task on two processors. If the results of the two executions do not match, the task is executed again in another processor until a pair of processors produces identical results. This scheme does not use checkpoints, and every time a fault is detected, the task has to be started from its beginning.
References
M. Rebaudengo, M. Sonza Reorda, M. Torchiano, M. Violante, Soft-error detection through software fault-tolerance techniques. Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 1999, pp. 210–218
M. Rebaudengo, M. Sonza Reorda, M. Torchiano, M. Violante, A source-to-source compiler for generating dependable software. Proceedings of the IEEE International Workshop on Source Code Analysis and Manipulation, 2001, pp. 33–42
P. Cheynet, B. Nicolescu, R. Velazco, M. Rebaudengo, M. Sonza Reorda, M. Violante, Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors. IEEE Transactions on Nuclear Science 47(6), 2000, 2231–2236
A. Benso, S. Chiusano, P. Prinetto, L. Tagliaferri, A C/C++ source-to-source compiler for dependable applications. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2000, pp. 71–78
N. Oh, P.P. Shirvani, E.J. McCluskey, Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51(1), 2002, 63–75
G. Sohi, M. Franklin, K. Saluja, A study of time-redundant fault tolerance techniques for high-performance pipelined computers. 19th International Symposium on Fault Tolerant Computing, 1989, pp. 463–443
C. Bolchini, A software methodology for detecting hardware faults in VLIW data paths. IEEE Transactions on Reliability 52(4), 2003, 458–468
N. Oh, E.J. McCluskey, Error detection by selective procedure call duplication for low energy consumption. IEEE Transactions on Reliability 51(4), 2002, 392–402
K. Echtle, B. Hinz, T. Nikolov, On hardware fault detection by diverse software. Proceedings of the 13th International Conference on Fault-Tolerant Systems and Diagnostics, 1990, pp. 362–367
H. Engel, Data flow transformations to detect results which are corrupted by hardware faults. Proceedings of the IEEE High-Assurance System Engineering Workshop, 1997, pp. 279–285
M. Jochim, Detecting processor hardware faults by means of automatically generated virtual duplex systems. Proceedings of the International Conference on Dependable Systems and Networks, 2002, pp. 399–408
S.K. Reinhardt, S.S. Mukherjee, Transient fault detection via simultaneous multithreading. Proceedings of the 27th International Symposium on Computer Architecture, 2000, pp. 25–36
E. Rotenberg, AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. 29th International Symposium on Fault-Tolerant Computing, 1999, pp. 84–91
N. Oh, S. Mitra, E.J. McCluskey, ED4I: error detection by diverse data and duplicated instructions. IEEE Transactions on Computers 51(2), 2002, 180–199
M. Hiller, Executable assertions for detecting data errors in embedded control systems. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2000, pp. 24–33
J. Vinter, J. Aidemark, P. Folkesson, J. Karlsson, Reducing critical failures for control algorithms using executable assertions and best effort recovery. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2001, pp. 347–356
S.S. Yau, F.-C. Chen, An approach to concurrent control flow checking. IEEE Transactions on Software Engineering 6(2), 1980, 126–137
N. Oh, P.P. Shirvani, E.J. McCluskey, Control-flow checking by software signatures. IEEE Transactions on Reliability 51(2), 2002, 111–122
Z. Alkhalifa, V.S.S. Nair, N. Krishnamurthy, J.A. Abraham, Design and evaluation of system-level checks for on-line control flow error detection. IEEE Transactions on Parallel and Distributed Systems 10(6), 1999, 627–641
O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, M. Violante, Soft-error detection using control flow assertions. Proceedings of the 18th International Symposium on Defect and Fault Tolerance in VLSI Systems, 3–5 November 2003, pp. 581–588
R. Vemu, J.A. Abraham, CEDA: control-flow error detection through assertions. Proceedings of the 12th IEEE International On-Line Testing Symposium, 2006, pp. 151–158
R. Vemu, J.A. Abraham, Budget-dependent control-flow error detection. Proceedings of the 14th IEEE International On-Line Testing Symposium, 2008, pp. 73–78
C. Babbage, On the mathematical powers of the calculating engine, unpublished manuscript, December 1837, Oxford, Buxton Ms7, Museum of History of Science. Printed in The Origins of Digital Computers: Selected Papers, B. Randell (ed.), Springer, Berlin, 1974, pp. 17–52
A. Avizienis, J.C. Laprie, Dependable computing: from concepts to design diversity. Proceedings of the IEEE 74(5), 1986, 629–638
A. Avizienis, The N-version approach to fault-tolerant software. IEEE Transactions on Software Engineering 11(12), 1985, 1491–1501
B. Randell, System structure for software fault tolerance. IEEE Transactions on Software Engineering 1(2), 1975, 220–232
D. Pradhan, Fault-Tolerant Computer System Design. Prentice-Hall, Englewood Cliffs, NJ, 1996
J.P. Kelly, T.I. McVittie, W.I. Yamamoto, Implementing design diversity to achieve fault tolerance. IEEE Software 8(4), 1991, 61–71
J.H. Lala, L.S. Alger, Hardware and software fault tolerance: a unified architectural approach. Proceedings of the 18th International Symposium on Fault-Tolerant Computing, 1988, pp. 240–245
C.E. Price, Fault tolerant avionics for the space shuttle. Proceedings of the 10th IEEE/AIAA Digital Avionics Systems Conference, 1991, pp. 203–206
D. Briere, P. Traverse, AIRBUS A320/A330/A340 electrical flight controls: a family of fault-tolerant systems. Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, 1993, pp. 616–623
R. Riter, Modeling and testing a critical fault-tolerant multi-process system. Proceedings of the 25th International Symposium on Fault-Tolerant Computing, 1995, pp. 516–521
G. Hagelin, ERICSSON safety system for railway control. Proceedings of the Workshop on Design Diversity in Action, Springer, Vienna, 1988, pp. 11–21
H. Kanzt, C. Koza, The ELEKTRA railway signalling system: field experience with an actively replicated system with diversity. Proceedings of the 25th International Symposium on Fault-Tolerant Computing, 1995, pp. 453–458
A. Amendola, L. Impagliazzo, P. Marmo, G. Mongardi, G. Sartore, Architecture and safety requirements of the ACC railway interlocking system. Proceedings of IEEE International Computer Performance and Dependability Symposium, 1996, pp. 21–29
A.M. Tyrrell, Recovery blocks and algorithm-based fault tolerance, EUROMICRO 96. Beyond 2000: Hardware and Software Design Strategies. Proceedings of the 22nd EuroMicro Conference, 1996, pp. 292–299
K.M. Chandy, C.V. Ramamoorthy, Rollback and recovery strategies for computer programs. IEEE Transactions on Computers 21(6), 1972, 546–556
W.K. Fuchs, C.-C.J. Li, CATCH – compiler-assisted techniques for checkpointing. Proceedings of the 20th Fault-Tolerant Computing Symposium, 1990, pp. 74–81
J. Long, W.K. Fuchs, J.A. Abraham, Compiler-assisted static checkpoint insertion. Proceedings of the 22nd Fault-Tolerant Computing Symposium, 1992, pp. 58–65
D.K. Pradhan, N.H. Vaidya, Roll-forward checkpointing scheme: a novel fault-tolerant architecture. IEEE Transactions on Computers 43(10), 1994, 1163–1174
A. Ziv, J. Bruck, Performance optimization of checkpointing scheme with task duplication. IEEE Transactions on Computers 46(12), 1997, 1381–1386
K.H. Huang, J.A. Abraham, Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers C-33(6), 1984, 518–528
A. Roy-Chowdhury, P. Banerjee, Tolerance determination for algorithm based checks using simplified error analysis. Proceedings of the IEEE International Fault Tolerant Computing Symposium, 1993
M. Rebaudengo, M. Sonza Reorda, M. Violante, A new software-based technique for low-cost fault-tolerant application. Proceedings of the IEEE Annual Reliability and Maintainability Symposium, 2003, pp. 25–28
M. Rebaudengo, M. Sonza Reorda, M. Violante, A new approach to software-implemented fault tolerance. Journal of Electronic Testing: Theory and Applications 20, 2004, 433–437
B. Nicolescu, R. Velazco, M. Sonza Reorda, Effectiveness and limitations of various software techniques for “soft error” detection: a comparative study. Proceedings of the IEEE 7th International On-Line Testing Workshop, 2001, pp. 172–177
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Rebaudengo, M., Reorda, M.S., Violante, M. (2011). Software-Level Soft-Error Mitigation Techniques. In: Nicolaidis, M. (eds) Soft Errors in Modern Electronic Systems. Frontiers in Electronic Testing, vol 41. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-6993-4_9
Download citation
DOI: https://doi.org/10.1007/978-1-4419-6993-4_9
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-6992-7
Online ISBN: 978-1-4419-6993-4
eBook Packages: EngineeringEngineering (R0)