article

Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance

Authors:
Vimal K. Reddy

North Carolina State University

North Carolina State University
View Profile

,
Eric Rotenberg

North Carolina State University

North Carolina State University
View Profile

,
Sailashri Parthasarathy

Intel Corporation

Intel Corporation
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 41 Issue 11November 2006pp 83–94https://doi.org/10.1145/1168918.1168869

Published:20 October 2006Publication History

ACM SIGPLAN Notices

Abstract

Redundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance duplicates instructions only during periods of poor single-thread performance. (ii) ReStore does not explicitly duplicate instructions and instead exploits mispredictions among highly confident branch predictions as symptoms of faults. (iii) Slipstream creates a reduced alternate thread by replacing many instructions with highly confident predictions. We explore PRT as a possible direction for achieving the fault tolerance of full duplication with the performance of single-thread execution. Opportunistic and ReStore yield partial coverage since they are restricted to using only partial duplication or only confident predictions, respectively. Previous analysis of Slipstream fault tolerance was cursory and concluded that only duplicated instructions are covered. In this paper, we attempt to better understand Slipstream's fault tolerance, conjecturing that the mixture of partial duplication and confident predictions actually closely approximates the coverage of full duplication. A thorough dissection of prediction scenarios confirms that faults in nearly 100% of instructions are detectable. Fewer than 0.1% of faulty instructions are not detectable due to coincident faults and mispredictions. Next we show that the current recovery implementation fails to leverage excellent detection capability, since recovery sometimes initiates belatedly, after already retiring a detected faulty instruction. We propose and evaluate a suite of simple microarchitectural alterations to recovery and checking. Using the best alterations, Slipstream can recover from faults in 99% of instructions, compared to only 78% of instructions without alterations. Both results are much higher than predicted by past research, which claims coverage for only duplicated instructions, or 65% of instructions. On an 8-issue SMT processor, Slipstream performs within 1.3% of single-thread execution whereas full duplication slows performance by 14%.A key byproduct of this paper is a novel analysis framework in which every dynamic instruction is considered to be hypothetically faulty, thus not requiring explicit fault injection. Fault coverage is measured in terms of the fraction of candidate faulty instructions that are directly or indirectly detectable before.

References

T.M. Austin. Diva: a reliable substrate for deep submicron microarchitecture design. 32nd International Symposium on Microarchitecture, pp. 196--207, Nov. 1999. Google ScholarDigital Library
D. Burger, T.M. Austin, and S. Bennett. The Simplescalar Toolset, Version 2. Tech. Report CS-TR-1997-1342, CS Department, University of Wisconsin-Madison, July 1997.Google Scholar
M. Gomaa, C. Scarbrough, T.N. Vijaykumar, and I. Pomeranz. Transient-fault recovery for chip multiprocessors. 30th International Symposium on Computer architecture, pp. 98--109, June 2003. Google ScholarDigital Library
M. Gomaa and T.N. Vijaykumar. Opportunistic transientfault detection. 32nd International Symposium on Computer Architecture, pp. 172--183, June 2005. Google ScholarDigital Library
J.J. Koppanalil and E. Rotenberg. A simple mechanism for detecting ineffectual instructions in slipstream processors. IEEE Trans. on Computers, 53(4):399--413, April 2004. Google ScholarDigital Library
S. Kumar and A. Aggarwal. Reducing resource redundancy for concurrent error detection techniques in high performance microprocessors. 12th International Symposium on High-Performance Computer Architecture, pp. 212--221, Feb. 2006.Google ScholarCross Ref
K.M. Lepak and M.H. Lipasti. On the value locality of store instructions. 27th International Symposium on Computer Architecture, pp. 182--191, June 2000. Google ScholarDigital Library
S.S. Mukherjee, M. Kontz, and S.K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. 29th International Symposium on Computer Architecture, pp. 99--110, May 2002. Google ScholarDigital Library
S.S. Mukherjee, C. Weaver, J. Emer, S.K. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. 36th International Symposium on Microarchitecture, pp. 29--40, Dec. 2003. Google ScholarDigital Library
A. Parashar, S. Gurumurthi and A. Sivasubramaniam. A complexity-effective approach to ALU bandwidth enhancement for instruction-level temporal redundancy. 31st International Symposium on Computer Architecture, pp. 376--386, June 2004. Google ScholarDigital Library
S. Parthasarathy. Improving transient fault tolerance of slipstream processors. M.S. Thesis, ECE Department, North Carolina State University, Dec. 2005.Google Scholar
Z.R. Purser. Slipstream processors. Ph.D. Thesis, ECE Department, North Carolina State University, July 2003. Google ScholarDigital Library
Z. Purser, K. Sundaramoorthy, and E. Rotenberg. A study of slipstream processors. 33rd International Symposium on Microarchitecture, pp. 269--280, Dec. 2000. Google ScholarDigital Library
Z. Purser, K. Sundaramoorthy, and E. Rotenberg. Slipstream memory hierarchies. Tech. Report CESR-TR-02-3, ECE Department, North Carolina State University, Feb. 2002.Google Scholar
J. Ray, J.C. Hoe and B. Falsafi. Dual use of superscalar datapath for transient-fault detection and recovery. 34th International Symposium on Microarchitecture, pp. 214--224, Dec. 2001. Google ScholarDigital Library
S.K. Reinhardt and S.S. Mukherjee. Transient fault detection via simultaneous multithreading. 27th International Symposium on Computer architecture, pp. 25--36, June 2000. Google ScholarDigital Library
G.A. Reis, J. Chang, N. Vachharajani, R. Rangan and D.I. August. SWIFT: Software implemented fault tolerance. 3rd International Symposium on Code Generation and Optimization, pp. 243--254, March 2005. Google ScholarDigital Library
G.A. Reis, J. Chang, N. Vachharajani, R. Rangan, and D.I. August. Design and Evaluation of Hybrid Fault-Detection Systems. 32nd International Symposium on Computer Architecture, pp. 148--159, June 2005. Google ScholarDigital Library
E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. 29th International Symposium on Fault-Tolerant Computing, pp. 84--91, June 1999. Google ScholarDigital Library
E. Rotenberg. Exploiting large ineffectual instruction sequences. Technical Report, North Carolina State University, Nov. 1999.Google Scholar
J.E. Smith and A.R. Pleszkun. Implementation of Precise Interrupts in Pipelined Processors. 12th International Symposium on Computer Architecture, pp. 36--44, June 1985. Google ScholarDigital Library
J.C. Smolens, J. Kim, J.C. Hoe and B. Falsafi. Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures. 37th International Symposium on Microarchitecture, pp. 257--268, Dec. 2004. Google ScholarDigital Library
K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream processors: improving both performance and fault tolerance. 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 257--268, Nov. 2000. Google ScholarDigital Library
D. Tullsen, S.J. Eggers and H.M. Levy. Simultaneous multithreading: Maximizing on-chip parallelism. 22nd International Symposium on Computer Architecture, pp. 392--403, June 1995. Google ScholarDigital Library
T.N. Vijaykumar, I. Pomeranz, and K. Cheng. Transientfault recovery using simultaneous multithreading. 29th International Symposium on Computer Architecture, pp. 87--98, May 2002. Google ScholarDigital Library
N.J. Wang and S.J. Patel. ReStore: Symptom based soft error detection in microprocessors. International Conference on Dependable Systems and Networks, pp. 30--39, June 2005. Google ScholarDigital Library

Index Terms

Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance
1. Computer systems organization
  1. Architectures
  2. Dependable and fault-tolerant systems and networks
2. Hardware
  1. Hardware test
  2. Robustness

Recommendations

Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance
Proceedings of the 2006 ASPLOS Conference

Redundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance ...
Read More
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems

Redundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance ...
Read More
Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance
Proceedings of the 2006 ASPLOS Conference

Redundant threading architectures duplicate all instructions to detect and possibly recover from transient faults. Several lighter weight Partial Redundant Threading (PRT) architectures have been proposed recently. (i) Opportunistic Fault Tolerance ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 41, Issue 11
Proceedings of the 2006 ASPLOS Conference
November 2006
425 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1168918
Issue’s Table of Contents
ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
October 2006
440 pages
ISBN:1595934510
DOI:10.1145/1168857
General Chair:
John Paul Shen
Intel Corp.
,
Program Chair:
Margaret R. Martonosi
Princeton University
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 October 2006
Check for updates
Author Tags
branch prediction
chip multiprocessor (CMP)
redundant multithreading
simultaneous multithreading (SMT)
slipstream processor
time redundancy
transient faults
value prediction
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 796
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance

Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance

Understanding prediction-based partial redundant threading for low-overhead, high- coverage fault tolerance