Bit Impact Factor: Towards making fair vulnerability comparison

https://doi.org/10.1016/j.micpro.2014.04.009Get rights and content

Abstract

Reliability is becoming a major design concern in contemporary microprocessors since soft error rate is increasing due to technology scaling. Therefore, design time system vulnerability estimation is of paramount importance. Architectural Vulnerability Factor (AVF) is an early vulnerability estimation methodology. However, AVF considers that the value of a bit in a clock cycle is either required for Architecturally Correct Execution (i.e. ACE-bit) or not (i.e. unACE-bit); therefore, AVF cannot distinguish the vulnerability impact level of an ACE-bit. In this study, we present a new dimension which takes into account the vulnerability impact level of a bit. We introduce Bit Impact Factor metric which, we believe, will be helpful for extending AVF evaluation to provide a more accurate vulnerability analysis.

Introduction

Transient faults such as bit flips mainly caused by particle strikes, are important problems in a digital system design [1], [2]. These particle strikes do not result in permanent faults but may lead to system crashes, and hence, are termed as “soft errors”. It is predicted that the soft error problem will increase in the future systems since, in every new generation of manufacturing technology, feature sizes decrease; consequently error susceptibility of digital circuits increases [3]. This increasing soft error rate makes reliability a major design concern in contemporary microprocessors.

The vulnerability of the system to soft errors should be quantified as early as possible at design time, so that required precautions can be taken. Also, it is important not to overestimate/underestimate the vulnerability of the system due to the associated performance/power overheads.

Mukherjee et al. define the Architectural Vulnerability Factor (AVF) of processor components to provide early reliability estimation [4]. AVF analysis is implemented based on the fact that systems are known to mask some of the faults either at the circuit level or architectural level and these faults do not propagate to the final outcome of a program. Quantifying this masking effect allows adjusting the level of error protection in the design of a digital system.

AVF is defined as the average ratio of the bits in the system that are required for Architecturally Correct Execution (i.e. ACE-bits) at a given clock cycle. AVF analysis does not care about the process of a bit flip creating the error, but it rather qualifies the outcome: if the flip on a stored bit results in a system level visible error, the flipped bit is defined as ACE-bit. In reality, a bit flip may lead to an unwanted outcome through different paths.

Mukherjee et al. advocate that the AVF of the whole processor can be calculated by summing the AVFs of all structures multiplied by their area normalized with respect to total chip area. However, this assumption introduces discrepancies in the vulnerability especially for systems with many components or for long-running workloads [5]. This is mainly due to two reasons: (1) AVF always assumes that all of the ACE-bits in a processor component or in an instruction have the same impact on vulnerability. (2) AVF assumes that ACE-bits in different components are equally important for the vulnerability. Due to these limitations, it is not possible to make an apple-to-apple reliability comparison between hardware components (e.g. register file, issue queue) or different parts of a hardware component (e.g. data or tag area of the cache).

In this study, our goal is investigating a new dimension that provides a comparison between different bit types and a comparison between different hardware components. To this end, we start from the fact that not all of the bits belong to the same field type, and we show that the impact of faults occurring on different fields are different.

For instance, a single bit fault occurring on the immediate value of an instruction may cause the result of the instruction to be faulty in one bit position. However, if a failure occurs in the source register identifier, it causes reading completely different source value from the wrong register which may cause multiple bit failures in the result of the instruction.

We first classify bit types in several hardware components (i.e. Register File, Reorder Buffer and Issue Queue) in terms of vulnerability level, and we examine impact of a single bit flip in each class. We define Bit Impact Factor (BIF), which shows the vulnerability level of a bit, and indirectly allows the quantification of the relative vulnerability across processor components and component fields.

The contributions of this study are:

  • We classify bit types in several hardware components (i.e. Register File, Reorder Buffer and Issue Queue) in terms of vulnerability level.

  • We define Bit Impact Factor (BIF) which shows the average number of bits affected in the next dependent component when the defined bit fails.

  • We extend AVF with BIF dimension in order to provide more accurate vulnerability analysis and to allow connecting the vulnerability of different hardware components.

Section snippets

Quantifying soft error impact factor

In this section, we first explain AVF principles. Then, we classify bits within microarchitecture classified according to the information they store and on the vulnerability impact of bits.

Evaluation

We ran SPEC 2006 benchmark applications [6] on the MSIM microarchitectural simulator [7] with the simulation parameters given in Table 1 to observe the Bit Impact Factor (BIF) of register values, source and destination register identifiers and opcodes. We also measured AVF of applications and we calculate the AVFweighted by including BIF into AVF calculation. We compare AVF and AVFweighted with the vulnerability measured by the fault injection methodology. We inject 100 faults to each structure

Related work

In this section we explain the previous studies on calculating the vulnerability of the system components.

Mukherjee et al. [4] introduce Architectural Vulnerability Factor (AVF) and Architecturally Correct Execution (ACE) analysis to measure a processor’s failure rate caused by soft errors early in the design process. They classify the unACE bits due to microarchitectural (i.e. idle, invalid or mis-speculated states, predictor structures) or architectural (i.e. NOPs, performance enhancing

Conclusion

In this study, we argue that every bit in the hardware does not have the equal vulnerability impact. We introduce a new metric; the Bit Impact Factor (BIF) which shows the impact of a single bit error in a hardware component by measuring the average number of errors in the result of the execution unit. We also include the BIF metric in the AVF and introduced AVFweighted to provide an accurate vulnerability estimation. We believe that AVFweighted will help comparing the vulnerability of

Acknowledgment

This work was partially supported by the Scientific and Technological Research Council of Turkey (TUBITAK) under research Grants 112E004. The work is performed in the framework of COST ICT Action 1103 “Manufacturable and Dependable Multicore Architectures at Nanoscale”.

Serdar Zafer Can received the BS degree in electrical and electronics engineering from TOBB University of Economics and Technology, Ankara, Turkey, in 2012 and he is now MS student in computer engineering in TOBB University of Economics and Technology, Ankara. His current research interests include computer architecture, reliability, VLSI design and reconfigurable architectures.

References (24)

  • A. Biswas et al.

    Computing accurate AVFs using ACE analysis on performance models: a rebuttal

    IEEE Comput. Archit. Lett.

    (2008)
  • R. Baumann

    Soft errors in advanced computer systems

    IEEE Des. Test Comput.

    (2005)
  • J.F. Ziegler et al.

    Ibm experiments in soft fails in computer electronics (1978–1994)

    IBM J. Res. Develop.

    (1996)
  • S. Borkar

    Designing reliable systems from unreliable components: the challenges of transistor variability and degradation

    IEEE Micro

    (2005)
  • S.S. Mukherjee, C. Weaver, J. Emer, S.K. Reinhardt, T. Austin, A systematic methodology to compute the architectural...
  • X. Li, S.V. Adve, P. Bose, J.A. Rivers, Architecture-level soft error analysis: examining the limits of common...
  • J.L. Henning

    SPEC CPU2006 benchmark descriptions

    SIGARCH Comput. Archit. News

    (2006)
  • J.J. Sharkey, D. Ponomarev, K. Ghose, M-sim: A Flexible, Multithreaded Architectural Simulation Environment, Technical...
  • H. Cho, S. Mirkhani, C.Y. Cher, J.A. Abraham, S. Mitra, Quantitative evaluation of soft error injection techniques for...
  • N.J. Wang, J. Quek, T.M. Rafacz, S.J. Patel, Characterizing the effects of transient faults on a high-performance...
  • M.L. Li, P. Ramachandran, U.R. Karpuzcu, S.K.S. Hari, S.V. Adve, Accurate microarchitecture-level fault modeling for...
  • N.J. Wang, A. Mahesri, S.J. Patel, Examining ace analysis reliability estimates using fault-injection, in: Proceedings...
  • Cited by (3)

    Serdar Zafer Can received the BS degree in electrical and electronics engineering from TOBB University of Economics and Technology, Ankara, Turkey, in 2012 and he is now MS student in computer engineering in TOBB University of Economics and Technology, Ankara. His current research interests include computer architecture, reliability, VLSI design and reconfigurable architectures.

    Gulay Yalcin is a PhD student at Universitat Politecnica de Catalunya and a researcher student in Bacelona Supercomputing Center. She holds BS degree in Computer Engineering from Hacettepe University and MS degree in Computer Engineering from TOBB University of Economics and Technology. Her research interests are reliability and energy minimization in computer architecture.

    Oguz Ergin received his MS and PhD degrees in Computer Science from the State University New York, Binghamton. He is currently and Assistant Professor in the Department of Computer Engineering of TOBB University of Economics and Technology, Ankara. He was a senior research scientist at Intel Barcelona Research Center before he joined his current university. His research interests include computer architecture and VLSI design.

    Osman Sabri Unsal is co-leader of the Architectural Support for Programming Models group at the Barcelona Supercomputing Center. Dr. Unsal is also a researcher at the BSC-Microsoft Research Centre. He holds BS, MS, and PhD degrees in electrical and computer engineering from Istanbul Technical University, Brown University, and University of Massachusetts, Amherst, respectively.

    Adrián Cristal received the “licenciatura” in Computer Science from Universidad de Buenos Aires (FCEN) in 1995 and the PhD. degree in Computer Science in 2006, from the Universitat Politécnica de Catalunya (UPC), Spain. From 1992 to 1995 he has been lecturing in Neural Network and Compiler Design. In UPC, from 2003 to 2006 he has been lecturing on computer organization. Currently, and since 2006, he is researcher in Computer Architecture group at BSC. He is currently co-manager of the “Computer Architecture for parallel paradigms”. His research interests cover the areas of microarchitecture, multicore architectures, and programming models for multicore architectures. He has published around 60 publications in these topics and participated in several research projects with other universities and industries, in framework of the European Union programmes or in direct collaboration with technology leading companies.

    View full text