Elsevier

Microelectronics Reliability

Volume 63, August 2016, Pages 267-277
Microelectronics Reliability

PVTA-aware approximate custom instruction extension technique: A cross-layer approach

https://doi.org/10.1016/j.microrel.2016.05.008Get rights and content

Abstract

Process, Voltage, and Temperature variations together with transistor Aging (PVTA) can result in significant number of timing errors in Custom Instructions (CIs) manufactured at nano-scaled silicon nodes. The state-of-the-art approach to tackle this concern is to use guard-band. However, this policy can adversely decrease the performance gain obtained by CIs as the gap between worst-case delay and true delay due to PVTA variations is increased. This paper proposes a novel approximate CI selection technique to address this issue. This technique allows the applications which do not require perfect accuracy to experience a tolerable amount of timing errors imposed by PVTA variations in favor of significantly improving the performance of the extensible processor. To achieve this, the proposed CI selection technique not only considers those CIs which their PVTA-aware delay is less than the given timing constraint, but also it takes into account the approximate CIs (i.e., those CIs that cannot strictly meet the timing constraint resulting in noisy/approximate computations). First, a timing analysis is performed to precisely compute the delay distribution of CIs in the presence of workload- and circuit-dependent PVTA variations. Then, based on the obtained distribution for each CI, a fault-map (i.e., timing error locations) is extracted. Using the fault-map, each circuit-level timing error is propagated to application-level to evaluate the quality/accuracy of the application output in the presence of PVTA-induced errors in approximate CIs. Finally, based on this cross-layer information, an optimal set of CIs is selected. This set results in maximum performance per silicon area under the given constraints on the power consumption and the errors which can be tolerated by the user. The simulations for various benchmark applications show that the proposed cross-layer technique results in up to 2.7 × speedup increase compared to the existing techniques, which comes at the expense of 6% more error.

Introduction

The evolution from desktop computing to mobile computing creates relentless demands on designing low power and high performance embedded processors with a very short time-to-market window [1], [2], [3], [4], [5]. General Purpose Processors (GPPs) are not very desirable for embedded systems, since they are one-size-fits-all solutions to provide high average efficiency for a wide spectrum of applications. However, workloads come in a variety of shapes and characteristics, and as a result the ultimate flexibility of GPP leads to significant overheads in power consumption and non-optimal performance speedup for very certain applications. On the other hand, Application Specific Integrated Circuits (ASICs) provide both high speedup and low power consumption, but they are usually costly and less flexible in comparison to GPPs. Moreover, ASICs' long time-to-market may not be tolerable by market [4]. The emerging as a tradeoff between GPPs and ASICs us Application Specific Instruction Set Processors (ASIPs) [6]. In this methodology, hotspot regions (i.e., frequently used sequences of instructions) of the applications are accelerated by adding specific Custom Instructions (CIs) to the instruction sets of GPPs in order to improve the speedup [3], [4], [7], [8].

As outlined by International Technology Roadmap for Semiconductors (ITRS) [9], susceptibility to timing errors induced by process, voltage, temperature, and transistor aging (PVTA) variations has become one of the major designing challenges for processors [4], [10], [11], [12]. Indeed, due to PVTA variations, the delay of CIs increase over time and thus, they may result in timing errors when the timing constraint (i.e., clock period) is violated [3], [13], [14]. The most common approaches for tolerating PVTA variations in order to prevent timing errors are guardbanding and adding redundancies at different layers of design hierarchy [15].

Selecting a CI that fails to meet timing constraint of the processor may increase the speedup [16]. This is at the expense of timing violations of critical paths, resulting in timing errors. The rate of timing errors in a CI is a statistical parameter which depends on PVTA-dependent delay distribution of CIs, the operating frequency of the processor, workload, and input patterns. However, a timing error that occurs in the circuit-level either may be masked by micro-architecture or by application. Even, an altered application output still might be tolerated by user when the precise value of the output is not a necessity (i.e., acceptable instead of precise output). Based on this observation, we propose an approximate CI selection technique. Indeed, Approximate computing is emerged as a new promising source of efficiency. This approach is driven by the fact that the today's computing demand is almost overwhelmed by a growing number of applications (such as media applications for mobile devices) which are intrinsically tolerant to noisy/approximate calculations. Approximate computing can significantly increase the speedup of calculations by adjusting the degree of accuracy needed for the given tasks [17]. The key idea behind our proposed approximate CI selection technique is to let computation precision of the extensible processor be slightly reduced in favor of gaining more speedup. This is achieved by pushing the limit of CIs timing and selecting those CIs that do not strictly meet the clock period.

The rest of this paper is organized as follows: the related work is discussed in Section 2. Section 3 compares the conventional selection techniques of CIs with the proposed technique using a motivational example. Section 4 explains the sources of approximation in this paper. Section 5 discusses the problem statement and overviews the proposed cross-layer CI selection technique. Section 6 explains the cross-layer PVTA profiling flow to extract candidate CIs. Section 7 presents the application error estimation technique in the presence of PVTA-aware approximate CIs. Section 8 discusses the CI selection method and its merit function. Section 10 shows the results. Finally, Section 11 is the conclusion of this paper.

Section snippets

Related work

In the field of Instruction Set Architecture (ISA) extension, various techniques have been proposed for CI selection that can be categorized to deterministic CI selection and approximate CI selection groups [5], [3], [20], [21], [22], [18], [23], [16], [24], [25], [26], [27]. In particular, the impact of process variations on deterministic CI selection is studied in [18]. The authors discussed that to improve the speedup of extensible processors while maintaining the desired timing yield, it is

A motivational example

Let us compare Approximate CI selection with the traditional CI selection algorithms using a motivational example. To do so, we selected two CIs (i.e., CI1 and CI2) from mibench benchmark suite [28]. Both CI1 and CI2 implement the same computation, but with different hardwares. CI1 has a low clock saving (low speedup) and low circuit-level delay. CI2 offers high clock saving (high speedup) and it has a high circuit-level delay. Since CI2 cannot meet the timing constraint (due to its high

Sources of approximation

As discussed in Section 2, approximate CI can be achieved by two means, namely PVTA-aware aggressive clocking and imprecise hardware. Let us explain the impact of each of them on the accuracy of CIs using a hypothetical example as shown in Fig. 3(a).

Proposed cross-layer technique for approximate CI selection

As shown in Fig. 4, the proposed approximate CI selection flow consists of three phases, namely Cross-layer Profiling, Application Error Estimation, and CI Selection. In the profiling phase, using the Data Flow Graph (DFG) of the application in addition to instruction-level profiling, candidate CIs, their corresponding speedups and their input statistics (distribution) of the CIs are obtained. Using input statistics and floorplan information, statistical PVTA distribution of each CI is

Cross-layer profiling

In this section, we explain how candidate CIs with their speedup and the PVTA profiles are extracted [4]. The obtained PVTA profiles are analyzed in the next phase of the proposed technique in order to obtain the delay distribution and also the TEM of CIs.

Error generation

The first step to compute TEM is PVTA-aware statistical timing analysis. The delay distribution of each output of CI is obtained in this analysis. In this paper, we perform a PVTA-aware timing analysis by extending the statistical techniques presented in [51], [14], [4]. As shown in Fig. 7, the timing analysis flow starts by reading the statistical PVTA profiles of each instance (i.e., gate and flipflop) of the gate-level design. Depending on these profiles, an instance-based

CI selection

The main objective of the CI selection phase is to select a subset of CI candidates. This selection maximizes the speedup of the processor while the constraint of application error during target lifetime as well as the constraints of area and power are satisfied:Maximize:speedup=ClockSaving×FreqCIFreqnoCI,ClockSaving=CyclenoCICycleCI,CIiinRin,CIioutRout,TEMtFreqtargetTEMconstraint,t,Areabi×CIiAlimit,Powerbi×CIitPlimit,t,where, CIiin and CIiout denote the number of inputs and outputs of

Error rate control

It is practically infeasible to explore all combinations of inputs and the entire input patterns to predict and guarantee the error rate at offline phase. Therefore, there is always a possibility that the error rate of the extensible processor significantly deviates from the expected error rate. It may result in lower output quality. Unpredictability is a common problem across almost all approximate techniques [17], which can be tolerated using specific countermeasures. For example, to tackle

Experimental setup

All circuits are synthesized, placed and routed based on NANGATE 45 nm library [56]. Each instance in the gate-level model of CIs is also characterized in the presence of PVTA variations using SPICE simulation. It is done by a library characterizer framework. According to industrial results and ITRS report [9], we consider up to 10% variations in the operating voltage, Vth(0), and Leff and up to 15% threshold voltage increase due to BTI in 7 years. Several benchmark applications including gsm and

Conclusion

CIs manufactured at advanced silicon nodes are very prone to PVTA-induced timing errors. Such timing errors are usually tackled by conservative guard-bands which lead to significant performance penalties. This paper presents a novel approximate technique for CI selection in the presence of PVTA-induced timing error. The key idea of approximate CI selection is to relax the delay constraint at the circuit-level to be able to select high speedup CIs, while tolerating the introduced timing errors

References (61)

  • ITRS
  • B. Farahani et al.

    A cross-layer approach to online adaptive reliability prediction of transient faults

  • B. Farahani et al.

    An instance-based SER analysis in the presence of PVTA variations

  • M. Gupta

    Tribeca: design for PVT variations with local recovery and fine-grained adaptation

  • F. Firouzi

    Incorporating the impacts of workload-dependent runtime variations into timing analysis

  • A. Rahimi et al.

    Hierarchically focused guardbanding: an adaptive approach to mitigate PVT variations and aging

  • M. Kamal

    Improving efficiency of extensible processors by using approximate custom instructions

  • H. Esmaeilzadeh et al.

    Neural acceleration for general-purpose approximate programs

  • M. Kamal et al.

    An architecture-level approach for mitigating the impact of process variations on extensible processors

  • B.J. Farahani et al.

    Reliability-aware cross-layer custom instruction screening

  • L. Pozzi et al.

    Exact and approximate algorithms for the extension of embedded processor instruction sets

    IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

    (2006)
  • X. Chen et al.

    Fast identification of custom instructions for extensible processors

    IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

    (2007)
  • K. Atasu et al.

    Chips: custom hardware instruction processor synthesis

    IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

    (2008)
  • P. Bonzini et al.

    Recurrence-aware instruction set selection for extensible embedded processors

    IEEE Trans. Very Large Scale Integr. Syst.

    (2008)
  • T. Li et al.

    Fast enumeration of maximal valid subgraphs for custom-instruction identification

  • M. Zuluaga et al.

    Design-space exploration of resource-sharing solutions for custom instruction set extensions

    IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

    (2009)
  • D. Wu et al.

    Fast generation of multiple custom instructions under area constraints

    J. Semicond. Technol. Sci.

    (2011)
  • M.R. Guthaus

    Mibench

  • M. Kamal et al.

    Timing variation-aware custom instruction extension technique

  • Z. Kedem et al.

    Optimizing energy to minimize errors in dataflow graphs using approximate adders

  • Cited by (2)

    • Resilience-Aware Frequency Tuning for Neural-Network-Based Approximate Computing Chips

      2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems
    View full text