PVTA-aware approximate custom instruction extension technique: A cross-layer approach

doi:10.1016/j.microrel.2016.05.008

Microelectronics Reliability

Volume 63, August 2016, Pages 267-277

https://doi.org/10.1016/j.microrel.2016.05.008 Get rights and content

Abstract

Process, Voltage, and Temperature variations together with transistor Aging (PVTA) can result in significant number of timing errors in Custom Instructions (CIs) manufactured at nano-scaled silicon nodes. The state-of-the-art approach to tackle this concern is to use guard-band. However, this policy can adversely decrease the performance gain obtained by CIs as the gap between worst-case delay and true delay due to PVTA variations is increased. This paper proposes a novel approximate CI selection technique to address this issue. This technique allows the applications which do not require perfect accuracy to experience a tolerable amount of timing errors imposed by PVTA variations in favor of significantly improving the performance of the extensible processor. To achieve this, the proposed CI selection technique not only considers those CIs which their PVTA-aware delay is less than the given timing constraint, but also it takes into account the approximate CIs (i.e., those CIs that cannot strictly meet the timing constraint resulting in noisy/approximate computations). First, a timing analysis is performed to precisely compute the delay distribution of CIs in the presence of workload- and circuit-dependent PVTA variations. Then, based on the obtained distribution for each CI, a fault-map (i.e., timing error locations) is extracted. Using the fault-map, each circuit-level timing error is propagated to application-level to evaluate the quality/accuracy of the application output in the presence of PVTA-induced errors in approximate CIs. Finally, based on this cross-layer information, an optimal set of CIs is selected. This set results in maximum performance per silicon area under the given constraints on the power consumption and the errors which can be tolerated by the user. The simulations for various benchmark applications show that the proposed cross-layer technique results in up to 2.7 × speedup increase compared to the existing techniques, which comes at the expense of 6% more error.

Introduction

The evolution from desktop computing to mobile computing creates relentless demands on designing low power and high performance embedded processors with a very short time-to-market window [1], [2], [3], [4], [5]. General Purpose Processors (GPPs) are not very desirable for embedded systems, since they are one-size-fits-all solutions to provide high average efficiency for a wide spectrum of applications. However, workloads come in a variety of shapes and characteristics, and as a result the ultimate flexibility of GPP leads to significant overheads in power consumption and non-optimal performance speedup for very certain applications. On the other hand, Application Specific Integrated Circuits (ASICs) provide both high speedup and low power consumption, but they are usually costly and less flexible in comparison to GPPs. Moreover, ASICs' long time-to-market may not be tolerable by market [4]. The emerging as a tradeoff between GPPs and ASICs us Application Specific Instruction Set Processors (ASIPs) [6]. In this methodology, hotspot regions (i.e., frequently used sequences of instructions) of the applications are accelerated by adding specific Custom Instructions (CIs) to the instruction sets of GPPs in order to improve the speedup [3], [4], [7], [8].

As outlined by International Technology Roadmap for Semiconductors (ITRS) [9], susceptibility to timing errors induced by process, voltage, temperature, and transistor aging (PVTA) variations has become one of the major designing challenges for processors [4], [10], [11], [12]. Indeed, due to PVTA variations, the delay of CIs increase over time and thus, they may result in timing errors when the timing constraint (i.e., clock period) is violated [3], [13], [14]. The most common approaches for tolerating PVTA variations in order to prevent timing errors are guardbanding and adding redundancies at different layers of design hierarchy [15].

Selecting a CI that fails to meet timing constraint of the processor may increase the speedup [16]. This is at the expense of timing violations of critical paths, resulting in timing errors. The rate of timing errors in a CI is a statistical parameter which depends on PVTA-dependent delay distribution of CIs, the operating frequency of the processor, workload, and input patterns. However, a timing error that occurs in the circuit-level either may be masked by micro-architecture or by application. Even, an altered application output still might be tolerated by user when the precise value of the output is not a necessity (i.e., acceptable instead of precise output). Based on this observation, we propose an approximate CI selection technique. Indeed, Approximate computing is emerged as a new promising source of efficiency. This approach is driven by the fact that the today's computing demand is almost overwhelmed by a growing number of applications (such as media applications for mobile devices) which are intrinsically tolerant to noisy/approximate calculations. Approximate computing can significantly increase the speedup of calculations by adjusting the degree of accuracy needed for the given tasks [17]. The key idea behind our proposed approximate CI selection technique is to let computation precision of the extensible processor be slightly reduced in favor of gaining more speedup. This is achieved by pushing the limit of CIs timing and selecting those CIs that do not strictly meet the clock period.

The rest of this paper is organized as follows: the related work is discussed in Section 2. Section 3 compares the conventional selection techniques of CIs with the proposed technique using a motivational example. Section 4 explains the sources of approximation in this paper. Section 5 discusses the problem statement and overviews the proposed cross-layer CI selection technique. Section 6 explains the cross-layer PVTA profiling flow to extract candidate CIs. Section 7 presents the application error estimation technique in the presence of PVTA-aware approximate CIs. Section 8 discusses the CI selection method and its merit function. Section 10 shows the results. Finally, Section 11 is the conclusion of this paper.

Section snippets

Related work

In the field of Instruction Set Architecture (ISA) extension, various techniques have been proposed for CI selection that can be categorized to deterministic CI selection and approximate CI selection groups [5], [3], [20], [21], [22], [18], [23], [16], [24], [25], [26], [27]. In particular, the impact of process variations on deterministic CI selection is studied in [18]. The authors discussed that to improve the speedup of extensible processors while maintaining the desired timing yield, it is

A motivational example

Let us compare Approximate CI selection with the traditional CI selection algorithms using a motivational example. To do so, we selected two CIs (i.e., CI₁ and CI₂) from mibench benchmark suite [28]. Both CI₁ and CI₂ implement the same computation, but with different hardwares. CI₁ has a low clock saving (low speedup) and low circuit-level delay. CI₂ offers high clock saving (high speedup) and it has a high circuit-level delay. Since CI₂ cannot meet the timing constraint (due to its high

Sources of approximation

As discussed in Section 2, approximate CI can be achieved by two means, namely PVTA-aware aggressive clocking and imprecise hardware. Let us explain the impact of each of them on the accuracy of CIs using a hypothetical example as shown in Fig. 3(a).

Proposed cross-layer technique for approximate CI selection

As shown in Fig. 4, the proposed approximate CI selection flow consists of three phases, namely Cross-layer Profiling, Application Error Estimation, and CI Selection. In the profiling phase, using the Data Flow Graph (DFG) of the application in addition to instruction-level profiling, candidate CIs, their corresponding speedups and their input statistics (distribution) of the CIs are obtained. Using input statistics and floorplan information, statistical PVTA distribution of each CI is

Cross-layer profiling

In this section, we explain how candidate CIs with their speedup and the PVTA profiles are extracted [4]. The obtained PVTA profiles are analyzed in the next phase of the proposed technique in order to obtain the delay distribution and also the TEM of CIs.

Error generation

The first step to compute TEM is PVTA-aware statistical timing analysis. The delay distribution of each output of CI is obtained in this analysis. In this paper, we perform a PVTA-aware timing analysis by extending the statistical techniques presented in [51], [14], [4]. As shown in Fig. 7, the timing analysis flow starts by reading the statistical PVTA profiles of each instance (i.e., gate and flipflop) of the gate-level design. Depending on these profiles, an instance-based

CI selection

The main objective of the CI selection phase is to select a subset of CI candidates. This selection maximizes the speedup of the processor while the constraint of application error during target lifetime as well as the constraints of area and power are satisfied: $\begin{array}{l} Maximize : speedup = ClockSaving \times \frac{Fre q_{CI}}{Fre q_{noCI}}, \\ ClockSaving = \frac{Cycl e_{noCI}}{Cycl e_{CI}}, \\ C I_{i_{in}} \leq R_{in}, \\ C I_{i_{out}} \leq R_{out}, \\ TEM (t, Fre q_{target}) \leq TE M_{constraint}, \forall t, \\ Area (b_{i} \times C I_{i}) \leq A_{limit}, \\ Power (b_{i} \times C I_{i}, t) \leq P_{limit}, \forall t, \end{array}$ where, CI_{i_in} and CI_{i_out} denote the number of inputs and outputs of

Error rate control

It is practically infeasible to explore all combinations of inputs and the entire input patterns to predict and guarantee the error rate at offline phase. Therefore, there is always a possibility that the error rate of the extensible processor significantly deviates from the expected error rate. It may result in lower output quality. Unpredictability is a common problem across almost all approximate techniques [17], which can be tolerated using specific countermeasures. For example, to tackle

Experimental setup

All circuits are synthesized, placed and routed based on NANGATE 45 nm library [56]. Each instance in the gate-level model of CIs is also characterized in the presence of PVTA variations using SPICE simulation. It is done by a library characterizer framework. According to industrial results and ITRS report [9], we consider up to 10% variations in the operating voltage, V_th(0), and L_eff and up to 15% threshold voltage increase due to BTI in 7 years. Several benchmark applications including gsm and

Conclusion

CIs manufactured at advanced silicon nodes are very prone to PVTA-induced timing errors. Such timing errors are usually tackled by conservative guard-bands which lead to significant performance penalties. This paper presents a novel approximate technique for CI selection in the presence of PVTA-induced timing error. The key idea of approximate CI selection is to relax the delay constraint at the circuit-level to be able to select high speedup CIs, while tolerating the introduced timing errors

References (61)

A. Prakash et al.
FPGA-aware techniques for rapid generation of profitable custom instructions
Microprocess. Microsyst.
(2013)
B. Farahani et al.
A cross-layer SER analysis in the presence of PVTA variations
Microelectron. Reliab.
(2015)
S.-K. Lam et al.
Rapid design of area-efficient custom instructions for reconfigurable embedded processing
J. Syst. Archit.
(2009)
D.A. Patterson et al.
Computer Organization and Design: The Hardware/Software Interface
(2013)
P. Yu et al.
Scalable custom instructions identification for instruction-set extensible processors
Y. Hara-Azumi
Instruction-set extension under process variation and aging effects
B. Farahani et al.
Cross-layer custom instruction selection to address pvta variations and soft error
Microelectron. Reliab.
(2015)
P. Yu et al.
Disjoint pattern enumeration for custom instructions identification
P.-A. Hsiung et al.
Reconfigurable System Design and Verification
(2009)
P. Bonzini et al.
Recurrence-aware instruction set selection for extensible embedded processors
TVLSI
(2008)

ITRS

B. Farahani et al.

A cross-layer approach to online adaptive reliability prediction of transient faults

B. Farahani et al.

An instance-based SER analysis in the presence of PVTA variations

M. Gupta

Tribeca: design for PVT variations with local recovery and fine-grained adaptation

F. Firouzi

Incorporating the impacts of workload-dependent runtime variations into timing analysis

A. Rahimi et al.

Hierarchically focused guardbanding: an adaptive approach to mitigate PVT variations and aging

M. Kamal

Improving efficiency of extensible processors by using approximate custom instructions

H. Esmaeilzadeh et al.

Neural acceleration for general-purpose approximate programs

M. Kamal et al.

An architecture-level approach for mitigating the impact of process variations on extensible processors

B.J. Farahani et al.

Reliability-aware cross-layer custom instruction screening

L. Pozzi et al.

Exact and approximate algorithms for the extension of embedded processor instruction sets

IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

(2006)

X. Chen et al.

Fast identification of custom instructions for extensible processors

IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

(2007)

K. Atasu et al.

Chips: custom hardware instruction processor synthesis

IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

(2008)

P. Bonzini et al.

Recurrence-aware instruction set selection for extensible embedded processors

IEEE Trans. Very Large Scale Integr. Syst.

(2008)

T. Li et al.

Fast enumeration of maximal valid subgraphs for custom-instruction identification

M. Zuluaga et al.

Design-space exploration of resource-sharing solutions for custom instruction set extensions

IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

(2009)

D. Wu et al.

Fast generation of multiple custom instructions under area constraints

J. Semicond. Technol. Sci.

(2011)

M.R. Guthaus

Mibench

M. Kamal et al.

Timing variation-aware custom instruction extension technique

Z. Kedem et al.

Optimizing energy to minimize errors in dataflow graphs using approximate adders

Cited by (2)

Automating application-driven customization of ASIPs: A survey
2024, Journal of Systems Architecture
The rapid advancements and stringent requirements of modern embedded computing systems have led to a surge in the demand for customized processors that can efficiently cater to specific application needs. This survey paper delves into the realm of automating application-driven customization of extensible processors, offering insights into the challenges, advancements, and trends in this domain. It explores the trade-offs between fine-grained and coarse-grained customization, discussing Custom Instructions (CIs) identification and optimization techniques, while emphasizing the shift towards larger accelerators that target complex control sequences in the application. It scrutinizes the balance between speedup and reusability, addressing the challenges of efficient design approaches to manage area and power consumption. The agile nature of early Design Space Exploration (DSE) is discussed, where rapid evaluation of area and communication costs plays a pivotal role. In essence, this survey serves as a valuable guide for researchers and practitioners in the field of processor customization, aiding designers in navigating this complex landscape to optimize performance in a rapidly evolving computing paradigm.
Resilience-Aware Frequency Tuning for Neural-Network-Based Approximate Computing Chips
2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems

View full text

PVTA-aware approximate custom instruction extension technique: A cross-layer approach

Abstract

Introduction

Section snippets

Related work

A motivational example

Sources of approximation

Proposed cross-layer technique for approximate CI selection

Cross-layer profiling

Error generation

CI selection

Error rate control

Experimental setup

Conclusion

Microprocess. Microsyst.

Microelectron. Reliab.

J. Syst. Archit.

Computer Organization and Design: The Hardware/Software Interface

Scalable custom instructions identification for instruction-set extensible processors

Instruction-set extension under process variation and aging effects

Cross-layer custom instruction selection to address pvta variations and soft error

Microelectron. Reliab.

Disjoint pattern enumeration for custom instructions identification

Reconfigurable System Design and Verification

Recurrence-aware instruction set selection for extensible embedded processors

TVLSI

A cross-layer approach to online adaptive reliability prediction of transient faults

An instance-based SER analysis in the presence of PVTA variations

Tribeca: design for PVT variations with local recovery and fine-grained adaptation

Incorporating the impacts of workload-dependent runtime variations into timing analysis

Hierarchically focused guardbanding: an adaptive approach to mitigate PVT variations and aging

Improving efficiency of extensible processors by using approximate custom instructions

Neural acceleration for general-purpose approximate programs

An architecture-level approach for mitigating the impact of process variations on extensible processors

Reliability-aware cross-layer custom instruction screening

Exact and approximate algorithms for the extension of embedded processor instruction sets

IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

Fast identification of custom instructions for extensible processors

IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

Chips: custom hardware instruction processor synthesis

IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

Recurrence-aware instruction set selection for extensible embedded processors

IEEE Trans. Very Large Scale Integr. Syst.

Fast enumeration of maximal valid subgraphs for custom-instruction identification

Design-space exploration of resource-sharing solutions for custom instruction set extensions

IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.

Fast generation of multiple custom instructions under area constraints

J. Semicond. Technol. Sci.

Mibench

Timing variation-aware custom instruction extension technique

Optimizing energy to minimize errors in dataflow graphs using approximate adders