PVTA-aware approximate custom instruction extension technique: A cross-layer approach
Introduction
The evolution from desktop computing to mobile computing creates relentless demands on designing low power and high performance embedded processors with a very short time-to-market window [1], [2], [3], [4], [5]. General Purpose Processors (GPPs) are not very desirable for embedded systems, since they are one-size-fits-all solutions to provide high average efficiency for a wide spectrum of applications. However, workloads come in a variety of shapes and characteristics, and as a result the ultimate flexibility of GPP leads to significant overheads in power consumption and non-optimal performance speedup for very certain applications. On the other hand, Application Specific Integrated Circuits (ASICs) provide both high speedup and low power consumption, but they are usually costly and less flexible in comparison to GPPs. Moreover, ASICs' long time-to-market may not be tolerable by market [4]. The emerging as a tradeoff between GPPs and ASICs us Application Specific Instruction Set Processors (ASIPs) [6]. In this methodology, hotspot regions (i.e., frequently used sequences of instructions) of the applications are accelerated by adding specific Custom Instructions (CIs) to the instruction sets of GPPs in order to improve the speedup [3], [4], [7], [8].
As outlined by International Technology Roadmap for Semiconductors (ITRS) [9], susceptibility to timing errors induced by process, voltage, temperature, and transistor aging (PVTA) variations has become one of the major designing challenges for processors [4], [10], [11], [12]. Indeed, due to PVTA variations, the delay of CIs increase over time and thus, they may result in timing errors when the timing constraint (i.e., clock period) is violated [3], [13], [14]. The most common approaches for tolerating PVTA variations in order to prevent timing errors are guardbanding and adding redundancies at different layers of design hierarchy [15].
Selecting a CI that fails to meet timing constraint of the processor may increase the speedup [16]. This is at the expense of timing violations of critical paths, resulting in timing errors. The rate of timing errors in a CI is a statistical parameter which depends on PVTA-dependent delay distribution of CIs, the operating frequency of the processor, workload, and input patterns. However, a timing error that occurs in the circuit-level either may be masked by micro-architecture or by application. Even, an altered application output still might be tolerated by user when the precise value of the output is not a necessity (i.e., acceptable instead of precise output). Based on this observation, we propose an approximate CI selection technique. Indeed, Approximate computing is emerged as a new promising source of efficiency. This approach is driven by the fact that the today's computing demand is almost overwhelmed by a growing number of applications (such as media applications for mobile devices) which are intrinsically tolerant to noisy/approximate calculations. Approximate computing can significantly increase the speedup of calculations by adjusting the degree of accuracy needed for the given tasks [17]. The key idea behind our proposed approximate CI selection technique is to let computation precision of the extensible processor be slightly reduced in favor of gaining more speedup. This is achieved by pushing the limit of CIs timing and selecting those CIs that do not strictly meet the clock period.
The rest of this paper is organized as follows: the related work is discussed in Section 2. Section 3 compares the conventional selection techniques of CIs with the proposed technique using a motivational example. Section 4 explains the sources of approximation in this paper. Section 5 discusses the problem statement and overviews the proposed cross-layer CI selection technique. Section 6 explains the cross-layer PVTA profiling flow to extract candidate CIs. Section 7 presents the application error estimation technique in the presence of PVTA-aware approximate CIs. Section 8 discusses the CI selection method and its merit function. Section 10 shows the results. Finally, Section 11 is the conclusion of this paper.
Section snippets
Related work
In the field of Instruction Set Architecture (ISA) extension, various techniques have been proposed for CI selection that can be categorized to deterministic CI selection and approximate CI selection groups [5], [3], [20], [21], [22], [18], [23], [16], [24], [25], [26], [27]. In particular, the impact of process variations on deterministic CI selection is studied in [18]. The authors discussed that to improve the speedup of extensible processors while maintaining the desired timing yield, it is
A motivational example
Let us compare Approximate CI selection with the traditional CI selection algorithms using a motivational example. To do so, we selected two CIs (i.e., CI1 and CI2) from mibench benchmark suite [28]. Both CI1 and CI2 implement the same computation, but with different hardwares. CI1 has a low clock saving (low speedup) and low circuit-level delay. CI2 offers high clock saving (high speedup) and it has a high circuit-level delay. Since CI2 cannot meet the timing constraint (due to its high
Sources of approximation
As discussed in Section 2, approximate CI can be achieved by two means, namely PVTA-aware aggressive clocking and imprecise hardware. Let us explain the impact of each of them on the accuracy of CIs using a hypothetical example as shown in Fig. 3(a).
Proposed cross-layer technique for approximate CI selection
As shown in Fig. 4, the proposed approximate CI selection flow consists of three phases, namely Cross-layer Profiling, Application Error Estimation, and CI Selection. In the profiling phase, using the Data Flow Graph (DFG) of the application in addition to instruction-level profiling, candidate CIs, their corresponding speedups and their input statistics (distribution) of the CIs are obtained. Using input statistics and floorplan information, statistical PVTA distribution of each CI is
Cross-layer profiling
In this section, we explain how candidate CIs with their speedup and the PVTA profiles are extracted [4]. The obtained PVTA profiles are analyzed in the next phase of the proposed technique in order to obtain the delay distribution and also the TEM of CIs.
Error generation
The first step to compute TEM is PVTA-aware statistical timing analysis. The delay distribution of each output of CI is obtained in this analysis. In this paper, we perform a PVTA-aware timing analysis by extending the statistical techniques presented in [51], [14], [4]. As shown in Fig. 7, the timing analysis flow starts by reading the statistical PVTA profiles of each instance (i.e., gate and flipflop) of the gate-level design. Depending on these profiles, an instance-based
CI selection
The main objective of the CI selection phase is to select a subset of CI candidates. This selection maximizes the speedup of the processor while the constraint of application error during target lifetime as well as the constraints of area and power are satisfied:where, CIiin and CIiout denote the number of inputs and outputs of
Error rate control
It is practically infeasible to explore all combinations of inputs and the entire input patterns to predict and guarantee the error rate at offline phase. Therefore, there is always a possibility that the error rate of the extensible processor significantly deviates from the expected error rate. It may result in lower output quality. Unpredictability is a common problem across almost all approximate techniques [17], which can be tolerated using specific countermeasures. For example, to tackle
Experimental setup
All circuits are synthesized, placed and routed based on NANGATE 45 nm library [56]. Each instance in the gate-level model of CIs is also characterized in the presence of PVTA variations using SPICE simulation. It is done by a library characterizer framework. According to industrial results and ITRS report [9], we consider up to 10% variations in the operating voltage, Vth(0), and Leff and up to 15% threshold voltage increase due to BTI in 7 years. Several benchmark applications including gsm and
Conclusion
CIs manufactured at advanced silicon nodes are very prone to PVTA-induced timing errors. Such timing errors are usually tackled by conservative guard-bands which lead to significant performance penalties. This paper presents a novel approximate technique for CI selection in the presence of PVTA-induced timing error. The key idea of approximate CI selection is to relax the delay constraint at the circuit-level to be able to select high speedup CIs, while tolerating the introduced timing errors
References (61)
- et al.
FPGA-aware techniques for rapid generation of profitable custom instructions
Microprocess. Microsyst.
(2013) - et al.
A cross-layer SER analysis in the presence of PVTA variations
Microelectron. Reliab.
(2015) - et al.
Rapid design of area-efficient custom instructions for reconfigurable embedded processing
J. Syst. Archit.
(2009) - et al.
Computer Organization and Design: The Hardware/Software Interface
(2013) - et al.
Scalable custom instructions identification for instruction-set extensible processors
Instruction-set extension under process variation and aging effects
- et al.
Cross-layer custom instruction selection to address pvta variations and soft error
Microelectron. Reliab.
(2015) - et al.
Disjoint pattern enumeration for custom instructions identification
- et al.
Reconfigurable System Design and Verification
(2009) - et al.
Recurrence-aware instruction set selection for extensible embedded processors
TVLSI
(2008)
A cross-layer approach to online adaptive reliability prediction of transient faults
An instance-based SER analysis in the presence of PVTA variations
Tribeca: design for PVT variations with local recovery and fine-grained adaptation
Incorporating the impacts of workload-dependent runtime variations into timing analysis
Hierarchically focused guardbanding: an adaptive approach to mitigate PVT variations and aging
Improving efficiency of extensible processors by using approximate custom instructions
Neural acceleration for general-purpose approximate programs
An architecture-level approach for mitigating the impact of process variations on extensible processors
Reliability-aware cross-layer custom instruction screening
Exact and approximate algorithms for the extension of embedded processor instruction sets
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
Fast identification of custom instructions for extensible processors
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
Chips: custom hardware instruction processor synthesis
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
Recurrence-aware instruction set selection for extensible embedded processors
IEEE Trans. Very Large Scale Integr. Syst.
Fast enumeration of maximal valid subgraphs for custom-instruction identification
Design-space exploration of resource-sharing solutions for custom instruction set extensions
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
Fast generation of multiple custom instructions under area constraints
J. Semicond. Technol. Sci.
Mibench
Timing variation-aware custom instruction extension technique
Optimizing energy to minimize errors in dataflow graphs using approximate adders
Cited by (2)
Automating application-driven customization of ASIPs: A survey
2024, Journal of Systems ArchitectureResilience-Aware Frequency Tuning for Neural-Network-Based Approximate Computing Chips
2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems