Modelling and Automated Implementation of Optimal Power Saving Strategies in Coarse-Grained Reconfigurable Architectures

This paper focuses on how to efficiently reduce power consumption in coarse-grained reconfigurable designs, to allow their effective adoption in heterogeneous architectures supporting and accelerating complex and highly variable multifunctional applications. We propose a design flow for this kind of architectures that, besides their automatic customization, is also capable of determining their optimal power management support. Power and clock gating implementation costs are estimated in advance, before their physical implementation, on the basis of the functional, technological, and architectural parameters of the baseline design. Experimental results, on 90 and 45 nm CMOS technologies, demonstrate that the proposed approach guides the designer towards optimal implementation.


Introduction
Electronic devices on the market rely on the execution of computation-intensive applications on complex heterogeneous systems.Coarse-grained reconfigurable (CGR) platforms combine the high performance levels provided by Application Specific Integrated Circuit (ASIC) designs with an increased flexibility, allowing the execution of a larger set of applications over the same substrate [1,2].However, in the dark silicon era, due to the limited available power budget, a gap exists between the number of transistors that can be placed within a die and the number that can be actually active during execution [3,4].Therefore, systems are also required to be energy efficient and CGR designs must integrate specific power management techniques of the functional logic regions constituting them.
Several effective techniques for power monitoring [5] and reducing [6] have been presented at the state of the art.Among them, voltage/frequency scaling [7,8] and power shut-off schemes [9,10] can be extremely beneficial.However, their integration requires manual intervention of the designer, resulting in a complex, error prone, and time consuming process.While commercial synthesizers [11] allow the automatic implementation of low overhead saving strategies at the gate level, such as fine-grained clock gating, they only provide implementation-level instruments to apply more complex strategies, like power gating.In the CGR systems field, the Multi-Dataflow Composer tool (MDC), combining the dataflow-based system specification approach with the coarse-grained reconfigurable design paradigm, is capable of automatically generating run-time reconfigurable multifunctional systems, featuring flexibility and area minimization [12].MDC was originally meant to address reconfigurable codec implementations and was conceived to be exploited within MPEG Reconfigurable Video Coding (MPEG-RVC) studies.However, it was successfully adopted also in different resource and power-constrained scenarios [13,14], where only microprogrammed solutions have been used so far, either exploiting single-core digital signal processors [15] or custom multicore embedded processors [16].MDC design suite is composed of different extensions.The work presented in this paper is related to MDC power management extension, which has been previously addressed in [17,18].MDC tool identifies in the generated CGR system the minimum set of disjointed functionally homogeneous logic areas of the system, called logic regions.These latter are exploited to automatically implement dynamic power management strategies, applying indistinctly to all of them either clock [17] or power gating methodologies [18].
The work we are presenting in this paper intends to propose a power modelling methodology and to improve the MDC power management extension by integrating such methodology within its automated flow.We introduce in this paper an algorithm that analyses the identified logic regions and, on the basis of one single synthesis and a minimal set of simulations (one for each scenario of the multifunctional problem), is capable of optimally characterizing the power management support.This flow, in a separate manner for each logic region of the CGR design, is capable of assessing both clock and power gating management costs and of determining which is the optimal power saving strategy (if any) prior to any physical system implementation.The algorithm is based on detailed static and dynamic power consumption models that take into account functional, architectural, and technological parameters to define the potential overhead and benefits of the considered solutions.As a future perspective, besides its application within the MPEG-RVC scenario, the proposed modelling strategy may also be extended to support other complex autonomous computing systems [19], where the number of involved resources may change at runtime.
The rest of this paper is organized as follows.Section 2 reports the background of this work.Section 3 describes the operation of MDC: in particular, Section 3.1 focuses on the base operation, while Section 3.2 details the proposed power estimation models and their integration within the tool.Section 4 presents the designs under test used to validate the proposed approach: Section 4.1 involves a FFT use case targeting a 90 nm CMOS technology, while Section 4.2 presents the experimental results conducted to assess the enhanced MDC flow on a zoom coprocessor (targeting both 90 nm and 45 nm CMOS technologies).Finally, Section 4.4 details the benefits of the proposed models, before concluding with some final remarks in Section 5.

Background
This work presents a model of power consumption for CGR systems.The model is capable of estimating the system power dissipation since the early stages of the design flow and it has been integrated within an automated flow that decides which power saving technique, between clock gating and power gating, has to be applied to each portion of the design.
This section provides an overview of the state of the art in the field of power-aware optimization and, in more detail, on the main aspects involved in the proposed approach.Section 2.1 introduces the distinctive features of CGR systems and the main architectural trends for such a kind of devices.Section 2.2 deals with power issues in digital system, with a particular emphasis on modelling strategies and design automation.

Coarse-Grained Reconfigurable Systems.
Reconfigurable architectures are usually conceived as collections of functional units (FUs) whose functionality and connections can be configured at run time, to adapt them to different applications or operating modes.Such systems can be classified according to the granularity of the FUs.Fine-grained approaches, typically exploiting FPGA devices as underlying technology, involve bit-level FUs, resulting in a higher flexibility but requiring long configuration time (due to the configuration bitstream size).Coarse-grained reconfigurability, on the other hand, provides word-level FUs, thus providing less flexibility while guaranteeing faster configuration phases.CGR architectures are usually exploited to design flexible ASICs, making them capable of switching among a finite set of functionality.In such design cases, high efficiency in terms of area obstruction of the designed system can be easily obtained, but not all the resources are evenly involved in the computation.Dedicated power management techniques are needed to reduce the overhead, in terms of power consumption, related to resources that are not involved in each operating mode [17].CGR architectures already demonstrated being suitable to address application scenarios that require flexibility along with strong area, power, and execution efficiency [13,20].
One of the main issues of CGR architectures is their complex mapping and programming [21,22].Several works tried to automate the mapping of applications and computational kernels onto CGR and multicore systems [23][24][25].The mapping problem requires specific knowledge of the considered kernels that usually have to be identified and specified by means of hardware description languages.The mapping effort is directly proportional to the number of involved kernels [26].Recently, dataflow models demonstrated to be very useful in this scenario [27,28].Dataflows describe programs through a graph whose nodes are processing elements (actor) linked by point-to-point unidirectional channels managed according to a FIFO protocol.Actors encapsulate their own state and communicate only through atomic packets of data (tokens).Due to their intrinsic modularity, dataflows favour hardware and software components definition and reuse.Furthermore, they are natively capable of highlighting the intrinsic parallelism of the specified applications.The Multi-Dataflow Composer (MDC) tool, adopted within the presented work, relies on dataflows (RVC-CAL formalism by MPEG is currently supported) to solve the CGR mapping problem.It exploits the characteristics of such kind of models to provide several advanced features (e.g., power management [17,18] or coprocessing units automatic generation [29]).

Power Management.
Power consumption in digital devices is composed mainly of two different contributions: dynamic and static.The former is due to capacitance charging/discharging when logic transitions occur (i.e., switching activity).The latter is due to leakage currents and it is consumed also when no circuit activity is present.Modern designers need to consider both terms when conceiving smart management strategies.Several techniques (clock gating, multifrequency, operand isolation, multithreshold, multisupply libraries, power gating, etc.) exist and, in some cases, they are automatically implemented by commercial synthesis/ place-and-route tools.In custom computing systems, some advanced design tools support the designers in the application-driven customization of the hardware architectures [30,31].However, generally speaking, invasive techniques (requiring insertion of additional logic and target technology support (such as the availability of dedicated cells and processes on the implementation stack)) still need tools to be guided with significant manual effort by the designer.Clock gating is an example of quite noninvasive technique.It may reduce the dynamic power consumption due to the clock tree and to sequential logic up to the 40% [32].It consists in shutting off the clock of the unused synchronous logic, by means of simple AND gates.Clock gating has been deeply automated and it is available on most of the commercial synthesizers.In the MPEG-RVC community, recent studies [33] presented an extension of a High-Level Synthesis tool, Xronos, to selectively switch off clock signal for parts of the circuit that are idle due to stalls in the pipeline, to reduce power consumption.Moreover, as mentioned, the MDC tool has the capability of identifying, by means of a graph-based analysis of the input dataflow specifications, independent circuitry regions.These logic regions can be clock gated to dynamically adapt power consumption when switching between different functionalities [17,18].From the technical point of view, in ASIC designs AND gates can be used directly on the clock to disable it, while in FPGA designs the clock network cannot be modified by the insertion of any custom logic and dedicated cells are required (Xilinx boards, e.g., are equipped with dedicated blocks (BUFGs), whose outputs can drive distinct regions of logic powering down different design portions (when enabled)).Clock gating can be applied at different granularities: fine-grained approaches act on single registers, whereas coarse-grained ones are referred, as in [17,18,33], to a set of resources.Commercial synthesizers normally can automate only fine-grained strategies.
Power gating is quite invasive.The main idea behind it is as follows: if a specific portion of the design is not used in a given computation mode, then it can be completely switchedoff by means of a sleep transistor.This technique, as the clock gating one, is applicable at different granularities: finegrained approaches require driving a different sleep transistor for every cell in the system, while coarse-grained ones, again, operate on a set of resources instantiating one sleep transistor to drive different cells connected to a shared power network.MDC, as discussed in [18], supports also automatic power gating for CGR architectures.Each identified logic region in the CGR system is implemented (no matter of its nature or characteristics) as a different power domain (PD) that, in order to be managed, requires to insert and drive the following resources: (i) The sleep transistor between the gated region and the main power supply to switch on/off the derived power supply (ii) The isolation logic between the gated region and normally-on cells to avoid the transmission of spurious signals in input to the normally-on cells (iii) The state retention logic to maintain, where needed, the internal state of the gated region MDC, besides defining the power gated design netlist, provides also the automatic definition of the power format file.

Modelling.
To the best of our knowledge, literature does not treat the problem of modelling power gating and clock gating costs in CGR designs.Some approaches only partially address the issue.For example, [35] focusses on low-power techniques and power modelling for FPGAs.In [36], only clock gating is taken into account: different power states (on the basis of the clock enable signals) are defined and their consumption is characterized by low-level Power Analysis results.[37] focusses on estimating the leakage reduction for power gating and reverse body bias.
The CASPER simulator for shared memory many-core processors [38] includes precharacterized libraries containing power dissipation models of different hardware components, enabling accurate power estimation at a high-level exploration stage.In particular the authors implement Chipwide Dynamic Voltage, Frequency Scaling, and Performance Aware Core-Specific Frequency Scaling.The FALPEM framework [39] provides power estimations at preregister transfer level (RTL) stage, specifically targeting the power consumed by clock network and interconnect, but power and clock gating costs are not defined.Other approaches perform an estimation that considers different components.Li et al. [40] propose an architecture-level integrated power, area, and timing modelling framework for multicore systems, which evaluates system building blocks (CPU, buses, etc.) for different technology nodes, providing also power gating support.Finally, the work presented in [41] focuses on onchip networks.

Design Suite for Coarse-Grained Power-Aware Systems
This section discusses the proposed technique for modelling the power consumption of a CGR system when clock gating or power gating are applied.These models, combined in a selection algorithm, can be exploited for developing an automated design flow for power-efficient CGR systems, where the optimal saving strategy is selected for each identified working set of resources.
In this work, we have embedded these models and the algorithm in the Multi-Dataflow Composer (MDC) tool, a framework capable of CGR systems characterization.MDC provides a comprehensive design suite automating several development tasks of the synthesis and development of CGR systems, within design flows targeting both FPGA [29] and ASIC.The tool provides extensive support to dynamic power management [17,18], addressing power-constrained design cases scenarios, completely automating implementation and control of clock gating and power gating strategies in the final CGR platform.Nevertheless, such techniques are not currently addressed in a hybrid manner, the users must choose the approach to be used in the design without an a priori automated analysis process.Such an unsupported selection may easily lead to suboptimal implementations on the final platform.
In the following, Section 3.1 provides an overview of the MDC baseline functionality and of the current power management support.Section 3.2 discusses the proposed power models and automated selection algorithm identifying optimal power management strategy in CGR systems.In both sections, step-by-step examples are presented to clarify the methodology.

The Multi-Dataflow Composer
Tool.MDC automates generation and management of CGR systems, facing the complex mapping of multiple applications onto a single reconfigurable architecture.It automates the mapping process and guarantees the minimization of hardware resources, allowing for significant area/energy savings [12,42].In literature, this problem is known as datapath merging and it deals with the combination of a set of input datapaths, described by means of graphs, onto a single reconfigurable datapath.It aims at maximally sharing (among the different input graphs) both processing nodes and connections.
MDC is naturally compliant with the RVC-CAL formalism and natively supports Dataflow Process Network (DPN) models as input.Currently, it is interfaced with the Open RVC-CAL Compiler, Orcc [43].The Orcc front-end is responsible for parsing, one-by-one, the high-level DPN specifications of the different datapaths that MDC will merge within the CGR system.Please note that MDC can be interfaced with other graph parsers, so that it will be able to be easily adapted to any other dataflow-based modelling environment.
As depicted by Figure 1, the MDC baseline flow involves three main phases: (1) The input DPNs parsing, performed by the Orcc front-end, translates the RVC-CAL specifications into Java Intermediate Representations (IRs), which are basically directed graphs.(2) The datapath merging, performed by the MDC frontend, combines the IRs into a reconfigurable IR, inserting (where necessary) special switching actors (responsible for properly distributing the token flow among the different merged DPNs) keeping trace of the system programmability through a dedicated Configuration Table (TAB in Figure 1).(3) The hardware platform generation, performed by the MDC back-end, leads to the creation of the RTL that describes the CGR system itself, where each actor of the reconfigurable IR is mapped onto a different hardware FU.In this phase, the hardware communication protocol and the HDL (Hardware Description Language) components library (providing the RTL descriptions of the required FUs, manually or automatically generated) are provided as input to the tool.
At the hardware level, reconfiguration takes place in a single clock cycle.It is achieved through low overhead switching elements (SBoxes) that allow the sharing of common resources among different input DPNs.SBoxes are simple combinatorial multiplexers and demultiplexers, whose configuration is stored into dedicated Look-Up tables that, according to the Configuration Table, compute the selectors necessary for the correct data forwarding in order to implement the requested functionality.

Automated Power Management.
Dealing with reconfigurable architectures, and in particular with CGR systems, the power consumption has to be carefully taken into consideration.Such a kind of systems is affected by resource redundancy, mainly due to the FUs that are not shared among different functionalities.Thus, when a certain functionality is executed, part of the design (not involved in the computation) is in an idle state and can uselessly consume precious power.Fortunately the unused resources, for each implemented functionality, depend on the input specifications and, therefore, are fixed at design-time.
Given these considerations, then, a CGR system can be characterized by a set of disjointed logic regions (LRs), grouping the resources that are always active/inactive at the same time.The MDC power management extension is capable of automatically identifying LRs.It performs the LRs identification at a high-level of abstraction, on the reconfigurable IR, by exploiting the intrinsic modularity of the dataflow graphs.Once the LRs have been identified, the MDC dynamic power manager automatically applies, according to the user selection, either clock gating [17] or power gating [18] on the resulting CGR hardware platform.The identification of the minimal number of LRs is guaranteed, to minimize the power overhead of the extra logic needed to implement the selected power saving strategies.An overview of the power management extension is provided by Figure 1.
For each input DPN   , the currently available algorithm determines the set    , which contains all the resources of the reconfigurable IR activated by   .These are the original sets of LRs that represent the starting point for the algorithm to find the final LRs by iteratively comparing two    sets at a time, determining their possible overlapping.If overlapping is found, its resources are removed by the two considered sets and a new    (corresponding to a new LR) involving these shared resources is issued.   groups resources that are shared among different input DPNs, while the remaining resources in the two    will uniquely belong to the originally considered DPNs.This compare and split identification process guarantees that the number of LRs found by the MDC dynamic power manager is the minimum achievable one.If this number is still too high for the considered target platform, as it can happen if FPGAs are the target devices (in FPGA devices the number of hardware blocks that can drive the different LRs is limited; e.g., 32 BUFG units are available in Xilinx boards for clock management purposes), a LRs merging process has to be applied.MDC users are required to specify the target technology and the maximum number of implementable LRs.This latter is compared with the number of LRs determined by the compare and split identification process and, if necessary, the LRs merging process is applied.Two LRs at a time are unified (details on how to merge different LR sets can be found in [17], where two merging strategies (a power-aware one and a number-aware one) are presented) until the constraint fixed by the user is met.This process leads to a suboptimal system implementation: each DPN, while activating its corresponding LRs, may also activate some resources that do not contribute to its computation, leading to extra unnecessary power consumption.MDC power management extension, during the HDL generation phase, provides also the implementation of the chosen power management strategy upon the identified LRs.It blindly applies the selected strategy to all the identified LRs, without any warranty on the approach effectiveness.Clock gating acts only on the dynamic contribute of the power consumption and requires a minimum logic overhead on the final platform.Indeed, the simplest implementation is achieved by means of one AND gate for each LR plus one unique Enable Generator to properly set the enable signals of the AND gates according to the desired functionality.On the contrary, power gating is able to reduce both power contributions, static and dynamic, by shutting off the power supply of the region.However, it is quite more invasive, since it requires one power switch for each LR, one state retention cell for each Flip-Flop whose state has to be kept also when the corresponding LR is off, and one isolation cell for each bit-wise wire that goes from a disabled LR to an enabled one.A different clock gating cell (again an AND gate) is required for each LR, according to the switching-off protocol for the proper operation of the retention cells (details on the power gating switch on/off protocol can be found in [11]).Furthermore, one Power Controller block (involving a different finite state machine for each LR) is needed to properly drive the inserted power switch, state retention and isolation cells.

Step-by-Step Example.
In order to clarify the features provided by MDC baseline functionality and its related dynamic power manager extension, this section describes a step-by-step example of the whole flow.Three different input functionalities, labelled , , and , are considered and modelled as DPNs.As first step of the baseline MDC functionality, the DPNs are parsed by the Orcc font-end and translated into Java IR graphs.Figure 2 depicts an overview of the whole flow starting from these input IRs (.V, .V, and .V).At this level, MDC combines the dataflows into a reconfigurable IR inserting the SBox actors (SB in Figure 2).Three SBoxes are required to share actor  between  and  and actor  among all the three functionalities.
Once the reconfigurable IR has been derived, the dynamic power manager can identify the corresponding LRs.The starting    sets are (iii)    = {, , , , SB 0, SB 1, SB 2}.The compare and split identification process produces five different LRs: LR 4 and LR 2 involve shared resources, being activated, respectively, by  and  and by , , and , while the remaining LRs involve nonshared resources.LR 1 is activated only by , LR 3 only by , and LR 5 only by .
At this point, the selected power saving strategy is applied during the CGR system HDL generation.Figure 2 shows both the final designs resulting from the application of clock gating and power gating.
The clock gated platform is shown in the bottom left corner of Figure 2. In this case, the identified LRs become the Clock Domains (CDs) of the resulting architecture, meaning that the involved actors are all driven by the same gated clock.SBoxes are not included in any CD since they are fully combinatorial modules.It can be noticed by Figure 2 that the clock gating overhead is limited to four AND gates (LR 2 is activated by all the implemented functionalities and does not need to be turned off) and one Enable Generator that properly assigns the clock enable values.
The power gated platform is depicted in the bottom right corner of Figure 2. LRs define the architecture power domains (PDs) where, in this case, also SBoxes are taken into consideration.Power gating turns off the whole PD power supply and it has effect also on combinatorial blocks.Figure 2 clearly shows that the logic overhead of power gating is larger than the clock gating one.In the power gated platform power switches, state retention cells, isolation cells, and one Power Controller are inserted.Please note that clock gating cells are not reported for simplicity.Again the logic necessary to switch off LR 2 is avoided since this region is an always on one (being activated by all the input DPNs).

Automated Power Management with Hybrid Clock and
Power Gating.To overcome the limits of a blindly applied unique power management strategy, we propose, in this paper, a power estimation flow capable of ( 1) characterizing, at a high-level of abstraction, the LRs identified by the MDC power extension, and of (2) autonomously applying the optimal power reduction technique for each LR.Power and clock gating overhead are estimated, based on LRs characteristics, before any physical implementation.This strategy is meant for ASIC technologies, which allow hybrid power and clock gating support over the same CGR design.
The estimation is based on two sets of models that determine the static and dynamic consumptions of each LR when clock gating or power gating are applied.The proposed models are derived after a single logic synthesis of the baseline CGR system generated by MDC, carried out with commercial synthesis tools from the analysis of the power reports obtained after netlist simulation.Such a synthesis constitutes the only implementation effort required for the designer, besides the characterization of technique-specific blocks such as the Enable Generator or the Power Controller.Models are technology-dependent since they include parameters that are characteristic of the chosen target technology library, as it will be discussed in Section 3.2.4.

Power Gating: Static Power Consumption Model.
Static power can be estimated on the basis of the leakage contributes provided, for each cell, by the targeted ASIC library.Given any hardware FU (uniquely corresponding to an actor of the reconfigurable IR) its static power can be obtained by summing up the single contributions of the adopted cells.The static power consumption term is tightly related to the LR area: the more cells are included in the considered region, the more is its corresponding static dissipation.
The proposed model for the static power consumption is defined as follows: Dealing with a prospective power gating implementation, the static estimation (1) for each LR involves two terms:  lkgON (LR  ) corresponds to the static consumption when the LR is active and Ext Over lkg (LR  ) refers to the power overhead due to the additional power gating logic.This second term does not consider the power switch overhead, since it is not included in the prelayout netlist.Power gating prevents, by definition, any static dissipation on the LR when disabled; therefore, (1) does not present any  lkgOFF (LR  ).
lkgON (LR  ) is obtained as the multiplication of the LR activation time Ti ON and the sum of leakage power of the involved actors, considering separately combinatorial and sequential logic.The former,  lkg (cmb), is equal to the leakage of the combinatorial cells within the considered LR.The latter is related to the number of registers (#reg) within the LR and their need (according to the implemented functionality) of preserving or not their status, by means of state retention cells, when the region is inactive.Then it involves, in turn, two terms.The first one refers to the registers whose state can be lost and it is estimated on the basis of the static consumption of the sequential cells ( lkg (reg)), as an average on the number of registers that are not retained.The second one refers to the retention cells and it is estimated starting from the number of registers whose state has to be maintained (#rtn) multiplied by the leakage of a single state retention cell ( lkg (RC)), whose value is retrieved from the target ASIC library.
Ext Over lkg (LR  ) is composed of three terms: the first one is related to the isolation cells (#iso), the second one to the Power Controller, and the third one to the clock gating cell (the power gating switch off protocol requires applying clock gating at the region level, before retaining the registers value).Note that, unlike  lkgON , for the three abovementioned terms, Ext Over lkg characterizes the LR static consumption in both its on and off states.In the on state, the model accounts for the static consumption in the on state (e.g.,  lkg (ISO ON )) multiplied by the activation time Ti ON and by the overall number of cells within the LR (e.g., #iso).In the off state, the model accounts for the static consumption in the off state (e.g.,  lkg (ISO OFF )) multiplied by the inactive time Ti OFF and by the overall number of cells within the LR (e.g., #iso).Please note that there is just one Power Controller for all the LRs and one clock gating cell per LR, but an a priori characterization phase would be required to the designer, since their consumption values cannot be retrieved directly from any ASIC library.

Power Gating: Dynamic Power Consumption Model.
Estimating the dynamic power is more complex than estimating the static one, since this term strongly depends on the nodes switching activity.Frequently, commercial tools (e.g., Cadence Encounter Digital Implementation System) consider dynamic power as composed of two main terms, as depicted by (2): a net contribution due to the power dissipated throughout the wires linking the cells, and an internal contribution due to the dissipation occurring inside the cells [44]: The operating frequency, , influences both terms. net accounts for the load capacitance of each net  (bearing a specific capacitance  load  ) and the related switching activity (SW  ), whereas  int depends on the power per MHz dissipated by each cell (  ) and the related switching activity (SW  ).
Currently, the developed model is able to estimate only the  int contribution that can be expressed for each single LR as follows: Considering a prospective power gating implementation, the different parts of (3) basically reflect the ones of (1).The main difference among the static power model and the dynamic one is that this latter requires accurate data in terms of nodes switching activity.For this reason, the netlist of the baseline CGR system is not sufficient to retrieve accurate values from the power reports and one different simulation of the netlist for every implemented functionality is required.Thus, dynamic power model takes into consideration the real system switching activity provided by the hardware simulations.
The  net term of ( 2) is not currently addressed in our model.Nevertheless, as demonstrated further on in this work (please see Tables 8, 9, 12, and 14) neglecting this term seems not to affect the optimal identification of the region to be gated.We have already planned to extend the proposed methodology to include the nets contribution in a near future.

Clock Gating: Static and Dynamic Power Consumption
Models.Clock gating static and dynamic models are less complicated than the power gating ones, since clock gating requires a very low logic overhead and it positively acts only on the dynamic dissipation.Equations ( 4) and ( 5) report the models adopted, respectively, for the static power estimation and for the dynamic power one, referring to a clock gated design: At the logic region level, (4) considers always the combinatorial and sequential contributions for both the on or off states, since clock gating does not affect the system leakage, whereas (5) considers always the combinatorial part for both the on or off states, since combinatorial logic cannot benefit from clock gating and the sequential contribution only during the LR active time.The overhead, Ext Over int (LR  ), is given by the clock gating cell and the Enable Generator.Please remember that implementing clock gating management at a coarse-grained level, just one clock gating cell per LR has to be inserted within the system.Equation ( 5) is pretty much the same as (4), a part from the fact that, dealing with the dynamic model, clock gating effects are estimated by omitting the contribute of sequential logic when the LR is off.

Parameters Discussion
. The proposed models are determined by the intrinsic features of the LRs.In particular, they consider the following: (i) Architectural Parameters.LRs composition determines the amount of involved combinatorial and sequential cells.(ii) Functional Parameters.LRs behaviour defines the region activation time and if its status has to be preserved or not.(iii) Technological Parameters.Target technology has an impact on the ratio between dynamic and static power (as it will be demonstrated in Section 4.2) and on the different cells characterization.
Table 1 reports, for each parameter considered in (1), ( 3), (4), and (5), their classification.A deeper explanation about  lkg/int (cmb) and  lkg/int (reg) is necessary.They are not associated with any specific parameters class, indeed they depend on type and number of involved cells composing the considered LR and also on the system switching activity (especially for the internal contribute).These values are gathered by the reports of the baseline CGR system netlist, assuming that the amount and type of cells composing the FUs do not change as power saving strategies are applied (except for the retained registers).

Step-by-Step Example. The example proposed in Fig-
ure 2 shows the merging process of three DPNs in a CGR system, where five LRs have been identified.Equations ( 1), ( 3), (4), and ( 5) can be applied to all of them.However, as already discussed, LR 2 can be discarded: being common to all the input DPNs, it is always active and does not require to be switched-off.As defined in the previous section, the parameters reported in Table 2 are extracted by the reference technology library or characterized by synthesis trials (see the definition provided in Table 1).The power consumption values in Table 3 have been extracted by the synthesis reports of the baseline (with no power saving applied) CGR platform and required three hardware simulations (one for each input kernel).These simulations are necessary to correctly estimate the internal power consumption of the different LRs, taking into account the real switching activity of the design.In practice, power values are determined as an average of those obtained according to the different switching activity profiles.
Starting from the data in Tables 3 and 2, here follows the detailed equations characterization for LR 5 , which include actors  and G.
When power gating is applied, the static power consumption of LR 5 is derived according (1), as follows: The internal power consumption is given by (3): When clock gating is considered, ( 4) and ( 5) are computed as follows: lkg (LR Table 4 summarizes all the values achieved applying the proposed static and dynamic models to all the different logic regions.

Hybrid Clock and Power
Gating Support and Integration in MDC.The discussed models (described in (1), ( 3), (4), and ( 5)) have been integrated in the MDC design flow, in order to implement a fully automated power management strategy.Designers are guided towards the optimal solution for each LR, rather than choosing a one-fit-to-all switchingoff technique for all of them.This automated selection flow is implemented as reported in Algorithm 1.For each LR, identified by the MDC power management extension, Algorithm 1 executes the following steps, embodied by different functions.
(1) Area Thresholding (See evaluate area Function).As previously discussed, power gating is a quite invasive technique, requiring a lot of extra logic to be inserted in the nonswitchable always on domain.Thus, for small LRs, we can assume that it will not bring any benefit, so that power gating is not to be considered for implementation.Indeed, clock gating may still be beneficial, due to its very small additional logic amount.
(2) Power Gating Overhead Estimation (See evaluate PG Function in Algorithm 1).Power gating cost is estimated in order to find out if it can lead to power saving or not.The prospective power and clock gating implementations are compared on the basis of their overall consumption.Equation ( 1) is applied and summed to (3), if there is not total power saving the algorithm goes to the clock gating overhead estimation.On the contrary, if there is saving it has to be compared with the sum of ( 4) and ( 5) to determine whether the current LR may benefit from power gating (despite its larger overhead) or from clock gating.
(3) Clock Gating Overhead Estimation (See evaluate CG Function in Algorithm 1).Clock gating cost is estimated to investigate the possibility of achieving power saving with this technique.If the LR clock gating achievable saving does not counterbalance its implementation costs, the LR is discarded.This means that when the MDC back-end generates the RTL description of the CGR system, the LR logic is included in the always on domain.On the contrary, if the clock gating leads to an overall saving in terms of total power, the LR will be clock gated during the implementation.
The output of Algorithm 1 is the classification of the LRs, stating which one should be power gated (see PG set in Algorithm 1), which one should be clock gated (see CG set in Algorithm 1), and which ones should be included in the always on domain.
Figure 3 provides an overview of the modified design flow.As it can be noticed, the MDC tool and its power management extension are directly interfaced with the logic synthesizer.Algorithm 1 is implemented within the Power Analysis block.MDC baseline tool provides the HDL description of the plain CGR system and all the scripts to perform the synthesis of the CGR design and all the different hardware simulations (one for each input DPNs), as required by the proposed power estimation models.The power reports are then fed back to the MDC power management extension and parsed within the Power Analysis to execute Algorithm 1.The LRs classification (see LR class in Figure 3), generated by the Power Analysis block, is used by the CG/PG HDL Generation block to automatically define the hybrid, clock and power gating, power management support for the given CGR design.
Summarizing, this flow, with respect to what is discussed in Section 3.1, does not require designers to opt for a specific power management technique.On the basis of the proposed power estimation models and by linking MDC with a logic synthesis tool, the presented flow is capable of overcoming the limit of providing a one-fit-to-all solution.Each LR, in a CGR design, is supported (where necessary) with the optimal power management technique.

3.2.7.
Step-by-Step Example.In this section, a step-by-step example of the application of Algorithm 1 is presented, considering the same example proposed in Figure 2. In that case, MDC LRs identification led to determining five LRs and the user-specified power management technique is blindly applied to all of them except LR 2 .This region is used by all the input DPNs; thus, it is never disabled and does not require any power management support.In the following step-bystep example, shown in Figure 4, the threshold on the area (area th ) is set to 5%.
(a) Its area is calculated: area LR 1 = 52% of total area.(b) area LR 1 > area th , so that a prospective power gating implementation on LR 1 is taken into consideration by invoking evaluate PG(LR 1 ).
(1) The static and the dynamic overheads are estimated, respectively, applying (1) and (3).The power gating overhead on the overall consumption is calculated by subtracting the power consumption of the LR when PG is applied, to the power consumption of the LR in the baseline design, the result is then divided by the total power consumption of the baseline design in order to estimate the total percentage power variation.The power variation when PG is applied to region LR 1 is −86.45%.Since this value is negative, power gating may be convenient if its total saving is larger than the clock gating one.(2) Equation ( 4) is calculated and summed up to (5) to determine clock gating overhead on the overall consumption, which is −2.15%.(3) Power gating is more beneficial than clock gating determining, overall, a larger power saving.Thus, LR 1 is added to PG set.
(a) Its area is calculated: area LR 3 = 0.4% of total area.(b) area LR 3 < area th , so that a prospective clock gating implementation on LR 3 is considered straight away by invoking evaluate CG(LR 3 ).
(1) Equations ( 4) and ( 5) are evaluated to determine clock gating overhead on the overall consumption: CG over = +0.01%.(2) Clock gating is not beneficial since its overhead is positive.Thus LR 3 is discarded and no power management policy will be applied to it.
(a) Its area is calculated: area LR 4 = 7% of total area.(b) area LR 4 > area th , so that a prospective power gating implementation on LR 4 is taken into consideration by invoking evaluate PG(LR 4 ).(1) The static and the dynamic overheads are estimated, respectively, applying (1) and (3).The power gating overhead on the overall consumption is −1.2%.Since this value is negative, power gating may be convenient if its total saving is larger than the clock gating one.(2) Equation ( 4) is calculated and summed up to (5) to determine clock gating overhead on the overall consumption, which is −2.0%.(3) Clock gating is more beneficial than power gating determining, overall, a larger power saving.Thus, LR 4 is added to CG set.
(a) Its area is calculated: area LR 5 =15% of total area.(b) area LR 5 > area th , so that a prospective power gating implementation on LR 5 is taken into consideration by invoking evaluate PG(LR 5 ).
(1) The static and the dynamic overheads are estimated, respectively, applying (1) and (3).The power gating overhead on the overall consumption is −0.89%.Since this value is negative, power gating may be convenient if its total saving is larger than the clock gating one.(2) Equations ( 4) and ( 5) are evaluated to determine clock gating overhead on the overall consumption, which is −1.04%.(3) Clock gating is more beneficial than power gating determining, overall, a larger power saving.Thus, LR 5 is added to CG set.
The resulting hardware design with the hybrid application of clock gating and power gating is shown in Figure 5. Comparing this design with the two reported in Figure 2   of the clock gating.All the remaining logic, which includes region LR 3 , is always on.

Assessments
In order to assess the proposed power estimation flow and the effectiveness of the hybrid clock and power gating management, in this section we discuss two use cases, which are completely different in terms of both behaviour and resulting power consumption contributions.The first one deals with a simple FFT algorithm implemented on a 90 nm CMOS technology and it has been mainly adopted to evaluate in detail the proposed flow.The second one presents a more complex scenario.An image coprocessing unit, accelerating a zoom application, has been implemented both on a 90 nm and on a 45 nm CMOS technology in order to access the robustness of the proposed flow with different technology parameters.
In the following, Section 4.1 discusses the evaluation phase involving the FFT use case, while Section 4.2 describes the experimental results conducted to validate the approach on the zoom application.Finally, Section 4.4 details the benefits of adopting the proposed models and their correlated flow.

Evaluation Phase.
This section deeply discusses the results obtained considering the FFT use case targeting a 90 nm CMOS technology.

Fast Fourier Transform Algorithm.
Fast Fourier Transform (FFT) is an optimised algorithm for the Discrete Fourier Transform (DFT) calculation.It is widely adopted in several applications, ranging from the solving of differential equations to the digital signal processing.We refer to the original DFT equation: The FFT algorithm that has been adopted for this use case has been proposed by Cooley and Tukey [45].It aims at speeding up the calculation of a given size  DFT by considering smaller DFTs of size , called radix.To obtain the whole original DFT,  stages of size  DFTs are required, where  =   .Small DFTs have to be multiplied by the socalled twiddle factors, according to the decimation in time variant of the algorithm.When the radix  = 2, the DTFs take the name of butterflies by their block scheme.The equations describing a butterfly are where  0 and  1 are the outputs, while  0 and  1 are the corresponding inputs.   are the twiddle factors, defined as x(k = 0) x(k = 4) x(k = 2) x(k = 6) x(k = 1) x(k = 5) x(k = 3) where  and  are integers depending on the butterfly position in the FFT.The adopted use case involves a radix-2 FFT of size 8, as depicted in Figure 6, obtained by means of three stages involving four butterflies each, meaning 12 butterflies overall ( = 2,  = 3,  =   = 2 3 = 8).Stages have been pipelined to keep the system critical path short.The baseline 12 butterflies design then requires three clock periods for the outputs elaboration.
From the baseline 12 butterfly design, several variants have been derived through the decrease of the involved butterflies number.In such a way, the available resources of the design must be multiplexed in time and reused.Therefore, the overall computation latency increases and the throughput becomes lower.In particular, four size 8 FFT configurations are considered: (i) 12b is the baseline 12 butterflies FFT design, taking 3 clock periods to finalize the transform.(ii) 4b involves 4 butterflies for an overall execution latency of 6 cycles.(iii) 2b involves 2 butterflies for an overall execution latency of 12 cycles.(iv) 1b involves 1 single butterfly for an overall execution latency of 24 cycles.

FFT CGR System
Implementation.The abovementioned configurations have been modelled as dataflow networks and the corresponding CGR system has been assembled with MDC.The activation percentage, resource utilization, and power consumption of each FFT variant are shown in Table 5.In general, the higher the number of butterflies is, the more the corresponding area and dissipation are.The main purpose of the resulting CGR system is to enable several trade-off levels between power dissipation and throughput, as illustrated in Figure 7.Such a system is capable of dynamically switching among the different configurations, fitting to external environment requests.For instance, in a battery operated environment, when the battery level becomes lower than a given threshold, some throughput can be waived to consume less power.
MDC identifies 8 different LRs in the CGR system.The LRs activated by each FFT variant are listed in Table 5, while their characteristics are reported in Table 6.In this table, given any LR, its activation time ( ON ) has been obtained summing up the activation times of the FFT configurations activating the same region (provided in Table 5).For example, LR 2 is activated by 1b, 2b, and 4b.Its  ON is 0.67, which is the sum of  ON (1b) = 0.42,  ON (2b) = 0.21, and  ON (4b) = 0.04.

Power Modelling and Hybrid Power Management Assessment.
As explained in Section 3.2, the proposed flow requires  a preliminary synthesis of the baseline (without any implemented power saving strategy) CGR system.From the synthesized design, it is possible to retrieve area occupancy and logic composition (combinatorial and sequential contributes) of the 8 LRs, as depicted in Figure 8.The area is given in terms of percentage with respect to the overall system area.The biggest region is LR 1 ; it occupies more than 60% of the whole system area.It is the region that mainly impacts power consumption.Furthermore, it is quite entirely combinatorial (99.18%), so that power gating should be a very suitable strategy for this LR.By Figure 8 it is possible to notice that LR 2 , LR 4 , LR 5 , LR 6 , and LR 8 are extremely small.For all these regions, the proposed power modelling strategy can be extremely beneficial to investigate if power saving techniques may lead or not to an effective power saving.Figure 8   In order to evaluate the proposed power model, Figures 9 and 10 compare the estimated and measured (retrieved from the postsynthesis reports) overhead, respectively, due to power gating and clock gating.In both cases, the reported power refers to both the static and internal contributions, as taken into consideration by the power model.The remaining term, the net one, is neglected.The error of neglecting the net contribution is discussed in Section 4.1.4.The proposed power models are generally able to accurately approximate the power saving strategy overhead.As expected, LR 1 is the region with the highest power saving, regardless of the considered strategy (please note that a negative overhead implies a saving in power).It is interesting to notice that LR 8 , despite being one of the smallest regions, does not provide any saving if power gated, but it can achieve a little power reduction when clock gated.
The static and internal power estimations obtained by applying (1) and ( 3), for a prospective power gating implementation, and ( 4) and ( 5), considering a possible clock gating implementation, are shown in Table 7. Please notice  that clock gating static overhead is not appreciable, since one single clock gating cell is required per LR.Algorithm 1 (see Section 3.2.6)implies a preliminary area thresholding step.Two different thresholds have been considered for the algorithm evaluation: (i) DAT 5%: Threshold set to 5%.Regions with area above the 5% are LR 1 , LR 3 , and LR 7 , so that the power gating overhead estimation step is performed for each of them.All the considered regions lead to an overall saving (negative overhead) larger than those achievable with a prospective clock gating implementation; therefore, they are selected as eligible regions for power gating.Clock gating overhead estimation is performed on all the remaining subthreshold regions.The regions capable of providing saving, when clock gated, are LR 2 and LR 8 , since LR 4 , LR 5 , and LR 6 are fully combinatorial.Thus, clock gating will be implemented only on LR 2 and LR 8 .(ii) DAT 10%: Threshold set to 10%.Only LR 1 and LR 3 are above the area threshold and, as occurred also for DAT 5%, they both achieve power saving if implemented with power gating strategies.In this second case, the clock gating overhead estimation step is performed also on LR 7 , which results in a negative overhead.Then, the regions to be clock gated are LR 2 , LR 7 , and LR 8 , while LR 4 , LR 5 , and LR 6 are again discarded.
To access the proposed flow, five designs have been assembled: (i) Base: the baseline CGR design without any power saving (ii) PG full: the CGR design, where power gating is applied blindly to all the regions (iii) CG full: the CGR design, where clock gating is applied blindly to all the regions (iv) DAT 5%: derived with the proposed automated flow capable of hybrid power and clock gating support, setting the Area Threshold to 5% in Algorithm 1  (v) DAT 10%: derived with the proposed automated flow capable of hybrid power and clock gating support, setting the Area Threshold to 10% in Algorithm 1 These designs have been synthesized with Cadence RTL Compiler, targeting the same 90 nm CMOS technology adopted to synthesise and simulate the baseline CGR design, whose results have fed the Power Analysis block of the proposed enhanced power management flow to assemble DAT 5% and DAT 10%.The power consumption (internal, static, and total) of these designs is depicted in Figure 11.The reported data correspond to the real power consumed by the synthesised designs.Since LR 1 occupies more than the 60% of the design area and it is mainly combinatorial, little differences among the entirely power gated design (PG full) and the hybrid clock and power gating ones (DAT 5% and DAT 10%) are visible.Nevertheless, DAT 5% achieves the largest power saving (−45.12%)among all the designs, validating the proposed hybrid and selective management with respect to a onefit-to-all solution.CG full, capable of diminishing only the dynamic power consumption, is the worst design among those applying power management.
The area overhead of the implemented power management strategies, reported in the legend of Figure 11, proves that the proposed hybrid management leads to less area hungry designs than the entirely power gated one.In fact, DAT 5% and DAT 10% present half of the area overhead of PG full.CG full data confirms that clock gating has a very little impact on the baseline design, presenting a negligible area overhead (two orders of magnitude smaller than DAT 5% and DAT 10%).
For the sake of completeness, in Figure 12 the trade-off levels between power and latency (and, in turn, throughput) are illustrated for all the considered designs.The trade-off curves demonstrate that power management strategies are generally extremely beneficial within a CGR scenario.

Accuracy and Errors.
The accuracy of the proposed power modelling approach is assessed by Tables 8 and 9, respectively, considering the power gating overhead estimation step and the clock gating overhead estimation step of Algorithm 1.These tables, in each row, report the estimation errors with respect to the real consumption of the baseline CGR system, where the given power saving strategy (i.e., power gating in Table 8 and clock gating in Table 9) is applied only on the LR specified in the first column.
Looking at Table 8, power overhead estimations per LR demonstrate to be very accurate, leading to errors that are always below 1.1%.Also errors related to the estimation of state retention and Power Controller overhead are quite low (resp., below 5% and 1%).The isolation cells overhead estimation is less precise, resulting in an error of 16.36% for PD3, due to the fact that the static and internal values of (ISO ON ) and (ISO OFF ) are characterized as average values, the same for each LR.Nevertheless, this error has no visible impact on the total estimation one that is 1.07%.
Table 9 depicts an overview of the estimations accuracy for the clock gating overhead.In this case, estimations are even more accurate.The error on the clock gating overhead is always below the 0.3%.The error due to the clock gating cells overhead is very limited too, being always under the 1.1%.
Tables 8 and 9 report also the errors caused by omitting the power net term (column net % (Err.)) in ( 3) and ( 5).This error is obtained by comparing the estimated overhead (not comprehensive of the net contribute) with the real measured overhead comprehensive of the net term, as extracted by the synthesis reports.The net error is higher in the power gating overhead estimation (13.6% for LR 7 ) with respect to the clock gating one (at maximum 0.47% for LR 8 ), since power gating requires more additional logic to be implemented.

Validation Phase.
In order to validate the proposed approach, a second use case has been assessed targeting the same 90 nm technology used for the FFT use case and a smaller 45 nm library.The reconfigurable computing core of an image coprocessing unit, accelerating a zoom application, has been assembled.Its characteristics are discussed in Section 4.2.1, while Sections 4.2.2 and 4.2.3 analyse the achieved results.and profiled on a general purpose machine in order to identify the most computational intensive portions of the code (computational kernels) with the intention of accelerating them on a CGR hardware accelerator.Seven computational kernels have been identified and modelled as dataflow networks.Table 10 summarizes the kernels composition (in terms of number of dataflow actors), activation time, and main functionality.Then, these dataflow kernels have been combined by MDC to obtain a multidataflow specification, constituting the computing core of the CGR accelerator in charge of accelerating the zoom application.Thirteen LRs are identified on the CGR zoom coprocessor.The main difference between this scenario and the FFT one is that in the zoom coprocessor it is not necessary to retain the status of any kernel when switching among them.This means that, applying power gating, no retention cells are needed in the identified regions.

Zoom Coprocessor Validation Results at 90 nm CMOS
Technology.This section is meant to provide the discussion of the achieved results in the zoom coprocessor scenario using the same 90 nm CMOS technology adopted for the FFT CGR designs assessment.From the implementation point of view, we will discuss the same designs we considered for the FFT use case, defined as follows: (i) Base: the baseline CGR design without any power saving (ii) PG full: the CGR design, where all the 13 LRs are power gated (iii) CG full: the CGR design, where all the 13 LRs are clock gated (iv) DAT 5%: the CGR design, where hybrid power and clock gating support are implemented by means of the proposed flow, setting the Area Threshold to 5% in Algorithm 1 (v) DAT 10%: the CGR design, where hybrid power and clock gating support are implemented by means of the proposed flow, setting the Area Threshold to 10% in Algorithm 1 Please refer to Table 11 for the composition of DAT 5% and DAT 10%. Figure 13 depicts static, internal, and total power consumption for each considered design.In this case, the differences among CG full, PG full, DAT 5%, and DAT 10% are not so evident.The reason is that, in this scenario, the dynamic power consumption (due to the internal power) is considerably higher than the static one.As you can see in the reported histograms, on average, there are approximately more than two orders of magnitude of difference.Power gating and clock gating demonstrate to be equally capable of cutting down the internal power consumption.LR 5 is the only region that Algorithm 1 completely discards by any form of power management, both in the DAT 5% design and in the DAT 10% one.It is fully combinatorial; therefore, clock gating does not provide any positive effect on it.Nevertheless, it is so small (0.65% of the whole system area) that, if power gated, it cannot provide any substantial benefit.A closer observation of the histograms confirms what we already got for the FFT: despite the similar trend for all the designs, which lead to more than the 62% of power saving, DAT 5% consumes less than any other (62.61% of saving), while the CG full design is the less beneficial (62.29% of saving).
Focusing on the static histograms, as expected, the CG full design introduces a small overhead with respect to  base.That is due to the 12 (one for each region but LR 5 ) clock gating cells introduced in the always on domain of this design, which never contribute to save any static power consumption.When power gating is applied, there is always a benefit in terms of static power consumption: DAT 5% saving is slightly higher than the PG full one, both being over 51%; DAT 10% is still beneficial, but its saving is limited to the 15%.Please note that the difference between DAT 5% and DAT 10% (in terms of static consumption) demonstrates that, in the area thresholding step of the proposed Algorithm 1, it is better to opt for small area threshold values to achieve higher saving results.
In terms of area occupancy, reported in the legend of Figure 13, the PG full design is the one with the largest overhead, +6.4%.DAT 5%, which is the most beneficial in terms of power, shows a slightly smaller overall overhead, +4.55% of the whole system area.The CG full is again the less invasive one, leading just to +1.73% of area overhead.Summarizing, DAT 5% constitutes the optimal solution for the zoom coprocessor scenario, considering a 90 nm technology.DAT 10%, which is less beneficial than DAT 5% in saving static power consumption, is a better solution than a fully power gated design, presenting basically the same power saving (−62.38% for DAT 10% versus −62.29% for PG full) but a smaller area overhead (+3.19% for DAT 10% versus +6.4% for PG full).
Table 12 reports the estimation error of the proposed automated hybrid power management design flow, when the power saving percentages for the considered domains, respectively, considering power gating (power gating overhead estimation) and clock gating (clock gating overhead estimation), are evaluated.PG saving% errors are always below 0.3% and CG saving% ones do not exceed 1.5%.These data confirm the accuracy of the proposed models, as in the FFT use case.Table 12, for both power and clock gating implementations, depicts also the error of neglecting the net term in the dynamic power consumption.Again, as in the previously discussed scenario, models are not affected by this simplification.

Zoom Coprocessor Validation Results at 45 nm CMOS
Technology.In order to provide a robust validation of the proposed approach, we decided to assess the same zoom coprocessor designs on a different technology targeting a 45 nm CMOS technology.The idea is assessing what changes when the ratio between the static and the dynamic consumption is varied.Here follows the list of the implemented designs: (i) Base: the same as in the 90 nm synthesis trial (ii) PG full: the same as in the 90 nm synthesis trial (iii) CG full: the same as in the 90 nm synthesis trial  Targeting a smaller technology and having already established that the 10% area threshold leads to power results comparable to those of the fully power gated design, we have decided to add in this second trial an additional design: DAT 1%.Setting the area threshold to 1%, quite all the LRs are considered for a prospective power gating implementation.Please refer to Table 13 for the composition of DAT 1%, DAT 5%, and DAT 10%.The power consumption in terms of static, internal, and total contributions is illustrated in Figure 14.The dynamic power consumption is still higher than the static one, determining the overall trend of the total power.However, with the scaling of the channel length, the ratio among internal and static power on average has decreased from a factor of 100 to approximately 10.In this second trial, the influence of the static power consumption is partially reflected on the total one.Technology scaling and the different static versus dynamic power ratio are such that PG full is capable of providing better overall saving results than DAT 5% and DAT 10%.At 45 nm technology, designers are required to select a very low area threshold in Algorithm 1 to achieve really optimal results.DAT 1%, which basically excludes from power gating only the 3 LRs, saves up to 61.84% of total base power, and represents the optimal design solution for the zoom coprocessor in this second synthesis trial.Please note also that, lowering down the area threshold, the area of the optimal design and that of the fully power gated one are pretty similar.
We can conclude that as technology gets smaller the area thresholding step of the proposed algorithm is less beneficial still, in the automated flow, its presence makes the overall process more robust, avoiding useless iterations on not convenient by construction designs when the technology are not so constrained or the ratio among static and dynamic consumption is larger.
The accuracy of the proposed models, targeting the 45 nm CMOS technology, is reported in Table 14, which contain both power gating and clock gating estimation errors.The models, even neglecting the net contribution in the discussed equations, are extremely accurate (the error never exceeds 3.70%).

Power Switch
Overhead.The sleep transistors are inserted in the design during the place-and-route process and their overhead is strictly use case dependent since it is related to the aspect ratio of the macro and to the style of power routing that is selected in the target design.Since the proposed power estimation model is based on the synthesis of the design, the contribution of these cells is not considered yet.
The insertion of header/footer switches (we used header ones in this work as reference) determines two kinds of power overhead: (1) a leakage-related overhead; (2) the dynamic power dissipated during sleep and wake-up transition.Another overhead that has to be taken into consideration is the time necessary to wake up the power domain.For the proper operation of the power gating methodology, the gating logic has to be enabled/disabled according to a switch on/off protocol [11] that requires 4 clock cycles for each transition.Thus, when a kernel is off, at least 4 clock cycles are necessary before it is switched on and then it can receive new input data.
The leakage-related contribution is fixed by the technology library and it is always present regardless of the ON/OFF state of the power domain.It could be inserted in (1) as  lkg (SW) * #switches where  lkg (SW) is the static power consumption of the considered power switch, as reported in the technology library, and #switches is the number of power switches inserted in the power domain.In the library that we took as reference, a header switch has the same leakage power of a 64-bit wide register, and one column of  switches can be used to switch down  horizontal virtual power stripes of the power grid.If we assume to use only one vertical real power stripe to supply power to  horizontal virtual power stripes of a power domain, only one column of  power switches is inserted.In this case, gating the virtual power supply easily saves enough leakage power to counterbalance the leakage dissipation of the inserted switch.
The dynamic power contribution is only relevant when intervals between successive kernel switches are in the order of tens of cycles (Hu et al. [46]).When the computation of the kernels last tens of cycles also the wake-up time is not relevant.Thus, in designs with low switching rates, these two overhead contributions could be neglected.
The FFT use case is a really simple design, used only for the development of the power estimation model and, as reported in Section 4.1, its kernels are far from lasting tens of cycles.The zoom application adopted for the validation 24 Journal of Electrical and Computer Engineering phase of the proposed model is a real use case but it is a small size design, where the execution of the fastest kernels lasts 24 clock cycles.Then, the wake-up time overhead is, in the worst case, almost 17% of the total execution time.If we consider a bigger and more complex real use case, such as interpolation filtering for motion compensation in High Efficiency Video Coding [47], we would achieve the condition for neglecting the dynamic power consumption of the sleep transistors and the wake-up time overhead.This application involves 2dimensional filters working on subblocks of pictures belonging to the same video sequence.The smallest block, corresponding to the fastest execution time, has 8 × 8 pixels.In this case, 160 overall cycles are required to filter the whole block and the wake-up time overhead is just 2.5% of the total execution time.
4.4.Advantages of the Proposed Approach.Considering a CGR system implementing  different functionalities and partitioned into  different LRs, the proposed selection algorithm, based on the power models embodied in (1), ( 3), (4), and (5), requires the synthesis of the baseline CGR design (without any power saving strategy applied) and  hardware simulations, each one running a different functionality (i.e., executing the different DPNs provided as input to the MDC tool).The hardware simulations are needed since the real switching activity is essential for correct dynamic power estimation.Table 15, targeting the FFT scenario and the power gating implementation, depicts the estimated and real power overhead when estimations are performed adopting the default synthesis reports (without taking into account the real switching activity).The estimation errors are extremely high when the switching activity is neglected; therefore, the proposed models are not capable of properly determining which LRs would actually benefit of power gating.In order to understand the advantages of the proposed approach, let us compute the effort needed to determine the optimal saving strategy for each region if our flow is not adopted.It is required to (1) synthesize the baseline design without any power management support; (2) synthesize one power gated design and one clock gated design for each LR; (3) perform  different hardware simulations for the baseline design, to retrieve the real switching activity of the system; (4) perform  different hardware simulations for each power gated and clock gated design, to retrieve their real switching activity; (5) compare each power gated design and clock gated design, in the different operating conditions, with respect to the synthesized baseline CGR design.
Our flow requires only point 1 and point 3.In numbers it corresponds to one single synthesis and  hardware simulations.On the contrary, not using our approach, 2 *  + 1 synthesis ( for power gating evaluation,  for clock gating evaluation plus the baseline one) and  * (2 * +1) hardware simulations are necessary.The only simplification that may be done, even without adopting the proposed approach, is when a given region is fully combinatorial.This would save the effort related to its perspective clock gating evaluation.Dealing with the presented use cases, for the FFT (Section 4.1) there are  = 4 different functionalities and  = 8 LRs.Among these latter 3 are fully combinatorial.The proposed approach required one synthesis and 4 hardware simulations, rather than 14 synthesis (8 power gated LRs, 5 clock gated LRs and the baseline design) and 56 hardware simulations (4 for each synthesized design).Considering the zoom coprocessor (Section 4.2),  = 7 and  = 13, with only 1 fully combinatorial LR.The proposed approach required one synthesis and 7 hardware simulations, rather than 26 synthesis (13 power gated, 12 clock gated and the baseline designs) and 182 hardware simulations.

Conclusions
In this paper, we addressed the problem of power management in coarse-grained reconfigurable (CGR) systems.Such systems are as suitable to accelerate multifunctional applications as, potentially, energy inefficient.In fact, on a CGR substrate, while a particular task is executed, the resources not involved in the computation may potentially waste precious power if not properly managed.On top of that, these architectures are also characterized by an intrinsic design difficulty: mapping several applications on the same substrate, customizing the datapath, is not so straightforward and requires a deep knowledge of the target applications.Dataflow models of computation turned out to be very efficient for the development of automated mapping problems.In our studies, we have exploited a dataflow-based approach to define a complete design suite for multifunctional CGR systems: the Multi-Dataflow Composer (MDC) tool.Besides automatically managing dataflow to hardware systems composition, MDC also supports the automated characterization of power and clock gated platforms.
The proposed work makes some steps further both in the MPEG-RVC field, which the MDC tool belongs to, and in the definition of optimal power management strategies for CGR designs.In this paper, we have presented an automated methodology capable of estimating, prior to any physical implementation, the effectiveness and costs that power gating or clock gating would have when implemented on top of the functional logic regions constituting a CGR system.This methodology is based on static and dynamic power estimation models that, in a separate manner for each logic region in the CGR design, are capable of determining the overhead of clock gating and power gating on the basis of the functional, technological, and architectural parameters of the baseline CGR system.These models and the corresponding estimation algorithm are applicable in any CGR scenario and are currently integrated in the MDC tool, improving its basic functionality.In fact, MDC was normally applying a one-fitto-all user-specified power reduction technique, either clock or power gating, without any warranty of its effectiveness on the different identified regions.
By considering two different scenarios and adopting different ASIC technologies, our assessments proved that the enhanced MDC flow is capable of guiding the designers towards the definition of the optimal power management support.It is more efficient than the previous, blindly applied, methodology and the proposed models turned out to be extremely accurate.Finally, as demonstrated in Section 4.4, the new flow drastically reduces the number of designs to be synthesized and simulated, leading to saving both designer effort and computational time.

Figure 2 :
Figure 2:Step-by-step example of the MDC baseline and dynamic power management features.Saving strategies are blindly applied by the dynamic power manager on each identified LR.

Figure 3 :
Figure 3: Enhanced MDC design suite: integration of the automated hybrid, clock, and power gated support.

Figure 5 :
Figure 5: Enhanced MDC design suite: hardware platform with hybrid application of clock gating and power gating methodologies.

Figure 6 :
Figure 6: FFT use case: original design with 12 radix-2 butterflies for an FFT of size 8. Twiddle factors    are calculated according to (11).

Figure 7 :
Figure 7: FFT use case: latency versus power consumption trade-off for the 4 different 8-size FFT configurations.

Figure 8 :
Figure 8: FFT use case: area percentage per LR.

Figure 9 :
Figure 9: FFT use case: comparison between the estimated and real power variation due to the power gating integration.

Figure 10 :
Figure 10: FFT use case: comparison between the estimated and real power variation due to the clock gating integration.

Figure 11 :
Figure 11: FFT use case: comparison between the base design and the four gated designs.Legend shows, in brackets, the power management area overhead for each design with respect to the Base one.

Figure 12 :
Figure 12: FFT use case: atency versus power consumption tradeoff for the 4 different 8-size FFT configurations, when gated designs are adopted.

Figure 13 :
Figure13: Zoom coprocessor at 90 nm CMOS technology: comparison between the base design and the four gated designs.Legend shows, in brackets, the power management area overhead for each design with respect to the Base one.

Figure 14 :
Figure 14: Zoom Co-Processor at 45 nm CMOS technology: Comparison between the base design and the five gated designs.Legend shows, in brackets, the power management area overhead for each design with respect to the Base one.

Table 1 :
Parameter classification.Table reports for each considered parameter, typology (architectural, functional, and technological), description, and extraction method.

Table 2 :
Contributions of static and internal power consumption extracted by the reference technology library or characterized by synthesis trials.foreach LR  in set LRs do evaluate area(LR  , area th ) end function: evaluate area(LR  , area th ): calculate LR  area; if area LR > area th then evaluate PG(LR  );

Table 3 :
Parameter and power consumption of each LR, extracted by the synthesis reports of the baseline CGR platform.

Table 4 :
Resulting power consumption of the different LRs when the proposed models are applied.
, we can notice as now the power gating is applied only to region LR 1 (called PD1 in the figure), while the clock gating is applied to regions LR 4 and LR 5 (called CD4 and CD5 in the figure); the SBoxes SB 0 and SB 1 included in region LR 4 are purely combinatorial, so they are not affected by the application

Table 5 :
FFT use case: features of the different configurations integrated on the CGR design.Data refer to a 90 nm CMOS target technology.

Table 6 :
FFT use case: logic regions architectural and functional characteristics.
also suggests that LR 4 , LR 5 , and LR 6 cannot benefit from clock gating, since they are fully combinatorial.

Table 7 :
FFT use case: detailed static and dynamic saving due to power (PG) and clock gating (CG).

Table 8 :
FFT use case: power gating overhead estimation step accuracy.

Table 9 :
FFT use case: clock gating overhead estimation step accuracy.

Table 10 :
Zoom coprocessor use case: computational kernels distinctive features.

Table 12 :
Zoom coprocessor at 90 nm CMOS technology: power gating overhead estimation step and clock gating overhead estimation step accuracy.

Table 14 :
Zoom coprocessor at 45 nm CMOS technology: power gating overhead estimation step and clock gating overhead estimation step accuracy.

Table 15 :
FFT use case at 90 nm CMOS technology: power gating overhead estimation step accuracy, using reports generated without the real switching activity.