Asymmetric Aging Effect on Modern Microprocessors

Reliability is a crucial requirement in any modern microprocessor to assure correct execution over its lifetime. As mission critical components are becoming common in commodity systems; e.g., control of autonomous cars, the demand for reliable processing has even further heightened. Latest process technologies even worsened the situation; thus, microprocessors design has become highly susceptible to reliability concerns. This paper examines asymmetric aging phenomenon, which is a major reliability concern in advanced process nodes. In this phenomenon, logical elements and memory cells suffer from unequal timing degradation over time and consequently introduce reliability concerns. So far, most studies approached asymmetric aging from circuit or physical design viewpoint, but these solutions were quite limited and suboptimal. In this paper we introduce an asymmetric aging aware micro-architecture that aims at reducing its impact. The study is mainly focused on the following subsystems: execution units, register files and the memory hierarchy. Our experiments indicate that the proposed solutions incur minimal overhead while significantly mitigating the asymmetric aging stress.


I. INTRODUCTION
HIP reliability, a crucial requirement for modern microprocessors, is essential to assure correct execution over a system's lifetime. New mission-critical computation-intensive applications (e.g., autonomous vehicles, data centers, cloud computing, life-support systems, etc.) impose strict requirement demands on reliability through a product's lifetime and operating conditions. For example, the automotive industry expects an integrated circuits (IC) to function reliably for 10-15 years at a given temperature (usually about 125°C [1,2] and under various workloads. In data centers, the requirements are slightly relaxed but remain challenging: the lifetime requirement stipulates at least ten years, but under conditions where the temperature can range from 105 to 110°C with arbitrary workloads. None of these reliability-sensitive applications can afford microprocessor faults caused by reliability issues. While reliability requirements have been substantially heightened, advanced FinFET VLSI technologies (28nm and lower) have become highly suspectable to reliability issues and, in particular, to transistor aging. Transistor aging is the deterioration process of This paper was submitted for a review on August 4 th , 2020. F. Gabbay. Author is with the Electrical Engineering Department, Ruppin Academic Center, Israel (e-mail: freddyg@ruppin.ac.il).
transistors due to charge carriers trapped at the dielectric insulator of a transistor gate. This phenomenon is induced by Hot Carrier Injection (HCI) and Bias Temperature Instability (BTI) effects that are further discussed in Section 2. The common approach to handling such degradation in digital circuits is to provide extra timing margins to the clock-cycle time, hence, taking the timing degradation into account (as a result, microprocessors incur performance degradation). One may consider this to be an adequate solution as long as the aging is symmetric, i.e., all transistors age at the same rate. Unfortunately, many digital circuits may incur asymmetric transistor aging. Consequently, different paths in circuits may incur uneven degradation that may result in critical timing violations. Asymmetric aging is mainly induced by the BTI effect ( [3][4][5]) as a result of applying constant voltage to transistor gates for long periods. The period of time required for the transistor to incur such degradation may vary between tens of seconds up to several weeks ( [4]). Accordingly, even if the asymmetric delay shift is relatively small, critical timing constraints in the logical circuit may not be met. This observation makes logical elements that are maintained under constant logical values for long periods susceptible to severe reliability issues. The problem is even more severe for logical structures such as register files and memory arrays where a single element failure may fail the whole module and, as a result, data integrity may become unreliable. Past studies indicated that from the architectural point of view, asymmetric aging is commonly induced by dynamic-power saving techniques, e.g., clock gating, which enforce a static state on logical circuits and, consequently, they incur BTI stress ( [6]).
Many of the prior studies, described in Section 2.3, approached asymmetric aging from the physical design point of view. Such approaches were not straightforward since they involved highly complex simulations and analysis methods. Even many of the common electronic design automation (EDA) tools today cannot yet handle such tasks and, in particular, are very limited in their ability to analyze very large-scale circuits that could employ billion of transistors.
In this paper, our main focus is minimizing microprocessor susceptibility to asymmetric aging induced by static BTI stress applied for long periods. We present three novel mechanisms to mitigate asymmetric aging under constant stress conditions. The first mechanism avoids asymmetric aging in execution units that are under BTI stress through periodic injection of pseudorandom data at low rates. The second mechanism deals with asymmetric aging avoidance in architectural and control registers by employing periodic shifts and remapping of register identifiers. The third mechanism handles asymmetric aging in memory systems using a new approach that combines swapshift of cache sets and pseudorandom data generation.
The remainder of this paper is organized as follows: Section 2 introduces asymmetric aging reliability challenges and reviews previous works. Section 3 describes our asymmetric aging observations and Section 4 presents our proposed microarchitecture enhancements for modern microprocessors accompanied by experimental results. Finally, Section 5 summarizes the study and suggests directions for future research work.

II. ASYMMETRIC AGING
The susceptibility of modern process technologies to reliability-related issues has grown dramatically. Starting at 28nm process technology and below (16,7,5, and 3nm), design efforts dedicated to reliability have substantially increased. The design community has mainly tried to enhance the synthesis and place-and-route flows to minimize and eliminate reliabilityrelated issues. Such flows involve substantial design efforts and, in many cases, require multiple iterations to make the IC comply with the design rules (also known as the "sign-off process").
We now describe the asymmetric aging effect and thereafter provide an overview of previous studies.

A. Asymmetric Transistor Aging
Transistor aging is the deterioration process of transistors, residing in logical gates and memory elements ( [8,9]), due to charge carriers from the transistor inversion channel being trapped at the dielectric insulator of a transistor gate. There are two physical mechanisms that cause charge carriers to be trapped: 1. Hot Carrier Injection (HCI), which involves charge carriers that flow from the transistor source to the drain; the charge carriers may get trapped in the gate oxide due to excessive energy levels. 2. Bias Temperature Instability (BTI), where charge carriers are also caught in the dielectric gate insulator, but this time no current flow is required between the source and drain of the transistor; rather, they may get caught whenever voltage is applied to the transistor gate. When gate voltage is removed after a short period of time (<10s), the damage is partially reversible, and part of the charge carriers is detached.
Both BTI and HCI increase the transistor threshold voltage, reduce charge carriers' mobility in the channel, and mandate a higher voltage to switch on the transistor. In addition, they also slow down transistor speed due to the degradation in the transistor current. As a result, ICs may experience major frequency degradation over their lifetime. Various methods for dealing with aging affects have been offered and these are physical design or circuit-based solutions ( [9][10][11]). The most common approach is to provide extra margins for the clockcycle time to compensate for the lifetime performance degradation.
Recent studies have discovered that the degradation due to aging may not be uniformly distributed. This may happen in the following scenarios: 1) Inside a logical cell when p-devices and n-devices age unequally and, as a result, rising and falling transient edges may experience different delay shifts. 2) Between different paths in a logical circuit [7,12], which incur uneven aging degradation, may result in critical timing constraint violations. When such violations involve setup timing constraints, they can be mitigated by reducing the clock frequency; however, when hold constraints are violated, the circuit will incur severe reliability issues that cannot be mitigated. This phenomenon, referred to as "asymmetric aging", has become a major reliability concern in mission critical systems.
Asymmetric aging is induced as a result of static stress applied to logic gates or memories for long periods, which may vary between tens of seconds up to several weeks ( [7]). BTI has been found to be the main contributor to this phenomenon that may affect both p-type (known as NBTI) and n-type (PBTI) transistors. NBTI exhibits higher impact by several orders of magnitudes relative to PBTI, though in advanced process technologies PBTI was shown to also have considerable impact. As a result of the BTI effect, logical paths that are under different static stress will age asymmetrically and introduce new timing violations that cannot be identified by conventional timing verification methods.
Asymmetric aging is highly complex to model, analyze, predict, and avoid in very large-scale ICs and, therefore, it has become a major reliability issue. In addition, timing analyses that takes into account the asymmetric aging effect are nontrivial as they depend not only on the mode of operation (static vs. dynamic stress) but also on the operating conditions and technology specifications. Conventional timing verification tools ( [13]) lack any information related to the lifetime activation modes of the digital circuit, e.g., standby modes, constant values and activation of clock gates that are applied for long periods. The complexity of handling asymmetric aging also exists in reliability qualification tests such as High-Temperature Operating Life (HTOL) ( [14]), which are commonly used by the industry to test the lifetime reliability of ICs. HTOL tests accelerate the aging degradation by running circuits at high temperature and under high voltage; however, they assume symmetric aging and use designated circuitry to assure that all logical paths are kept toggling through the test period in order to avoid BTI stress. Such limitations, which are major concerns in assuring IC reliability, motivate us to explore new approaches to mitigate the asymmetric aging effect.
From an architectural point of view, asymmetric aging in many cases is a result of dynamic power saving techniques that enforce a static state on logical circuits, so that they incur a BTI effect. This effect is demonstrated, in Fig. 1, on five typical scenarios common to modern microprocessors and SoCs. Fig.  1 (a) depicts a bit cell circuit that is the basic circuit of SRAMs. A bit cell consists of two cross-coupled inverters and pass transistors that provide access to the bit cell. No matter what data is stored in the bit cell, there is always an inverter with a logical 1 in its input while the other inverter has 0. Whenever a 0 is applied to a logical gate, the p-type transistor will suffer from NBTI stress and, as a result, the inverter will age asymmetrically in respect to the other inverter. Fig. 1 (b) illustrates a similar phenomenon in an SR latch. SR latches are the building blocks of master-slave D flip-flops registers that are broadly used in digital circuits. Whenever the latch maintains the same storage value (SR=00), it similarly behaves like the bit-cell cross-coupled inverters, which consequently incur asymmetric aging. Another example of a general synchronous digital circuit is shown in Fig. 1 (c) where the launch clock path is controlled by a clock gate while the capture clock path is free running. Such a circuit implementation is common in microprocessors to save dynamic power, particularly when the launch path changes infrequently. If the clock gate disables the launch clock for a long time, static stress will be applied to all the clock buffers in the launch clock path. Hence, they will experience a higher delay shift relative to the capture path, resulting in an unbalanced clock tree between the launch and captured clock endpoints. Such a design will incur a degradation in setup slack that may lead to setup violation due to the late arrival of the launch clock edge. If the clock gate is placed in the capture path, it may suffer from hold violation. The logical data path between the flops, illustrated in Fig. 1 (c) by a chain of inverters, may suffer from timing violations as well. Whenever the launch flop output value is kept static for a long time, the static stress will emerge through all the inverters in the chain. In particular, all inverters with 0 on their gate will stress their p-type transistors and, as a result, will introduce a higher raise delay shift in respect to the other inverters.
The example illustrated by Fig. 1 (d) is a delay chain circuit that is commonly used in I/O interfaces such as DDR DRAM to adjust timing delays. Such a circuit consists of a controllable chain of delay gates where the programable delay is determined either by production testing or at boot time based on the training sequence. The unused delay gates will be excluded from the delay path and will remain static to save dynamic power. As a result, these delay gates will incur asymmetric aging and if they are used in the future, they will exhibit severe reliability issues due to the delay shift between the rise and fall times. In certain cases, the signal will not even be able to propagate through the chain when the delay shift becomes high.
The last example depicted by Fig. 1 (e) shows an idle data path of an execution unit (e.g., an FP adder or multiplier) that is under static stress. The inputs to the block are stored by clockgated registers and when the unit is idle, the clock is disabled.

B. Prior Studies
Many of the prior works approached asymmetric aging from the physical design point of view. Such an approach is not straightforward, since the process of simulating, analyzing and fixing asymmetric aging issues in large-scale circuits is highly complex ( [15]), with only few EDA tools (e.g., BERT, RelXpert) today having such a limited capability ( [16,17]). The strategies that were suggested attempted to cope with the problem from various directions. One approach was to enhance the process node to reduce the impact of the BTI effect. This, however, became highly challenging due to the down-scaling dimension of the gate oxide. Traditional approaches relied on taking margins in timing closures for both setup and hold that would take into account the asymmetric aging effect. This was found to necessitate a highly complex analysis and, in many cases, ended up in overdesign. Other studies attempted to model and predict the degradation as a result of NBTI ( [3-5, 7, 12, 13, 18-20]) and suggested various solutions such as transistor sizing, VDD tuning, duty cycle reduction and also decreasing the transistor channel length. Agrawal et al. in [21] presented a mechanism for circuit failure prediction by collecting data from special sensors placed in different locations in the silicon die. Their results indicate that by using these sensors, they can reduce the conservative margins used by the traditional design flows and improve chip performance. Further studies ( [22][23][24][25]) also introduced methods for analyzing digital circuits and identifying critical gates that are the most susceptible to NBTI stress. This was done by employing an aging model (consisting of BTI-aware libraries) and an aging-aware timing analysis.
Several other studies ( [26][27][28][29]) suggested applying power gating and drowsy memory to mitigate BTI stress in memories. Power gating employs transistor switches that powers down the

Clock gate
Enable entire SRAM, and as a result, eliminates the BTI stress all through the shutdown process. Power gating, however, has several limitations. First, when applied, the entire content of the SRAM is lost and cannot be recovered when the SRAM is switched back on. In addition, power gating introduces a performance overhead when the memory is switched back on, and, therefore, it is inefficient when the memories become idle for a long time. Drowsy memory is another technique for power saving in SRAMs. When the memory is entered into low power mode ("drowsy" mode), the memory voltage is reduced to the minimum retention voltage that allows bit cells to continue maintaining their values safely without being accessible. When the memory is accessed, the voltage is switched back to nominal voltage. Throughout the drowsy state, the BTI stress on bit cells is reduced since VDD reduction also reduces the gate voltage.
Other studies examined asymmetric aging on different logical circuits that can suffer from significant NBTI degradation. Velamala et al. ( [7,18]) introduced the NBTI effect on DDR delay chains and SRAM internal circuits. Another SRAM asymmetric aging study by [30] suggested an on-chip reliability monitor to measure SRAM BTI impact. Their study found that SRAMs can incur significant internal clock duty-cycle shifts in read operations as a result of asymmetric aging. Yan et al. ( [31]) studied NBTI impact on master-slave D flip-flops under different duty-cycle assumptions.
While many studies attempted to cope with asymmetric aging from the physical design point of view, only a limited number of works examined this phenomenon from an architectural point of view. Firouzi et al. suggested a NOP instruction insertion to reduce the impact of NBTI on the execution stage of MIPS processors ( [32]). They found that such NBTI degradation is dominated by the values of source operands rather than opcodes and suggested software and hardware approaches to relax NBTI stress by using different variations of NOP instructions. This method was found to provide limited improvement in processors that employ a negligible number of NOP instructions ( [33]). In addition, it may increase GPR register utilization and, as a result, reduce the number of available registers for the application. Abbas et al. suggested running anti-aging programs instead of idle tasks when the processor is not utilized ( [33]). The anti-aging programs generate specific value patterns to repair the BTI asymmetric aging effect in the execution unit combinatorial circuits. This technique was efficient; however, it required complex analysis of the critical paths and the requisite antiaging values. Moreover, it was limited to handling the execution stage combinatorial circuits only and assumed a scalar processor. Note that in the case of out-of-order processors, such techniques may become limited in mapping the anti-aging patterns to the multiple execution units. Chen et al. examined the performance degradation due to asymmetric aging in multicore systems where processors may be asymmetrically aged due to different workloads and utilizations ( [34]). They suggested reserving certain cores at early stages of the system lifetime to be used for executing critical missions at late stages.
Field Programable Gate Array (FPGA) devices may also be highly susceptible to NBTI. Unused FPGA logic can suffer from long constant logical stress and, when such logic is used again, it may incur asymmetric aging degradation. A technique to reduce the impact of asymmetric aging on FPGA was introduced by [35] who suggested bundling unused FPGA elements in logical chains and toggling them at low rates to prevent the constant NBTI stress.
Other studies proposed solutions for asymmetric aging in the processor's memory system. As previously discussed, SRAM memories are highly susceptible to asymmetric aging due to the basic structures of bit cells, and when a bit cell stores static values for a long period, e.g., startup value (SUV), secret keys etc., severe reliability concerns induced by asymmetric aging may arise ( [36]). Various techniques aimed at mitigating SRAM asymmetric aging degradation by balancing the signal probability of 0 and 1 states. Kumar et al. suggested a periodic bit-flipping process for the first level cache where every cache line is periodically read, inverted and written back to the cache ( [37]). When managed by software, this mechanism introduced significant performance overhead due to the disruption of real cache accesses by the processor. A hardware-based approach that maintains the bit flipping locally in the cache SRAM was reported to reduce the memory access overhead. Gebregiorgis et al. proposed a self-controlled bit-flipping (SCF) method that performs the bit flipping upon every I-cache and D-cache line write access and avoids processor interruptions ( [38]). This approach was found to be limited when the cache miss rate was low or when the rate of writes small. Duan et al. introduced a cell flipping technique with distributed refresh phases (CFDR) to reduce the NBTI effect in the I-cache ( [39]). This technique flips and refreshes I-cache blocks at a certain refresh rate by scanning cache blocks from a lower to higher index. Since the refresh rate is relatively very low, the processor disruption time becomes negligible. Such a method was reported to introduce a 125% lifetime improvement.
Several past studies attempted to handle cache asymmetric aging using a different approach. Calimera et al. ( [40]) suggested probing and scrambling functions for cache reindexing in fine-grain and coarse-grain modes of operations. Whenever the indexing function was changed, it required flushing the cache content. Their study was examined under two different cache power saving modes: power gating and drowsy power. The results showed that their method, when used with power saving techniques, can extend cache lifetime up to six-fold. Nevertheless, it may incur a timing overhead on the cache access timing path.

III. ASYMMETRIC AGING ENHANCEMENTS IN MODERN MICROPROCESSORS
In this section we present new microarchitecture enhancements for modern microprocessors to cope with the asymmetric aging problem. We start this section by presenting our simulation environment and then we present our experimental observation of asymmetric aging in existing microprocessor architectures. Finally, we suggest microarchitecture enhancements to mitigate asymmetric aging and present our experimental observations of them.

A. Experimental Environment
For this study, we used the Sniper x86-64 microarchitecture simulator [41]. We modified the simulation platform and added the needed mechanisms to model the behavior and measure the characteristics required for our experiments. The simulation environment included both a detailed cycle-level x86 core model and a memory system. TABLE 1 summarizes the configuration of the simulation environment (based on the Intel Gainestown core [42]). We used the Spec2017 benchmarks [43,44] as our workload benchmarks with ref inputs and EMBC Coremark (two different runs). Every Spec2017 benchmark was run as a singlecore workload in two different regions of interest: during the initialization phase and during the main execution phase (denoted "Init" and "Main," respectively). Each experiment used 10 billion instructions (for both initialization and main execution phases). The EMBC Coremark benchmark was run from the beginning to completion.

B. Asymmetric Aging Experimental Observations
Our experimental observations focus on three domains in modern microprocessors: CPU architectural registers, execution units, and memory hierarchy and page tables. Our main focus is identifying elements that are under static stress for long duration. We start our examination by inspecting all architectural registers: general-purpose registers, floating point (FP) registers, vector registers and various control registers. Our experimental results are presented in TABLE 2, which summarizes all register groups with static BTI stress, i.e., they are written only once through simulations or not written at all. As can observed, control registers, e.g., CRs and MSRs, incur major BTI stress as they are kept constant through very long execution periods. Additional memory protection, segment registers and debug registers also incur similar BTI stress. Surprisingly, we also observe that even computation-oriented registers such as FP/vector registers as well as temporary registers suffer from constant stress. These observations are highly concerning as they indicate that critical functions, e.g., page table pointers, cache configuration registers, memory protection mechanisms and pure computational values stored in such registers, may experience severe asymmetric aging. Registers whose values change very seldom.

Registers are static in integer benchmarks.
Our next experiments examine execution units that are under static BTI stress. TABLE 3, which presents our basic observations, shows that FP adder/subtractor and multiplier/divider execution units may incur very long periods of static stress as they are not utilized during the execution in the CoreMark benchmarks and some of the Spec2017 benchmarks that do not use any FP operation (e.g., 602.gcc, 605.mcf, 631.deepsjeng and others). These observations are also highly troubling as they may augur major reliability issues. Our final observations pertain to the memory hierarchy and the page translation tables. The results are presented in Fig. 2, which illustrates the number of entries with constant stress throughout the entire simulation experiment (either written once or not written at all). Our observations indicate that the L1-I cache is much more susceptible to asymmetric aging, mainly due to a small miss ratio that encourages the line to remain static in the cache. In the L2 cache, we also observe a significant number of lines with static stress, especially in the Coremark benchmarks that have a smaller footprint in respect to the Spec2017 benchmarks (which also introduce a small number of entries with constant stress). The L3 cache is the module with the biggest number of cache lines under constant stress and this is due to either cache entries with low temporal locality that wait for a very long time until they are evicted, or because of unutilized cache lines. Page translation tables also exhibit the existence of a significant number of entries with constant stress. It can be observed that the ITLB is very dominant for the Spec2017 and this is due to the high locality of code footprints that consist of many pages that fit into the main memory and, as a result, maintain a significant number of page entries with constant values. In the Coremark benchmark we also observe that both DTLB and STLB maintain a significant number of entries with constant stress. These results raise more concern for the page translation tables residing in main memory that are kept under constant values for long periods. The existence of 1 www.opencores.org memory elements under such major constant BTI is very disturbing as bit cells are highly susceptible to BTI stress and any such singular failure may induce a major reliability issue for the entire system. We realize that the likelihood of memory BTI stress in systems that utilize frequent context switches is very low. Nevertheless, machines that run constant jobs, e.g., microcontrollers in embedded systems or database servers, may likely experience the reported observations.
When examining the root cause of these observations from an architectural point of view, we identify three main causes. First, backward compatible features that are not in use by new applications, e.g., x87 control registers, may lead to underutilization and BTI stress. Second, new forward compatible features, e.g., x86 temporary registers, when not efficiently utilized due to legacy considerations, may also generate asymmetric aging concerns. Third, microprocessors are by nature general purpose machines and, consequently, it is highly challenging to inhibit static stress on all their functional units, e.g., FP execution units. The latter observation is also supported by [45] who estimates that 21-50% of microprocessors transistors are underutilized.

IV. ASYMMETRIC AGING-AWARE MICROARCHITECTURE ENHANCEMENTS
This section introduces microarchitectural solutions to reduce the asymmetric aging effect on different microprocess architectural subsystems: execution units, register files, and cache memories.

A. Execution Units
Based on our experimental observations summarized in TABLE 3, FP execution units may be under statics stress for long periods. One may argue that in general purpose environments, the use of FP resources may be distributed more evenly, yet in many other environments, such as embedded systems, autonomous cars or even in-memory data servers that run the same workload for very long times, the impact of asymmetric use of the FP execution units may be extremely severe. In this section, we propose a novel scheme to mitigate BTI stress over execution units. Note that although the technique is suggested for FP execution units, it is applicable to any other data path computational module that may be susceptible to asymmetric aging as well. Our scheme utilizes a pseudorandom sequence bit (PRBS) generator that is activated by a slow frequency clock. The PRBS generator generates pseudorandom patterns that are fed into the execution units to prevent extended periods of constant stress. The clock frequency of such a PRBS generator can be in the order of MHz or even lower, to minimize dynamic power overhead. Varieties of PRBSs generators are used for communication and security applications. We examined a simple PRBS circuit ( [46]) as part of our study, which introduces very small logic and power overhead while being able to generate random patterns that are sufficient to toggle the execution unit at a low rate.
We examined the effectiveness of the proposed technique on a double precision FP adder design from OpenCores 1 . The block was connected to a PRBS generator as illustrated in  3, synthesized, and the signal probability (the ratio of signal node at the gate level being active) was measured during a simulation of one million clock cycles. The histogram presented in Fig. 4 depicts the signal probability distribution across all the signals in the design. It can be clearly observed that the majority of the signal has a signal probability of approximately 50% due to the values injected by the PRBS circuit. The histogram shows that only a small group of signals could not be toggled effectively and remained static through most of the simulation. We identified that this group is related to the FP addition result shift that involves zero-padding. This can be easily fixed by forcing the PRBS patterns to be injected into this group of signals that, as a result, avoids the constant state. As part of our examination, we also summarized ( Table 4) the overall overhead on 28nm-process technology using high VT (HVT) standard cells, which is negligible in terms of power and area.

B. Architectural Registers
The experimental observations presented in TABLE 2 clearly indicate that there is a significant number of registers that incur BTI stress. Typically, out-of-order microprocessor architectural registers are renamed as physical ones and are hosted by a cyclic buffer (as part of the RoB). This implementation mitigates BTI stress for the physical registers since architectural register mapping changes their physical locations rapidly. Nonetheless, architectural registers, and in particular static control and configuration registers, such as CRs and MSRs, still suffer from asymmetric aging. The proposed architectural solution, illustrated in Fig. 5, avoids BTI hotspots by periodically changing the mapping of registers to their corresponding architectural hosting locations. The scheme is based on modulo rotation of the mapping between the architectural or control register identifiers and their physical locations. As illustrated in Fig. 3, a pulse trigger is asserted to shift the register mapping in the register file (RF) either periodically at low frequency (or each time we change CR3) or as part of the return-from-interrupt procedure before saving the values of the user-level process. A modulo-counter (RF rotator) serves to map the architectural or control register number to the mapped register location by modulo addition. After each assertion of the rotation trigger (at any arbitrary time point), the counter is incremented, and the register values are shifted between registers, as illustrated in Fig. 5. When examining the proposed scheme in our simulation environment, we observe that it is able to prevent the static stress that was reported in TABLE 2. The rotation trigger in our simulation was asserted every 10 million clock cycles. We examined different rotation trigger rates and found that this value does not impact performance. Table 5 summarizes the power, timing path and gate count overhead for 28nm-process technology (for a bulk of 32 registers) using HVT standard cells. As can be observed, the overall overhead is very small: The proposed solution may appear to resemble the Sun SPARC and Berkley RISC CPU register window [47], which is used for a different purpose. Register windows are a scheme that aims to evenly distribute sets of GPR registers between different sections of code, typically procedure calls. At every nested call, the register window is shifted to provide the program with a new working set of registers. In contrast, our proposed scheme extends to all architectural registers (FP, vector, control etc.) and shifts one register at a time unlike, the register window technique that shifts a bulk of registers and is limited to integer registers. It should also be noted that register windows involve more frequent register window switches, resulting in excessive dynamic power while the rotation frequency of our proposed scheme is very low.

C. Memory Hierarchy
Our observations, presented in Fig. 2, suggest that cache entries may incur BTI stress in different levels of the memory hierarchy. Again, one may argue that in general purpose, multiprocessing environments such phenomena may be rare due to the frequent context switching. Our experiments, however, indicate that this situation is quite common for system environments that run the same program for very long periods (e.g., embedded microcontrollers) since they cannot leverage any context switching to purge constant values out of the cache memories. Our proposed technique can mitigate major reliability concerns regarding such systems. We suggest combining a PRBS generator with the swap-shift set index remapping method, introduced by Wang et al. ( [48]). The swapshift method was proposed to handle a different reliability phenomenon, termed write endurance, in PCM-based nonvolatile memories (NVMs). Unlike NBTI where the stress is induced by constant values maintained for long periods, PCMbased NVMs experience bit cell wear-out after an excessive number of writes, leaving the bit cell resistance in a low or high resistance state. The original swap-shift technique, illustrated in Fig. 6, suggests periodically swapping a pair of cache sets whenever a number of writes reaches a certain threshold. It should be noted that the swapping only changes the set identifier, and it invalidates the data residing in the swapped pair of lines. A swapped-set counter is maintained through the shift process to identify the last set that was shifted. This counter is incremented whenever a set is swapped. Once all sets are swapped, the swapped-set counter wraps to 0, the cache index mapping is rotated by one and the whole process is restarted. The accumulative number of cache index rotations is maintained by the set-shift counter and once it reaches the number of sets it wraps to 0.
We propose employing a modified swap-shift scheme combined with a PRBS generator as illustrated in Fig. 7. The swap-shift process is activated by a periodic shift trigger that is asserted in low frequency. The set-shift and the set-swap counters operate in similar fashion to the original swap-shift method except for the fact that the set swapping is triggered by the periodic shift trigger signal. Upon any swapping, a pseudorandom pattern, generated by the PRBS module, is written into the invalidated swapped cache sets. To minimize the timing impact on the cache access time, we retime the index remapping logic and place it in a pipeline stage prior to the cache access stage (address generation and memory order buffer). The usage of the combined swap-shift and PRBS generator serves to eliminate constant BTI stress induced by two scenarios: 1. Constant values residing in the cache for long periods -In this case the set rotation spreads the BTI stress uniformly across all sets. 2. Unutilized cache lines that remain unchanged for long periods despite the set shift -In this scenario, the PRBS generator helps eliminate the BTI stress even if the physical set remains unutilized after the swapping. The proposed scheme introduces several advantages in respect to CFDR and the scrambling-probing methods. First, it eliminates the flushing of the entire cache and hence minimizes the performance overhead. The proposed technique was examined in our experimental environment with various shifttrigger rates and it was found that a shift-trigger every 10Mcache accesses yields minimal performance overhead (less than 0.01% impact on cycle count) while fully eliminating BTI stress of constant values. In addition, our scheme introduces simpler logic complexity due to the avoidance of read-modify-write sequences required by the CFDR method. Such a sequence overloads the cache port with a significant timing overhead and may require sampling stages and an additional dedicated FSM to maintain such a flow. Last, we minimize the timing impact on the cache access time by retiming the index remapping to prior pipeline stages. Our logical scheme was synthesized, and Table 6 summarizes power, timing and area overhead obtained by the synthesis for 28nm-process technology. Since the propagation delay of the index remapping in our proposed scheme may be significant, we present our synthesis results using different threshold voltage standard cell options: HVT, standard VT (SVT), low VT (LVT) and ultra-low VT (uLVT). This allows us to find the optimal tradeoff for the propagation delay given system power constraints. It can be observed that Cache sets -initial state Write data the area overhead in very small. In addition, the index propagation delay can be reduced by more than x3 when using uLVT in respect to HVT cells. The power, however, increases by approximately x50-60. SVT and LVT cells introduced a nearly x1.4 and x2, respectively, propagation delay improvement in respect to HVT cells but with a power overhead significantly smaller than in the uLVT cells.

V. CONCLUSIONS AND DISCUSSION
Microprocess reliability is a crucial requirement that has been highly challenged by advanced process technologies and new computation-intensive applications such as autonomous vehicles, data centers, cloud computing and life-support systems. Recent advanced process nodes have become highly susceptible to asymmetric aging that can cause critical timing violations in ICs and overall system failure. Asymmetric aging is primarily induced by the BTI effect when constant voltage is applied to transistor gates for long duration. We summarize the contributions offered in this paper as follow: 1) We examined the asymmetric aging root cause in microprocessors and identified three new main sources of BTI stress. First, backward compatible features may lead to underutilization and BTI stress. Second, new forward compatible features-when not efficiently utilized due to legacy consideration-may also prompt asymmetric aging concerns. Third, microprocessors are by nature general purpose machines and, as a result, it is highly challenging to inhibit static BTI stress on all their logical circuits. 2) We introduced a novel mechanism to avoid asymmetric aging in execution units that are under BTI stress through periodic injection of pseudorandom data at low rates.

3)
We presented a mechanism for asymmetric aging avoidance in architectural and control registers by employing periodical shift and remapping of register identifiers. 4) We offered a mechanism to handle asymmetric aging in memory systems using a new approach that combines swapshift of cache sets and a pseudorandom data generator. 5) Our experimental results indicated that the proposed techniques were able to efficiently eliminate constant BTI stress with negligible performance impact. In addition, we performed a synthesis trial and examined power, timing and area impact of the proposed mechanism and found that all methods introduce very small overhead. Asymmetric transistor aging is becoming a highly important phenomenon in many fields such as embedded systems, autonomous cares, memory data bases, and more. Many of these environments require system architects to guarantee the lifetime of products, which may be governed by their reliability. Meeting such demand requires further extensive studies by different disciplines: process technology, physical design, EDA tools and system microarchitecture. In the most advanced process technologies of 5 and 3nm, reliability related issues are expected to become even more complex mainly because the HCI effect becomes more dominant there. Physical design flows should be developed to better analyze and fix asymmetric aging violations in large-scale circuits. This is a major challenge that requires both industry and research communities to find practical solutions to allow the development of future reliable large-scale ICs. System architects are also encouraged to conduct asymmetric aging studies in order to mitigate this phenomenon by architectural means. Additional processing systems such as GPUs, FPGAs, networking systems and dedicated processing accelerators should be further examined to find innovative architectural solutions for asymmetric aging mitigation. As part of his industrial role, he worked for National Semiconductor, on the team that invented and developed the first PC-on-Chip. At Intel he worked for five years as a researcher in Intel Research Labs and six years as principle engineer in the mobile CPU architecture team where he was chief architecture of the first CMP feature (multicore) of Intel cores. For this task and leadership, he received the IAA (Intel Achievement Award).
Prof. Avi Mendelson is an IEEE Fellow, was a member of the Board of Governors of the IEEE Computer Society and served as a second VP of the IEEE Computer Society.