Advanced power analysis methodology targeted to the optimization of a digital pixel readout chip design and its critical serial powering system

A dedicated power analysis methodology, based on modern digital design tools and integrated with the VEPIX53 simulation framework developed within RD53 collaboration, is being used to guide vital choices for the design and optimization of the next generation ATLAS and CMS pixel chips and their critical serial powering circuit (shunt-LDO). Power consumption is studied at different stages of the design flow under different operating conditions. Significant effort is put into extensive investigations of dynamic power variations in relation with the decoupling seen by the powering network. Shunt-LDO simulations are also reported to prove the reliability at the system level.


Introduction
The High Luminosity Large Hadron Collider (HL-LHC) pixel detectors for the ATLAS and CMS experiments at CERN require higher granularity and significantly higher performance of the pixel readout chips compared to previous generation ones, i.e. smaller pixel size (e.g. 50 × 50 µm 2 ), unprecedented radiation conditions (up to 500 Mrad or above), higher hit rate capabilities (3 GHz/cm 2 ), extended trigger latency (12.5 µs) and higher readout data rates (100×) [1,2]. These specifications impose the use of scaled CMOS technology, featuring low voltage supply to guarantee both the target resolution and a sustainable power density while still implementing all the necessary functionality. The combination of these requirements poses significant challenges to assure an appropriate low material budget to deliver the required physics performance. In particular, an independent passive power distribution system from low voltage power supplies located outside the experiment (as present in the current LHC pixel systems), is excluded as it would imply huge power losses along thousands of cables. A serial powering scheme, based on a constant current supply distributed across the detector modules, is instead considered a very promising option which would allow the experiments to achieve low mass power cabling, low number of powering chains and low losses on power cables. The ATLAS pixel community has already experimentally proven the feasibility of the scheme with previous generation pixel chips as FE-I3 [3] and FE-I4 [4] (even though it has not been installed in the experiment during the IBL upgrade due to the limited time available).
A new pixel chip in 65 nm CMOS technology is being designed within the RD53 collaboration [5] to meet design goals for advanced detectors at HL-LHC. Low power design techniques are adopted and an on-chip regulator supporting operation with serial powering is integrated to prove its feasibility and reliability. In this context, an upgraded version of the shunt-LDO circuit [6] is -1 -used. It is composed of a Low-DropOut (LDO) regulator generating the low supply voltage and a shunt regulator consuming the current not drawn by the load, as shown in figure 1. In particular, two shunt-LDOs are integrated to power the digital and the analog domains of the chip separately. Digital power variations constitute a major worry as they could couple into the analog domain and they could cause chip failure, if higher than the current provided to the serial power chain.  Figure 1. Block diagram of a serial powered chip with integrated regulators for analog and digital domains (on the left); sketch showing the effect of power variations in the adopted powering scheme (on the right).
In this work, modern digital low power methodologies, mostly meant to achieve the optimal trade-off between performance and energy consumption (e.g. [7,8]), are adopted and innovatively targeted to the specific needs of serial powering for pixel detector modules. In this context, digital optimization is not only focused on the average power budget, but strongly involves minimization of peak power taking into account the impact of local decoupling on power variations. The defined power methodology and power results are presented in section 2, whereas the optimization results on the RD53 pixel array logic is described in section 3. The results of detailed shunt-LDO simulations in a serial powering configuration combined with the digital power fluctuations are presented in section 4. Finally, conclusions and further developments are described in section 5.

Power analysis methodology and results
Digital design tools have been used to perform power analysis starting from the gate-level netlist and moving to detailed post-layout power estimation with parasitic annotation. In both cases, it is essential to obtain digital activity for different operating conditions in order to accurately estimate power consumption and its variations. The power analysis has been integrated with the RD53 SystemVerilog-UVM simulation framework VEPIX53 [9], capable of generating the proper stimuli and of simulating the Design Under Test (DUT) up to detailed post-layout netlist. The resulting full activity can be provided to the power analysis tools for accurate results (not only at the inputs, but also for each single module, cell, wire and port of the netlist).

Power estimation for architectural choices
Power analysis at gate-level, i.e. after synthesis to gates without layout parasitics information, is useful to drive substantial architectural choices before going into a complete detailed design. In the context of RD53, a key design choice is related to the use of clock gating technique [8], since it is a source of power variations. It is commonly used to reduce dynamic power consumption, which plays a major role in the chip given the high activity conditions. However, its use was initially discouraged to keep power as constant as possible. A 64 × 4 pixel array, featuring similar characteristics to the foreseen RD53 pixel chip, has been synthesized by means of the Cadence RTL Compiler tool.
-2 -It should be underlined that clock gating was performed manually in the pixel array logic as the automated insertion from the tools is implemented "wherever possible", which has been seen to be less efficient (clock gating cells also come at power and timing costs). Simulations of the obtained netlists (i.e. with and without the implementation of clock gating) have been run within VEPIX53 under 3 GHz/cm 2 hit rate, 1 MHz trigger rate and 12.5 µs trigger latency. A power profile showing power variations averaged over a 1 µs time scale is presented in figure 2. It has been obtained by means of a defined iterative algorithm which instructs power reports in sequence over small time windows. As it will be described in more detail in the following sections, such a time scale (1 µs) emulates the effect of the on-chip decoupling. A significant increase (∼ ×5) in power consumption is seen when excluding any form of clock gating in the architecture and cannot be tolerated according to RD53 specifications, where the goal for average digital current consumption is set to be lower than 5 µW/pixel, excluding the chip periphery [10]. Moreover, power variations observed at this initial stage of the design flow (excluding accurate parasitics information) are not particularly critical.

Post-layout power analysis
A more detailed power analysis is necessary to provide initial specifications to the powering system, whereas gate-level analysis has shown around 50% underestimation due to limited modeling of parasitics and clock tree. For this reason, the implementation flow with the Cadence Encounter/Innovus tool has been advanced to the post Place&Route (P&R) stage. Parasitics have been extracted (SPEF, Standard Parasitic Exchange Format, file) and the more detailed post P&R netlist has been simulated by means of the VEPIX53 framework to annotate activity (VCD, Value Change Dump, file).
At first, average power estimations have been obtained under multiple hit and trigger conditions, as described in the following, in order to assess power impact of different factors and guide design choices. A summary, where results are given per pixel and also scaled to the full pixel matrix (with 400 × 400 pixels foreseen for both ATLAS and CMS experiments) is reported in table 1. Presented results are based on the technology typical corner (1.2 V, 25 • C). The activity conditions included are: extreme hit and trigger rate as described in section 2.1, high hit rate and trigger absence (to -3 -

JINST 12 C02017
decouple hit and trigger effect), without hits and without triggers (i.e. just clocking the logic). It can be highlighted that the power consumed by the clock tree, including both global and local clock delivery, is dominant. This is mainly due to a combination of high switching activity of the clock and high total load of clock buffers, with many registers as well as interconnects.

JINST 12 C02017
time scales (1 ns, 25 ns, 100 ns, 1 µs, 10 µs): absolute peak value and percentage increase with respect to average power are highlighted. This study allows to investigate the impact of decoupling seen from the chip to the serial power network, which acts as a low-pass filter to current variations. It can be noticed that digital power variations are very high within the clock time period (25 ns), but they get much smoother after averaging over 1 µs. Even in the case of high hit and trigger rates, variations at this time scale are limited within 20%. In the plot at the bottom of figure 3, variations are already filtered at short time scales, since digital activity is stable. These results can be used as an input to Shunt-LDO simulations to verify its functionality and demonstrate the reliability of the serial powering scheme.

Test case: pixel array logic optimization
The described methodology has been adopted to assess power performance of the digital array logic throughout the design process. The target of the design optimization has been the trade-off among area, hit loss and power. The adopted pixel region architecture, i.e. digital logic shared between small groups of pixels, is based on a mapping in 65 nm CMOS of the FE-I4 digital architecture [11], featuring Pixel Regions (PR) made of 2×2 pixels. Its main characteristic is that Time over Threshold (ToT) information is stored locally in the pixels, while memories for the timestamps and triggering logic is shared in the PR.
Architectural changes have been evaluated: a summary is reported in table 2, where area utilization, average power and peak power per pixel, with averaging over 1 µs time scale, are shown. In the initial implementation (case #1), ToT information was calculated with 4-bit counters, stored in flip-flops and read through asynchronous readout. Only 7 memories (for the timestamp and the 4-pixel ToTs) could be fit assuring an area utilization lower than ∼ 90%, which is required for the digital tools to close design at the final stages. A significant area utilization decrease (∼ 10%) and a reduction in peak power per pixel has been successfully obtained by storing the ToT data in latches (case #2). A different approach for ToT calculation without counters, i.e. local subtraction of the timestamp value at the trailing and leading edge of the incoming hit, has been evaluated in case #3 with the aim of limiting power variations. At the rates of interest for the application, an increase in average and peak power has been actually observed due to the additional logic. In case #4, slightly improved results have been achieved with fully synchronous memory readout, which is also preferable for timing constraint reasons. As it can be seen in case #5, the area gain has allowed designers to fit in an additional memory, which significantly reduces hit loss of the digital logic. The additional memory has implied a small power increase. Finally in case #6: • clock gating was implemented by means of special Integrated Clock Gating (ICG) cells, which profit from more efficient placement and timing. Such standard cells are composed of an AND or OR combinational gate (to stop the clock to the pixel logic when not operating) preceded from a level-sensitive latch (to prevent glitches on the resulting gated clock); • improved results on area, average and peak power lead to the choice of case #6 architecture as the baseline; -5 - • power analysis performed with power models including total ionizing dose effects (500 MRad) has shown less than 5% power increase and no dominant impact of leakage power induced by radiation.

Topology and setup
A serial powering topology is being considered for the pixel detector, where modules (containing multiple chips) are powered in series, whereas chips in each module are powered in parallel. This solution both profits from the serial powering advantages and allows to keep chips in the same module (connected to the same sensor) at the same potential. The shunt-LDO regulator design gives the possibility to connect in parallel multiple shunt-LDOs on different chips and provide regulated output voltages to the analog and digital power domains. It is foreseen that under normal operating conditions a 25% current headroom will be injected on a serial power chain. The headroom current will be consumed by the shunt part of the regulators, which in combination with the local decoupling, allows to absorb dynamic peaks from the expected chip activity. The shunt-LDO regulator can support operation up to 2 A. An example topology of two serially powered modules, each composed of four chips, was simulated based on the detailed shunt-LDO design, as shown in figure 4. The chip in red represents the one with simulated digital activity, the green colored chips are its neighboring chips within the same module and the light blue the neighboring module in the serial power chain. Each chip was simulated as a pair of shunt-LDOs for analog and digital operated in parallel. In the case of digital active chip (red), the load was simulated as a current sink based on the VCD files previously described in section 2.2, after being scaled for a full-size chip (400 × 400 pixels). In the other cases, the load was simulated as a constant current sink of 800 mA (assuming 5 µW per pixel for a voltage of 1 V). As shown in figure 4, local decoupling capacitances (chip, power grid, input/output shunt-LDO capacitors with ESR), parasitic inductances (wire-bonds, cabling), resistances and capacitances (pads) were also included in the simulation.

Simulation results
The impact of the digital activity of a chip on the regulated output voltages was studied. A maximum limit of 10% and 1% for the digital and analog domain, respectively, were considered to be acceptable without compromising functionality and performance. The digital activity of a chip was simulated for the extreme case with maximum peaks (1 ns resolution) in figure 3, in order to confirm the effect of decoupling capacitance. As shown in the top plot of figure 5, in the digital domain, a variation of less than 100 mV is noticed for the active chip itself and less than 10 mV for the voltages of the rest of the chips on the chain. The bottom plot shows the respective effect for the sensitive analog domain, where it can be seen that the digital activity of one chip causes a variation of less than 1 mV in the rest of the module, while the impact on the rest of the serial power chain is negligible. Overall, the performance of the shunt-LDO regulator with a digitally active load is demonstrated to be within acceptable limits. The presence of local decoupling is proven to filter short power fluctuation which get averaged over the µs timescale. This effect assures stable operation of serially powered modules.

Conclusions
The presented power analysis methodology has been successfully used to guide architectural choices and to obtain accurate power profiling in realistic operation at different time scales. Moreover, it has been proven to be fundamental for the optimization of the critical digital array logic of the RD53 chip at detailed level. In addition, the obtained digital activity power profiles have been used as an input to a system simulation of the serial powering topology, including the detailed shunt-LDO design and parasitics. Promising results have shown that the impact of digital power fluctuations to the digital and analog voltage supply of chips in the same or neighbor modules is acceptable.
Further developments will be focused on performing additional simulations under extreme conditions for power variations (e.g. background events, machine cycle, failure scenarios) and also using radiation models, for extensive verification. Moreover, further power optimization of the digital array logic will be investigated (e.g. multi-bit latches, power optimized latches, minimization of clock tree parasitics, optimization of clock distribution). Finally, it is foreseen to perform tests of prototype shunt-LDO chips to confirm simulation results and study more extensively serial power aspects such as the dynamic behavior and failure scenarios.