A UVM simulation environment for the study, optimization and verification of HL-LHC digital pixel readout chips

The operating conditions of the High Luminosity upgrade of the Large Hadron Collider are very demanding for the design of next generation hybrid pixel readout chips in terms of particle rate, radiation level and data bandwidth. To this purpose, the RD53 Collaboration has developed for the ATLAS and CMS experiments a dedicated simulation and verification environment using industry-consolidated tools and methodologies, such as SystemVerilog and the Universal Verification Methodology (UVM). This paper presents how the so-called VEPIX53 environment has first guided the design of digital architectures, optimized for processing and buffering very high particle rates, and secondly how it has been reused for the functional verification of the first large scale demonstrator chip designed by the collaboration, which has recently been submitted.


Introduction
The High Luminosity project for the Large Hadron Collider (HL-LHC), which was recently approved at CERN, is aimed to feature a maximum peak luminosity of 5 × 10 34 cm −2 s −1 and reach 3000 fb −1 of accumulated luminosity in a decade period. This will introduce extreme operating conditions for upgraded pixel detectors of the so-called Phase 2 in terms of particle rate, radiation levels and data bandwidth.
The RD53 collaboration, established in 2013 between the ATLAS and CMS experiments, has addressed the design challenges of next generation hybrid pixel readout chips by proposing the use of an advanced submicron technology, such as 65 nm CMOS. RD53 has focused on different aspects: radiation tolerance, analog front-ends, Intellectual Property (IP) block library, I/O interfaces, global floorplan, integration and simulation [1]. A full scale chip called RD53A has recently been designed and submitted by the collaboration, with the goal of demonstrating in a large format integrated circuit the suitability of the proposed technology. A block diagram of the chip is shown in figure 1 and the main specifications are listed in the following: • 20×11.8 mm 2 chip size, with a pixel size of 2500 µm 2 ; • radiation tolerance up to 500 Mrad total ionizing dose; • stable low threshold operation (600 e − minimum); • very high hit and trigger rate capabilities (up to 3 GHz/cm 2 and 1 MHz, respectively); • hit loss below 1%; • high speed readout (up to 5 Gbit/s); • operation with serial powering.
-1 - The complexity of the system and the challenging specifications are such that a traditional design and verification approach, based only on directed tests and targeted architectural simulations, is no longer sufficient. Digital architecture optimization and functional verification are among the greatest challenges of the RD53A demonstrator design. To this purpose, we developed a simulation and verification platform called VEPIX53 (Verification Environment for RD53 PIXel chips) [2] addressing both these issues, which is the focus of this paper. The verification platform supports also the definition of ad-hoc directed tests, which are still useful to address specific features and blocks, in a trade-off between comprehensive verification and time constraints.
VEPIX53 is based on consolidated high level tools and methodologies coming from the industry domain: the hardware description and verification language SystemVerilog (SV), which has been itself defined as a Verilog extension [3]; and the Universal Verification Methodology (UVM) library [4]. The former provides both enhancements for the description of the Design Under Test (DUT) at different levels of abstraction and advanced verification features; the latter is based on a documented set of standard classes for the building blocks of the environment. VEPIX53 provides a set of dynamic components that are reused in order to support the design flow at different steps, from initial architectural modeling to the verification of the final design, providing i) the generation of different types of input pixel hits including Monte Carlo physics data, ii) the possibility of simulating designs under test (DUTs) at different levels of abstraction and iii) automated verification features e.g. pixel chip output prediction, conformity checks, statistics and coverage collection and configurable reporting.
This paper presents three ways we have profited from the environment for different purposes: evaluating the performance of digital architectures, addressing their optimization for the implementation on the RD53A chip and performing extensive functional verification. It should be mentioned that the environment was also used for performing power profiling in order to verify serial powering operation [6]. This paper is organized as follows: section 2 describes the VEPIX53 simulation environment and its most relevant verification components. Section 3 reports on the initial architecture exploration, which was carried out with VEPIX53 simulations at behavioral level and RTL and was finalized (section 4) with the optimization of different pixel chip architectures implemented in RD53A. Section 5 gives an overview of how VEPIX53 has been reused for the RD53A functional verification, including the verification plan, the testbench infrastructure and the test regressions. Finally, section 6 concludes the work and discusses further developments.

Description of the VEPIX53 framework
The VEPIX53 framework, represented in figure 2, consists of a top level testbench which contains the DUT (wrapped in a top level harness module) and different UVM verification components (UVCs); these are instantiated and configured according to the test scenario, which is specified by the different tests that are run (defined in the test library). The framework has been developed based on the UVM version 1.1, supported by default in the simulation tool adopted. UVCs communicate between each other using transaction class objects. Most of them are associated to pixel chip interfaces and can be configured either as active or passive, depending on the interfaces being related to the input or the output of the chip, respectively. An interface UVC generally contains sequencer and driver components that are in charge of generating the input stimuli to the DUT: this takes place in a layered fashion through sequences of transactions, specified in the tests, that are translated into physical signals. There are also monitor and subscriber components that work the other way around, i.e. they create transactions according to the physical signals in the interface for checks, coverage collection and forwarding to other parts of the testbench. When the interface UVC is configured as active, all the mentioned components are instantiated inside it; in case of a passive components, instead, the stimuli generation is missing. The interface UVCs are shared and versioned in a generic repository (not RD53A-specific), as they are reused for different purposes in multiple testbenches. Along with the interface UVCs there are module UVCs, which are specific to a pixel chip and contain predictors and checkers associated to single building blocks.
Here follows a description of the VEPIX53 UVCs: • the hit UVC, associated to the hit interface, has the main function of generating the charge signals associated to particles crossing the detector, and injecting them into the pixel matrix. The input hits can be either generated in a constrained-random fashion, according to a set of pre-defined classes of clustered hits, or read from physics data in ROOT format produced by Monte Carlo pixel detector simulations. It is also possible to mix the externally sourced hit data with the internally generated ones; -3 - • the trigger UVC, associated to the trigger interface, is in charge of generating the external trigger signal of the pixel array according to configurable trigger rate and latency; • the virtual sequencer controls the coordinated generation of hit and trigger transactions; • the output UVC, associated to the pixel array output interface, takes care of producing data transactions by monitoring the data at the output of the pixel array; • the pixel array analysis UVC is the module UVC associated to the whole pixel array. It contains a reference model that predicts the pixel array output according to the monitored hit and trigger transactions (it is, in practice, a transaction level description of the pixel array used as a golden reference for the DUT); a scoreboard that checks for conformity between predicted and actual output; additional components, used for performance assessment, that monitor internal signals of the pixel array and keep track of its status. The main features of the pixel array analysis UVC are generic and can be seamlessly used up to full chip top-level post-layout simulations. Monitoring of the internal signals for detailed performance studies, requires instead to modify signal probing performed at top-level, as the hierarchical paths are modified by synthesis; • the command UVC and the Aurora UVC are additionally defined for the functional verification of the pixel chip, as it will be explained in more detail in section 5: the command UVC is in charge of generating the input command stream of the chip in agreement with a dedicated serial protocol and the Aurora UVC monitors data transactions at the pixel chip output, encoded with the Xilinx Aurora protocol [7].
With reference to the Monte Carlo data, for the simulations described in this paper we have used CMS ROOT trees produced by a workflow based on the CMS data analysis framework (CMSSW). These data sets contain events related to layer 0 of the CMS pixel detector with different pixel sizes (50×50 or 25×100 µm 2 , where the size is expressed as z×φ with reference to the cylindrical coordinate scheme of the pixel detector), sensor thickness of 150 µm, a digitizer threshold of 1500 e − and a pileup of 140. Subsets have been extracted related to modules at the center and edges of the barrel, i.e. with particles hitting the sensor at different angles, corresponding to different cluster sizes.

Architecture study and optimization
The design requirements on pixel size, hit efficiency (with the defined hit and trigger rates and trigger latency) and low power demand a dedicated optimization of the digital pixel array logic. A relevant gain in terms of storage resources can be obtained when the information associated to multiple hits from the same physical cluster is stored locally: an efficient architectural solution that can be implemented is to group pixels in so called pixel regions, which share buffering logic, leading to compact circuitry and low power. Such an approach is already followed by the FE-I4 and Timepix3/Velopix chips [8][9][10].
An architecture exploration was therefore conducted, focused on investigating the optimal sharing strategy of digital logic in pixel chip arrays: how many pixels should share storage logic -4 -  within a region, in what pattern, with which internal organization, and how region boundaries are handled. The optimization depends on cluster size distributions, which in turn depend on sensor type and location in the detector, and on physics input. Initial statistical and analytical studies based on cluster shapes [11] had shown that the most convenient pixel region configuration to adopt is square, with either 2×2 or 4×4 pixels.

Behavioral level study
A parameterized SystemVerilog pixel chip model at behavioral level was realized [12], with two different buffering schemes to choose from: a centralized one (Central Buffering Architecture, CBA), featuring a single shared hit packet buffer, and a distributed one (Distributed Buffer Architecture, DBA), containing a shared hit time buffer composed of latency counters and independent Time over Threshold (ToT) buffers in each pixel unit cell. Both schemes are shown in figure 3. The behavior of the analog front-end (FE) in the architectures is abstracted with a charge converter module, which converts the input hit charge into a discriminator output pulse; the ToT is then determined with a counter. The CBA and DBA behavioral architectures were studied by running VEPIX53 simulations for different pixel region sizes and number of buffer locations. Two performance metrics were defined: buffer occupancy and hit loss, which is due to either the dead time of the single pixel/pixel region or the overflow of the latency buffer. Simulation results (figure 4) showed an increasing dead time for the CBA architecture as the region gets bigger in size; the hit loss rate is, instead, constant with respect to the pixel region size for the DBA. The buffer occupancy, on the other hand, was used for building the occupancy histogram: from this quantity it is possible to carry out the buffer overflow probability as a function of the number of locations and subsequently determine the optimal buffer size that keeps such a probability below a given design value, which was set to 0.1%. At this stage of the architecture evaluation the DBA was seen to be preferable in terms of dead time losses.

RTL study
Both the architectures were prototyped on small scale chips featuring design improvements suggested by the initial study at behavioral level. In particular, the DBA was implemented with a 2×2 pixel region configuration in the FE65-P2 prototype [13], featuring a buffer depth of 7 both in the local ToT memory and in the shared hit time memory. The CBA, instead, was implemented in the CHIPIX65 prototype [14] using 4×4 pixel regions. The write logic is such that the ToT is counted within a fixed dead time, which is equal to the time needed to compute the longest possible value; the end of the processing is flagged to a region digital writer module, which saves into the shared buffer a reduced information packet containing timestamp, a binary hit map of every pixel in the region, and up to six pixel ToTs. VEPIX53 simulations were run with the two RTL architectures in order to assess their compliance to the RD53A chip specifications. The comparison was conducted with a DUT consisting of a 4×64 pixel multicolumn array and with respect to several different parameters, such as: i) different analog FE behavioral models, with multiple charge-ToT relations and ToT counting clock frequencies [5]; ii) different numbers of memory locations; iii) input hits featuring a 3 GHz/cm 2 rate with different charge distributions, produced by mixing CMS Monte Carlo data with internally generated clusters. Single pixel hit loss was used as a performance metric. The results, some of which were presented in [5], are summarized in table 1, including pixel cell area occupation of the synthesized architectures. In terms of dead time the CBA shows more losses due to the implemented fixed buffer writing-time, when a standard analog FE running at 40 MHz bunch crossing clock is used; on the contrary, the digital logic of the DBA does not introduce any additional dead time (the range of results is only due to multiple charge-to-ToT conversion functions used in the analog FE behavioral models). From the buffering resources point of view, instead, the CBA shows better results thanks to the reduction of the number of fired pixels stored in each region packet. This leads to reduced area, which makes it possible both to fit up to 16 memory locations (corresponding to negligible hit loss due to buffer overflow) and leave room for further optimization of the architecture by inserting additional features. On the contrary, the DBA requires more area for comparable buffering performance.

Architecture optimization into RD53A
Given the complementary advantages and disadvantages of the two architectures, both the CBA and DBA have been integrated into the RD53A chip. The limitations of both architectures have been addressed: limited area (and therefore increase in buffer losses) for the DBA; power consumption and dead time losses for the CBA. A performance comparison of the architectures has then been performed by means of VEPIX53 simulations.
-6 -  The optimization in the DBA has been aimed to reduce area occupation and buffer losses. It should be underlined that the area density has been calculated as the percentage of the total area available for the digital logic after the integration with a specific analog FE. The FE65-P2 prototype has 7 buffer locations per pixel region, with an area density after place and route of almost 90%, which does not make it possible to fit additional logic. As seen in table 1, a buffer depth of 7 is not sufficient for keeping the overall hit losses below 1%. This issue has been addressed by implementing the ToT memories using latches instead of flip flops, leading to a significant decrease of 10% in area utilization. An additional memory location could therefore be fitted, still keeping area density at ∼82% depending on the analog FE. The pixel region shape was optimized as well: simulations performed by means of VEPIX53 with Monte Carlo data, using the behavioral pixel chip model described in section 3.1, have shown that an elongated pixel region shape is to be preferred with respect to a square one, contrarily to what was found from the purely analytical assumptions of [11]. Simulation results are compared in table 2 for a 2×2 and 4×1 pixel region. Therefore, in the RD53A implementation of the DBA architecture the pixel region shape has been changed to 4×1.
On the CBA side, the optimization has made it possible to overcome the pixel fixed dead time issue and to increase the number of ToT values per event for each pixel region. The implemented pixel region architecture features the insertion of an intermediate level of buffering before moving the ToT values to the shared region memory. This section, called staging buffer, contains a 3-deep buffer (latch-implemented) to store the pixel region hit map and a pixel region synchronization counter is used to synchronize the access of each pixel to the shared memory, instead of the previous fixed dead time scheme: when the counter reaches the waiting time, the hit map is propagated to a ToT compressor module together with the ToT values; up to 8 ToTs are then stored in the shared buffer. The comparison between the optimized RD53A architectures for the digital matrix is reported in table 3, including hit loss, achieved through RTL simulation (within VEPIX53) and area occupation after place and route.
The hierarchical block simulated for the DBA and CBA comparison is an 8-pixel wide column with full chip height, i.e. a total of 8×192 pixels, corresponding to a so-called pixel core column in the RD53A chip. The common simulation parameters are the following: • mixed Monte Carlo and internally generated hits (pixel size of 50×50 for center of barrel, 25×100 for edges of barrel) to achieve the target specification of 3 GHz/cm 2 . Internally generated hits feature the same pixel charge distribution and similar cluster shapes as extracted from the Monte Carlo data.
RD53A integrates three different analog FE flavors called Linear, Differential and Synchronous; a unique analog FE behavioral model has been described for the simulations, featuring an improved charge-ToT relation. The slope of the conversion function (not necessarily linear for all FEs) is controlled according to the bias and is normally defined as a tradeoff between efficiency, charge resolution and dynamic range. Two points have been chosen for the linear approximation for the threshold to correspond to a unitary ToT and for pileup inefficiency to be in the order of 1%. In particular, the second point is defined such that a Minimum Ionizing Particle (MIP) traversing perpendicularly a 50 µm×50 µm×150 µm sensor (12 ke − ) corresponds to a half-range ToT value, in order to improve resolution without excessively compromising losses and dynamic range. The conversion function is represented in figure 5. It can be noticed that no saturation of the conversion function is present, emulating the actual behavior of FEs developed for RD53A which do not feature any discharge mechanism in case of overflow. This clearly reflects to the dead time cycles needed for the ToT conversion (even if the value stored by the digital logic will saturate at a ToT of 15, for a 4-bit measurement). It should be highlighted that the choice of the final conversion function is not meant to be absolute and it will depend on future optimizations of analog FE and digital architecture as well as on updated simulations, including position on the detector and/or needs of the experiments. No fast speed analog FE models have been taken into account, as they do not provide further insight on the digital -8 - architectures and have not been integrated in the whole pixel chip matrix. A faster ToT counting leads to a reduction on dead time losses, as also seen before.
In terms of inefficiency results, the DBA and CBA architectures integrated in RD53A feature comparable losses, which are close to specifications if the charge-to-ToT curve is chosen to limit dead time losses. The latter are higher at the edges of the barrel unless a different conversion function is used, since Monte Carlo data show in average a higher charge per pixel (for the pixel size: 25×100). The digital logic has also an impact on dead time, above all in the case of the CBA (2 clock cycles versus 1 clock cycle of the DBA). Latency buffer overflow achieves an order of magnitude lower losses than the analog part. Moreover, it can be noticed that the DBA architecture profits from the implemented elongated pixel region shape, as latency losses are the same in the different portions of the detector. Inefficiency is dominated by dead time at the high hit rates of operation and that a certain charge-ToT conversion function may need to be chosen, possibly penalizing physics considerations (e.g position and charge resolution). To this end, the adoption of a faster ToT counting is a valuable solution, which needs to be evaluated. With respect to area, the DBA and CBA architectures integrated in the RD53A feature rather similar density, with the main difference being the analog FE size (Linear: 35 µm × 35 µm, Differential: 34.71 µm × 32.44 µm, Synchronous: 35 µm × 33.2 µm).

Functional verification
The VEPIX53 testbench for verifying the RD53A demonstrator reuses and extends the environment implemented for the architecture study, which was dedicated to the pixel array logic only. When taking into account the full chip, as shown in the digital functional block diagram of figure 6, there are additional building blocks to be verified: the array includes also pixel configuration, calibration injection pulse generation circuitry, and an additional HitOR channel, while the chip periphery contains control and readout logic and global configuration registers.
Different interfaces can be identified, apart from the one related to incoming particle hits. The command interface is associated to the serial 160 Mbps input stream of the chip, which follows a -9 - dedicated, DC-balanced custom protocol that encodes clock, trigger and other commands on a single link and features built-in framing and error detection. A corresponding command UVC has been implemented. Its main function is to generate, drive and monitor all the possible command types, among which are synchronization pulse, calibration pulse and read/write configuration registers; moreover, it extends the functionality of the pre-existing trigger UVC as the driver encodes incoming trigger transactions into high priority input commands to the chip. The chip output encodes pixel data, configuration data and messages on 1 to 4 serial links (programmable) at 1.28 Gbps nominal bandwidth using the Xilinx Aurora 64b/66b protocol. The associated Aurora UVC monitors the chip output decoded by a Xilinx IP receiver module and builds corresponding hit transactions or monitoring data (configuration or messages) transactions. The additional HitOR output of the chip is monitored as well with a dedicated UVM component. From the module UVC point of view, a simple monitor and scoreboard have been implemented for automatic verification of the RD53A command decoder module and its error correction functions.
Due to short time to submission constraints a fully metric-driven verification approach was not pursued for the RD53A demonstrator. For the verification of crucial functions of the pixel chip (such as the triggered hit data path) it has been possible to reuse and extend the constrained-random tests initially defined for the architecture exploration. Concerning the functions related to the chip periphery, instead, not all the verifications components have been yet implemented, notably a UVM register model which could enable an automated verification of global and pixel configuration. These functions have then been verified using several traditional directed tests, which were, anyway, implemented using custom sequences run by the command UVC. It also needs to be underlined that radiation effects, such as Single Event Upset (SEU) injection, were not included in simulations as the UVCs do not yet support this feature. For what concerns the verification of the analog domain, UVM tests were run for producing output Value Change Dump (VCD) files which were used as input stimuli for fully analog simulations at schematic or post-place and route level of analog building blocks. This has allowed the definition of a set of meaningful digital stimuli for the analog block and verification of the interface between the analog and digital domains. For example, connectivity issues and incorrect default configuration of such analog blocks have been highlighted thanks to this approach.
-10 - Both constrained-random and directed tests were written with respect to a detailed verification plan document which lists in a prioritized fashion all the main functions and test cases to be verified, from the single building block to the full chip level. The tests were first run at RTL, both for single pixel core columns and full pixel array, to find possible bugs in the logic. Then full chip test regressions containing a set of the most relevant functional tests (e.g. processing of random hits and triggers, digital injections, read back of configuration data) were run at gate level in order to verify the chip including timing back-annotation for three different simulation corners (figure 7). For the final debug of the chip several design iterations needed to be carried out, with fixes applied both to functional bugs and timing issues not covered by static timing analysis (e.g. reset strategy, signals crossing asynchronous clock domains, etc.). Test regressions were repeated at both RTL and gate level at each iteration until receiving an all-pass result.

Conclusion
A UVM simulation and verification environment has been developed by the RD53 Collaboration. The environment has made it possible over the years to perform incremental architecture studies of critical processing and buffering sections of hybrid pixel readout chip prototypes, as well as their functional verification.
From the architecture exploration side, the collected simulation metrics have helped understanding the amount of buffering needed for processing efficiently particle hits at the High Luminosity LHC operating conditions, as well as how to optimize different configurations for meeting the requirements of the first large scale prototype. The UVM infrastructure so far implemented can now easily be enhanced for pursuing the architecture optimization for the final ATLAS and CMS pixel readout chips, whose design will be done in 2018.

-11 -
As for what concerns functional verification, the same testbench components have been reused and expanded, profiting from the UVM library. The constrained-random tests simulating operating conditions have been crucial for finding and fixing bugs in the RD53A chip design, especially in the particle hits data path; such a result would have been very hard to get using a fully directed testing approach. Time to submission constraints did not allow to finalize the implementation of all the needed UVM components and constrained-random tests, so a considerable amount of directed ones had to be included in the verification plan. However, due to the fact that RD53A is a prototype and not a production chip, the extensive verification of functions with less priority could be overlooked. This will not, by any means, be the case when verifying the final ATLAS and CMS chips, so it will be very important to proceed with the testbench development. In particular, a future development that includes SEU injection for radiation tolerance assessment will be mandatory.