Proton-induced radiation effects in the I/O blocks of an SRAM-based FPGA

This work presents an experimental study of the proton-induced failures in the Input/Output blocks and in the associated configuration of an SRAM-based Field Programmable Gate Array (FPGA) using a single inverter ring oscillator-configuration. The tests have been done on the KINTEX-7 FPGAs, which were exposed to a 35 MeV proton beam. The cross-sections for 3 classes of single-event effects—in an I/O user logic with a low configuration memory usage of less than 1%—have been determined to be: 2.22−1+1.4 ⋅ 10−11 cm2/device, 1.53 ⋅ 10−11 cm2/device (the upper limit) and 0.97−0.6+1 ⋅ 10−11 cm2/device, respectively. The conclusions and the probable impact of the results on the end-application are presented.


Introduction
The applicability range of reconfigurable devices, especially the commercial off-the-shelf FPGAs, has been extended in the last decades even to the harsh environments with large radiation background (e.g. space and particle accelerator experiments). Given the nature of these environments, the endapplications are demanding reliable semiconductor devices which are required to operate with very low error rates.
In high energy physics (HEP) experiments, the SRAM-based commercial FPGAs have some advantages over application specific integrated circuits (ASICs), such as: lower price, low nonrecurring engineering cost, in-flight reconfigurability, high logic density and partial reconfiguration. Therefore, the FPGAs are becoming a possible option in HEP experiments with such extreme environments.
During the second-long shutdown of the Large Hadron Collider (LHC) accelerator at CERN, scheduled for 2019-2020, the Large Hadron Collider beauty (LHCb) collaboration [1] will implement the upgrade program in each sub-detector [2]. This upgrade program will allow LHCb to operate at an instantaneous luminosity of up to 2 · 10 33 cm −2 s −1 , 5 times higher than the previous maximum values. To achieve this goal, the present trigger system will be replaced by a trigger-less full readout followed by a full software trigger allowing the detector to increase the readout rate from 1 MHz up to the LHC bunch crossing rate of 40 MHz in all sub-systems [3,4].
In particular, the Ring Imaging Cherenkov (RICH) sub-detectors are to be upgraded given their central role in the identification of charged hadrons for a broad momentum range. The photo-detection system will be redesigned in a modular architecture and it will use multi-anode photomultiplier tubes (MaPMT) as photodetectors. For the digital readout, an SRAM-based FPGA from KINTEX-7 family [5], XC7K160T-FBG676, has been proposed as central unit of a digital board which reads the trigger signals from the front-end electronics and sends them through gigabit transceivers to the data acquisition system.
As a commercial off-the-shelf component, the FPGA needs to be tested to prove it withstands the high radiation levels expected for the RICH front-end electronics: 200 krad (in silicon) total -1 -ionizing dose (TID) and a fluence of 1.2 · 10 12 HEH/cm 2 for 50 fb −1 total luminosity [4]. The FPGA reliability in a given radiation environment can be established by exposing the device to a beam of ionizing particles, monitoring its parameters changes, and extrapolating the results to the LHCb expected environment. The SRAM-based FPGAs are well known for their tendency to suffer from radiation-induced single event upsets (SEUs) in the configuration memory [6,7]. Even one SEU in the configuration data might lead to the corruption of FPGA circuit, which could lead to operational failure.
The Xilinx KINTEX-7 family is manufactured using TSMC high performance and low power process (HPL) with a 28 nm high-k metal gate (HKMG) technology node [8]. The smallest device from KINTEX-7 family, XC7K70T-FBG484C6 (lidless), has been intensively investigated and tested by our group for various particle beams like: ions, protons and X-Rays [9]- [11]. The XC7K70T device under test (DUT) has the following resources: 82000 user Flip-Flops, 240 DSP slices, 4.86 Mb of Block RAM, 65600 logic cells, 8 GTX transceivers, up to 300 I/O pins and 18884576 bits of configuration memory [5]. Due to the fact that the DUT has been manufactured in a flip-chip package, its ∼ 0.185 ± 0.02 mm thickness top layer was thinned to ∼ 0.05 mm. This procedure was needed to reduce the energy loss in the device when using ion beams.
The methodology and measurement of proton induced Input/Output (I/O) blocks failures are given in this paper. Other researchers have tested these resources using similar techniques, but they have used different devices and with a slightly different testing methodology [12].

Setup preparation
Due to radiation and facility constrains a custom DUT test board and a custom data acquisition system have been designed [13]. Device electrical parameters and firmware integrity were monitored and controlled using a graphical user interfaces (GUIs) designed with LabVIEW TM . In figure 1 is presented the architecture of the setup used to test and monitor the KINTEX-7 I/O blocks operation. The I/O blocks were tested using a custom firmware which embeds the I/O blocks in ring oscillator (RO) structures. The RO waveforms were monitored using an oscilloscope which acquires the signal over 5 meters of high-quality coaxial cables, each with a 50 Ω matching impedance as well as a 50 Ω termination inside the oscilloscope.
As an optional feature, the integrity of the configuration memory can be monitored with the SEM IP core -a Xilinx proprietary error detection, classification and mitigation tool [14]. However, we have seen that using the SEM IP core as a mitigation tool to correct the configuration memory, is not enough, since the tool itself gets corrupted with higher probability than the actual LHCb firmware, which has less critical bits than the SEM IP. Besides the error mitigation, the SEM IP core has embedded a feature which allows users to inject SEU in the DUT configuration memory. This feature is very useful for users that want to test, debug and validate their design before going to a radiation facility.

DUT firmware
To test the I/O blocks reliability in the presence of ionizing radiation, a custom firmware architecture with 4 independent RO structures has been implemented in 5 out of 6 I/O banks of the FPGA. This -2 - was done using the I/O buffers configured as delay elements in which the signal is shifted from first up to the last delay element and through the entire chain. Then the output of the last element is inverted using a basic inverter and fed into the first delay element. In this way, the configuration that is presented in figure 2 meets the minimum conditions to generate a complete self-oscillation while not using additional logic resources besides the single inverter. To evaluate the I/Os under different conditions, each ring oscillator was implemented using a different configuration. They were implemented using different type of I/O bank -either high performance (HP) or high range (HR) -and using different number of I/O pins. Each I/O buffer is configured with the following attributes: "drive strength = 12 mA", "slew rate = slow" and "IO standard = "LVCMOS18 or LVCMOS15". Depending on its configuration and based on the internal delay and propagation paths each RO has a fixed oscillation frequency and their details are given in table 1.
About 71% of the I/O blocks resources available in the DUT have been used to implement these RO circuits, and this particular firmware uses 36688 (0.19%) essential bits from the entire FPGA configuration memory of 18884576 bits. The Xilinx essential bits [15] are a subset of the -3 - configuration bits, and if one of them is upset, e.g. radiation-induced SEU, it changes the design circuitry, but it might not affect the function of the design. By including the SEM IP core as an error mitigation tool for the configuration memory, the firmware goes up to 188458 essential bits (1%) usage with 154474 (0.82%) essential bits used only by the tool. Due to the fact that the ratio between the SEM IP core essential bits and the actual design is 4 times larger, we decided to not include the SEM IP core in these tests as its probability to fail before the RO oscillators is higher.
In figure 3 the RO waveforms are presented in an oscilloscope snapshot where from top to bottom they are labelled as follows: RO2, RO1, RO3 and RO4. The waveforms are acquired continuously while their parameters are measured automatically and transferred to a LabVIEW TM GUI from where are saved in ASCII files for later analysis.

Test beam results
The I/O blocks reliability has been tested in 35 MeV proton beam -energy on DUT surfaceusing the JULIC cyclotron, which is the injector for the COSY facility [16]. The energy loss in DUT package was estimated to be about 1.68 ± 0.2 MeV.
Each of the three tested FPGAs, were configured with only the RO structures and without the configuration memory mitigation. They were exposed to 50 krad (Si) TID, each. Individually they have accumulated a fluence of to about 2.3 · 10 11 protons/cm 2 within 7% uncertainty and with an average dose rate of 30 rad/s.

-4 -
Before and during the irradiation procedure each RO was monitored continuously and implemented trigger conditions were used to count the SEUs in the I/O blocks. Every RO was monitored for variations in its waveform parameters: amplitude, frequency and duty cycle ratio. If a parameter deviation from its nominal values occurs while exposed to radiation, the oscilloscope records the current waveform on its drive. In figure 4 it is shown the DUT placed on the sample holder at its irradiation position in front of the beam exit. Complete loss of oscillation is most likely due to the corruption of the configuration memory by radiation-induced SEU causing the specific RO circuit to be changed, resulting in the disruption of its oscillation. The signal will be stuck in a random state either at 0 V or at a specific voltage level given by the I/O standard configuration (1.5 or 1.8 V). In order to have such events, either the output buffer from an I/O block is disabled by its 3-state T pin value (logic high), or a connection inside the RO circuit is rerouted/created, or with a smaller probability, the LUT-based inverter is affected. This type of failure was only recovered by a reconfiguration of the DUT.
For each DUT, the RO failure probability can be expressed as cross-section (σ) value which is obtained by dividing the number of observed failures to the total fluence accumulated during the tests on device. These values include the experimental uncertainties at a given confidence level (CL) of 68% and 95% [17,18]. The number of recorded failures, the RO oscillation and the associated cross-section values are given in table 2 for each tested sample.
Assuming to first approximation that all 16 errors, which are given in table 2, are due to corruption of the configuration memory, the number of critical bits (N crit ), can be approximated equation (3.1): where σ A is the total cross-section value from table 2, N CRAM is the size of the DUT configuration memory, and σ CRAM is the SEU cross-section for the total configuration memory, which was determined by our group to be about 1.5 · 10 −7 cm 2 /device [8]. Then, the N crit value is calculated to be 2790 bits representing 0.015% from the total configuration memory size, and 8% from the number of essential bits.
Duty cycle shifts are failures which may occur only during half of the RO oscillation period. A possible cause for these failures is either a corruption of the configuration memory (most likely) or a direct strike of a proton into the I/O blocks region causing a transient effect (glitch) to be propagated forward into the RO structure. To highlight and identify the duty cycle failures from the other types of RO failure, a compressive analysis has been done by analyzing the characteristics of all parameters which were measured for a specific RO during an irradiation run, besides the duty cycle parameter. Hence, if only the duty cycle value is shifted while the other parameters (frequency and amplitude) remain within their nominal values, then the failure is labeled as a duty cycle failure. Since by design the RO duty cycle value is close to 50% and assuming an acceptable level of variation ±2%, no shifts outside this range were seen. Hence, the upper limit of duty cycle failures cross-section was found to be equal with 1.53 · 10 −11 cm 2 /DUT for a CL of 95%. This value is given for measurements associated to the cumulated fluence of 6.9 · 10 11 protons/cm 2 added over all 3 samples and all runs.
The frequency shift failures can be caused by modifications of the FPGA design circuit or changes in the I/O attributes (slew rate, drive strength or I/O standard). When the FPGA design circuit is modified by radiation, additional parasitic circuits might be created and interfering with to the RO circuit nodes. Such modifications cause positive or negative shifts in the RO's nominal frequency.
Several frequency failures were seen in all 3 DUTs while irradiated -these were observed as positive and negative frequency shifts larger than 10%. Hence, a sum of 7 +3.8 −3.1 (68% CL) and 7 +7.5 The frequency failures cross-section was found to be equal with 0.97 +0.5 −0.4 · 10 −11 cm 2 /DUT for a CL of 68% and 0.97 +1 −0.6 · 10 −11 cm 2 /DUT for a CL of 95%. In the left side of figure 5 is shown a typical waveform of a given RO recorded before the irradiation run. As it can be seen, its edges are noisy due to the long cables influence, which are connected directly to the DUT without any reconstruction circuitry. The waveform is good enough -6 -to be analyzed and its parameters can be extracted. In the right side of figure 5 is shown, as example, a radiation induced frequency failure of the same RO while the DUT is being exposed to radiation. Figure 5. Sample 1 RO3 waveform data, recorded before exposure to radiation (left side) and while DUT is exposed to radiation (right side).
By zooming in on the time axis of both figure plots, see figure 6, the changes in the RO frequency during radiation exposure are more visible. Assuming that the waveform from the right side of figure 6 is composed from 2 signals, we get: a signal which has a period T 1 ∼ 1.15 · 10 −7 s and frequency F 1 ∼ 0.87 MHz; and another one with a period T 2 ∼ 0.67 · 10 −7 s and frequency F 2 ∼ 1.49 MHz. These values show a negative shift of the RO frequency, and highlight the fact the RO circuit has been affected and it no longer operates within nominal parameters. These failures were completely recovered by a full reconfiguration of the DUT, and similar failures were seen in the other samples and their associated ROs. Figure 6. Zoom in on the sample 1 RO3 waveform data, recorded before exposure to radiation (left side) and while DUT is exposed to radiation (right side).

Conclusions
The I/O blocks reliability has been investigated by our group due to their important role of connecting the user logic with other devices embedded in a larger system (e.g. with front-end electronics in -7 -LHCb-RICH). While exposing to 35 MeV proton beam, failures in the I/O blocks have been seen. These lead to complete loss of the oscillation frequency or at least to some modification of the I/O block attributes, e.g. frequency shifts. However, all I/O blocks failures have been fully mitigated by a full reconfiguration of the DUT.
The I/O block and firmware failures seen during the irradiation have a small cross-section compared with the total SEU cross-section per device [9,10,19]. The majority of the I/O failures are correlated with the corruption of the configuration memory, as we have seen similar failures in our laboratory tests by doing SEU injections with the SEM IP core tool. Their impact on the endapplication is not dangerous, but they might modify the timing of a specific I/O which can in turn lead to false-positive events to be propagated in the user logic. However, carefully error-mitigation techniques need to be implemented even if the number of essential and critical bits with respect to the total configuration memory size is very low.
Based on these results, we can extrapolate for 3000 FPGAs with 7000 hours of operation and 50 fb −1 total luminosity, as it will be in the LHCb-RICH system during upgrade. Hence, with a CL of 95% we get the following preliminary upper limits per hour in all FPGAs: 20 complete loss of I/O bank pins, 1 duty cycle failure, and 11 frequency failures out of 512 K input pins and 84 K output pins. This will have a low impact on the LHCb-RICH operation, as the error-mitigation for such events will be done online either by partial/full reconfiguration of the affected FPGAs using the DAQ system, or offline by masking the data from the affected FPGAs till they recover.
Other test beams have been carried out with ions and X-rays beams and an extensive analysis including time domain analysis is ongoing in order to better extrapolate these results to the LHC harsh radiation environment.