Soft error rate estimations of the Kintex-7 FPGA within the ATLAS Liquid Argon (LAr) Calorimeter

This paper summarizes the radiation testing performed on the Xilinx Kintex-7 FPGA in an effort to determine if the Kintex-7 can be used within the ATLAS Liquid Argon (LAr) Calorimeter. The Kintex-7 device was tested with wide-spectrum neutrons, protons, heavy-ions, and mixed high-energy hadron environments. The results of these tests were used to estimate the configuration ram and block ram upset rate within the ATLAS LAr. These estimations suggest that the configuration memory will upset at a rate of 1.1 × 10−10 upsets/bit/s and the bram memory will upset at a rate of 9.06 × 10−11 upsets/bit/s. For the Kintex 7K325 device, this translates to 6.85 × 10−3 upsets/device/s for configuration memory and 1.49 × 10−3 for block memory.


Introduction
The density and capability of modern FPGAs is growing rapidly. The 28 nm Xilinx Kintex7, for example, is a low-power, high capacity FPGA that can implement a large amount of computation, I/O processing, and configurable logic. The Kintex 7K325T contains 840 DSP slices, over 300,000 logic cells, 445 Block RAM memories (16.4 Mb of internal memory), and 16 12.5 Gb/s transceivers (for a total of 200 Gb/s of I/O bandwidth) [1].
The capability and benefits of modern FPGAs can be exploited within high-energy experiments and detectors such as the ATLAS detector at CERN [2]. SRAM-based FPGAs, however, are sensitive to ionizing radiation and fault mitigation methods must be employed to mitigate the effects of this radiation. Fortunately, the space electronics community has invested significant effort in identifying and deploying methods for exploiting the advantages of FPGAs within a radiation, space environment [3]. Through methods such as configuration scrubbing, triple-modular redundancy, and error correction coding, FPGAs have been shown to operate robustly in harsh radiation environments.
The purpose of this work is to investigate the feasibility of deploying the Xilinx Kintex7 FPGA within the ATLAS Liquid Argon (LAr) Calorimeter front end electronics [4]. The FPGA can potentially replace custom ASICs that are currently used to route the digitized signals to the optical links, monitor the front end analog to digital converters, and in a later stage process signals at the front end. Specifically, this work will investigate the sensitivity of the Kintex7 to high energy hadrons, estimate the upset rate of the FPGA configuration memory and BRAM within the LAr, and validate mitigation methods to address the estimated failure rates. This work will show that with proper mitigation, the Kintex7 can be used effectively within the LAr.

Radiation testing
A variety of radiation tests were conducted from October, 2012 through September, 2013 to better understand the effects of radiation on the Kintex7 architecture. The goals of these tests were as follows: (1) measure the sensitive static cross-section of key architectural components of the Kintex7 device, (2) estimate the upset rates within the LAr environment, (3) identify single-event functional interrupt (SEFI) mechanisms within the Kintex7, and (4) explore SEE mitigation methods on the Kintex7 and validate their operation. This paper will report on the progress on items (1) and (2) listed above. Future publications will report on items (3) and (4).

Radiation tests
The Kintex7 device was tested in the following radiation environments: LANSCE [5], October 2012. This initial test measured the sensitive cross section of configuration memory and BRAM memory to a wide-spectrum neutron beam. This test involved a periodic readback of the configuration contents to identify upsets within the configuration and BRAM memory. The test setup is shown in figure 1. The results from this test confirmed a similar test performed by Xilinx [6].
H4IRRAD, November 2012. This test measured the sensitive cross section of the configuration and BRAM memory within CERN H4IRRAD facility which is a well-characterized mixed-field of high-energy hadrons (HEH). The H4IRRAD test followed the same test procedure as the initial LANSCE test. The H4IRRAD field more closely resembles the LAr field than the other tests performed during this campaign. TSL [7], May 2013. An additional test was performed at TSL PAULA facility to test the sensitive cross section of the CRAM and BRAM with high-energy protons (180 MeV). In addition, -2 -the ANITA wide-spectrum neutron facility was used to validate the results from LANSCE and to complete initial verification of a dynamic configuration memory scrubber.
Texas A&M [8], September 2013. The Kintex7 architecture was also tested with Heavy Ions (Nitrogen, Xenon, and Argon) in an effort to understand single-event latchup behavior and to obtain initial data for use in space rate upset estimations. Results from this test will be described in a future publication.

LANSCE, September 2013
The final test verified the JTAG scrubber, investigated the mult-giga bit transceiver (MGT) sensitivity, and the benefits of mitigation methods such as TMR.
Although the test infrastructure and radiation environments of these tests are slightly different, all tests were organized in a similar manner. The Kintex 705 development board was placed in a radiation beam and a remote laptop was connected to monitor its behavior while under test. A single run was conducted as follows: 1. Configure FPGA with design under test (DUT), 2. Expose DUT with fixed amount of radiation, 3. Query device state through configuration readback, and 4. Reconfigure/repower device in preparation for the next run.
The design used for the neutron and proton testing was a simple circuit that included a simple counter tied to the LEDS (to indicate operational status) and a BRAM module that instantiates all 445 BRAM primitives in the device (the BRAM primitives must be used in order to observe upsets). The configuration bitstream was loaded onto the FPGA through the JTAG port and the post-radiation configuration bitstream was readback through this same JTAG port.
The upsets within the configuration memory (CRAM) and the BlockRAM memory (BRAM) were identified by comparing the readback bitstream obtained after irradiation against a golden bitstream (i.e., the pre-radiation readback bitstream). The readback bitstream is partitioned into two different regions: one region for the CRAM and another region for the BRAM. The number of upsets for the CRAM and BRAM is determined by performing a bit-wise compare of the corresponding readback bitstream section with the golden bitstream section.
Once the number of CRAM and BRAM upsets was identified, the cross section can be computed. The static cross section for each architectural element (CRAM and BRAM) was estimated as

Static testing results
The static cross-section results are listed in table 1. This table includes the static cross section estimations of the configuration memory (CRAM) and the block memory (BRAM) for radiation environments.
There are several important observations to make from these results. First, there is a strong correlation in cross-section estimations from the various neutron test facilities. The results obtained by Xilinx at LANSCE during an independent test are very close to those conducted by the authors at LANSCE. Further, the results from the neutron testing at TSL closely match those obtained at LANSCE. Second, the BRAM cross section is lower than the CRAM cross section for all test configurations (on average the BRAM has a cross section 7% lower than CRAM). Third, the 180 MeV Proton cross section is about 20% larger than the wide-spectrum neutron cross section. Fourth, the measured cross section at the H4 facility is 2.2× higher than the the cross section measured at LANSCE. Although the reasons for this difference are not yet known, we suspect it is due to the presence of higher energy particles in H4, dosimetry calibration issues, thermal neutrons, and unaccounted particles in H4.

Configuration memory multi-bit upsets (MBU)
An important result from this test was the observation that many events observed during the tests caused multiple bit upsets within the memory cells of the device. These events, called multiplebit upsets or MBUs, are very important because MBUs have the potential of disrupting single-bit correction fault mitigation methods such as error correction codes (ECC).
The configuration memory can be protected against single-bit upsets using the FrameECC primitive and an internal scrubbing controller. The configuration memory is organized as a series of 101 × 32-bit "Frames" (3232 bits). Within the frame is a 32-bit ECC word that can be used by the FrameECC primitive to identify the location within the frame of a single-bit upset. In addition, the FrameECC primitive can detect the presence of multiple bit upsets within a frame (but not the location).
Fortunately, the configuration bits of frames are interleaved so that physically adjacent configuration bits are in different configuration frames (see figure 2). If a single-event causes an upset within two adjacent frames, the FrameECC mechanism can repair the upset individually in both frames. Because of this interleaving, individual MBU events can be interpreted in one of two ways: as an inter-frame MBU or an intra-frame MBU. An MBU event interpreted as an intra-frame MBU -4 -  will consider an upset in adjacent bits of adjacent frames as separate, single-bit events. When viewed as an intra-frame MBU, the event of figure 2 is interpreted as two distinct single-bit upsets within both Frame #0 and Frame #1. This interpretation is useful when analyzing the ability of the FrameECC to detect and correct single-bit upsets.

JINST 9 C01025
An MBU event interpreted as an inter-frame MBU will combine the upsets found in adjacent bits of adjacent frames and is viewed as a single, multiple-bit event. When viewed as an inter-frame MBU, the event of figure 2 is interpreted as a single, two-bit upset that occured in adjacent Frames #0 and #1. This interpretation is useful in understanding the impact of ionizing radiation within a particular FPGA device or family.
The MBU events for the wide-spectrum neutron source are tabulated in table 2. The first data row reports observed data interpreted as intra-frame MBUs and indicates the percentage of observed events in which N upsets were observed in a frame (where N is specified by the column). For example, 90.1% of the observed events had only a single (N = 1) upset within a frame and 7.5% of the observed events had two upsets within a frame. These results suggest that 90.1% of the configuration upsets can be repaired with the internal FrameECC and 9.9% of the events will require correction from some external CRAM scrubbing architecture.
The second data row of the table interprets the same observed data as inter-frame MBUs and indicates the percentage of events in which N adjacent upsets were observed regardless of the frame boundaries. These results suggest that 65.0% of the events caused only one CRAM cell to upset while the other 35.0% events cause more than one CRAM cell to upset. A few "large" events (1.3%) were observed in which six or more CRAM cells were upset by a single particle. This data suggests that the 28 nm process technology used for the Kintex 7 is relatively susceptible to MBUs. Fortunately, internal CRAM frame interleaving reduces the impact of this behavior and the internal scrubbing and the FrameECC mechanism can repair the majority of events. 1 1 It is interesting to observe that these results are not monotonically decreasing. For reasons that are likely due to the proprietary layout of the device, it is more common to observe four-bit upsets than three-bit upsets.

ATLAS Liquid Argon upset estimation
To estimate the expected SEU rates in the LAr electronics, the SEU cross sections as function of hadron energy would be required. However, the measurement we perform yields the integrated cross section over the neutron white energy spectrum provided by LANSCE-WNR. The measured SEU cross section is given by: where dn/dE is the hadron energy spectrum in neutrons ·cm −2 s −1 , E is the hadron energy in MeV , w(E) is the SEU cross section dependence on the hadron energy, σ 0 is the overall normalization, and E 0 the onset energy for an SEU occurrence. The SEU rate on the other hand can be calculated as: The difference between the actual application and measurements at LANSCE lies on different energy spectra, dn/dE, that are available from measurements at LANSCE and simulations for LAr. The two unknown quantities that prevent us to perform the evaluation are σ 0 and w(E). It is customary to assume that w(E) is a Weibull function, where the three parameters are the onset energy E 0 , the width W , the power parameter α, and E is the hadron energy. These parameters depend on the device characteristics, and are often sensitive to the transistor geometry, voltage, etc. In the absence of parameters for Kintex 7, or similar device, we extract them by fitting the measured interaction cross section of fast neutrons at LANSCE as measured by a Timepix device [9,10]. It is assumed that the energy cluster observed by the Timepix device, 230 keV, will cause an SEU. This procedure yields E 0 = 0.5 ± 0.4, W = 63.6 ± 4.6 and α = 0.986 ± 0.038. σ 0 can be evaluated using eq. (3.1) with the experimentally measured energy spectrum at LANSCE-WNR. The expected rates for ATLAS can now be obtained from eq. (3.2) and are shown in table 3. The first column shows the value obtained for σ 0 for CRAM and BRAM, the second column the expected rates using the procedure above described and the third column a simple estimate as a product of the integrated flux of hadrons above 20 MeV and the SEU cross section measured at LANSCE-WNR. We estimate that the errors are 45% for the first column and 35% for the second column. These uncertainties are mostly from the normalization provided by the LANSCE-WNR facility, ATLAS simulations, and uncertainties from the fitting procedure. The fact that the two columns are similar arises from the sharp turn on of the Weibull curve obtained from the analysis of the Timepix data.
The flux within the detector is estimated to integrate 2 fb in 10 hours (5.56 × 10 −5 fb/s). Multiplying this estimated flux by the estimated cross section in table 3 produces the following estimated upset rates: CRAM -1.1 × 10 −10 upsets/bit/s and BRAM -9.06 × 10 −11 upsets/bit/s (estimated accuracy of 50%).  The actual device upset rate (number of upsets per device per second) depends upon the device and number of CRAM and BRAM cells within the device. Table 4 summarizes the number of CRAM cells, BRAM cells, and device upset rate for both CRAM and BRAM. For example, the upset rate of the Kintex 7K325 device (the device used within our radiation tests) is 6.84 × 10 −3 CRAM upsets per second. This corresponds to 1 CRAM upset every 150 seconds and 1 BRAM upset every 670 seconds.

-6 -
The results from table 4 suggest that active SEU mitigation methods will be required to incorporate Kintex7 FPGAs within the ATLAS LAr detector. Active configuration scrubbing will be necessary to repair upsets within the configuration memory [11] and structural redundancy such as TMR [12] will be needed to preserve the functionality of the circuitry while the CRAM memory is temporarily upset. Further, FPGA circuit designs operating on FPGAs within the ATLAS LAr should employ the built-in BRAM ECC modes to protect the memory from these upsets.

Conclusions
This paper describes the results of radiation tests performed on the Kintex7 in an effort to evaluate the suitability of deploying the Kintex7 within the ATLAS LAr Calorimeter. Several tests were conducted to estimate the SEU cross-section of the device for a number of radiation environments including the ATLAS environment. The results from these tests suggest that the Kintex7 will experience relatively frequent upsets within the CRAM and BRAM and that active SEU mitigation is needed. Fortunately, a number of previous efforts have demonstrated the ability to use FPGAs within such environments. The results of this work, along with the results of previous experiments with FPGAs, suggests that the estimated SEU upset rate of the Kintex7 within ATLAS is reasonable and previously demonstrated techniques will be able to mitigate the effects of these SEUs.
This work is the first step in a number of activities to validate the Kintex7 for use within ATLAS. Now that we have an estimate of the CRAM and BRAM upset rate, specific scrubbing architectures and mitigation approaches can be tested. Additional steps to validate the Kintex7 for the ATLAS LAr include the testing and verification of a configuration scrubber, the deployment -7 -of TMR, testing for single-event functional interrupts (SEFI), testing to measure the total-ionizing dose (TID) requirements, and testing of the multi-gigabit receivers. The results from these tests will be reported in future publications.