Firmware implementation of a recurrent neural network for the computation of the energy deposited in the liquid argon calorimeter of the ATLAS experiment

The ATLAS experiment measures the properties of particles that are products of proton-proton collisions at the LHC. The ATLAS detector will undergo a major upgrade before the high luminosity phase of the LHC. The ATLAS liquid argon calorimeter measures the energy of particles interacting electromagnetically in the detector. The readout electronics of this calorimeter will be replaced during the aforementioned ATLAS upgrade. The new electronic boards will be based on state-of-the-art field-programmable gate arrays (FPGA) from Intel allowing the implementation of neural networks embedded in firmware. Neural networks have been shown to outperform the current optimal filtering algorithms used to compute the energy deposited in the calorimeter. This article presents the implementation of a recurrent neural network (RNN) allowing the reconstruction of the energy deposited in the calorimeter on Stratix 10 FPGAs. The implementation in high level synthesis (HLS) language allowed fast prototyping but fell short of meeting the stringent requirements in terms of resource usage and latency. Further optimisations in Very High-Speed Integrated Circuit Hardware Description Language (VHDL) allowed fulfilment of the requirements of processing 384 channels per FPGA with a latency smaller than 125 ns.


A
: The ATLAS experiment measures the properties of particles that are products of protonproton collisions at the LHC. The ATLAS detector will undergo a major upgrade before the high luminosity phase of the LHC. The ATLAS liquid argon calorimeter measures the energy of particles interacting electromagnetically in the detector. The readout electronics of this calorimeter will be replaced during the aforementioned ATLAS upgrade. The new electronic boards will be based on state-of-the-art field-programmable gate arrays (FPGA) from Intel allowing the implementation of neural networks embedded in firmware. Neural networks have been shown to outperform the current optimal filtering algorithms used to compute the energy deposited in the calorimeter. This article presents the implementation of a recurrent neural network (RNN) allowing the reconstruction of the energy deposited in the calorimeter on Stratix 10 FPGAs. The implementation in high level synthesis (HLS) language allowed fast prototyping but fell short of meeting the stringent requirements in terms of resource usage and latency. Further optimisations in Very High-Speed Integrated Circuit Hardware Description Language (VHDL) allowed fulfilment of the requirements of processing 384 channels per FPGA with a latency smaller than 125 ns.

Introduction
The ATLAS experiment [1] at the Large Hadron Collider (LHC) [2] measures the properties of particles produced in proton-proton collisions at energies of several teraelectronvolts [TeV] and a collision frequency of 40 MHz. In the years 2026-2029 the LHC will undergo a major upgrade to increase its instantaneous luminosity by a factor of 5-7 leading to the High luminosity LHC (HL-LHC). During the same period the ATLAS detector will be upgraded to cope with the increased luminosity of the HL-LHC. This upgrade is called the phase-II upgrade. The readout electronics of the ATLAS liquid argon (LAr) calorimeter will be replaced as part of the phase-II upgrade [3]. The new frontend boards will shape, sample, and digitise at 40 MHz the electronic signal from the calorimeter before sending the samples to the backend electronics through optical fibers. Figure 1 shows a typical pulse shape in the detector before and after the bipolar shaping. The new backend boards employ FPGAs to compute the energy deposited in the calorimeter out of the samples received from the frontend boards. The computed energy is then sent to the trigger system at 40 MHz, and to the readout system at 1 MHz in case of a level 1 trigger accept decision.
The LAr calorimeter measures the energy of particles produced in LHC collisions. An excellent energy resolution and accurate detection of the energy-deposit time are crucial to enhance the  Sample sequence (black) of an EMB middle-layer cell located at a pseudorapidity =0.5125 and an azimuthal angle =0.0125 within the ATLAS coordinate system, simulated by AREUS, together with the true transverse energy (E T ) deposits (red), at an average pileup of 140 as a function of the bunch crossing (BC) counter. The samples amplitude is normalized to the value of the deposited energy in GeV.
ATLAS physics discovery potential at the HL-LHC. Currently, the transverse energy is computed using optimal filtering algorithms [4] that assume a nominal pulse shape of the electronic signal. Calorimeter electronic signals of up to 25 subsequent collisions overlap and create distortions to the pulse shape. This increases the difficulty of energy reconstruction and identification of the corresponding proton-proton bunch crossing. Up to 200 simultaneous proton-proton collisions are expected at the HL-LHC, which will lead to a high rate of overlapping signals in a given calorimeter channel. This will result in a significant energy degradation especially for low time-gap between two consecutive pulses, as discussed in [5]. Figure 2 shows a sequence of energy deposits in the calorimeter and the corresponding pulses simulated using AREUS [6]. The energy deposits correspond to events with a pileup of 140 collisions per bunch crossing. High energy deposits from hard scatter events are simulated by adding flat random energy deposits between zero and five GeV, separated by 30 bunch crossings on average.
Neural networks implemented in FPGAs have demonstrated enhanced object reconstruction and identification at trigger level in LHC experiments [7][8][9][10][11]. The use of neural networks improves the energy resolution in the ATLAS LAr calorimeter especially in the low time-gap region as shown in [5]. Both convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are shown to outperform the optimal filtering algorithm. The transverse energy reconstruction is performed in custom electronic boards based on the latest state-of-the-art FPGAs [3]. Each FPGA should reconstruct the energies of 384 independent channels. The computation is done on-the-fly at the collision frequency of 40 MHz. Furthermore, the energy reconstruction should be done within a latency of 125 ns. These requirements put very stringent constraints on the firmware implementation of the energy reconstruction algorithms. The implementation should balance serialisation of the channels to save FPGA computation resources, and parallelisation to reduce the implementation latency and keep up with the high bandwidth.
The electronic boards for the LAr calorimeter phase-II upgrade are currently under development and will use the Agilex [12] family of Intel FPGAs. A demonstrator board is already produced with Stratix 10 FPGAs [13]. In this article we consider the implementation of a vanilla RNN [14] algorithm in a Stratix 10 FPGA from Intel (part number 1SG280HU1F50E2VG). The implementation is carried out initially in HLS for fast prototyping and optimisation of the network parameters and the implementation architecture. After this initial implementation, VHDL is used for the final optimisation, ensuring that the requirements in terms of FPGA resource occupancy and latency are met.

Vanilla RNN
The neural network used in this article is the vanilla RNN with the sliding-window approach described in [5]. It is trained using Keras [15] with input simulated data from AREUS. At each bunch crossing, the network receives as input five consecutive samples of the electronic pulse of one channel of the LAr calorimeter and computes the corresponding deposited transverse energy. The five samples correspond to a time window of five bunch crossings. The computed transverse energy corresponds to a possible deposit at the second bunch crossing. Four of the samples are around the pulse peak generated by the deposited energy while the first sample is prior to the pulse allowing to detect the presence of overlapping pulses from past energy deposits.
The network implementation is composed of five consecutive RNN cells followed by a dense layer as shown in figure 3. Each cell process the input from one sample corresponding to one bunch crossing. Each of the cells is composed of 6 blocks of computation as shown in figure 4: two addition blocks, one multiplication of a scalar with a vector, one multiplication of a vector with a matrix, and one activation function. The weights of the network obtained from the training are stored in memory and are given as input to the RNN cells along with the electronic pulse samples. The same weights are used for each of the five network cells. The ReLU, described in equation 2.1, is used as the activation function for its simple implementation in FPGAs.

High-Level Synthesis Implementation
The High Level Synthesis (HLS) is a design process that takes as input a behavioural specification of a digital system and convert it to a register-transfer level (RTL) realising the described functions.
For the study presented in this article, the Intel HLS [16] is used. The Intel HLS takes a C++ code as input and produces RTL optimised for Intel FPGAs. It provides several macros which are inlined in the C++ code that allow control over the RTL implementation. The Intel HLS implements the standard C++ types but also defines several other types, in particular arbitrary precision fixed-point representations.
Neural network cell computations are made of additions and multiplications. The aim of the firmware implementation is to reproduce with high precision the software computation results while keeping low resource usage in the FPGA and low latency.

Implementation of multiplications
Inside the Vanilla RNN cell there is one vector multiplication and one matrix multiplication which reduces to several vector multiplications which in turn reduce to several scalar multiplications and scalar additions. There is a dedicated component inside the FPGA to perform scalar multiplications called the Digital Signal Processing (DSP). The DSP can be used in three possible modes. The first mode performs one 32×32 bits multiplication in the floating-point representation. The second mode performs one 27×27 bits multiplication in the fixed-point representation. The third mode can perform two independent 19×18 bits multiplications in the fixed-point representation. As explained  Figure 4. Schematics of the operations performed inside each of the RNN cells. The recurrent weight multiplication multiplies the state vector from the previous cell (S t ) with the recurrent kernel weight matrix (R). Simultaneously, the LAr cell input (X t+1 ) is multiplied by the kernel weight vector (W) and added to the bias weight (B). The results from the two above operations are added to create the internal vector T t+1 . The ReLU activation function is applied on the elements of T t+1 to create the state vector S t+1 . The state vector size (n) is equal to 8. in 3.2, the number of multiplications of the FPGA limits the number of calorimeter cells that can be handled by one FPGA. The third mode is thus chosen since it allows doubling the available dedicated multiplication resources on the FPGA.

Multiplexing
The number of multiplications inside the neural network depends on the size of the state vector ( ) and the number of network cells ( ) following equation 3.1 :

Implementation of arithmetic operations
Intel HLS implements two binary representations of numbers: a fixed point representation and a floating point representation. The implementation of floating point representation is relatively complex and uses more arithmetic and logic blocks for the addition operations. Furthermore, the DSP blocks of the Stratix 10 FPGA allows only one floating point multiplication instead of two simultaneous fixed point multiplications as explained in section 3.1. The fixed point representation is chosen in order to minimise the resource usage in the FPGA.
The fixed point representation implemented in Intel HLS follows the Algorithmic C (AC) Data-types defined by Mentor Graphics [17]. Four parameters define this representation: the total number of bits, the number of bits of the integer part, the quantisation type, and the treatment of the overflow. Three different data categories are defined for the vanilla RNN implementation: the input and output data, the neural network weights, and the intermediate data which are the results of the internal computation inside the neural network blocks. 19 bits are used for the width of the internal and input/output categories, while 16 bits are used for the weights. This ensures an efficient use of the DSP resources inside the FPGA while providing a very good compatibility between the firmware computation and the one performed in the Keras software with floating points arithmetic operations. The firmware resolution, which is defined as the relative difference between the firmware and the Keras computed energies, is less than 0.1 % with this choice of number of bits.
The overflow defines how the bits to the left of the most significant bit (MSB) are lost due to saturation. The number of bits and the position of the radix point are chosen to be able to represent the maximum value that could occur in the network. Thus, no saturation detection is needed and a simple drop of bits implementation is used.
The quantisation defines how the bits to the right of the least significant bit (LSB) are lost. The loss of bits can occur during the conversion of the floating points representing the inputs and weights. The inputs to the neural network are given by the AREUS simulation while the weights are provided by Keras. These two use 32-bit signed floating point representations that need to be converted to a fixed point representation in order to be used in the firmware. The loss of bits can also occur inside the neural network internal computations. The 37-bit output of the DSP is reduced to the number of bits internally used. Two types of quantisation are implemented in the Intel HLS compilation: truncation and rounding. Each of these types possess different subtypes which are explained in [17]. Figure 5 shows a comparison of the resource usage, in terms of DSPs, arithmetic lookup tables (ALUT), Flip Flops (FF), and random access memory (RAM), and the latency of the firmware for different implementations of the quantisation. The same quantisation is applied for all categories of data of the network. Two modes are interesting, the default truncation (TRN) and the default rounding (RND) which use lower resources and lower latency compared to other quantisation modes. Figure 6 shows a comparison of the transverse energy computed in firmware and the one computed in software, with a full floating point implementation, for the different quantisation modes. All quantisation modes show similar resolution with the exception of the TRN quantisation modes which have large tails. The RND mode gives a good compromise among resource usage, latency, and resolution.
To further optimise the firmware implementation, a mix of quantisation procedures are used for In each mode, the same quantisation is applied for all categories of the data of the network. The different quantisation modes are described in [17]. The different RND modes give very similar results and their corresponding curves overlap. A lower cut of 240 MeV is applied on T (software) to remove low energies below the 3 noise level as described in [5]. different data categories. Figure 7 shows the resource usage and the latency while applying the TRN and RND quantisations on different data categories. The rounding of the weights does not require any additional resources in the FPGA since it can be done in software before loading these weights into the FPGA. The input data will be digitised and quantised in the frontend boards and does not require additional resources for rounding in the FPGA. The Rounding of the simulated input data is thus also done offline before loading it into the FPGA. Rounding the output data induces a slight increase in the latency. Rounding the internal data categories increases the needed resources and the latency significantly. Figure 8 shows a comparison of the transverse energy computed in firmware and the one computed in software depending on which of the data categories is rounded.
One can see that it is important to round the weights and input/output categories, while rounding the internal data category does not have any significant impact on the resolution. The root mean square (RMS) of the TRN distribution is 0.2%. It decreases to 0.07% if all categories are rounded (RND_IWD). The RMS becomes slightly worse (0.09%) if only the weights and the input/output are rounded (RND_WD). Therefore, the TRN mode is used for the internal data computation while the RND is used for the other data categories. These optimisations allow significant improvement of the resolution of the firmware at a low resource and latency cost. For each test, the letters I, W, and D indicate that the RND is applied for the internal data category, the weights, and the input/output data, respectively, while the TRN mode is applied by default in all other categories.

Implementation of the neural network
The computation inside the neural network requires 304 multiplications and 231 additions. The multiplications are implemented inside the DSPs, and the additions are initially implemented using ALUTs and FFs. The DSP allows summing the output of the two multiplications internally. The usage of such functionality reduces the number of additions implemented in the ALUTs and FFs from 231 to 131. Furthermore, the DSP component contains an additional adder that can take an external input. It is possible to chain two DSPs to sum their outputs by using the output of the first DSP in the additional adder of the second DSP. By doing this, it is necessary to synchronise the DSPs so that their results arrive at the same time to the additional adder of the second DSP. For each test, the letters I, W, and D indicate that the RND is applied for the internal data category, the weights, and the input/output data, respectively, while the TRN mode is applied by default in all other categories. A lower cut of 240 MeV is applied on T (software) to remove low energies below the 3 noise level as described in [5].  Therefore, the second DSP in the chain should be shifted by 1 clock cycle. The same procedure is applied to more DSPs to build a full chain. In such case, all additions can be implemented in DSPs.
To perform the timing shift, each input of the DSPs needs an additional level of registers to delay the results. This is done by combining several ALUTs to create a memory logic array block (MLAB) to implement a first in first out (FIFO) memory. However, the MLAB frequency is limited to 450 MHz in read-during-write mode needed for FIFO implementation. At higher frequencies the synchronisation cannot be implemented in MLAB, in such case the delays are implemented in basic ALUTs and FFs which increase the number of needed logic elements significantly. At high frequency, more ALUTs and FFs are needed to synchronise the DSPs in chained mode than to implement the additions in a non-chained mode. Figure 9 shows the FPGA resource usage as function of the frequency for DSPs used in chained or non-chained mode for one matrix multiplication. The chained mode is advantageous below 450 MHz. Above this frequency, it is more advantageous to use logic elements to do the additions than to chain DSPs. This will be the option retained for the implemented RNN since we seek higher frequencies to increase the multiplexing and thus reduce the overall resource usage.

Results of the HLS Implementation
The designed RNN HLS code is compiled in Intel HLS and Quartus [18]. The results of these two compilations are given in table 1. For one implemented network the design can run at 455 MHz which a priory allows multiplexing up to 11 channels per network for a data input rate of 40 MHz. However, with a multiplexing of 10 the maximum frequency reached is 393 MHz while the needed frequency is 400 MHz for the firmware to run without timing violations. The maximum frequency is reduced when applying multiplexing due to the additional weights that are needed by the network to perform the computation for several channels. Up to 37 networks can fit in the FPGA which leads to the usage of 100% of the available DSP resources and most of the logic resources. The logic resources are given in terms of ALUTs and FFs in the HLS report and adaptive logic modules (ALM) for the Quartus report. The ALMs are the actual physical resources in the FPGA and each ALM can be configured to be used as two ALUTs or 4 FFs. The HLS design does not allow reaching the required 384 channels processed in one FPGA, even with full utilisation of resources in the FPGA. In practice only part of these resources will be available for the transverse energy computation as discussed in section 4.4. Additional optimisations are needed to fulfil the requirements of the LAr calorimeter. Furthermore, the latency of this HLS firmware is 277.5 ns which is significantly larger than the required 125 ns. The HLS design was optimised for the maximum possible frequency which leads to additional registers added by the compiler to meet the required timing constraints. This in turn increases the latency of the design.

VHDL Implementation
The HLS implementation adds an additional level of abstraction that allows fast and efficient optimisation of the network parameters and firmware implementation. However, this additional level of abstraction prevents some finer optimisations that are possible in VHDL. These finer optimisations allow to meet the specifications, which is not possible for the HLS implementation. That is why VHDL is used for the final optimisation and placement of the RNN firmware implementation.

Reuse of common computations between RNN cells
As shown in figure

Placement constraints
Several instances of the neural network are needed to process all 384 channels. In a given compilation each instance have a different placement shape. Moreover, these shapes change between compilations due to the randomisation in Quartus. This complicates the optimisation of the timing critical paths that is needed to reach higher frequencies and thus higher multiplexing.
Placement constraints are used to force the same placement shape of the implemented neural network. Thus, all instances of the neural network have the same shape which simplifies the optimisation of the critical paths. Moreover, the placement of the 5 network cells is optimised to minimise the distance between connected cells. The shape of the neural network is shown in figure 10. The first cell is placed in the middle since it is connected to all other four cells as described in section 4.1. The other cells are placed around the first cell and ordered to reduce the distance between consecutive cells. The dense layer is placed next to the fifth cell.
All cells use the same set of recurrent weights except for the first cell that does not have a recurrent block as explained in section 4.1. These weights are stored in memory (M20K blocks) and are directly connected to the DSPs to make the matrix multiplications. To reduce the mean distance between the M20k block and each DSP the recurrent weights are duplicated. Each set of weights is used for two cells instead of four.
These placement constraints allow increasing the maximum frequency from 434 MHz to 492 MHz with 28 instances of the RNN implemented in the FPGA.

Incremental compilation
Quartus compilation is composed of three parts: Analysis and Synthesis, Fitting, and Timing Analysis. The Analysis and Synthesis will translate the VHDL code into RTL. The RTL code creates a high-level representation of a circuit. The Fitter will place and route the design into the FPGA. It will determine the required resources and the wiring between the different components. The Timing Analysis will determine the maximum frequency that the FPGA can reach with a given design.
Quartus can divide a firmware design into multiple partitions. It also provides the possibility to preserve the results of a given compilation at different steps of the process for each partition. To further increase the maximum frequency the firmware design is partitioned in a way that each partition corresponds to one neural network instance. A sequence of compilations are performed and each partition that reaches the target frequency is preserved while the others are recompiled. Several combinations of target frequencies and numbers of neural networks instances are tested. We converged on a configuration with a target frequency of 560 MHz and 28 instances of the RNN. This allows to run the RNN with a multiplexing of 14 and reach the required number of channels to be processed in one FPGA. Four compilations are needed to reach the target frequency in this configuration, while 18 out of the 28 instances reach the target frequency at the first compilation.

Results of the VHDL implementation
The final firmware implemented in VHDL contains 28 instances of the vanilla RNN each with a multiplexing of 14 which allows covering 392 channels. The results from this firmware are summarised in table 2. This firmware can run at 561 MHz while the required frequency for a multiplexing of 14 is 560 MHz. The final requirements on the resource usage are not available since the full final phase-II firmware is not yet available. However, we require a resource usage of the neural network block similar to the transverse energy computation block used for the phase-I upgrade [19] of the LAr calorimeter which is currently in operation. This block uses about 70% of the DSPs and 30% of the logic. The optimised RNN resource usage is within these requirements as shown in table 2. The latency of the firmware is 65 clock cycles corresponding to 116 ns which is within the 125 ns required latency. The optimisations done in VHDL do not affect the results of the energy computation; therefore, the firmware resolution stays below 0.1 % as it was for the HLS implementation.

Conclusion
The phase-II upgrade of the LAr calorimeter allows the unique opportunity to implement neural networks in FPGAs in order to improve the computation of the transverse energy deposited in the calorimeter. This report presents the implementation of a vanilla RNN in Stratix 10 FPGAs from Intel. The firmware is first developed in HLS for fast prototyping and optimisation of the network architecture. The implemented network in firmware matches the computation of the deposited transverse energy in software with a resolution better than 1‰ after the optimisation of the bit width of the fixed point representation and the quantisation of the arithmetic operations. However, the HLS implementation could neither reach the required frequency to allow the implementation of 384 channels per FPGA nor the required latency. VHDL is used to further optimise the HLS implementation and to add constraints on the placement allowing to reach the required specifications of the firmware. The final result is a firmware containing 28 instances of the RNN capable of running at 560 MHz with a multiplexing of 14 and a latency of 65 clock cycles (116 ns). This firmware requires less than 70% of the available DSPs and less than 20% of the available logic elements in the Stratix 10 FPGA.
work would not be possible without the support of the Institut für Kern-und Teilchenphysik of the Technische Universität Dresden.