Use of FPGA embedded processors for fast cluster reconstruction in the NA62 liquid krypton electromagnetic calorimeter

The goal of the NA62 experiment at the CERN SPS is the measurement of the Branching Ratio of the very rare kaon decay K+→π+ ν ν̄ with a 10% accuracy by collecting 100 events in two years of data taking. An efficient photon veto system is needed to reject the K+→π+ π0 background and a liquid krypton electromagnetic calorimeter will be used for this purpose in the 1-10 mrad angular region. The L0 trigger system for the calorimeter consists of a peak reconstruction algorithm implemented on FPGA by using a mixed parallel architecture based on soft core Altera NIOS II embedded processors together with custom VHDL modules. This solution allows an efficient and flexible reconstruction of the energy-deposition peak. The system will be totally composed of 36 TEL62 boards, 108 mezzanine cards and 215 high-performance FPGAs. We describe the design, current status and the results of the first performance tests.

. Schematic view of the NA62 detector.
The Level 0 (L0) trigger algorithm is based on few sub-detectors (the charged hodoscope, the muon detector and the liquid krypton electromagnetic calorimeter and possibly large-angle vetoes) and it is performed by dedicated custom hardware modules, with a maximum output rate of 1 MHz and a maximum latency of 1 ms.
The data from each sub-detector -except the Liquid Krypton (LKr) calorimeter -are sent to a farm of PCs where the Level 1 (L1) and Level 2 (L2) software triggers are performed. L1 algorithms are run on the data of individual detectors. A positive L1 decision triggers the readout of the calorimeter data (which are kept in memories up to then) and, subsequently, L2 algorithms are executed on the complete event. The L1 trigger has a maximum output rate of 100 kHz and 1 s of total latency, while the L2 trigger, has an output rate of the order of 15 kHz with a maximum total latency equal to the basic data taking time unit, the period of the SPS beam-delivery cycle.

The Liquid Krypton electromagnetic calorimeter
In order to suppress the background from K + → π + π 0 decay, an efficient photon veto system is foreseen. The NA48 electromagnetic calorimeter is used [6] in the 1-10 mrad angular region. This calorimeter is a quasi-homogeneous ionization device using liquid krypton as active medium and characterized by excellent time and energy resolution.
The Liquid Krypton (LKr) calorimeter will be readout by the new Calorimeter REAdout Modules [7] (CREAMs) which will provide 40 MHz 14 bit sampling for all 13248 calorimeter readout channels, data buffering, optional zero suppression and programmable trigger sums for the L0 LKr calorimeter trigger processor.

The Liquid Krypton Level 0 trigger
The L0 LKr electromagnetic calorimeter trigger (figure 2) identifies electromagnetic clusters in the calorimeter and prepares a time-ordered list of reconstructed clusters together with the arrival time, position, and energy measurements of each cluster. Information on reconstructed clusters is used to veto decays with more than one cluster in the LKr calorimeter.
-2 -  The trigger processor also provides a coarse-grained readout of the LKr calorimeter that can be used in software triggers and off-line as a cross-check for the CREAM high-granularity readout.

Trigger algorithm
Trigger algorithm is based on energy deposits in tiles of 16 calorimeter cells which are available from the CREAM readout boards. Electromagnetic cluster search is executed in two steps with two one-dimensional (1D) algorithms (figure 3).
The calorimeter is divided in slices parallel to the vertical axis. In the first step peaks in space and time are searched independently in each slice with a 1D algorithm. In the second step, different peaks which are close in time and space are merged and assigned to the same electromagnetic cluster.

Trigger processor implementation
The main parameters driving the design of the processor are the expected high instantaneous hit rate (30 MHz), the required single cluster time resolution (1.5 ns) and a maximum allowed latency of 100 µs from detector hit generation to trigger primitives output to the L0 trigger processor.
The processor is a three-layer parallel system, composed of Front-End and Concentrator boards, both based on the 9U TEL62 cards [8,9] equipped with custom dedicated mezzanines (figure 4).
The LKr L0 trigger continuously receives from the LKr readout modules (CREAMs) 864 trigger sums, each one corresponding to a tile of 4 × 4 calorimeter cells. Data transmission from the -3 - The processor input stage is composed by 28 Front-End boards, each Front-End board receives 32 trigger sums as 16-bit tiles at 40MHz from two TELDES meazzanines (figure 5). Each board performs peak search in space and it computes time, position and energy for each detected peak. In order to extract timing information at the ns level a parabolic interpolation in time around sample maximum and a digital constant fraction discrimination are performed after the peak search algorithms. Information on reconstructed peaks is transferred from the Front-End boards to the Concentrator boards on low-latency high-bandwidth dedicated trigger links. Raw data received by the readout modules are also stored in latency memories, to be readout upon request.
The Concentrator board receives trigger data from up to 8 FE boards and combines peaks detected by different front-end boards into a single cluster. Overlap between neighbouring Concentrators is foreseen to guarantee that each cluster will be fully contained in at least one Concentrator board with proper logic to avoid double counting. The reconstructed clusters are also stored in latency memories, to be readout upon request. Eight Concentrator boards equipped with 24 custom mezzanines are foreseen in the whole system.
High speed low latency trigger data transmission from the Front-End to the Concentrator boards is performed by dedicated mezzanines (Trigger and Readout TX mezzanines and Trigger RX mezzanines, see figure 4 and 6).
The Trigger and Readout TX mezzanines transmit up to 4.8 Gbps (48 bits at 100 MHz) over halogen-free individually shielded twisted pairs using the DS90CR485 serializer. The Trigger RX mezzanines receive and deserialize data using the DS90CR486 deserializer.
Readout data is transmitted over two standard gigabit Ethernet cables using an Altera IP MAC core together with an external PHY.

Embedded processors for trigger logic
Highly selective L0 triggers traditionally require a careful implementation in dedicated high-speed logic. FPGA-based design is a common choice that allows some degree of flexibility but far away from the quick development, test and update possibilities in the software world. Additionally, developing effort is often concentrated where timing performance is not crucial.
The L0 trigger of the NA62 LKr calorimeter is implemented with a combination of custom logic on Altera Stratix III FPGAs tightly coupled with embedded processors NIOS II [10]. The NIOS II we used is the "fast" version, aimed at high performance applications. It allows 250MHz+ operations (240 MHz used in this work) with performance over 300 MIPS and it is optimized for performance-critical applications as well as applications with large amounts of data.
Software written in standard C language implements part of the peak-reconstruction algorithm. This allows to fine-tune between software and hardware in execution time, developing time and validation time. Higher performance is also easily achievable by using a multiprocessor architecture.
The entire architecture is fitting well inside the used Stratix III Altera FPGA (EP3SL110) (see figure 7). The code running on the NIOS II processor has been optimized in order to allow the reduction of the size of the processor onchip instruction RAM (e.g. it can fit on M9K memory blocks instead of using the more scarse M144K blocks).
-5 -   . The peak reconstruction algorithm. Left: peak-finder logic, implemented in VHDL, is a pipelined stage performing peak recognization with the criteria peak in time, peak in space and over threshold. Right: NIOS II software-based parabolic fit and fine estimation of the peak rising time.

Performance tests
We present the results of the first tests aimed at verifying that the designed architecture meets the performance requirements for the L0 trigger processing of the NA62 LKr calorimeter.
The architecture of the test, performed on a single Pre-Processing FPGA, is shown in figure 8. It has been designed to test the stand-alone trigger processor providing dummy calorimeter data from an internal memory. The Experiment Control System (ECS) of NA62 is a standardized system to access firmware registers, FIFOs and RAMs from a PC platform, implemented trough the PCI interface of the Credit-Card PC on-board the TEL62. In order to control the tests and access the results, the ECS system has been connected to the memory with input data, configuration registers -6 - Figure 10. In red the normalized distribution of the NIOS II processing time for the peak-fitting algorithm with a sample of 1000 events. The various colors show how I/O and different mathematical operations contribute to the total processing time. Vertical lines indicate worst cases. and performance counters system. The mux in figure 8 outputs the input data, either coming from test memory or from TELDES, to the processing firmware, allowing to switch from real data to dummy data in any moment. The data are 8 channels of 16-bit ADC values at 40 MHz. The pipelined peak-finder logic (VHDL) performs, for each of the 8 input channels, a peak recognition based on the criteria peak in time, peak in space and over threshold as shown in the upper part of figure 9. Peaks are identified on each tile with the two vertical 1 neighbors and on 4 consecutive time slices. The peak is therefore fully described by 240 bits.
Peak data enter a load-balancing logic block that delivers the data to four NIOS II cores that perform a parabolic fit and a fine estimation of the rising time of the peak (see right part of figure 9). In this first test we chose to implement a simple Round-Robin scheduling algorithm: first peak goes to the first NIOS II, second peak to the second NIOS II and so on, going back to the first NIOS II for the 5-th peak.
The performance was calculated through counters that measure the processing time of the NIOS II cores. As shown in the normalized distributions in figure 10, multiple measurements have been performed with different programs running on the NIOS II: complete peak-reconstruction or, in addition to the I/O operation, different and increasingly more complex mathematical operations. Results agree with expectations, such as a higher cost for the division compared to other operations. The algorithm has been therefore designed in order to minimize its computational cost: e.g. the finetime reconstruction of the peak rising time is calculated by constructing a linear approximation between the two data sample on the rising edge of the peak and by finding its crossing time of a threshold level (fraction of the peak value). The width of the distributions corresponds to a variation in the algorithm latency, to be attributed to the bit banging technique used, in the current implementation, to interface control signals between the NIOS and the external on-chip logic. The distributions show no tails, allowing to determine the maximum (worst) processing time for this test in 139 clock cycles. Considering the 240 MHz NIOS II system frequency, this is equivalent to 1.9 MHz processing rate per core, hence a total of 7.6 MHz with 4 cores processing in parallel. 1 The remaining horizontal dimension is handled on the concentrator boards, not included in this test.

Discussion
Performance results must be conservatively compared with the rate of incoming peaks from the calorimeter, that is the output rate of the peak-finder logic. We therefore considered the maximum instantaneous hit rate on the LKr calorimeter of 30 MHz and we made the hypothesis that each hit produces a wide cluster of 256 calorimeter cells 2 . By using simulations to estimate spatial non-uniformity in the peak rate, we estimated, for the tiles read by a Pre-Processing FPGA, a worst-case scenario of 4.2 MHz peak rate in the calorimeter center. This is significantly smaller than the performance result of 7.6 MHz. The proposed architecture for the LKr L0 Trigger can therefore sustain the estimated worst-case scenario of 4.2 MHz incoming peak rate.

Conclusion
A fast parallel architecture, based on a mixture of VHDL design and NIOS II processors, has been designed for cluster reconstruction and counting in the LKr electromagnetic calorimeter of the NA62 experiment. Test results here presented show that the L0 trigger system fully meets the timing and bandwidth requirements of the experiment. More extensive tests to stress the system capabilities are undergoing and will include inter-communications between different TEL62 boards. The system will be commissioned in the last part of 2014, ready for data taking at the end of 2014-beginning of 2015.