First performance results of the ALICE TPC Readout Control Unit 2

This paper presents the first performance results of the ALICE TPC Readout Control Unit 2 (RCU2). With the upgraded hardware typology and the new readout scheme in FPGA design, the RCU2 is designed to achieve twice the readout speed of the present Readout Control Unit. Design choices such as using the flash-based Microsemi Smartfusion2 FPGA and applying mitigation techniques in interfaces and FPGA design ensure a high degree of radiation tolerance. This paper presents the system level irradiation test results as well as the first commissioning results of the RCU2. Furthermore, it will be concluded with a discussion of the planned updates in firmware.


Inroduction
ALICE (A Large Ion Collider Experiment) is a general-purpose, heavy-ion detector at the CERN LHC focusing on quark-gluon plasma (QGP), which is believed to exist at extremely high temperature, density, or both temperature and density [1]. Due to its short living state, the QGP is not possible to be observed directly. Therefore, a set of detectors, which aims at observing events and signatures that indicate the existence of Quark-Gluon Plasma, were designed and installed [2]. The Time-Projection Chamber (TPC) is the main tracking detector of the central barrel in ALICE. Through the study of hadronic observables, it is optimized to provide, together with the other central barrel detectors, charged-particle momentum measurements with good two-track separation, particle identification, and vertex determination [3]. The TPC data is collected by 557568 readout pads on its two end plates, behind which the readout electronics are connected [4]. The readout electronics consists of 4356 Front-end cards (FECs) in 216 readout partitions, distributed on 36 sectors. Each readout partition includes one RCU connecting to from 18 to 25 FECs via a multi-drop Gunning Transistor Logic (GTL) Bus. More information about the present TPC readout electronics has been discussed in [3][4][5].
In LHC Run 1, the RCU was working stable [5]. However, with the upgrades in long shutdown 1 (LS1), the energy of the colliding beams will be increased to 13 ∼ 14 TeV compared to the 7 ∼ 8 TeV during Run 1. As a result, the event size is expected to increase by 20% and the radiation load on the TPC electronics which is located in the innermost partitions is estimated to increase from 0.8 kHz/cm 2 to 3.0 kHz/cm 2 [4]. This led to the requirements of higher readout speed and improved radiation tolerance which could not be fulfilled by the current TPC readout electronics. In order to provide the needed performance, the present Readout Control Unit (RCU1) is upgraded -1 -to the Readout Control Unit 2 (RCU2). Further information on what motivated the upgrade of the RCU can be found in [4].
As presented in figure 1, the upgrades from RCU1 to RCU2 generally includes five aspects: (1) The GTL bus is divided into four branches from the current two branches structure, (2) the speed of detector data link (DDL) [6] is increased from 1.28 Gbps to 4.25 Gbps, (3) the functionalities of three PCB boards in RCU1 are integrated into a single PCB board in the RCU2, (4) the flash based Microsemi Smartfusion2 (SF2) FPGA [7] is chosen to replace the two SRAM based FPGAs and one flashed based FPGA in the RCU1, (5) a detector pad based readout scheme, aiming at utilizing the parallelism of improved hardware, is designed. More details regarding the upgrades from RCU1 to RCU2 has been discussed in [4,6,8].

The ALICE TPC Readout Control Unit (RCU2)
As shown in figure 2, the RCU2 consists of two major systems, the Readout System which is implemented in the SF2 FPGA fabric [7] and the Control and Monitor System which runs on the SF2 Microcontroller Subsystem (MSS) [9]. In the Readout System, the Trigger Receiver accepts, decodes and processes the trigger sequence that comes from the ALICE Central Trigger Processor (CTP) [10], before it passes the generated local triggers [11] to the Readout Module. Based on the local trigger information, the Readout Module reads data from the four branches of FECs in parallel, checks its quality, merges and packages it into the format of ALICE data files [12]. At the final stage, the packaged data is pushed into the DDL2 Module [6] through which it is shipped into the ALICE data acquisition system [13]. In addition, an Internal Logic Analyzer, which provides the capability of debugging the internal logics of the Readout System, has been implemented.
The Control and Monitor System includes the Monitoring and Safety Module (MSM) [14], the Ethernet Module and the SF2 MSS with its peripherals. The Monitoring and Safety Module is responsible for monitoring the status of the FECs and reporting it to the ALICE Detector Control System (DCS) [15] in case of abnormal situations. As shown in figure 3, a tailored 32-bit Linux system operates in the ARM-Cortex M3 [9] of the SF2 MSS and three 16-bit DDR3 SDRAMs [16]. Two of the SDRAMs together store the 32-bit words of the Linux, and the third one stores the parity bits used in SECDED mechanism. When SECDED is enabled, the MSS DDR controller [9] computes and adds parity bits to the data while writing into the DDR3 SDRAMs. Then in a read operation, the data and the parity bits are checked to support 1-bit error correction and 2-bit error -2 -

System level irradiation test
As mentioned above, the increased luminosity in LHC Run 2 with respect to LHC Run 1 will lead to higher radiation load on the TPC electronics, thus improved radiation tolerance of the RCU2 is required. The FPGA on the RCU2 is the Microsemi Smartfusion2 (SF2) SOC FPGA, where the configuration is stored in Single Event Upset immune flash cells [7]. In addition, several of the interfaces on the SF2, such as the Ethernet and the DDR interface, are protected by native mitigation techniques in the hardware. The RCU2 has been through several irradiation campaigns, and the -3 -

JINST 11 C01024
results on the final version of the RCU2 hardware have so far been promising. More details on the previous irradiation tests results can be found in [8].
In April 2015, a system level irradiation campaign was performed at The Svedberg Laboratory (TSL) in Uppsala using a 170 MeV proton beam. During this campaign the RCU2 was operated in a close to normal running situation while exposed to a wide proton beam at a moderate flux. As shown in figure 4, the test setup consists of three parts.
In the radiation area, the RCU2 is connected to four FECs and the supply voltage and current consumption of the SF2 FPGA is monitored by a SF2 starterkit [17]. The trigger crate, the data computer with the CRORC [6], and the PC which provides serial communication to the RCU2 are located some meters away in a shielded area. Via the LAN, all the above-mentioned devices are controlled and monitored by the three PCs which are located in the control room. In this test, the RCU2 was receiving and processing triggers upon which it was performing the basic data taking operation. At the same time all available registers in the RCU2 were monitored. This section will present the observations on the RCU2 stability, especially regarding the readout and the Linux, and discuss the corresponding mitigation actions. To evaluate the radiation tolerance of the RCU2, Mean Time Between Failures (MTBF) for Run 2 for different kind of failures were calculated based on the cross-section extracted from the test. While calculating the MTBF for Run 2, the radiation load on the TPC electronics in the innermost partitions (3.0 kHz/cm 2 [4]) is used and all the 216 RCUs plus 4356 FECs are counted in. Since the flux of high energy hadron in the outermost partitions is expected to be one third of that in the innermost partitions, the numbers listed in this paper are worst case estimations.

Readout stability
To evaluate the readout stability, data taking of the RCU2 was monitored with the trigger rate set to 10 Hz and two test cases were performed: irradiating the whole RCU2 and irradiating solely the SF2, which was realized by shielding the other parts of the RCU2 with a collimator. The FCEs were always irradiated, however, in the second case, they were partially shielded.
During the test, the readout was observed to stop for several times due to three categories of errors: the reset due to PLL lose lock, the SEU induced error on FECs and the data transmission error. The cross-section and MTBF in Run 2 of these errors are presented in table 1.
At the time of testing, the PLL lock signal was directly used as a reset signal in RCU2 FPGA design, thus any losses will lead to the stop of data taking. In the SF2, the PLL has three configuration options [9]: (1) holding reset before getting lock, (2) outputting clock before lock and do not synchronize after getting lock, and (3) outputting clock before lock and synchronize after getting lock. According to figure 5, it is concluded that the output clock of the PLL is not reliable -5 -

JINST 11 C01024
if it loses lock thus its usage should be minimized. There will not be any output clock if the PLL is configured with option (1) and the output clock will be unstable for several clock cycles with option (2) or (3) selected. Following the irradiation campaign, the reset strategy of the RCU2 has been redesigned, where the PLL lock signal is used as a reset signal only when the RCU2 is powered up, and after which it is no longer contributing to the reset scheme.
To deal with the SEU induced error on the FECs, which may cause the data taking to get stuck, the following mitigation actions have been implemented. Firstly, the front-end control bus on the FECs is continuously monitored. Secondly, the communication protocols between the RCU2 and FECs are monitored. Thirdly, the trailer word of each data package coming from the FECs to the RCU2, which contains signature information like channel address, length of data, etc. is verified. With all these actions, it is expected that the error situations will be detected and corrected at an early stage. In case of any data transmission error, ALICE DAQ will enter into a Pause and Recover (PAR) state so that the physics run does not need to stop. This PAR scheme benefits all the detectors and is to be supported by the RCU2. In addition, although no scenarios which can be interpreted as a FPGA fabric error has been seen in this irradiation tests, critical registers and state machines are considered to be protected with Triple Module Redundancy (TMR) or hamming as suggested in [18].

Linux stability
As mentioned in section 2, the Linux of the RCU2 operates on the ARM processor in the SF2 MSS together with three DDR3 SDRAMs, on which SECDED [9] protection can be enabled. While testing its stability, two kinds of errors were observed: sometimes the Linux reboots and in some cases it is frozen. The possible reason of these errors may be single-event upsets (SEUs) and multi-bit upsets (MBUs) in the DDR SDRAMs and in the ARM processor which lead to the kernel panic. The cross-section and MTBF of the Linux rebooting and frozen errors, with different test cases, are presented in table 2. Due to the limited statistics, it is hard to conclude weather SECDED protection on the DDR memories helps or not. To reduce the impact of instabilities caused by Linux errors, several mitigation actions have been taken or explored. First of all, a stand-alone module for DDL2 SERDES [6] initialization has been designed to replace its default initializing scheme, in which the SERDES is initialized by SF2 MSS on system boot-up. Furthermore, configuring the FECs via DDL2 has been realized. With the two above-mentioned measures, the readout can be separated from the Linux, so that the RCU2 could continue taking data in case any error occurs in Linux. In addition, an exploration -6 - Figure 6. SEUs in SF2 eSRAM on replacing the Linux system with a real time operation system (RTOS) that only resides on the internal eSRAM in the SF2 is ongoing. As a part of this activity, the cross-section of SEUs in the SF2 eSRAM has been characterized and the mean time between SEUs in RUN2 has been calculated. As shown in figure 6, providing single eSRAM is used on each RCU2, it is expected to see a SEU around every 220 s.

Trigger interface, Ethernet and MSM stability
In accordance with the previous tests [8], the Trigger Reception (TTCrx) is stable: no error was seen in this irradiation test. The Monitoring and Safety Module (MSM) is also stable, which means that no error has been seen on RCU2 side. Additionally, the stability of the Ethernet is acceptable. Two errors were observed in the tests, which refers to a cross-section of 2.5E − 11 ± 71%, and a (MTBF) in RUN2 of 17.0 ± 12.1 hours.

Readout performance of the RCU2
The readout time of single events has been measured in the setup as shown in the subplot (c) of figure 7, where one full readout partition, which consists of one RCU2 and 25 FECs (maximum number), is used. The benchmarking has been performed with full range of readout parameters: the number of data samples in each ALTRO channel [19] was varied from 0 to 1000, with the DDL2 working at 2.125 Gbps and 4.25 Gbps, separately. As presented in the subplot (b) of figure 7, the size of single event is in exact linear proportion to the number of samples, and it is also consistent with that of the events recorded by RCU1 [5].
At the speed of 2.125 Gbps (∼ 200 MB/s), the DDL2 link starts to get saturated if the number of samples exceeds ∼ 50. In this condition, the readout speed can be improved with a factor of ∼ 1.3 with respect to RCU1 [5]. With DDL2 working at 4 Gpbs (∼ 400 MB/s), the readout speed of the RCU2 can be increased by a factor of ∼ 2 compared to the RCU1. In this case, it -7 - is the Readout System operating at 80 MHz that limits the performance, because it can provide a maximum bandwidth at only ∼ 305 MB/s. A further performance improvement is expected by changing the internal clock frequency from 80 MHz to 100 MHz. In this case the readout speed is estimated to be ∼ 2.6 times that of the RCU1. The 100 MHz clock source will be provided by an on-board oscillator, thus the usage of PLLs in the SF2 can be fully avoided.

Commissioning results for the RCU2
In total 255 RCU2 boards have been produced, which includes more than 10% of spare cards. Since January 2015, 6 RCU2s have been installed and commissioned on one of the 36 TPC sectors. Their geometric locations and appearance can been seen in the subplot (a) and the subplot (b) of figure 8, respectively.
During this commissioning period the readout of the RCU2 is working stable with DDL2 at the speed of 2.125 Gpbs. This is verified by the following method: fixed pattern is filled into the pedestal memories [19] of FECs, read by the RCU2 and checked by the ALICE DAQ. In the commissioning, several TB of data has been looped and no corruptions on the data or stops of the readout have been observed.
Besides, no Linux reboots or freezes have been seen on the RCU2 boards. The statistic is however too low to draw any conclusion on the Linux stability of the RCU2. For comparison, only about 10 Linux reboots have been experienced on a total of 210 RCU1s.  The trigger reception, the Monitoring and Safety Module (MSM) and the Ethernet are working stable. The In-System Programming (ISP) of the RCU2 SF2 FPGA is in general operational. However, in 10-15 out of 100 times it exits prematurely. The reason could not clearly been identified, as the ISP programming is handled by SF2 MSS internally. In these cases a retry of the ISP leads to the desired result.

Conclusion and outlook
In April 2015, the RCU2 system level irradiation campaign has been performed. It revealed some stability issues, especially regarding the Linux and the readout. All the radiation related problems have so far been solved or the mitigation actions for them have been planned. Since January 2015, 6 RCU2s have been commissioned on the ALICE TPC. They were verified with all the surrounding systems (trigger, DCS and DAQ) and found to be working stable with the DDL2 at the speed of 2.125 Gbps. The RCU2 FPGA design is entering the finalizing phase, while some development is still ongoing: integration and verification of DDL2 working at 4.25 Gbps, increasing the system clock frequency from 80MHz to 100MHz, implementing a novel data sorting algorithm and implementing multi-event buffering for triggers. With DDL2 working at 4.25 Gbps and system clock at 100 Mhz, the readout speed will be improved by a factor of at least 2 compared to the current system, which will fulfill the requiremnts for Run 2 operation. With all the major building blocks in place, the RCU2 is planned to be installed in the ALICE TPC during LHC winter break (December 2015 to March 2016).