Upgrade of the TOTEM DAQ using the Scalable Readout System (SRS)

The main goals of the TOTEM Experiment at the LHC are the measurements of the elastic and total p-p cross sections and the studies of the diffractive dissociation processes. At LHC, collisions are produced at a rate of 40 MHz, imposing strong requirements for the Data Acquisition Systems (DAQ) in terms of trigger rate and data throughput. The TOTEM DAQ adopts a modular approach that, in standalone mode, is based on VME bus system. The VME based Front End Driver (FED) modules, host mezzanines that receive data through optical fibres directly from the detectors. After data checks and formatting are applied in the mezzanine, data is retransmitted to the VME interface and to another mezzanine card plugged in the FED module. The VME bus maximum bandwidth limits the maximum first level trigger (L1A) to 1 kHz rate. In order to get rid of the VME bottleneck and improve scalability and the overall capabilities of the DAQ, a new system was designed and constructed based on the Scalable Readout System (SRS), developed in the framework of the RD51 Collaboration. The project aims to increase the efficiency of the actual readout system providing higher bandwidth, and increasing data filtering, implementing a second-level trigger event selection based on hardware pattern recognition algorithms. This goal is to be achieved preserving the maximum back compatibility with the LHC Timing, Trigger and Control (TTC) system as well as with the CMS DAQ. The obtained results and the perspectives of the project are reported. In particular, we describe the system architecture and the new Opto-FEC adapter card developed to connect the SRS with the FED mezzanine modules. A first test bench was built and validated during the last TOTEM data taking period (February 2013). Readout of a set of 3 TOTEM Roman Pot silicon detectors was carried out to verify performance in the real LHC environment. In addition, the test allowed a check of data consistency and quality.


Introduction
The TOTEM Experiment [1] measures the total pp cross-section with a luminosity-independent method and studies elastic scattering and diffractive phenomena at the LHC energies. Furthermore, TOTEM's physics program aims for a deeper understanding of the proton structure by studying elastic scattering with large momentum transfers and diffractive processes in cooperation with the CMS Experiment. To perform these measurements, TOTEM requires a good acceptance for particles produced at very small angles with respect to the beam. TOTEM's coverage in the pseudo-rapidity spans the range 3.1 ≤ |η| ≤ 4.7 and 5.3 ≤ |η| ≤ 6.5 on both sides of the interaction point; this is accomplished by two gas detector telescopes, named T1 and T2. The T1 and the T2 telescopes adopt respectively Cathode Strip Chambers (CSC) and Triple Gas Electron Multiplier (GEM) chambers which are able to detect inelastically produced charged particles. The inelastic telescopes are complemented by silicon detectors housed in special movable structures embedded in the beam-pipe, called Roman Pots, placed at about 147 m and 220 m from the interaction point. The Roman Pots are designed to detect leading protons down to a few mm from the beam centre.
The TOTEM trigger and readout electronics is, by design, modular and compliant with the CMS DAQ and trigger system, in which TOTEM can be integrated during joined data taking campaigns. All TOTEM detectors adopt the same readout chip called VFAT (Very Forward Atlas and Totem) [2,3]. The VFAT chip provides also trigger capabilities generating trigger fast logical OR outputs. This feature allows all TOTEM detectors to have full triggering capabilities. The trigger -1 -primitives produced by the VFAT chips are processed by the TOTEM trigger system that distributes back the event selection signal, named Level 1 Accept (L1A), to all the VFATs. Upon receiving the L1A, the VFATs generate the data frame containing the event information. Custom designed Gigabit Optical Hybrid (GOH) transmitters collect up to 16 VFAT frames and, after serialization, transmit them through optical fibres up to the counting room. In the counting room fibre bundles of 12 GOH fibres are connected to the so called OptoRx mezzanine modules [4]. The OptoRx represents the cross-connection between the TOTEM stand-alone data acquisition infrastructure and the CMS one. Up to 3 OptoRx are plugged into VME boards that are read by the readout PCs in the stand-alone run configuration. The VME bus maximum bandwidth limits the maximum first level trigger (L1A) to 1 kHz rate.
2 The Data Acquisition upgrade project

Overview and requirements
During the first LHC Long Shut-down period (LS1), TOTEM has started a consolidation program aimed to increase the overall performances of its system data throughput, trigger rate and efficiency. The DAQ system upgrade project is part of the program. In particular, the project aims to get rid of the VME bottleneck and to increase the DAQ capabilities achieving higher bandwidth and introducing data filtering algorithms. This goal has to be reached by maintaining the full compatibility with the LHC Timing, Trigger and Control (TTC) transmission system as well as with the CMS DAQ.

The SRS system
The Scalable Readout System (SRS) [6], proposed in 2009 and developed in the Micro-Pattern Gas Detector (MPGD) development community [5] as a scalable, multi-channel and general-purpose readout platform, represents a suitable and cost-effective solution for the TOTEM project. The SRS has two main building blocks: the Front-End Concentrator card (FEC) and the Scalable Readout Unit (SRU). Both the FEC and the SRU can be controlled and read via Ethernet links. If more FECs have to be read, the SRU can act as a concentrator merging data from up to 40 FECs. The SRS modularity and scalability offers a wide range of solutions for building an Ethernet based data acquisition system. Replacing communication buses and interfaces with standard Ethernet links, the SRS reduces the data acquisition system to a network based architecture and allows to enter the data acquisition PC cluster through standard Network Ethernet Cards (NIC) supported by vendors with multi-platform drivers. In addition, thanks to the large amount of logic resources available in the FEC and SRU modules, data processing can be implemented in different stages of the DAQ chain. Xilinx's Virtex5 FPGA hosted on the FEC and Virtex6 FPGA on the SRU allow the SRS user to customize the module functionalities and implement algorithms tailored to an application.

The DAQ upgrade using the SRS
The figure 1 shows the DAQ system architecture integrating the SRS into the TOTEM DAQ system. The OptoRx modules are plugged into a custom designed card, named Opto-FEC, that allows the connection with the FEC board. Data received from the OptoRx are processed in the FEC and -2 - formatted using the User Datagram Protocol (UDP) protocol. Several FECs can be read directly by different PCs implementing point-to-point connection or in an Ethernet switched network architecture. At the present stage the SRU is used as TTC signals receiver and fan-out. In particular the optical signal from the TOTEM TTC system is connected to the SRU. The TTCrx chip on-board provides the TTC data stream decoding while the SRU FPGA distributes the clock, trigger and fast commands to the Data Trigger Clock and Control (DTCC) links [6,7]. FEC modules are connected to the SRU DTCC ports via CAT6 Ethernet cables. In addition the SRU module can act as a data concentrator receiving data via DTCC links. This operation mode will be exploited at a later stage in case further processing will be needed on the dataset merged from different FECs.
3 The first SRS based DAQ demonstrator for TOTEM 3

.1 Opto-FEC board
The Opto-FEC is an electronic board designed ad-hoc to interface the TOTEM OptoRx mezzanine with the FEC board via its back-end PCI connector. This 8-layer printed circuit board (PCB) has passed post-routing simulation in order to evaluate signal integrity (SI) of the high-speed paths. The card first version has been conceived as a prototype unit with the aim of exploiting the full connectivity between one OptoRx and a FEC board. The FEC board offers different options for interfacing the back-end user card. Several communication interfaces can be implemented configuring the FPGA I/Os available on the PCI connector such as high speed 8bit/10bit links and parallel buses.
The Opto-FEC was designed to fulfil the following requirements: • Implement the SLink32 parallel bus for data streaming; • Implement the high speed bi-directional serial link for data streaming; • Implement the I2C bus for OptoRx configuration; -3 - • Implement the Trigger Throttling System (TTS) connection; • Implement stand-alone powering mode; • Implement stand-alone external clock connection; • Implement the JTAG bus for the OptoRx programming.
Ten Opto-FEC units were produced and assembled. The figure 2 shows the first Opto-FEC prototype with its OptoRx that were tested and run with the SRS system.

First test at the Interaction Point IP5
We commissioned the first prototype of the new DAQ system in the LHC environment during the last p-Pb runs in February 2013. One FEC module was equipped with one OptoRx receiving data from 3 full RP detectors containing about 120 VFATs. The FEC was read directly using a standard PC and data was stored locally on a commercial SATA storage medium. The FEC was connected to the SRU module in order to receive the TTC signals. The SRU was connected to the TOTEM TTC network via optical fibre delivering the LHC clock, the orbit, the L1A and the fast commands signals. The SRS based DAQ demonstrator in IP5 is shown in figure 3. Data was acquired with a maximum trigger rate of 10 kHz, only limited by the disk write speed. We acquired 10 M event corresponding to ∼50 GB raw data. Data consistency and quality checks were performed off-line, running CRC code checks on VFAT frames and SLink frames and event alignment checks. No transmission error was detected along the whole DAQ chain.

Firmware and software development 4.1 Firmware development
The firmware of the FEC and SRU is mainly developed using the System Verilog [8] integrating hardware description and verification in the same standard language. Although the System Verilog is relatively new and EDA tools supporting it are not fully mature, the language gradually gains attention in the industry thanks to its compactness and syntax structures which translate into more reusable and less error prone code.
-4 - In order to take advantage of the System Verilog, the firmware shipped with SRS hardware, has been completely reviewed and created in the new language. The firmware of both the FEC and the SRU consists of two big blocks: System and User units. The System unit provides support for all physical interfaces employed by the FPGA to communicate with external components, i.e. Gigabit Ethernet (GbE), DTCC, I2C, ect. The System unit handles configuration, control and timing signal distribution to the rest of the system, providing some set of services to the User unit. This approach allows potential users of the system to focus only on the deployment of the User unit while adopting the firmware to a specific application. Having the skeleton of the System unit already supporting a whole set of services, that later could be improved gradually by common effort of a whole system users community, lets individuals invest time in design and implementation of processing and preanalysis algorithms that can benefit acceleration from the hardware of the FPGA.
The communication between all modules, especially between the System and User units, is provided by standard interfaces of Advanced Microcontroller Bus Architecture(AMBA) [9] family: • Advanced High-performance Bus (AHB) -interconnection dedicated for memory-mapped modules • Advanced eXtensible Interface 4 Stream (AXI4-Stream) -unidirectional interconnection for modules exchanging stream of packets The usage of a standard and well documented bus, such as the AMBA one, facilitates development, verification and merging of new modules into the design. Moreover, the adoption of an industrial standard allows one to reuse IP cores, widely available on the market, which speeds up the development stage.

Firmware verification
The  under test is actuated by sequences which communicate with the drivers through the sequencers.
Since the System Verilog supports randomization natively (at syntax level), using properly written sequences, test vectors have the chance to cover automatically a wide range of events. A similar result in old-fashioned, hard-coded approach would require thousands of lines of verification code. Moreover, using constraints and distributions, user may guide the test-bench to explore different areas of the probability space by looking for possible bugs in the firmware. The coverage indicator allows us to estimate how much a project has been verified. The monitor inside the agents is used to collect traffic generated by the driver and to report it to a scoreboard. The scoreboard is an object that gathers all the stimulus and firmware responses evaluating whether the firmware behaves correctly. The verification code takes advantage of the System Verilog object-oriented constructs, so it can be easily reused and written at the high level of abstraction.
For the FEC firmware a common UVM environment has been developed; its block scheme is shown in figure 4. Addition of a new functionality requires preparation of a specific sequence rather than a whole test-bench. After any modification, the firmware has to pass set of defined tests to be validated for production. Such an approach allows to trace logic, and sometimes also timing, pitfalls at the stage of development where all signals can be easily probed and analysed in the Hardware Description Language (HDL) simulator. It has an undeniable advantage over quite common, on-hardware test approach, in which signal visibility, even with help of logic analyser cores, is limited.

Readout software
The DATE data acquisition framework [11] developed by the ALICE Experiment collaboration is an off-the-shelf solution for building scalable and multi-nodes data acquisition systems. DATE supports the readout of the SRS equipments through Linux UDP sockets [12]. TOTEM will adopt DATE for the stand-alone runs. Moreover, to allow system testing and software migration to others platforms and frameworks, a tool-kit, based on the Boost C++ libraries, has been developed. A stand-alone, multi-threaded application is available for fully portable SRS readout. The application implements the well-known producer-consumer pattern and uses three threads. Thread tasks are implemented by objects specialised for producer and consumer requirements. Combining patterns with object-oriented has several advantages; on one side it keeps the code generic and allows flexible usage of multi-threading, on the other side it allows decoupling the application tasks into two independent domains. The producer thread handles the readout part receiving data from the UDP socket, while the consumer processes data. The processing tasks are fully customizable, both on-line consistency checks and data storage can be implemented. The third thread supervises the producer and consumer activity and collects statistics useful to study system performance. Producer and consumer communicate via two queues. The first queue contains empty memory buffer objects ready to store new data, the second queue contains buffers filled with data that needs to be processed by the consumer. The software queues are thread safe containers, two different queue versions have been designed using C++ generic programming techniques. At compile time the user decides whether to use the mutex-locked queue version or the more efficient lock-free queue that takes advantage of atomic memory access.

Conclusion and outlook
The prototype of the TOTEM DAQ system using the SRS was fully successful. The first tests on the field, showed the expected performance. Next step is to scale the present system up to the final size (20 FECs) exploiting the full throughput of the Ethernet links. The project will leverage the FPGA resources to reduce the event size by introducing pattern recognition algorithms. This goal to be reached will further increase the overall performance and efficiency of the DAQ system.