Real-time data processing with GPUs in high energy physics

As high energy physics experiments reach higher luminosities and intensities, the computing burden for real time data processing and reduction grows. Following the developments in the computing landscape, multi-core processors such as graphics processing units (GPUs) are increasingly used for such tasks. These proceedings provide an introduction to the GPU architecture and describe how it maps to common tasks in real time data processing. In addition, specific use cases of GPUs in the trigger systems of five different high energy physics experiments are presented.


Computing challenge in high energy physics
In the quest for exploring new physics scenarios in high energy physics (HEP) the collection of high statistics data sets is crucial. Therefore, experimental facilities are designed for increasing luminosities and beam intensities. This goes in line with increasing data rates and higher computing demands to process the data. In Run 3 of the LHC, beginning in 2021, LHCb and ALICE will process data rates of 40 Tbit/s and 30 Tbit/s respectively in software. At the software level trigger, ATLAS and CMS will reach this order of magnitude in Run 4, beginning in 2027. Not only LHC experiments, but also dedicated fixed-target experiments face increasingly higher decay and data rates in the future, such as the Mu3e experiment at the Paul Scherrer Institute and NA62 at CERN.
As the computing landscape is changing, the hardware used for real-time data selection needs to be adopted. With the stagnation of the single thread computing performance since the mid-2000s, the growing computing demand cannot be met with traditional single-core CPU processors. What is more, the prediction typically assumed in high energy physics (HEP) that the same amount of money will buy hardware with 15-20% better performance in the next year ("flat budget") does not hold without adopting new emerging architectures and computing models. Therefore, the importance of multi-core architectures has increased significantly over the last years. One particular parallel processor provides thousands of cores and tens of TFLOPs in processing power for highly parallelizable tasks: the graphics processing unit (GPU). When fully utilized, a GPU offers the best TFLOPS/$ performance, as illustrated in Figure 1. The price-performance advantage is especially pronounced for consumer GPU cards, but also scientific cards offer a higher performance per dollar as compared to CPUs. Furthermore, the price-performance increases more over time for GPUs than for CPUs. In HEP, the task of real-time data processing and reduction ("trigger") poses a significant compute load, since particle decays in the detector need to be reconstructed partially or fully to efficiently select signal decays of interest. In these proceedings, I discuss the usage of GPUs in the context of real-time data processing in HEP. In section 2, the GPU architecture is introduced and compared to other processors, section 3 covers the typical tasks and algorithms in trigger applications and how they fit the GPU architecture. Finally, in section 4 examples are discussed of how GPUs are used in data processing at different experiments.

Introduction to GPU computing
GPUs are designed to display graphics on a computer screen. To transform the color and shade of every object onto the millions of pixels of a screen, the hardware prioritizes throughput over latency. Since the mid-2000s, programmable GPU processors can be used for general purpose GPU computing.

The GPU architecture
The GPU computing paradigm follows the Single Instruction Multiple Threads (SIMT) approach, which has similarities with the Single Instruction Multiple Data (SIMD) paradigm used for vector processing on CPUs. In SIMT, a single instruction decoder is available for multiple threads, processing in lock-step. One algorithm, the so called "kernel", is executed on many threads, and every thread processes independent data sets. Groups of threads make up a core and the cores are again grouped into units that then make up the GPU. The two most commonly used frameworks for programming GPUs are CUDA [1], developed by Nvidia for their GPUs, and the cross-platform standard OpenCL [2] maintained by the Khronos Group, supporting both AMD and Nvidia GPUs, as well as other hardware accelerators. In addition, the open source project HIP [3] is being developed to support both AMD and Nvidia GPUs and several cross-architecture frameworks such as SYCL [4] (also maintained by the Khronos Group and building on the concepts of OpenCL) and Alpaka [5] are emerging to allow software development for various back ends. Both CUDA and OpenCL have distinct terminology to describe the GPU architecture, as illustrated in In the software abstraction for the hardware illustrated in Figure 2, the parallelization of an algorithm is assigned on two levels: threads are grouped into blocks and many blocks of threads make up a grid. Threads within one block share a common memory and can be synchronized. All threads in one block are always assigned to the same Streaming Multiprocessor / Compute Unit. Given the hardware constraints, one has to optimize the resources needed by a kernel, as well as the number of threads per block and blocks per grid to achieve an efficient usage of the GPU processing power.

I/O of a GPU
A GPU is typically connected to the host CPU via a PCIe connection. All data processed on the GPU is copied from the CPU to GPU memory and any results required on the CPU are copied back via this PCIe connection. It is therefore crucial to verify that the PCIe bandwidth does not pose a limitation to using the GPU as accelerator. Current GPU models are equipped with PCIe 3.0 connections (16 GB/s for 16 lanes), while the next generation of cards is foreseen to support PCIe 4.0 (32 GB/s for 16 lanes). Scientific Nvidia GPUs also provide the Nvlink protocol with a maximum data rate of up to 100 GB/s as interconnect among GPUs.

GPUs compared to other processors
As GPUs are designed to process the same arithmetic on independent data, they are optimal at parallel performance. Compared to CPUs, the GPU cache is smaller with higher latency, the processor runs at lower frequency and there are no speculative executions. However, by scheduling the thousands of threads optimally, the GPU cores always have work to do and hide the latency via high throughput. Apart from CPUs and GPUs, field programmable gate arrays (FPGAs) are typically used in the data acquisition of HEP experiments. With their fixed, short latency and versatile I/O connectors they are well suited for the early stages of the readout chain. A summary of the different characteristics of CPUs, GPUs and FPGAs is listed in table 1.

Real-time data processing tasks
Over the past decade, GPUs have found more and more use cases within the field of HEP. They are used in data analysis and simulation, mainly through machine learning tools. In these proceedings, the focus lies on the usage of GPUs for data reduction via real-time analysis.

Data reduction strategies
Data reduction in HEP can be divided into two main categories: a selection is either possible based on information from specific detectors or detector regions ("local"), or the information from several detectors is combined ("global"). If the decays of interest generate local characteristic signatures, such as energy deposit in the calorimeter, the former method can be used. The general purpose detectors ATLAS and CMS fall into this category with their main interest lying in Higgs, jet and electroweak physics. Local selections are also necessary if the data stream is too large to be read out from the detector completely. In this case, low latency and high bandwidth are required, such that FPGAs or custom circuit boards (ASICS) are best suited and the type of selection is referred to as "hardware level trigger"1. Global selections on the other hand are possible if the whole data stream can be read out or has already been reduced by a hardware level trigger. They are especially useful if signal decays are in large abundance or highly resemble background processes. In this case, the characteristics of the decays are determined by reconstructing the particle trajectories within the detector ("track reconstruction") and possibly performing particle identification and / or adding information from the calorimeters. Since the full data stream has already been read out at this stage, the latency requirements are relaxed compared to the hardware level trigger. Consequently, this type of selection is referred to as "software level trigger" and processors such as CPUs and GPUs can be used.

Mapping real time analysis tasks to GPUs
To determine whether the compute performance of a GPU can efficiently be exploited for real time analysis, it is crucial to establish how many tasks are "parallelizable". A "parallelizable" task is one where the same arithmetic acts on independent sets of data. Only the parallel part of a program benefits from the speedup due to many processors, as stated in Amdahl's law [6]. Therefore, only problems and algorithms with large portions of parallel tasks can make use of the processing power of a GPU. Real-time data reduction in software typically contains several or all of the following tasks: • Decoding the raw input into the global coordinate system of an experiment; • Clustering of measurements caused by the passage of the same particle in one detector unit into single coordinates ("hits"); • Finding combinations of hits originating from the same particle trajectory (pattern recognition); • Describing the track candidates from the pattern recognition step with a track model (track fitting); • Reconstructing primary and secondary vertices from the fitted tracks (vertex finding); • Performing particle identification with dedicated detectors; • Reconstructing the shower caused by a particle in the calorimeter; • Applying selections to the reconstructed candidates.
In addition to processing many particle collisions (or time slices in the case of continuous beam experiments) in parallel, the above tasks are also parallelizable. The decoding of raw input factorizes by readout unit. Clustering in tracking detectors or calorimeters can be performed in contained regions of a detector. The main compute burden of pattern recognition is the combinatorics of the many possible hit combinations, which can also be processed in parallel. Finally, tracks can be fitted independently from one another, as well as the different combinations of tracks forming a vertex. Particle identification can typically be processed per candidate track. It is often convenient to map the execution of a specific algorithm for one event or time slice to a block of threads, as communication among threads is possible in this case. In the case of large events, the processing of data in a sub-detector may be split into several blocks based on the geometry of the detector.
The various possibilities for parallelization within the software level trigger make all of its tasks or at least a few compute intensive ones optimal candidates to be processed on GPUs.

GPU usage in real-time analysis at HEP experiments
Various experiments in HEP consider the usage of GPUs for real-time data reduction or have already employed GPUs in the trigger in the past. Especially track reconstruction is highly compute intensive, so this task is a typical candidate to be processed on GPUs. The following sub-sections describe five different approaches of using GPUs at the trigger level, first for the fixed target experiments NA62 and Mu3e and then for the three LHC experiments CMS, ALICE and LHCb. A common feature among all described use cases is the coherence of the work flow on the GPU itself, as it reduces the amount of memory copies required between the GPU and the host CPU. A comparison of the GPU usage in the different experiments is summarized in table 2.

NA62
The NA62 experiment at CERN is dedicated to the study of rare kaon decays. During the low level trigger, the event rate is reduced from 10 MHz to 1 MHz and muon-pion particle identification occurs via a Ring Imaging Cherenkov Detector (RICH). An R&D project exists to perform the reconstruction of the ring-shaped patterns in the RICH detectors on GPUs already at the first trigger level [7]. Rings are reconstructed by either filling histograms of distances from every photo multiplier to measurements in parallel and finding the one that best fits a circle or by making use of the Almagest algorithm [16]. The challenge in using a GPU at the earliest trigger stage lies in meeting the strict latency requirements. Therefore, a dedicated network interface card was designed to handle the data transfers to and from the GPU directly over the PCIe switch [8]. With this setup, Possibly in 2021 [15] receiving the data, sending it to the GPU from the network interface card, processing on the GPU and the transfer back take at most 350 µs [7]. A test bed processing data at 5-6 MHz was installed during 2017 and 2018 data taking, the GPU reconstruction is planned to run at 10 MHz in 2021.

Mu3e
Designed for the search of the lepton flavour violating decay µ + → e + e − e + , the Mu3e experiment is being constructed at the Paul Scherrer Institute in Switzerland. The complete data-stream of 80 Gbit/s is read out and split into 50 ns time slices. The data selection will occur fully on GPUs based solely on data from the central pixel detector. Combinations of three hits are already determined in the readout FPGA board and transferred to the GPU at a data rate of 32 Gbit/s. Then the 3-hit stubs are extended to the fourth pixel layer and the linear three-dimensional track fit with multiple scattering developed for Mu3e [17] is processed. Finally, three track vertices are reconstructed based on geometric constraints and selection decisions are copied back from the GPU to the host CPU. Processing time slices and track seeds in parallel, 12 GTX 1080 Ti GPUs cards are sufficient to process the full data stream and reduce the event rate by a factor 100 [9].

CMS
To cope with a pile-up of 140 at the high luminosity LHC (Run 4, starting in 2027), CMS plans to introduce a new trigger stage for track reconstruction using high performance computing platforms. In this context track reconstruction of the pixel detector is proposed to run on GPUs [10,11] within the high level trigger. At this point, the event rate has already been reduced to 100 kHz by the hardware level trigger. Decoding of the pixel raw data, clustering and pattern recognition are offloaded per event to a GPU coprocessor. The cellular automaton algorithm [18] is used for pattern recognition. Based on a graph of interconnected cells, this algorithm is easily parallelizable as the cells (segments of a track) can be built independently. Offloading the pixel track reconstruction to GPUs in the high level trigger is already planned for Run 3 of the LHC in 2021.

ALICE
At ALICE, GPUs have been used in the high level trigger [12] to perform track reconstruction within the time projection chamber (TPC) for calibration purposes already since Run 1 of the LHC [19].
Since the primary goal of ALICE is studying the quark-gluon plasma in Pb-Pb collisions, the event size is orders of magnitude larger than for other LHC experiments, but events occur at a lower rate. As opposed to the other experiments described in these proceedings, ALICE compresses its data rather than selecting it. For the compression, high level objects such as reconstructed tracks are required. Therefore, the steps are similar to those used in other experiments for selection. Similarly to the CMS approach, the cellular automaton algorithm is used for pattern recognition to find track seeds within the TPC. In addition, a Kalman filter [20] is employed for track forwarding and the track fit. During Runs 1 and 2 of the LHC, a hardware level trigger was used, followed by calibration and compression in the high level trigger. In Run 3, ALICE switches to a fully software trigger scheme and the data rate processed in the high level trigger will increase from 384 Gbit/s in Run 2 to 30 Tbit/s. The GPU algorithms are updated for the higher event rate and for the upgraded TPC detector. In addition, other parts of the event selection might be processed on the GPU in addition to the TPC track reconstruction, such as the extension of tracks inside the TPC to the Inner tracking System (ITS), consisting of pixel detectors [13,14].

LHCb
The LHCb experiment is designed for the study of beauty and charm quarks. Part of the extensive detector upgrade ongoing for Run 3 is the complete readout of all detectors at 40 Tbit/s and an entirely software based trigger, split into two stages. During the first stage, High Level Trigger 1 (HLT1), the event rate is reduced from 30 MHz by a factor 30-60 based on inclusive 1-and 2-track selections. The second trigger stage on the other hand is mainly based on exclusive selections. The baseline design of the data acquisition system foresees the two trigger stages to be processed entirely by CPUs [21]. An alternative approach has also been developed in the so called "Allen" project, where the full HLT1 sequence was implemented to run on GPUs [15]. The concept is based on transferring raw data to the GPU, processing everything from decoding the binary payload to event selections on the GPU, and copying only the decisions and underlying objects back to the host CPU. Decoding for four sub-detectors, clustering in the pixel detector, track reconstruction in three sub-detectors, muon identification, as well as primary-and secondary vertex reconstruction and the selections are implemented on the GPU. The full data stream is processed on fewer than 500 Nvidia V100, Quadro RTX 6000 or RTX 2080 Ti cards.
In the baseline solution, two distinct server farms handle the data stream: one that receives data from the different sub-detectors and builds events and a second one where both high level trigger stages are executed. The data stream is only reduced in the second server farm in this scenario. Since 500 GPU cards physically fit into the first server farm, the data rate can already be reduced at an earlier level if HLT1 is executed on GPUs. Therefore, a significantly cheaper network is required between the first and second server farm and money can instead be spent on GPUs. This demonstrates that GPUs naturally integrate into LHCb's data acquisition system and make it more compact.

Conclusion
The computing demand of real-time data processing is increasing with higher luminosities and beam intensities in HEP. To address this challenge in view of the changing computing landscape, parallel processors such as GPUs are emerging in trigger systems due to their high price-performance. The strength of GPUs lies in processing the same computation on independent data. This concept matches well to algorithms used in real-time analysis, such as the reconstruction of particle trajectories. GPU cards are either employed for performing a specific task of data processing or to handle a full trigger stage, mostly at the level of software triggers. Numerous HEP experiments already use or plan to use GPUs both at colliders and in fixed target facilities. These developments will likely impact the design of trigger systems at future experiments and facilities.