Pile-Up Mitigation using Attention

Particle production from secondary proton-proton collisions, commonly referred to as pile-up, impair the sensitivity of both new physics searches and precision measurements at LHC experiments. We propose a novel algorithm, PUMA, for identifying pile-up objects with the help of deep neural networks based on sparse transformers. These attention mechanisms were developed for natural language processing but have become popular in other applications. In a realistic detector simulation, our method outperforms classical benchmark algorithms for pile-up mitigation in key observables. It provides a perspective for mitigating the effects of pile-up in the high luminosity era of the LHC, where up to 200 proton-proton collisions are expected to occur simultaneously.


Introduction
The Large Hadron Collider (LHC) at CERN, Geneva, will remain the most powerful tool on Earth to produce and study heavy elementary particles, at least for this decade. Further maximizing the potential of its experiments, such as ATLAS and CMS, and thereby increasing the chances of discovering new physics in the coming LHC runs, is of paramount importance. These runs will be characterized by an ever-increasing instantaneous luminosity, i.e., by a larger average number of simultaneous proton-proton collisions. During the final High Luminosity phase of the LHC (HL-LHC) starting in 2027, this number is expected to reach 200, which is almost an order of magnitude more than what was seen during Run 2 (2016Run 2 ( -2018.
This poses an enormous challenge to the experiments, because their reconstruction algorithms have to identify interesting signatures among a large number of signals ‡ These authors contributed equally to this work.
coming from secondary collisions. Reconstructed objects falsely attributed to the primary collision are called "pile-up", and they can dramatically impair the sensitivity of an analysis. Especially in the Run 3 and High-Luminosity scenarios, removing pile-up contamination will become a primary objective. We present a method for identifying pile-up particles with the help of deep neural networks based on sparse transformers adapted from the field of natural language processing.
One example of a widely-used algorithm for rejecting pile-up particles is the charged-hadron subtraction as used in the CMS particle-flow (PF) algorithm [1]. It removes charged particles whose tracks have not been assigned to the primary vertex. As more sophisticated algorithms, PUPPI [2] and SoftKiller [3] aim at further identifying the pile-up component present among neutral particles.
First studies on the performance of pile-up mitigation using image recognition techniques [4] to identify pile-up contributions, and graph-based [5,6] methods have focused on a subset of particles in the event clustered into hadronic jets or did not consider detector resolution effects. In addition, the weights calculated for each particle using the PUPPI algorithm are used as input features in these networks. Here, we present for the first time a machine learning algorithm relying only on raw reconstructed information that outperforms the classical benchmarks like PUPPI on event-and jet-level metrics, and in a realistic detector setting. This is a crucial step towards demonstrating the superiority of machine learning-based pile-up rejection on a global event level at future detectors and scenarios like the HL-LHC.
The paper is organized as follows. Firstly, a description of the setup and of the datasets is provided that we use for training and for performing the comparisons between the different pile-up mitigation algorithms. We then give a detailed explanation of the implementation of Puma. Third, we introduce the key metrics to benchmark the performance of the algorithms and provide details on the algorithm training. Finally, we quantify the performance of Puma.

The Delphes Simulation Framework
The goal of Delphes [7] is to allow the simulation of a multipurpose detector for phenomenological studies. The simulation includes a track propagation system embedded in a magnetic field, electromagnetic and hadron calorimeters, and a muon identification system. Physics objects that can be used for data analysis are then reconstructed from the simulated detector response. These include tracks and calorimeter deposits and high level objects such as isolated electrons, jets, taus, and missing energy.
The Delphes framework allows for a fast simulation of an approximated detector response for typical LHC detectors using parameterized resolution and efficiency functions. The simulation includes a tracking system within a magnetic field, electromagnetic (ECAL) and hadronic calorimeters (HCAL) as well as a muon systems. High-level objects like isolated electrons, particle jets or missing transverse energies are reconstructed using low level observables such as tracks and energy deposits in the calorimeters. The calorimeter systems within Delphes is finely segmented in η and φ and it is assumed that ECAL and HCAL have the same granularity, i.e. each ECAL cell has a corresponding cell in the HCAL. The geometrical center of each cell then defines the coordinate of the calorimeter energy deposit.
The stable charged particles on generator level with a minimal transverse momentum (e.g. p T > 100 MeV) undergo the track reconstruction. The track reconstruction efficiency as well as the resolution and scale is parameterized vs. p T , η and φ.
Particles on generator level that reach the calorimeter system leave the fractions f ECAL and f HCAL in the electromagnetic and hadronic calorimeter cells, respectively. The relevant cells can then be group together in one calorimeter tower. When several particles reach the same cells the total energy of one tower is simple the sum over all particles, that leave energies either in the ECAL or HCAL. Delphes assumes that all electrons and photons leave their full energy on the ECAL system, i.e. f ECAL = 1.0 and f HCAL = 0.0. Muons and neutrinos are assumed to leave no energy at all in the calorimeter system. All stable hadrons are treated with f ECAL = 0.0 and f HCAL = 1.0, with the exception of Kaons and Λ, where the values f ECAL = 0.3 and f HCAL = 0.7 are used.
The resolutions of the electromagnetic and the hadronic calorimeters are independently parameterized in dependence of the particle kinematics, using a stochastic, a noise and a constant as parameters.
The CMS collaboration was one of the first that implemented a PF algorithm for the reconstruction and measurement of finale state objects, in order to maximize the use of sub-detector measurements for the event reconstruction. Since the full PF approach is rather complex, a simplified version is implemented within the Delphes framework, which results in two types of object collections: PF tracks and PF towers. Each tower is described by the total energy deposited in the electromagnetic and the hadronic calorimeter, E ECAL and E HCAL , respectively. In addition, the total energy deposit originating from charged particles in a given tower is stored in the variables E ECAL,trk and E HCAL,trk . It is important to note in this context, that Delphes assumes that the momentum of charged particles is always measured best by the tracking system. This allows to define the energy flow of a tower by where ∆ ECAL = E ECAL − E ECAL,trk and ∆ HCAL = E HCAL − E HCAL,trk . The PF algorithms then create a PF track for each reconstructed track and a PF tower with the energy E eflow Tower if E eflow Tower > 0. Particle-flow tracks describe therefore all charged particles with a good resolution. Particle-flow towers on the other hand, contain the energy information of neutral particles and charged particles that have not been reconstructed by the tracking system. In addition, also energy deposits due to the positive smearing of the calorimeters are taken into account. A detailed description of the algorithms can be found in Reference [7], where also several validation studies are shown.

Collision datasets
To emulate the HL-LHC data taking situation, we generate Monte-Carlo simulations of LHC proton-proton collisions at √ s = 14 TeV using Pythia version 8.244 [8]. Pythia is used for both matrix-element generation and parton showering and hadronization, employing the parton shower tune 4C [9], which also provides a description of effects from multi-parton interaction and the underlying event. The stable generator-level particles are subsequently passed through a model of the CMS detector built with Delphes version 3.4.3pre01, retaining the correspondence between reconstructed PF objects and the incident truth particles. The layout of the detector roughly corresponds to the Phase-II upgrade conditions at CMS, including novel, high-granularity forward detector components. The processes simulated are dileptonic tt production, Z(→ νν)+jets production, vector boson fusion (VBF) production of a Higgs boson with subsequent H → cc or H → dark matter decays, and soft-QCD production. For each of the studied SM processes, 200 thousand events have been generated for training and evaluation. On average, 140 soft-QCD events are mixed with the hard scattering event, sampled from a total of 50 million soft-QCD events. In the following, "PF candidate" and "particle" will be used interchangeably to refer to a PF object.
Very few simplifying assumptions are made in the simulation and reconstruction of the events: for stable, charged particles (electrons, muons, charged hadrons), the vertex assignment is assumed to be perfect. No underlying event is simulated. We note that apart from these assumptions, this study is based on a fast but sophisticated detector simulation that realistically models particle reconstruction efficiencies and smearing of track momenta and calorimeter tower energies.

Particle and event definitions
Each PF candidate is represented by a set of features: its Lorentz four-vector, the particle species, the electric charge, the corresponding vertex (if available), and local cluster information (see Section 3.1 for the method to obtain event sub-clusters). An event is represented as a set of PF candidates, as well as the number of reconstructed vertices. For each particle, we also compute the target quantity to be learned: where E gen LV is the summed energy of the incident generated particles from the leading vertex (LV), i.e., from the hard interaction, associated with the PF candidate. Similarly, E gen PU is the summed energy of associated incident generated particles stemming from pileup interactions. The quantity y is referred to as hard energy fraction in the following.

Modeling Pile-up with Sparse Transformers
We formulate the problem of pile-up identification as one of regressing the hard energy fraction of each particle. That is, we define a model g(·; Θ) to minimize the loss L(·; Θ): where x i is the feature vector of particle i in a given event, y i is the hard energy fraction of particle i, Θ denotes free parameters of the model g, and g(· · ·) j is the prediction of the model for particle j. Note that this prediction is conditioned on all particles in the event {x i }, but the loss is computed independently for each particle. This energy fraction regression task is distinct from previous ML approaches to the pile-up problem [5,6], which treat the problem as one of classification: a particle is uniquely identified as arising from a hard or pile-up vertex. This definition is no longer valid when considering the realistic scenario of a detector with finite spatial and energy resolutions. A measured PF candidate is frequently the result of several particles depositing energy in the same detector component. Accordingly, we train our models to decompose each detected particle into co-linear LV and PU components.
We parameterize g as a deep neural network Puma: Pile-Up Mitigation using Attention. Attention was first introduced in 2014 [10] for neural machine translation of natural languages, and further developed as self-attending Transformer networks [11] for a multitude of natural language tasks.
The dynamics of particle decay and hadronization at the LHC share some conceptual similarities with natural language problems: much information of particle dynamics can be described in a local picture (in phase space), with some information requiring higher-order global abstraction (e.g. jet hadronization). This is analogous to a natural language sentence, where most words are closely coupled to nearby words, but some long-range dependencies do arise. Therefore, we hypothesize that a similar model parameterization may work in both scenarios.
At the core of Puma lie several transformer layers [11]. The full model can be written as: where X = [x 1 , . . . , x N ] T is the particle feature matrix, MLP are multi-layer perceptrons, and Transformer is a stack of Transformer layers.
In defining a model that uses full self-attention, a difficulty arises: the time and memory complexity of self-attention scales quadratically with the cardinality of the input set. In a typical particle collision with 140 pile-up interactions, the number of particles, N , can reach ten thousand. This makes training the model with reasonable batch sizes, even on large GPUs, prohibitively difficult.
To reduce the memory consumption of the transformer, we find a way to sparsify the self-attention mechanism. While many variants of sparse attention exist [12,13,14], we use the Longformer [15] implementation. Essentially, we construct a nearest-neighbors graph among the particles, such that the non-zero elements of the adjacency matrix are a subset of a banded-diagonal matrix. By limiting the size of this band, w, we ensure that the attention complexity scales as O(N w) O(N 2 ).

Hierarchical particle clustering and sparsification
First, we cluster particles in a hierarchical fashion on the surface of a cylinder parameterized by (sin(φ), cos(φ), η). The result of IterativeCluster({x i }, w clust , k) (Algorithm 1) is a set of clusters of particles, of size w clust or smaller. Algorithm 1: Hierarchical particle clustering algorithm: w clust is the maximum cluster size, and k is the number of clusters at each iteration.
KMeans is a k-means clustering function that computes clusters on a cylindrical surface. Function Once the clusters {C 1 , . . . , C nc } have been computed, we define a unique ordering. For each cluster C i , we compute three quantities: These represent, respectively, the total transverse momentum of the cluster, the p Tweighted pseudorapidity of the cluster, and the p T -weighted azimuthal angle of the cluster. We define a permutation of the clusters π. It is initialized as: Then, for a > 0: where The results of this hierarchical clustering and ordering are shown for an example event in Fig. 1. The ordering of the clusters give a natural ordering of the particles, in which the particles are first ordered according to the cluster they belong to, and within each  Figure 1: Particle assignment to clusters (indicated by color) for a certain event after 1 iteration, after 2 iterations, and after convergence. In all three figures, k = 4 and w clust = 10. In Fig. 1c, clusters are colored based on their ordering (from dark to light), and particle marker areas are proportional to particle p T .
cluster, by decreasing p T . The particles can then be thought of as a graph with n c complete, mutually-disconnected subgraphs. Each particle is a vertex, and two particles are connected by an edge if they belong two the same cluster. As a consequence of our particle ordering, the adjacency matrix A clust is block-diagonal, as illustrated in Fig. 2. Furthermore, our cluster ordering means each block is adjacent to blocks arising from clusters that are proximal in (η, φ)-space. With <n PU > = 140, the number of reconstructed particles per event averages at about 6000, with some events having up to 9000 particles. Therefore, the sequence of particles is zero-padded up to a length of 9000 if fewer particles are reconstructed.

Banded attention
The clustering allows us to avoid attending over particles that are far away from the query particle. However, if we were to only allow the attention mechanism to consider  Fig. 2c: the minimum banded-diagonal matrix required to cover the w clust = 5 adjacency matrix. Fig. 2d: a slightly wider banded-diagonal matrix that allows for longer-range attention across clusters.
keys adjacent to the query in A clust , the model could not consider relationships between clusters. Instead, we use a banded-diagonal matrix A with bandwidth w > w clust . Larger values of w allow stronger cross-cluster attention within a Transformer layer. Figure 2 illustrates the relationship between w clust , A clust and w, A. In practice, we find that values of w two to four times larger than w clust allow for sufficient information flow between clusters. Attention band widths smaller than w = 100 allows us to leverage the efficient attention kernels provided by the authors of [15]. For widths larger than 100, we encounter memory limitations on our hardware. However, by stacking Transformer layers, we are effectively still able to increase the receptive field, as the transformed feature vectors after layer n carry information from particles n × w away from the query particle.
Puma is implemented in Python using PyTorch [16] and uses elements of the Transformers [17] library. The data processing, including IterativeCluster, is Table 1: Variables used as input to the network. While they are attributes of each particle, they resemble three categories of different locality: variables corresponding to properties of the individual particle (p T , η, φ, E, particle ID, vertex ID); variables characterizing the cluster the particle is in (cluster ID, cluster R, cluster p T ); an eventwide variable (N PV ).
p T particle transverse momentum η particle pseudorapidity φ particle azimuthal angle E particle energy particle ID particle ID vertex ID vertex of particle, -1 if neutral cluster ID index of cluster containing particle cluster ∆R max. pairwise distance ∆R = ∆η 2 + ∆φ 2 between particles found in cluster cluster p ch T scalar sum of p T 's of all charged LV particles in cluster containing the particle cluster p neut T scalar sum of p T 's of all neutral particles in cluster containing the particle N PV number of reconstructed vertices in the event implemented in C++ using Delphes and ROOT [18].

Training procedure
The standard MSE loss (Eq. 2) puts all particles on equal footing. However, in a high pile-up event, we expect the vast majority of particles to be of extremely low-momentum, while the event dynamics are dominated by a handful of high-momentum particles. To bias Puma towards more accurate modeling of high-momentum particles, we modify the loss: where k is the index over the particles contained in the leading cluster and α ≥ 0 is a tunable hyperparameter. We choose α = 2 for our studies. As vertex identification for charged particles is essentially perfect, Puma only needs to optimize the loss function in Eq. 7 for neutral particles. This is equivalent to setting f j = 0 for charged particles. In practice, we find doing so does not improve performance on neutral particle energy regression, nor other downstream metrics. In what follows, the loss is computed over all particles.
The loss function in Eq. 7 is minimized stochastically using the Adam optimizer [19]. We employ batch sizes between 512 and 8192, distributed across four Nvidia Tesla V100 GPUs. To achieve the higher end of this range, we accumulate gradients over a maximum of 8 iterations before applying weight updates. The learning rate is initialized to 10 −3 and follows a cyclical schedule [20], with a period of 4 epochs and decay factor of 0.97. All models are trained to convergence, which typically occurs after O(10 5 ) steps.

Evaluation metrics
In addition to the key metric of the weighted mean square error, we evaluate the performance of pile-up mitigation techniques using three other metrics: the transverse momentum imbalance, (p miss T ); the leading jet p T ; and the RMS of the transverse component of the hadronic recoil vector.
The vector sum of the p T of all produced particles must be zero because the initial state of a hadron collision has no net momentum in the transverse plane. We calculate this as variable as: In events with undetectable particles (e.g. neutrinos), we expect to find p miss T ≡ | p T miss | > 0. Many crucial Standard Model measurements, e.g. of the W mass, involve neutrinos and rely on an optimal pile-up rejection, and several beyond-SM models predict other undetectable particles (e.g. dark matter, stable SUSY particles, gravitons). Pile-up interactions are isotropically distributed in φ and have minimal p miss T which causes an increase of the variance of reconstructed p miss T . Therefore, it is of great importance to assess the impact of pile-up, and pile-up mitigation techniques, on the missing transverse momentum. We define three per-event metrics from the momentum imbalance: where p refers to the true momentum imbalance of generated particles prior to detector effects andp refers to the estimated momentum imbalance of reconstructed particles after detector effects, PF reconstruction, and pile-up mitigation.
As the LHC is a hadron collider, many events produce jets, i.e., bundles of collimated hadrons. In some cases, the jet is incidental to the process being studied (e.g. production of a Z boson at high p T ), while in other cases it is a fundamental signature of the process (e.g. VBF production of a Higgs boson). Pile-up interactions typically produce soft jets. However, even if the hard interaction has produced several hard jets, particles from the soft pile-up jets can affect the reconstruction of the hardest jets. Therefore, we also consider misreconstruction in the mass of the dijet system in the VBF Higgs production mode as another metric: We do not expect any of these metrics to reach the ideal value of zero, even with perfect pile-up regression, becausep includes the effects of detector resolution and PF reconstruction. Some previous work has neglected these effects, probably to isolate the impact of pile-up. Finally, the measured hadronic recoil, U , in a Z+jets event can be decomposed in a parallel, U || , and a perpendicular, U ⊥ , component with respect to the true vector boson transverse momentum, p Z T , where an ideal measurement yields p Z T + U || = 0 and U ⊥ = 0. The RMS error of the U ⊥ distribution is typically taken as a measure of the hadronic recoil resolution. However, this definition has the disadvantage that any rescaling of the measured hadronic recoil components by some factor α also changes the resolution.
The width of the rescaled U ⊥ distribution will be smaller for α < 1, however no real gain in a physics measurement has been achieved, because an additional bias in p Z T + U || is introduced and the sensitivity of U || on p Z T decreases. In order to ensure a fair comparison between the different methods, the factors α i in Eq. 11 for each bin of p Z T have been chosen such that the average bias < p Z T + U || > in a given bin of p Z T is the same for all methods. The figure of merit is then the RMS error of the resulting U ⊥ distribution.
We define our gold standards as the event descriptions achievable assuming perfect pile-up regression but imperfect particle reconstruction. That is, we consider four scenarios: • CHS: charged-hadron subtraction: only consider charged particles if they are associated with the primary vertex; leave neutral particles unscaled • Puppi: scale each particle's 4-vector by the pile-up likelihood w puppi • Puma: scale each particle's 4-vector by the estimated hard energy fractionŷ • Gold standard: compare to a sample generated with n PU = 0

Model hyperparameters
In this section we describe the final choice of model hyperparameters selected. The cluster size used in IterativeCluster is set to w clust = 10. The remainder of the hyperparameters are described in Tab. 2. The model has 127074 trainable parameters.
Of particular interest is the attention band width, as this is chosen to be small to limit the memory footprint of the model. As shown in Fig. 3, it turns out Puma . Note that the difference in the absolute values of the two curves is not meaningful: the loss is dependent on a number of process-specific factors, like particle p T , η, charge, etc. The purpose of this figure is to demonstrate that in both processes, the loss is not strongly a function of w. The dark red points show the optimal result for VBF H(cc), i.e., when Puma is trained on this process instead of tt. The increase in performance for this configuration is minimal considering the large improvement over PUPPI that is present for all Puma scenarios. Right: If the PF candidate sequence is ordered by p T , only very large attention band widths can recover the performance of IterativeCluster with an attention band width of 10.
is robust across a range of attention widths. This holds true both for the process Puma has been trained on, as well as for the inference for a different physics process, namely Higgs boson production with subsequent decays of the Higgs boson to charm quarks. We hypothesize this is due to the optimal cluster ordering, describing pile-up primarily as a local problem; additionally, by stacking many Transformer layers and thereby increasing the receptive field, we are sensitive to residual long-range relations between pile-up particles even with small attention band widths. As noted in Sec. 3.2, we do not explore w > 100 due to computational constraints. For inference, we consequently choose the model trained for 300 epochs with an attention band width of 15.

Physics performance
In terms of the key metrics introduced in Sec. 4.2, we find that Puma outperforms PUPPI across the board. Figures 4a-4b show the difference between estimated p miss T and true p miss T for the same process that Puma was trained on, i.e., dileptonic  tt production. A sample that is statistically independent from the training sample has been used for inference. Note that the gold standard (n PU = 0) does not have perfect reconstruction, due to finite detector and PF resolution. An absolute improvement in the p miss T resolution, computed as the variance of the respective distribution, of about 20% is observed compared to PUPPI. Alternatively, this can be characterized as Puma bridging half of the gap between PUPPI and the theoretical minimum gold standard.
This demonstrates for the first time that a machine learning algorithm can mitigate pile-up effects more effectively than the currently best rule-based algorithm (PUPPI) at the event level, with a realistic detector simulation, and without actually using the scores of traditional algorithms like PUPPI as input features.
The same improvement is achieved for a different physics process in Figs. 4c-4d, where the model was applied on a sample of Z(νν)+jets events. Finally, as can be seen from Figs. Figs. 4e-4f, the dijet mass and p miss T for VBF Higgs production events with invisible decays of the Higgs boson -an essential search channel when looking for dark matter at the LHC -likewise demonstrate an increase in resolution compared to PUPPI.
The RMS error of the response-corrected U ⊥ distributions in tt production and Z+jets production is shown in Fig. 5, where a clear improvement over PUPPI is evident across the entire range of hadronic recoil. By eliminating virtually all contributions from pile-up in the case of Z+jets, we obtain the same resolution as observed in the gold standard sample.
Finally, Fig. 6 demonstrates that the particle grouping found by IterativeCluster is necessary to achieve the observed performance. We compare to a version of Puma trained using p T -ordered particles. While momentum ordering does still lead to an improvement over PUPPI, this improvement is only half of what is observed with our optimal Puma. This demonstrates that pile-up is to first order a local problem,  Figure 6: Left: Comparison between two Puma algorithms and PUPPI. An attention band width of 15 was used for both Puma models; the PF candidate sequence is either ordered by p T or according to IterativeCluster, which gives the best performance. Right: Using the PUPPI weight per particle as input in addition to the variables listed in Tab. 1 does not result in sizable improvement over the default Puma. In both figures, the parameter δ is estimated for each distribution and consists of the mean and the variance, reflecting the resolution in the estimation of the respective quantity.
where the particles in the vicinity of the query particle contain the most information about its vertex of origin. It also shows that attention mechanisms combined with IterativeCluster are optimally suited to exploit this information. The marginal improvement shown in the right panel of Fig. 6 when using the PUPPI weight per particle as input feature in addition to the ones listed in Tab. 1 suggests that Puma manages to capture the information contained in PUPPI and further improves on it. The observed improvement in the resolutions in p miss T for a variety of processes, which are crucial for SM precision measurements and essential backgrounds in many searches for BSM physics, would directly translate to a more powerful analysis of a plethora of signatures expected in LHC collisions, especially towards the HL-LHC.

Summary
We have presented a highly effective pile-up mitigation algorithm Puma based on sparse self-attention. This method is the first machine learning approach relying only on raw reconstructed observables to demonstrate superior performance in a realistic detector scenario over current state-of-the-art, rule-based pile-up mitigation techniques. This holds true for both event-level and particle-level quantities. As pile-up effects will continue to worsen as the LHC moves to increasing luminosity, this is an important step towards showing that statistically-learned algorithms like sparse transformers can be very useful at the HL-LHC and beyond.