Identifying quenched jets in heavy ion collisions with machine learning

Measurements of jet substructure in ultra-relativistic heavy ion collisions suggest that the jet showering process is modified by the interaction with quark gluon plasma. Modifications of the hard substructure of jets can be explored with modern data-driven techniques. In this study, a machine learning approach to the identification of quenched jets is designed. Jet showering processes are simulated with a jet quenching model Jewel and a non-quenching model Pythia 8. Sequential substructure variables are extracted from the jet clustering history following an angular-ordered sequence and are used in the training of a neural network built on top of a long short-term memory network. We show that this approach successfully identifies the quenching effect in the presence of the large uncorrelated background of soft particles created in heavy ion collisions.

In addition to measurements on high p T hadrons, which are the leading fragments of jets, the LHC opened new opportunities for direct observations of jet quenching and jet modifications in the QGP. Significant dijet transverse momentum asymmetry [12][13][14][15] and suppressed jet production [16][17][18] were observed in lead-lead (PbPb) collisions. Jet-hadron correlations gave insights into how the energy lost by the jets is distributed in the medium produced in the collisions. Measurements of the jet shapes (transverse momentum radial profile) [19][20][21] and the jet fragmentation functions [22,23] were performed to further investigate the mechanism of the jet shower modifications in the QGP. To quantify the medium modifications, measurements in proton-proton (pp) collisions [24][25][26][27][28][29][30][31][32] are used as a reference for jet modifications in the medium. The jet splitting function has been measured in both pp and PbPb collisions [33][34][35], through the usage of a jet grooming algorithm that is able to split ("decluster") a single jet into two subjets and locate the hard splitting by removing softer wide-angle radiation contributions. The hard splitting, characterized by two well-separated subjets ("two-prong" structure), provides access to the early stages in the parton shower evolution.
In recent years many machine learning applications have been developed for studies in high-energy physics [36]. For the identification of quenched jets, a previous study of this classification problem [37] explored several ways of jet representation and different neural network architectures. It showed some promising results of deep learning techniques applied for the study of quenched jets in QGP at simulation level. However, the application of the convolutional neural network (CNN) often involves image preprocessing techniques, such as rotation, and normalization of the pixel intensity, which are not universal. These techniques are usually designed for a specific jet species, for example the boosted W/Z jets as studied in Ref. [38], and may use a particular jet substructure feature, as studied in Ref. [39,40] for QCD jets. In addition, as a common practice in the implementation of modern Monte Carlo event generators, events associated with small cross-sections which often produce high p T jets, can be over-sampled for statistical benefits, and event weights are assigned accordingly for compensation. How the event weights fit into the development of a machine learning approach is not well studied. Such problem is addressed in Ref. [39] and a re-weighting method is designed. The sample weights are assigned based on the so-called effective sample number [41], which are not related to event cross-sections. For a more straight forward application to experimental data, it is desirable to have sample distributions corresponding to realistic cross-sections. Another challenge is that in the experiment the jets exist in the presence of a thermal background of uncorrelated particles created in heavy-ion collisions. Therefore, before the feasibility of machine learning can be established, techniques such as event mixing and background subtraction should be applied to simulated events, after which the jet finding algorithms can be performed. Such problem is addressed in Ref. [39] and the authors found a decrease in the performance due to the presence of the underlying event. A similar conclusion is reached in Ref. [42].
In this study, we use the successive splittings of the parton shower as input for a long short-term memory (LSTM) [43,44] neural network. These splittings are obtained using jet declustering procedures in which all the splittings from the primary Lund plane [45] are extracted. The LSTM neural network is a special type of recurrent neural network that is capable of learning long-term dependencies and, therefore, has the potential to provide insight on the parton shower evolution in the medium. It has been used for jet tagging based on the primary Lund sequence [45,46], showing a reasonable performance in W tagging, Top tagging and quark/gluon discrimination. Such tagging problems have also been studied in heavy-ion collisions with CNN [47]. In addition, the LSTM neural network shows its ability of jet grooming with a carefully designed training strategy [48]. We use the LSTM neural network and perform supervised machine learning in which a well-trained classifier is able to distinguish quenched jets from vacuum jets on a jet-by-jet basis.
We present a machine learning approach to the identification of quenched jets in the presence of a large uncorrelated underlying event as present in heavy ion collisions. In Section 2, we describe the procedure of event simulation and the subtraction procedure on the uncorrelated underlying event. In Section 3, we describe how the supervised machine learning is performed, including a general feature engineering method on jets, a training method with event weights taken into consideration, a method of hyper-tuning and a study of robustness. In Section 4, we test the discrimination power of the trained classifier for quenched against non-quenched jets. Finally, we present our conclusions and outlook in Sec. 5. The full code and data used in this study are available as open-source [49].

Data Sample
In this study, we simulate hard scattering events with Monte Carlo event generators Pythia 8 [50] and Jewel v2.2.0 [51,52] at a center-of-mass energy per nucleon pair ( √ s NN ) of 5.02 TeV. A minimum outgoing parton transverse momentump T = 100 GeV. Pythia 8 is used to simulate dijet events in pp collisions, while Jewel simulates dijet events in heavy-ion collisions including parton-medium interactions. An important consequence of the parton-medium interactions implemented in Jewel is the medium response. In our studies, we use Jewel with keeping track of the recoils to explore the possible extreme. It has to be pointed out that such implementation of the medium response is a modeldependent component, while other generators may implement their own descriptions of the medium response. More details can be found in Ref. [53]. To simulate the thermal background originating from the hadronization of all partons in the quark-gluon plasma, which is not correlated with the jets, dijet events from both event generators are embedded into underlying events, which have general features similar to those observed experimentally. Following the implementation in Ref. [53], the particles in the uncorrelated underlying events are generated from a Boltzmann distribution in p T , and uniform distribution in pseudorapidity η and azimuthal angle ϕ. We prepare the uncorrelated underlying events using two different multiplicity settings, corresponding roughly to the mid-central (40-50%) and the most-central (0-10%) PbPb collisions at √ s NN = 5.02 TeV, as inferred from published experiment results [54]. More details about the uncorrelated underlying events are listed in The mixed event, constructed by embedding a dijet event into an uncorrelated underlying event, contains a hard scattering in the presence of the soft thermal background in heavy-ion collisions. To eliminate the energy and momentum contribution of thermal background particles into the jets, we perform background subtraction on the mixed events. The event-wide constituent subtraction method [55] is used with maximum distance parameter between the signal and background particles ∆R max = 0.3, using p T weight α = 1. Particles that remain after subtraction are clustered into jets with the anti-k T algorithm [56] with a distance parameter R = 0.4.
After jet finding, each jet is reclustered with the Cambridge/Aachen algorithm [57] from its constituents to form a pairwise and angular-ordered structure. We then apply the soft drop declustering [58] procedure, continuously removing soft branches and keeping the splittings that satisfy the soft drop condition, where p T1 and p T2 are the transverse momenta of the two separated subjets, ∆R is their angular distance, and R 0 is the jet distance parameter, in our case R 0 = 0.4. In this study, we set the free parameters to values z cut = 0.1, and β = 0. We also applied a cut of p T,jet > 200 GeV to optimize the purity in locating the hard splittings [59]. To evaluate the background subtraction procedure in Fig. 1 we compare the distributions of jet substructure variables for jets simulated with Pythia and for Pythia jets embedded into thermal background events with particle multiplicities corresponding to mid-central and most-central PbPb collision events after the background subtraction. Ideally, if the background subtraction corrects both the jet energy and the jet substructure, the results will be identical. The top panels of Fig. 1 show the jet splitting function z g , the angular distance between the subjets R g , and the groomed jet mass m g divided by the jet p T , for jets propagating in vacuum (simulated with Pythia) before and after the embedding into the underlying PbPb events and the corresponding background subtraction. The bottom panels show the closure test of the subtraction procedure, e.g. the ratio of the jet substructure variables for jets embedded in the PbPb-backgrounds after subtraction, and the original Pythia jets. We find that for most of the phase space of interest the subtraction procedure works well, although there are some discrepancies that can be associated with mis-tagging of the subleading subjet due to fluctuations in the thermal background, as suggested in Ref. [59]. This indicates that large background fluctuations may produce effects that mimic a jet quenching signal when the soft drop grooming is applied. Other grooming methods may help control the mis-tagging and could be a subject of further investigations. In this study, we use the soft drop grooming imposing a jet p T cut of p T,jet > 200 GeV, which reduces mis-tagging.
To investigate the quenching effects on groomed jets, we compare the substructure variables of quenched jets with respect to vacuum jets, where both undergo embedding and background subtraction. The distributions of substructure variables of the groomed jets in the mid-central and the most-central mixing scenarios are shown in Fig. 2. Significant modifications are seen on all jet substructure variables. Additionally, in Fig. 3 we plot the primary Lund plane density [45],  where z g and R g are taken from the first pair of hard branches that pass the soft drop condition (Eq. 2.1) for each jet. Strong modifications of the hard splittings with enhancement in wider (larger R g ) and more unbalanced (lower z g ) splittings are apparent in the Jewel events. Next, we implement a machine learning approach to classify quenched and non-quenched jets.

Feature Engineering
The feature engineering procedure starts from jets reclustered with the Cambridge/Aachen algorithm. This leads to a clustering history with a binary tree structure, of which branches are separated based on their angular distance. The soft drop groomer continuously removes the softer branch, until finding the hard splitting that satisfies the soft drop condition (Eq. 2.1), after which the declustering continues on the harder subjet, as illustrated in Fig. 4.
Each splitting can be described with a feature vector, x t , consisting of substructure variables including, but not limited to, the momentum fraction z, angular distance ∆R,  perpendicular momentum k ⊥ , and the invariant jet mass m inv , which are all calculated from the pair of declustered subjets with: where i, j denote subjets at the declustering step t. In this way, sequential vectors [x 0 , ..., x t , ...] are extracted, and are then used in the training of a neural network.

Training of LSTM-based Neural Network
A classification problem, that is, whether the hot and dense QGP medium modifies the jet showering process, is investigated based on a supervised machine learning strategy. Jets simulated with Pythia 8 are labeled as 0, in the later context referring to the non-quenched or the negative class. Conversely, those simulated with Jewel are labeled as 1, referring to the quenched or the positive class. Data samples are split into training and validation sets in a 1:1 ratio. For each jet, sequential feature vectors are extracted from their clustering history, as described in Section 3.1. Considering that the lengths of the sequential feature vectors may vary, all the sequences in the same batch have post-sequence zeros as pads, such that they have the same length after padding. The zero-padded batch of sequential feature vectors is fed to a neural network built with LSTM layers and fully connected (FC) layers, of which the implementation is provided by the PyTorch framework [60]. Details about the parameters of the neural network architecture are further discussed in Section 3.3.
A common LSTM unit [44] is composed of a cell, an input gate, an output gate, and a "forget" gate. The cell remembers relevant information throughout the processing of sequences, and the three gates regulate the flow of information into and out of the cell.
The output of the LSTM network at the last step 1 is directed to FC layers which map high-dimensional values to a vector of predictive values, from which the batch losses are calculated. We use the weighted mean squared error (MSE) loss function: where x i , y i refer to the predictive label and the truth label of the ith jet, and ω i is the event weight scaled from the event cross-section. During the training, learnable neural network parameters (e.g., coefficients or biases) are updated to minimize the batch loss during training, as shown in Fig. 5, using the "gradient decent" algorithm. As mentioned previously, event weights come from generators due to over-sampling on small cross-section events, for the purpose of compensation. This is the default behavior in Jewel, but is optional in Pythia8. In our study the event weight is accounted for in the batch loss calculation. It can easily be replaced with unity for unbiased experimental samples. In the current study the introduction of event weights has brought challenges to the training process, showing significant loss fluctuation when batch size is small. Larger batch sizes help to regulate the batch loss fluctuation, with the trade-off of making the computation heavier.

Hyper-tuning
The so-called hyper-parameters of a model, those that are not trainable but specify the details of the architecture and its training, are tuned with the Hyperopt [61] package. The hyper-tuning procedure helps locate the optimal hyper-parameter set on its search space, such that a local minimum can be avoided. The hyper-parameters related to the neural network architecture are the number of LSTM layers, and the input dimensions of two FC layers 1 . The hyper-parameters related to the training process include the number of training epochs, batch size, learning rate and its decay factor. We scan over the hyperparameter search space 50 times and arrange three trainings for each hyper-parameter set. The distribution of validation loss (defined in the same way as the batch loss, but calculated with the validation dataset) for discrete hyper-parameter options are displayed in Fig. 6 with violin plots, showing both the probability density of the validation loss and its quartiles. The trained model with minimal validation loss shown with the red dot is selected as the best trained model, with the optimal hyper-parameter values listed in Table 2.

Robustness
The trained neural network can be used as a classifier, making predictions on the probability for a jet to be quenched. Here, the distributions of the raw LSTM output from the top three best-trained networks are shown in Fig. 7, which exhibit a non-deterministic behavior in the metric. During the training, samples of the negative class are labeled with 0, while samples of the positive class are labeled with 1, without intermediate labels in between. Hence, the metric that quantifies the probability for a jet to be quenched, which is established during the training process, suffer from the stochastic nature of machine learning. Such non-determinism is rooted in the randomness of selecting samples to form a batch. To solve this issue, a calibration method is designed. The calibration method is designed such that samples of the negative class are uniformly distributed along the newly established metric, meanwhile the relative quantitative relationships between the samples' raw LSTM outputs are kept in order. The new metric can be obtained with the help of the receiver operating curve (ROC), which plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The FPR quantifies the percentage of samples in the negative class that are falsely predicted as positive. For illustration, with a decision threshold of 0, all negative samples should have predictions larger than 0, which leads to FPR=1. The new metric is bound to the FPR, meanwhile the threshold on the raw LSTM output corresponding to a specific FPR value can be determined from the ROC curve. The distribution of the raw LSTM output from trained neural networks and their values after calibration are shown in Fig. 7 (top and middle panels, respectively). For the three well-trained classifiers our calibration method produces almost identical distributions of the calibrated LSTM values.
Given the distributions of the LSTM outputs, either raw or calibrated, it can be seen that not all samples in the positive class (jets from Jewel) are predicted to be quenched by the trained neural network. To determine if a jet is quenched, a threshold should be used, such that jets that have predictions larger than such a threshold would be classified as quenched, while those with predictions smaller than the threshold would be considered non-quenched. In the subsequent discussion use the threshold determined from TPR=0.4. With this threshold, the positive samples from Jewel are divided into top 40% subset, and bottom 60% subset. It can be seen from the red point on the ROC curve that this TPR corresponds to a FPR of roughly 10%, indicating that about 10% of the samples in the negative class have predictions larger than the threshold. Further, we consider the top 40% Jewel jets as quenched and compare their substructure variables with the remaining 60%, as well as those simulated with Pythia.

Results
Samples of the positive class (Jewel) are divided into two subsets, consisting of top 40% and bottom 60% samples, respectively, based on the raw LSTM output. To better understand the relation between the LSTM output and the quenching effects, we plot the distributions of the substructure variables of the groomed jets in Fig. 8  density for the hard splittings after grooming in Fig. 9, for two subsets separately. Samples of the negative class (Pythia) that serve as a baseline are also shown for comparison.
These two subsets of samples show quenching effects at different levels. The top 40% samples of the positive class show severe modifications of the jet substructure with enhancement of wider and softer splittings. In comparison, the bottom 60% samples exhibit a quenching pattern that is similar to that of the negative class. Two neural networks that are trained at two different multiplicity scenarios behave consistently, and show a similar ability in quantifying quenching effects. This indicates that the machine learning approach presented here successfully handles the uncorrelated underlying events independent of the background density.

Conclusion
Our study has shown a promising machine learning approach to identifying jet quenching in the presence of uncorrelated underlying event background. Sequential substructure variables extracted from the jet clustering history carry information about the parton showering process with or without quenching effects, and can be learned by an LSTM neural network. The well-trained neural network can then be used to select jets at different quenching levels. We also studied the non-deterministic behavior of the supervised machine learning algorithm. With the help of the ROC curve, a calibration method is designed to avoid the non-determinism.
The procedure of extracting sequential substructure variables can be used as a general feature engineering method on jets. In comparison with jet image approaches, which utilize jet substructure in image pre-processing, our feature engineering method extracts sequential  variables directly and explicitly from the jet substructure. The angular-ordered clustering history, which is in the basis of the sequential features carries information about the parton showering process.
In the supervised machine learning setup, the truth labels, either quenched or nonquenched, are related to two different Monte Carlo simulators that produce two types of dijet events. In a real experiment, such labels can simply be replaced with the collision systems, either pp collisions or heavy ion collisions. Unlike the study in Ref. [39] which uses the jet energy loss ratio χ jh in a regression setup, out approach does not rely on simulation-only variables.
The well-trained neural network is capable of quantifying quenching effects, making it possible to divide jets simulated from a quenching model (Jewel) into two subsets. The two separated subsets of jets have shown differences in the jet substructure variables, related to two different parton showering patterns: a vacuum-like pattern, similar to what is implemented in Pythia 8, and a medium-like pattern, which produces highly quenched jets. Moreover, the medium-like pattern tends to mark up a zone on the Lund diagram, which may help extract information about medium properties. As pointed in Ref. [53], the jet transport coefficientq, which is of interest, is connected to boundaries of the quenching zone on the Lund diagram.
The techniques involved in our study, such as the background subtraction and the jet grooming, have been previously applied to experimental data and their effectiveness is well-studied. Thus, our machine learning approach is very promising in the exploration of different jet classification topics with real experimental data. [31] ALICE Collaboration, Measurements of the groomed jet radius and momentum splitting fraction with the soft drop and dynamical grooming algorithms in pp collisions at √ s = 5.02 TeV, arXiv:2204.10246.