Jet flavour classification using DeepJet

Jet flavour classification is of paramount importance for a broad range of applications in modern-day high-energy-physics experiments, particularly at the LHC. In this paper we propose a novel architecture for this task that exploits modern deep learning techniques. This new model, called DeepJet, overcomes the limitations in input size that affected previous approaches. As a result, the heavy flavour classification performance improves, and the model is extended to also perform quark-gluon tagging.


Introduction
The Standard Model of particle physics (SM) [1,2] is a remarkably effective theory, able to describe the experimental observations made thus far in high energy physics with unprecedented precision and completeness.Despite its success however, this model fails to explain several observations like the baryon asymmetry and the presence of dark matter, which inspires searches for extensions to the SM.The study of the recently discovered [3][4][5] Higgs boson [6][7][8][9][10][11], and the search for extensions of the electroweak sector are two of the most active research sectors in the field.
In both cases the investigated phenomenon sports a clear flavour asymmetry, making the ability to classify jets originating from heavy-flavour (bottom and charm) quarks extremely beneficial.Heavy-flavour (HF) jets contain an open-bottom or open-charm hadron as a result of the fragmentation process.This hadron carries a large fraction of the initial parton momentum.HF hadrons also have a sizeable lifetime, with a cτ of ∼ 0.5 mm and ∼ 0.3 mm for bottom and charm, respectively.Most of the HF hadrons produced in the fragmentation process undergo a decay process far enough from the primary interaction vertex (PV) to result in displaced tracks and its decay products can be clustered in resolved secondary vertices (SV).
Traditionally, jet flavour classification has leveraged both fragmentation and lifetime properties of HF hadrons.The discrimination power provided by single features is not enough for a both efficient and pure classification, but the collective behaviour of the tracks in the jet, as well as the presence of a reconstructed secondary vertex, allows for a good performance of a combined algorithm.For this reason, machine learning has long been used in the field of jet flavour classification, starting from simple naive bayes classifiers [12,13], to more complex and modern models like shallow neural networks [14,15], or tree ensembles [14].
More recently, simple dense deep neural networks [16] have been successfully employed for this classification task alongside recurrent neural networks (LSTM) [17], outperforming previous classifiers.The ATLAS RNN algorithm [18], which is based on a recurrent neural network, processes as a sequence a small number of selected high purity charged particle tracks associated with the jet, sorted by their impact parameter.It exceeds the performance of the IP3D algorithm [19], which considers the same inputs, but does not fully take into account the correlations between the tracks.
Within CMS, the DNN based b-tagger DeepCSV [20] has been used as the baseline reference for b-tagging performance for quite a while.DeepCSV is a fully connected neural network consisting of 5 layers with 100 nodes each using high level features of selected tracks and vertices as input.In contrast with the ATLAS RNN tagger, DeepCSV combines properties from selected tracks and secondary vertices directly, but does not employ sequence processing.
The common feature between these two approaches, however, is that both models use only a small subset of the charged jet constituents that pass stringent quality criteria in order to provide the cleanest and simplest environment possible for the classifier to perform its task.
Yet, a stringent selection comes with information loss and therefore potential performance degradation.It has recently been shown that architectures compressing and converting individual track features to a latent space can overcome these limitations.The ATLAS Deep Sets b-tagging algorithm [21] relying on the idea of Deep Sets [22], where the features of the input tracks are transformed into a latent space representation before being combined by a simple per-feature sum, has been shown to have similar performance as the RNN approach with reduced training and inference time given the same inputs.Studies on relaxing the input selection criteria on the tracks show a significant gain.
In this paper, we describe DeepJet, a new network architecture that does not rely on a selection of the jet constituents and thus overcomes the previous limitations on the purity and number of inputs.In addition, DeepJet considers the full information of all jet constituents, charged and neutral particles, secondary vertices, and global event variables simultaneously.

Training Samples
The DeepJet model is trained and tested using a sample of simulated anti k T [23] jets (with R = 0.4) drawn from a mix of QCD multijet events and fully hadronic top-quark pair (t t) events, generated with PYTHIA8 [24] and POWHEGv2 [25][26][27][28], respectively.In both samples the hadronization and showering are performed using PYTHIA8.In order to apply the approach to a realistic reconstruction problem, GEANT4 [29] is used to simulate the response of the CMS Phase 1 detector [30,31] to the generated particles.The jet constituents are reconstructed using the Particle Flow (PF) algorithm [32] within the CMS Phase 1 detector reconstruction algorithms [33].The entire training sample consists of roughly 130 million jets in total.This is then split into a training, a validation and a testing sample in the ratio 0.765:0.135:0.1.

Labeling
The jets are labelled by ghost association [20].In this approach the last instance of each generated b and c-hadron before the decay have their momentum scaled to a very small value, in order to carry only directional information.These "ghosts" are then included among the particles to be clustered by the jet algorithm.If a jet contains at least one b hadron, it is labeled a b jet.If the jet contains at least one c hadron, and no b hadrons, it is labelled a c jet. Everything else is then labelled as a light jet.The jets labelled as b-jets are further sub-labelled into three categories: bb for jets containing two b-hadrons, b lept for jets containing b-hadrons decaying leptonically, and b for jets with b-hadrons decaying hadronically.The light jets are also sub-labelled into quark and gluon jets using the CMS definition [34].

Input features
DeepJet uses approximately 650 input variables, divided into four categories: global variables, charged PF candidate features, neutral PF candidates features, and SV features associated with the jet.The global variables include jet-level information such as the jet kinematics, the number of tracks in the jet, as well as the number of secondary vertices in the jets.Additionally, the global variables also include the number of reconstructed primary vertices in the event, in order to help the network learn the effect of multiple interactions per bunch crossing (pileup).The information related to the charged PF candidates is summarised in 16 features of the first 25 candidates with respect to their displacement significance.These features contain information on the track kinematics, track fit quality, displacement, and displacement uncertainty with respect to the PV.Similarly, 6 features of the leading 25 neutral candidates are provided to the network.Finally, up to 4 secondary vertices are considered, each with 12 variables, such as the flight distance significance.A full list of the variables used can be found in the appendix.

Preprocessing of features
Since the b-tagger is meant to generalise beyond the specific kinematic domain used for the training, it is important to remove any flavour dependence on transverse momentum (p T ), or pseudorapidity (η) to avoid the classifier leveraging those differences.During the data pre-processing, we downsample each of the flavour classes in p T and η such that they have the same shape as the b-jet class.Several variables are also bounded within a physically well defined range.In case features of an object are either not available or infinite, they are replaced with an appropriately chosen value for the specific feature.The particle objects are ordered based on their assumed importance.Charged particle candidates are ordered by impact parameter significance, while neutral candidates are ordered by the shortest distance to a secondary vertex.If there is no secondary vertex in the jet, p T ordering is used instead.Secondary vertices are sorted by flight distance significance.

Architecture
The concept of DeepJet is to use low-level features from as many jet constituents as possible as opposed to selecting a few that are well identified and reconstructed and exploiting higher level features.The exact number of features and constituents can vary with experiment and use case, but the all-inclusive concept remains the same.To process this large input space, a suitable architecture is needed.An illustration of the DeepJet architecture can be seen in Fig. 1.In the first step, an automatic feature engineering is performed for each constituent using convolutional branches comprising 1x1 convolutional layers [35].The filter size of 1x1 is chosen as inherently all candidates should undergo the same feature transformation without taking into account the other constituents in the jets, and a priori a specific class of physics objects should be treated in a consistent manner independent of input ordering.1 Separate convolutional branches are used for vertices, charged PF candidates and neutral PF candidates.The convolutional branches use a sequences of layers with a decreasing number of filters, as this projects and compresses the features of each constituent into a lower dimensional space.Each of the outputs are then fed into a a dedicated recurrent layer of the LSTM type.This treats the constituents as a sequence, following the order described in section 2.4.In particular the LSTM nodes are well suited for dealing with the variable length sequences of inputs, which occurs in jet physics.The charged candidate LSTM uses 150 nodes, whereas the neutral and secondary vertex LSTM uses 50 nodes each.The three LSTM outputs are concatenated with the global features of the jet and then fed into a fully connected layer with 200 nodes, followed by 7 fully connected layers with 100 nodes each.Finally 6 outputs nodes are integrated into the multi-classifier, allowing it to perform b-tagging, c-tagging and quark/gluon tagging.Throughout the network ReLU activation functions are used, except for the output layer, where softmax activation is applied.
At the beginning of the network, and in between each layer, batch normalization [36] is performed, and dropout [37] with a rate of 0.1 is applied for regularisation.

Training procedure
The neural network is trained using the Adam optimizer [38] with a learning rate of 3 • 10 −4 for 65 epochs and a categorical cross entropy loss.The learning rate is also halved if the validation sample loss stagnates for more than 10 epochs.After the first training epoch, the input batch normalization layer is frozen.Possible over-training is monitored through the validation loss, which is mostly dominated by medium p T jets, and additionally through ROC curves in different p T ranges.The training is performed on an Nvidia GEFORCE GTX 1080 Ti GPU.With a training sample size of roughly 130 million examples, the training convergence time is around 3 days.

Comparison with other b-tagging algorithms
The overall performance of DeepJet is compared to the CMS b-tagger DeepCSV, which is a multiclassifier embedded in the CMS reconstruction framework, and which uses similar, but strongly 1This concept is closely related to the idea of Deep Sets [22].preselected, inputs along with additional high-level variables.The comparison is made on a fully hadronic t t sample, as shown in Fig. 2. DeepJet performs significantly better than DeepCSV with an efficiency2 increase of almost 20% at 10 −3 misidentification probability3 for jets with p T > 90GeV.

Charged (16 features) x25
The performance of the classifiers can also be compared on a set of multi-jet QCD events, which allow accessing higher jet momenta.The inclusive results are shown in Fig. 3.A comparison for different jet p T can be found in Fig. 4 for fixed light jet misidentification probabilities.In both cases DeepJet shows an increasing performance gain, in particular for higher jet p T , presumably exploiting the information contained in all constituents.
The multi-classifier nature of DeepJet and DeepCSV allows the models to also efficiently perform c-jet identification.Figure 5 shows that DeepJet outperforms DeepCSV also in this task.In this case, both DeepCSV and DeepJet use a binary discriminator defined as P(c) P(c)+P(udsg) .Finally, DeepJet can be used for quark/gluon discrimination.DeepJet is compared to the "quark/gluon likelihood" [34], a binary quark/gluon classifier included in the CMS reconstruction framework.For this purpose, the DeepJet output is also projected to a binary classifier, P(uds) P(uds)+P(g) .
2True Positive Rate.3False Positive Rate.Also here, a performance increase can be observed when employing the DeepJet model on a sample consisting purely of light quark and gluon jets, as shown in Fig. 6.Moreover, given that a standard quark/gluon classifier uses a binary approach, we expect a significantly larger performance gain in scenarios with more realistic heavy-flavour contamination.

Qualitative assessment of the performance gain source
Given that the architecture and the network inputs are changed simultaneously between DeepCSV and DeepJet, a new network configuration is trained to assess the relative impact of these changes to the performance.This model contains the same inputs as DeepCSV, but its architecture is changed to mimic that of DeepJet, with a series of convolutional layers feeding into a LSTM processing the list of inputs belonging to different tracks.Additionally, DeepCSV uses the six most displaced tracks in the jet passing the pre-selection criteria.In this implementation the limit is raised to 25, without removing the selection requirements.An architecture including all the inputs used by DeepJet fed into a simple dense deep neural network has been found practically impossible to train to convergence.The performance of the DeepCSV RNN model is shown in Fig. 7 for inclusive t t events.The gain in performance of DeepJet comes largely from the removal of the track selection, complemented by the network structure that allows efficient processing of the larger and less pure track set.Similar improvements coming with similar preprocessing of unselected jet constituent inputs could be observed in the Deep Set based tagger [21].This qualitative statement also explains the additional gain observed in DeepJet at high jet p T as the track selection was historically optimised for mild jet boosts.The default DeepJet architecture is also compared to an adapted architecture employing the Deep Set idea.For this purpose, the DeepJet architecture is adapted as follows: the 1x1 convolutional branches are extended to contain 100 filters, each, increasing the number of free parameters.The final layer of each branch is enlarged further, with the charged branch being set to 256 filters, whereas the vertex and the neutral branch are set to 128 filters.At the same time, the LSTM layers are replaced by a simple sum that is evaluated independently for each convolutional branch, after which these sums are concatenated and fed to a series of fully connected layers.
As shown in Fig. 8, the performance of the Deep Set based architecture shows a slight gain for b to c jet identification, but shows no significant difference with respect to the b to light jet identification or the quark/gluon discrimination performance compared to the default DeepJet architecture.

Conclusion
A multiclass flavour tagging algorithm, DeepJet, has been developed to exploit the full information in a jet.The model was tested using CMS simulation.The model relies on low-level variables and loose selection of the inputs, and it uses an architecture that can process these inputs in an efficient manner.When compared with fully connected models using a smaller set of engineered features a gain in performance is observed in all topologies of flavour tagging, in some cases exceeding a two-fold efficiency gain for the same misidentification rate.The model goes beyond heavy flavour tagging and performs quark-gluon discrimination as well.The applicability of the DeepJet approach has been successfully extended in different directions, such as the construction of tagging algorithms for exotic long lived particles [39].
DeepJet has been trained using the homonymous github package [40,41], with a streamlined training process and multi-threaded data access.The training set has been scraped from the simulation data using [42], which contains an algorithmic description of all the variables used in the training.

( 6 Figure 1 .Figure 2 .
Figure 1.An illustration of the DeepJet architecture.Three seperate branches are used to process charged candidates, neutral candidates and secondary vertices.The algorithm makes use of 1x1 convolutional layers to perform automatic feature engineering for each class of jet constituents.The three RNN (LSTM) layers combine the information for each sequence of constituents.Finally the full jet information is combined using fully connected layers.

Figure 3 .
Figure 3.The performance of DeepJet and DeepCSV in three different jet p T ranges in QCD multijet events.The performance is shown for both b vs c (dashed lines), and b vs light (solid lines).

Figure 4 .Figure 5 .
Figure 4. Performance of DeepJet and DeepCSV as a function of jet p T .The efficiency for b-jets is shown for three different fixed light-jet misidentification probabilities (mistag rates) using QCD multijet events.

Figure 6 .
Figure 6.Quark gluon discrimination performance of DeepJet compared to the CMS "quark-gluon likelihood" method in a sample of pure light quark (uds) and gluon jets.

Figure 8 .
Figure 8.Comparison of the DeepJet architecture with an adapted DeepJet architecture based on the Deep Sets approach.Left: b jet identification performance.Right: quark/gluon jet separation performance.