From the bottom to the top—reconstruction of t t̄ events with deep learning

The reconstruction of top-quark pair-production (tt̄) events is a prerequisite for many top-quark measurements. We use a deep neural network, trained with Monte-Carlo simulated events, to reconstruct tt̄ decays in the lepton+jets final state. Comparing our approach to a widely-used kinematic fit, we find significant improvements in the correct assignment of jets to the partons from the decay, and we study the reconstruction performance of several kinematic top-quark properties. We document our workflow for the optimisation of the hyperparameters of the deep neural network. This workflow can be followed by experimental collaborations to retrain the network taking into account their detailed detector simulations.


Introduction
At the Large Hadron Collider (LHC), top quarks have been produced copiously in Runs 1 and 2 and will be produced in even higher numbers in Run 3 and at the HL-LHC. The dominant process is the production of a top quark and an antitop quark (tt) with a cross section of 832 +40 −46 pb [1] at a centreof-mass energy of √ s = 13 TeV. Datasets comprised of millions of tt events allow for measurements with unprecedented precision. Examples are measurements of Standard Model (SM) parameters, such as the top-quark mass [2,3], and measurements of differential cross sections [4,5], which may unveil deviations from the SM predictions and provide evidence for physics beyond the SM.
The decay tt → W + bW −b has a branching ratio (BR) close to unity, so that this decay is almost exclusively used for top-quark measurements. Many measurements have been made in the lepton+jets decay channel, where one of the W bosons decays leptonically and the other one decays hadronically. It is often chosen because of its significant BR (≈ 30% if electrons and muons are considered) and its relatively low background. Several important measurements in the lepton+jets channel require a reconstruction of the tt decay, such as measurements of the top-quark mass [2,3], the W-helicity in top-quark decays [6,7], the tt charge asymmetry [8], and the top-quark width [9].
The reconstruction of events in the lepton+jets channel, i.e. the assignment of jets to the four partons from the tt decay, is often achieved with kinematic fitting. In various analyses, the KLF framework [10] has been used. In this algorithm, each possible jet-to-parton assignment is tested for consistency with the kinematics of the tt decay chain using a maximum-likelihood approach. Wrong jet-to-parton assignments result in combinatorial background for the measurement and reduce its precision, so that a good tt reconstruction is essential for such measurements.
Deep neural networks (DNNs) have been proposed for various applications in data analysis at the LHC [11]. One area in which they have shown to outperform algorithms that were previously used by the experimental collaborations is the identification of particles with a complex signature. Examples include the identification of jets from the hadronisation of b-quarks and c-quarks [12,13], the identification of high-momentum hadronically decaying top quarks [14,15] and W, Z and Higgs bosons [14,16], the classification of jets from the hadronisation of quarks and gluons [17,18], and global particle identification [19].
-1 -In this paper, we study the reconstruction of tt events with a DNN, and we compare its performance to KLF as an established benchmark algorithm. We use Monte-Carlo (MC) simulated events that are passed through a simplified simulation of a multipurpose detector at the LHC. We then train the DNN with basic kinematic information of the reconstructed objects and flavour-tagging information for the jets. For data analysis at an LHC experiment, the DNN would likely be retrained using the detailed detector simulations that are used at these experiments. We hence present a detailed workflow for retraining the DNN, focusing on the optimisation of the choice of hyperparameters. While automated algorithms for such optimisations exist [20][21][22][23], the high dimensionality of the hyperparameter space and the large number of events needed result in a training that is computationally very intensive and hence often not practical in data analysis at LHC experiments. We present a simple sequential grid search, which in particular is parallelisable and helps to understand how the performance depends on certain hyperparameters. The proposed strategy is targeted at LHC experiments, but it may also be used for tt reconstruction at future hadron-and lepton-collider experiments.
For the reconstruction of tt events in association with a Higgs boson that decays via H → bb, a DNN was already shown to outperform a boosted decision tree as an alternative machine-learning approach [24]. The added value of the work documented in this paper is threefold: We quantify the performance in reconstructing tt events with a DNN, which has a wide range of applications, as outlined above. We compare the performance to the standard approach (kinematic fitting), which has been used for many analyses at the LHC [2,3,[6][7][8][9][25][26][27][28]. We consider the simple sequential optimisation procedure that we perform and document to be ready-to-use for data analysis at LHC experiments. This paper is structured as follows: In section 2, we present the MC simulation samples and the event selection that we used. The training and optimisation of the DNN is the topic of section 3. In section 4, we discuss the performance of the reconstruction using the DNN and compare it to the performance achieved with KLF . We present our conclusions in section 5.

Monte Carlo samples and event selection
Samples of simulated events are produced for tt production in proton-proton collisions at √ s = 13 TeV with up to one additional parton in the final state using M G 5_ MC@NLO [29] at leading order in the strong coupling constant and with a top-quark mass of 173 GeV. P 8 [30] is used for the simulation of the top-quark decays to a W boson and a b-quark, for the decay of the W bosons, and for the simulation of the parton shower and the hadronisation. The W-boson decays are simulated so that one W boson decays to an electron or muon and the corresponding neutrino, and the other W boson decays to two quarks. Additional radiation in the matrix element is matched with the parton shower using the MLM procedure [31]. A simplified simulation of an LHC-type detector1 is used, as implemented in D [32]. Electrons in the detector are required to have a transverse momentum, p T , larger than 25 GeV and an absolute value of the pseudorapidity, η, of at most 2.5. Muons in the detector are required to fulfil p T > 25 GeV and |η| < 2.4. An isolation criterion is used for the charged leptons: the 1We use the Delphes implementation of a CMS-like detector. However, the exact choice of the simplified detector simulation is believed to have little impact on the conclusions of this study.
-2 -maximum energy around the electron or muon within a radius of ∆R = (∆φ) 2 + (∆η) 2 = 0.5 is required to be smaller than 12% of the electron's p T or 25% of the muon's p T , respectively. Calorimeter jets are reconstructed with the anti-k t algorithm [33] with a radius parameter of 0.5 and required to fulfil p T > 25 GeV. Jets are b-tagged using a p T -dependent efficiency for jets that are geometrically matched to a b-quark. The efficiency is ≈ 52% at p T = 25 GeV, rises to ≈ 73% at p T ≈ 150 GeV, and decreases for higher p T , such that at p T = 500 GeV it is ≈ 55%. Mistagging efficiencies are applied to jets that are not geometrically matched to a b-quark, with higher mistagging efficiencies for jets that are matched to a c-quark than for jets that are matched to u-, dor s-quarks or to gluons.
The lepton+jets final state is characterised by a charged lepton, missing transverse momentum from the undetected neutrino, and at least four jets, two of which are expected to originate from the hadronisation of a b-quark. In order to select events in this final state, we require exactly one reconstructed electron or muon in the events, and at least four jets with |η| < 2.5. We did not make requirements on the amount of E miss T and on the number of b-tagged jets. Out of 225 467 022 generated events, 36 053 266 events fulfil these selection criteria. As additional jets can be present in tt events, for example due to initial-state or final-state radiation, we also study events with at least five and at least six jets. Out of all generated events, 19 781 437 events have at least five and 8 654 315 events have at least six jets.
The jets that are used for the reconstruction of the tt decay chain are selected by the following procedure: In a first step, the b-tagged jets in the event are used. If more than two b-tagged jets are present, only the two b-tagged jets with the largest p T are considered. In a second step, additional jets are considered in descending order in p T , regardless of whether they are b-tagged or not. We consider either the first four, five or six jets that are selected using this procedure. We perform an unambiguous matching of the four partons from the tt decay to the considered jets with the geometrical criterion ∆R (parton, jet) < 0.3. When considering the first four jets, the fraction of events in which all partons can be unambiguously matched to the considered jets is 19.6%. When considering the first five or six jets in events with at least five or six jets, respectively, this fraction increases to 30.4% and 38.3%.

Optimisation of the neural network
For the training of the DNN, we only use events with an unambiguous matching of all partons to the considered jets, so that each matched jet is assigned the label of its matched parton. The goal of the tt reconstruction algorithm is to predict the correct label for each of the considered jets, i.e. whether it corresponds to the b-quark from the hadronic top-quark decay, the b-quark from the leptonic top-quark decay, one of the light quarks from the hadronic W-boson decay, or whether it does not belong to the tt decay chain and is unmatched.
If all considered jets are assigned the correct label, the corresponding jet permutation is called "correct permutation". The prediction of the jet-to-parton assignment by an algorithm is called the "predicted permutation". The performance of a reconstruction algorithm can be quantified by the "reconstruction efficiency", defined as the fraction of unambiguously matched events in which the predicted permutation corresponds to the correct permutation. In unambiguously matched events with exactly four jets, there are 4! = 24 permutations, but most reconstruction algorithms -3 -do not distinguish between the two light jets from the decay of the hadronic W boson, so that only 4!/2 = 12 independent permutations remain. In events with five jets, there are 5!/2 = 60 independent permutations, and in events with six jets, there are 6!/2/2 = 180 permutations, where the second factor of 1/2 originates from the fact that also the two unmatched jets are not distinguished.
The DNN is trained as a binary classifier with the correct permutations as signal instances and with all other permutations as background instances.2 This means that we use each event as often as permutations exist for that event. We sort the jets such that in the correct permutation, the first jet corresponds to the b-quark from the hadronically decaying top-quark, the second jet corresponds to the b-quark from the leptonically decaying top-quark, and the third and fourth jets correspond to the two jets from the hadronic W-boson decay. For the wrong permutations, the ordering is hence different. The input features for the DNN are the four-momenta of the considered jets, the information whether a jet is b-tagged or not, the three-momentum of the charged lepton (assuming that the mass of the charged lepton is negligible), and the magnitude and the azimuth, φ, of the missing transverse momentum. As the azimuth is 2π-continuous, for each object, sin φ and cos φ are used instead of φ itself. All variables are scaled using -'s [34] StandardScaler, which transforms each variable so that it has a mean of zero and a standard deviation of unity. We treat events with electrons and events with muons together.
We use a fully-connected feedforward DNN implemented with K [35] using T -F [36] as the back end. We optimise the hyperparameters of the DNN according to the workflow detailed below, where we use 60% of the selected events for training (training sample), 20% for the evaluation of the performance during the hyperparameter optimisation (validation sample) and 20% for the final evaluation of the performance (test sample). For the minimisation during the training, we use the Adam optimiser [37] with binary cross entropy as loss function. The rectified linear unit is used as the activation function in the hidden layers, and the sigmoid function is used as activation function in the output layer. The batch size and the learning rate are hyperparameters of the Adam optimiser that define the sample size over which gradients are calculated during stochastic gradient descent and the step size in the optimisation, respectively. We choose the batch size to be a multiple of the number of permutations in order to always include all permutations of a single event in the batch.3 In each case we train for a maximum number of 200 iterations on the training sample (epochs). The measure that we use to evaluate the performance of a trained DNN is the reconstruction efficiency. The best network of one training is chosen using the epoch with the largest reconstruction efficiency on the validation sample. If the value of the loss function on the validation sample does not improve over 20 epochs, the training is stopped in order to avoid strong overfitting. We further prevent overfitting by regularising the DNN using dropout [38] and the L 2 norm. Regularisation with dropout is based on randomly removing a certain fraction (called "dropout rate") of nodes in the hidden layers during the training. Because using dropout requires wider network structures and we keep the dropout rate constant in the hidden layers, we 2We also studied classifiers, for which only a fraction of the wrong permutations was included as background, which reduces the imbalance in the number of signal and background instances in the training. However, we found that this has a negative effect on the classifier performance and we concluded that it is beneficial to include all wrong permutations of an event during the training.
3We ensure that all permutations that belong to one event are always included in the same batch. We found that the classifier performance decreases if they are randomly distributed over batches.

JINST 14 P11015
increase the number of nodes in the hidden layers to compensate for the number of dropped nodes in order to avoid instabilities during the training, which can otherwise occur in particular in the last and more narrow layers. In the L 2 regulariser, a penalty term is added to the loss function that is given by the squared sum of all weights in a hidden layer, multiplied by a strength factor (the L 2 parameter). We divide the value of the L 2 parameter in a given layer by two with respect to the previous hidden layer, which effectively reduces the strength of the L 2 regularisation from the first to the last hidden layer.
The workflow for the optimisation of the hyperparameters follows a sequence of two-dimensional grid searches: 1. Optimisation of the DNN structure: We vary the number of hidden layers and the number of nodes in the first hidden layer. The number of nodes in the following hidden layers are always half of the number of the nodes in the previous layer. The batch size and the value of the learning rate are kept constant and are chosen manually from a few trial trainings as 6000 and 10 −3 , respectively. Based on the results of all trainings, we choose a network structure with a good performance and a moderate amount of overfitting, which indicates that the structure has a capacity that is large enough for our task.
2. Optimisation of the hyperparameters of the optimiser: For the chosen structure from step 1, we vary the batch size and the learning rate. Based on the results of all trainings, we choose the hyperparameters that result in the best performance.
3. Optimisation of the regulariser: Since we do not apply regularisation in steps 1 and 2, the DNN may overfit to the training sample. We use the hyperparameters from the previous optimisation steps and we additionally optimise the dropout fraction and the L 2 parameter to choose a final set of hyperparameters that balances performance and overfitting.
We optimise DNNs separately for events with at least four jets, events with at least five jets and events with at least six jets. With an increasing number of jets, the classification problem becomes more and more challenging due to the increasing number of possible jet permutations. We discuss the steps of the optimisation procedure for events with at least four jets in detail in the following. The results of the optimisation of the DNNs for at least five and at least six jets are presented afterwards. When we train the DNNs with events with at least four, five or six jets, we always consider the corresponding number of jets for the jet permutations (four, five or six), following the ordering described at the end of section 2.
1. Optimisation of the DNN structure: In figure 1, the reconstruction efficiency is shown for networks where the number of hidden layers is varied from 2 to 7 in steps of one and the number of nodes in the first hidden layer is varied using the values 128, 256, 512, 1024 and 2048. The reconstruction efficiency for networks with only 2 hidden layers and for networks with only 128 nodes in the first hidden layer is lower than the reconstruction efficiency of networks with a larger number of hidden layers or number of nodes in the first hidden layer. We conclude that the capacity of such small networks is not large enough for the classification task. The best reconstruction efficiency is seen for the network with 5 hidden layers and 256 Reconstruction efficiency n = 2 n = 3 n = 4 n = 5 n = 6 n = 7 tt → e/µ + ≥ 4 jets best value from training best value of grid search tt → e/µ + ≥ 4 jets best value from training best value of grid search  tt → e/µ + ≥ 4 jets best value from training best value of grid search 3. Optimisation of the regulariser: In figure 5, the reconstruction efficiency is shown for the network with 5 hidden layers and 512 nodes in the first hidden layer and with a learning rate of 0.01 and a batch size of 12 000 if the dropout rate is varied from 0% to 25% in steps of 5% and the L 2 parameter is varied using the values 0, 10 −10 , 10 −9 , 10 −8 and 10 −7 . The reconstruction efficiency decreases if the network is too strongly regularised, i.e. the dropout rates and the values of the L 2 parameter are too large. Networks with a mild regularisation, however, show a similar performance as the network without regularisation. One such network with a mild regularisation uses L 2 regularisation with a parameter of 10 −8 and no dropout regularisation (dropout rate of 0). We chose this regularisation with L 2 over the setup with the best reconstruction efficiency using dropout, because the difference in performance between these setups is marginal and the difference between the reconstruction efficiency on the training and the validation datasets was found to be smaller in this case. The reconstruction efficiency as a function of the training epoch for the networks that are only regularised using L 2 is shown in figure 6 for different values of the L 2 parameter. Compared to the performance of the DNN without regularisation (figure 4), overtraining is reduced if an L 2 parameter of 10 −8 is chosen and the reconstruction efficiency is slightly higher than without regularisation.
The final optimised hyperparameters for the DNN trained with events with at least four jets are shown in table 1. The reconstruction efficiency as a function of the training epoch is shown in figure 7. The training with the best reconstruction efficiency on the validation sample is chosen, which is indicated by the vertical line. Overall, the reconstruction efficiency increases from 80.32% to 80.36% from step 1 to step 3. The increase is small due to a good initial guess of the hyperparameters, which is not guaranteed when the network would be retrained with a more detailed detector simulation.  In order to test for residual overtraining in the optimised DNN output, the distribution of the "permutation probability" in the training sample and in the test sample is compared. The permutation probability is defined as the ratio of the DNN output for the predicted permutation divided by the sum of the DNN outputs of all permutations in an event. The distributions of the permutation probability in the training and the test sample are shown for the correct and for all wrong jet permutations in figure 8. The comparison of training-sample and test-sample distributions shows that only a small amount of residual overtraining is present in the optimised DNN.
For the optimisation of the DNNs for events with at least five jets and at least six jets, we follow the same procedure as discussed above. For the optimisation of the regulariser, however, we only use the L 2 regulariser, as this is found to be sufficient for regularising the four-jet DNN discussed above. The hyperparameters of the final optimised networks are shown in tables 2 and 3. In both cases, similar values are found for the number of hidden layers, the number of nodes in the first    hidden layer, the learning rate and the value of the L 2 parameter, as in the four-jet case. The optimal batch size, however, is found to be larger, which is expected because the number of permutations (and hence the number of wrong permutations) per event is larger in the five-and six-jet cases. The reconstruction efficiencies as a function of the training epoch are shown in figure 9. The trainings with the best reconstruction efficiency on the validation sample are chosen (indicated by the vertical line in the figures). The distributions of the permutation probability evaluated on the training and on the test sample for the correct and for all wrong jet permutations are shown in figure 10. The comparison of training-sample and test-sample distributions shows that only a small amount of residual overtraining is present in the optimised DNN for the five-jet case. In the six-jet case, the overtraining is larger than in the other two cases, since the network is less regularised.

Results
We evaluate the performance of the optimised DNN and we compare it to the performance obtained with the KLF algorithm. The KLF algorithm uses a maximum-likelihood approach based on Breit-Wigner distributions of the top-quark and W-boson invariant masses that are convolved with transfer functions for the jet and lepton energies, as well as for the xand y-components of the missing transverse momentum. The transfer functions model the detector response and resolution of these quantities. In each jet permutation, a likelihood is maximised by varying the free parameters in a kinematic fit,4 which are: the energies of the four partons from the tt decay, the energy of the lepton from the leptonic W-boson decay, and the x-, y-and z-component of the neutrino fourmomentum. For a competitive comparison, we have chosen a KLF configuration that results in a large probability for the correct prediction of the jet-to-parton assignment: The central value of the Breit-Wigner distribution for the top-quark invariant mass is fixed to the value that is used in the generation of the MC samples ("fixed top-quark mass"), and b-tagging information for the jets is used by vetoing permutations in which a b-tagged jet is in a position of a light quark.5 For an additional comparison, we also show the performance of a KLF configuration where the top-quark invariant mass is not fixed but the reconstructed masses of both top quarks are required to be compatible with each other ("floating top-quark mass"). Depending on the number of b-tagged jets, the reconstruction with the KLF algorithm can result in less independent jet permutations as obtained for the reconstruction with the DNN. The jet permutation with the largest value of the likelihood is the KLF prediction for the jet-to-parton assignment. In figure 11, the performance of the DNN is compared to the performance achieved with the KLF configurations using unambiguously matched events, as defined in section 2 and used for the DNN training. The values from the figure are also displayed in table 4. In these events, the reconstruction efficiency for the correct assignment of jets to all four quarks from the tt decay shows a better performance of the DNN compared to KLF , where -as expected -the KLF configuration with the fixed top-quark mass shows a better performance than the floating-mass configuration [10]. The reconstruction efficiency is in addition compared for the correct assignment of only a part of the event. Also in the cases of the correct assignment of jets to the two quarks from the hadronic W-boson decay, to the b-quark from the hadronic top-quark decay and to the b-quark from the leptonic top-quark decay, the DNN outperforms the KLF algorithm. The improvement is particularly strong in the case of events with more than four jets, which are especially difficult to reconstruct, because in these events at least one additional jet is present -with respect to the leading-order picture of lepton+jets tt events with four quarks in the final state. Such events are a challenge for the KLF algorithm, which is constructed based on this leading-order picture. The DNN, however, receives also five-and six-jet events during the training, can learn the properties of jets originating from additional radiation and is hence able to identify the correct jet permutation with a higher probability in these cases.
4In the case of Gaussian approximations for the Breit-Wigner distributions and the transfer functions, this approach reduces to a traditional kinematic fit using a χ 2 function.
5As for events with more than two b-tagged jets, no permutation would survive this veto, in such cases only permutations in which a non-b-tagged jet is in the position of a b-quark are vetoed.
-15 - Table 4. Reconstruction efficiency for all four quarks from the tt decay ("all"), for the two quarks from the hadronic W-boson decay, for the b-quark from the hadronic top-quark decay and for the b-quark from the leptonic top-quark decay for events with at least four, five or six jets. In figures 12-14, distributions of the reconstructed mass of the top quark that decays hadronically, its difference in direction with respect to the top quark at parton level ("true top quark") in η-Φ space and the relative difference of the p T with respect to the true top-quark's p T are shown for the four-, five-and six-jet selections. In all figures, only events are shown that are unambiguously matched, i.e. the type of events that is used for training the DNN and where a correct permutation is well-defined. The distribution of the correct permutation is also shown, corresponding to a perfect reconstruction algorithm.
The high reconstruction efficiency of the DNN is reflected in its mass distribution being close to the distribution of the correct permutation for all jet selections (figures 12(a)-14(a)). KLF in the fixed-mass mode shows a mass distribution that is close to the distribution of the correct permutation for the four-jet selection and also shows narrow mass peaks for the five-and six-jet selections, which is expected as the algorithm uses the true top-quark mass as the central value of the top-quark-mass Breit-Wigner functions. However, the mass distributions are slightly shifted to higher values, especially for the five-and six-jet selections, because with increasing freedom in the choice of permutations, KLF may choose a wrong permutations, in which at least one chosen jet does not originate from the hadronically-decaying top-quark. This is also indicated by the fact that the mass peak of the distribution for correct permutations is less pronounced than the one reconstructed by KLF for the five-and six-jet selections.
The reconstruction of the direction of the hadronically-decaying top quark with respect to the true top quark's distribution is better in the case of the reconstruction with the DNN compared to both KLF modes in the four-jet case, with the DNN distribution being closest to the distribution of the correct permutation ( figure 12(b)). The same holds for the distribution of the relative difference of the reconstructed p T with respect to the true top quark's p T ( figure 12(c)). The improvement due to the reconstruction with the DNN is even much stronger for the selections with at least five or at -16 -least six jets (figures 13(b,c) and 14(b,c)) than seen in the four-jet case. As expected, KLF in the floating-mass mode results in a worse performance in all of these distributions than KLF in the fixed-mass mode.
These results indicate that the DNN is indeed able to learn the kinematics and b-tagging characteristics of the tt decay chain (including additional radiation) and identify the correct permutation by exploring this information. As expected from the higher reconstruction efficiencies, the DNN outperforms the two KLF modes also in terms of the studied kinematic variables. In the training of the DNN and in the studies presented so far only unambiguously matched events were used. However, it is important to evaluate the reconstruction with the DNN not only in this particular phase space, but using all events that pass the selection. In unmatched events a true permutation is not well-defined and we evaluate the performance of the reconstruction based on top-quark kinematic variables that are built using the predicted permutation. A better reconstruction algorithm, i.e. a better jet-to-parton assignment, is expected to result in a better reconstruction of the kinematics of the top quarks, which are constructed by adding the four-vectors of the jets that are assigned by the reconstruction algorithm.
In figures 15-17, distributions of the reconstructed mass of the top quark that decays hadronically, its difference in direction with respect to the true top quark's direction in η-Φ space and the relative difference of the p T with respect to the true top-quark's p T are shown for the four-, five-and six-jet selections.
The top-quark mass distribution obtained with the DNN is similar to the distribution obtained with the KLF algorithm in the fixed-mass configuration with the KLF distribution shifted to slightly larger values than the true top-quark mass (figures 15(a)-17(a)), as also observed using only unambiguously matched events. Mean values and standard deviations of a Gaussian fit to the reconstructed mass peak in a window of ±50 GeV around the true top-quark mass are presented in table 5, showing a similar mass resolution for the DNN and KLF in the fixed-mass mode, the slight upward shift of the KLF reconstruction in this mode, and the significantly worse resolution obtained with KLF in the floating-mass mode. The DNN leads to a larger fraction of top quarks within ∆R = 1.0 of the true top quark's direction (figures 15(b)-17(b)). We observe again that the improvements due to the use of the DNN are stronger in the more challenging cases of events with more than four jets. The fractions of reconstructed top quarks within a radius of ∆R = 1.0 of the true top quark's direction are shown in table 5, quantifying the improvement that is obtained when using the reconstruction with the DNN compared to the reconstruction with KLF . Using the DNN for the reconstruction, the fractions even increase with the five-and six-jet selection compared to the four-jet selection, while for both KLF modes the opposite trend is observed. One explanation for this increase is that with the four-jet selection, one of the jets originating from the hadronicallydecaying top quark is often not selected, but is chosen in the five-or six-jet selection, and the DNN is able to identify the correct permutation while the KLF reconstruction is not able to suppress the large combinatorial background. A similar behaviour is observed for the relative p T resolution of the hadronically-decaying top quark (figures 15(c)-17(c)). While only a small improvement in resolution is seen for the DNN compared to the two KLF modes for the selection with at least four jets, the improvement is much stronger in the case of the five-and six-jet selections.
-17 - Table 5. Mean (m t ) and standard deviation (σ(m t )) of a Gaussian fit to the peak of the reconstructed mass distribution for the hadronically-decaying top quark, and the fraction of hadronically-decaying top quarks that are reconstructed within ∆R = 1.0 of the true top quark, f (∆R < 1.0), for different jet selections and for reconstruction with the DNN and with the two configurations of KLF . The statistical uncertainties due to the limited size of the MC samples are smaller than the precision with which the values are cited. Although the DNN was only trained using unambiguously matched events, the improvement in performance compared to KLF holds also when kinematic top-quark variables are reconstructed using all events that pass the selection. The fact that the DNN outperforms KLF in the reconstruction task implies that the DNN learns from additional information in the events that is not used in the KLF algorithm. As the improvement in performance is more pronounced for events with additional jets, this indicates that such information is in particular related to additional radiation, which is not part of the leading-order picture that is the basis for the KLF algorithm. Interestingly, the DNN output is positively correlated with the p T of the true hadronicallydecaying top quark, with correlations of 24%, 30% and 32% in the case of the four-, five-and six-jet selections. This means that a better reconstruction is achieved for a larger top-quark p T . This is consistent with the improved performance of tt reconstruction found for the KLF algorithm at higher p T [10].

Conclusions
We have used a deep neural network, trained with Monte-Carlo simulated events, to reconstruct tt decays in the lepton+jets channel. We have optimized the hyperparameters of the neural network by maximizing the reconstruction efficiency, i.e. the probability to identify the correct jet permutation out of all possible jet permutations, separately for events with at least four, at least five and at least six jets. We have found that deep networks with five or six hidden layers and up to 512 nodes in the first hidden layer are necessary to reach the best performance. We have compared this approach to a widely-used kinematic fit and we have found significant improvements in identifying the correct jet permutation. We have observed particularly large improvements for the more challenging reconstruction of events with more than four jets, where at least one additional jet is selected that does not correspond to the leading-order tt lepton+jets topology. While the network was trained using only events in which all considered jets were geometrically matched to the true partons from the tt decay, the effect of the improved reconstruction on several kinematic top-quark properties was studied using all events. While the network's performance in events with at least four jets was found to be slightly better than the benchmark algorithm, larger improvements were found for events with more than four jets.
We conclude that deep neural networks provide improvements for the reconstruction of tt events, which promise to improve the precision of top-quark measurements at the LHC and at the HL-LHC. We have documented our workflow for the optimisation of the hyperparameters of the deep neural network and this workflow can be followed by experimental collaborations to retrain the network taking into account their detailed detector simulations.