Reconstruction of $\tau$ lepton pair invariant mass using an artificial neural network

The reconstruction of the invariant mass of $\tau$ lepton pairs is important for analyses containing Higgs and Z bosons decaying to $\tau^{+}\tau^{-}$, but highly challenging due to the neutrinos from the $\tau$ lepton decays, which cannot be measured in the detector. In this paper, we demonstrate how artificial neural networks can be used to reconstruct the mass of a di-$\tau$ system and compare this procedure to an algorithm used by the CMS Collaboration for this purpose. We find that the neural network output shows a smaller bias and better resolution of the di-$\tau$ mass reconstruction and an improved discrimination between a Higgs boson signal and the Drell-Yan background with a much shorter computation time.


Introduction
Tau leptons can be produced singly or in pairs through the decay of heavier mesons and baryons, or through the decay of Standard Model bosons that are created in particle collisions. The Z 0 /γ * and Higgs (H) bosons are the only particles that can mediate resonant di-τ production, and Higgs boson production has been recently observed in this decay by the CMS and ATLAS experiment at CERN [12] [5]. The reconstruction of the di-τ system invariant mass is fundamental in distinguishing between the Z and the Higgs boson. The dominant background in identifying the H → τ − τ + signal is the Drell-Yan (DY) process of di-τ production, therefore, it is important to distinguish between the two processes. However, the reconstruction of the mass of the di-τ system is challenging. The τ lepton decays after a short time into leptons or hadrons, both accompanied by neutrinos. Charged leptons and hadrons can be observed in the detector, thus they are called visible particles in this paper. Neutrinos, on the other hand, interact very weakly with matter and escape the experiment undetected, but can be identified in a relatively hermetic detector as used in the CMS and ATLAS experiments [10,3] as an imbalance in the measured momentum calculated in the transverse plane. The missing transverse momentum (MET) is defined as the negative vectorial sum of momenta in the transverse plane of all the visible particles produced in the collision. The momentum of the neutrinos in the beam direction, on the other hand, cannot be quantified, because it depends on the momentum of the quarks and gluons within the proton before the collision, which is only known statistically and not in any individual event. The CMS Collaboration makes use of two different SVfit algorithms for the reconstruction of the di-τ mass. One is based on a likelihood method to reconstruct the mass on an event-by-event basis [8], while the second, improved algorithm employs a likelihood function of arbitrary normalization [7]. The latter algorithm is additionally able to reconstruct the kinematic properties of the di-τ system. The ATLAS and CDF Collaborations use a Missing Mass Calculation (MMC) method that is based on minimizing a likelihood function in the kinematically allowed detector phase space [14]. The method presented in this paper is based on an artificial neural network (called neural network or NN from now on) to reconstruct the mass of the di-τ system. The neural network is implemented with the Python deep learning library "Keras" [9] and can be trained with a data set of simulated di-τ events, which contain all the known parameters of the visible particles and missing energy from neutrinos as input, and the mass of the simulated di-τ system as a training target. After the training, the neural network is able to calculate an approximate value of the mass of the di-τ system using the known parameters of the visible particles and the missing energy for any event with two τ leptons.
The paper is structured as follows: in section 2, events with simulated τ -lepton pairs are described, and are used to train and test the neural network. Section 3 focuses on the specific configuration of the neural network used to deliver the best predictions in comparison with the true mass of the simulated di-τ system. The performance of the neural network is compared to a standard tool for the di-τ mass reconstruction used by the CMS Collaboration in section 4. Conclusions are presented in the final section.

Simulation and selection of τ lepton pair events
Monte Carlo simulation is used to generate τ + τ − events at different masses. The events are generated using MadGraph_aMC@NLO 2.5.5 [2], and then showered using PYTHIA 8.226 [17] [18]. The detector response is modeled using a simplified fast simulation (DELPHES 3.4.1 [13]) of the phase-0 CMS detector with the acceptance and expected performance of the detector [10]. No additional proton-proton collisions (pile-up) are simulated. DY events occur when a mediator Z 0 /γ * is produced through quark-antiquark annihilation. This mediator can then decay into a pair of τ leptons. In proton collisions, the main production processes for the Higgs boson, ordered from largest cross section to smallest, are gluon-gluon fusion, vector boson fusion, W and Z associated production, and t associated production. Events are simulated using the most dominant production mode via gluon-gluon fusion, which occurs through an intermediate heavy-quark loop that is dominated by the top quark [6]. In order to increase the mass range over which the NN can reconstruct the di-τ system, the Higgs boson mass is artificially varied from 80 to 300 GeV in 5 GeV steps. The fully leptonic, semi-leptonic and fully hadronic Higgs decay channels are included. Events with produced Higgs bosons must pass selection requirements similar to those applied in the CMS H → τ + τ − search [7]. The MET must be greater than 20 GeV for all events, in order to make sure that the neutrino momenta are not pointing in opposite directions. For events where both τ leptons decay leptonically, the e/µ with the higher p T must satisfy p T > 20 GeV and |η| < 2.4, with the other e/µ satisfying p T > 10 GeV and |η| < 2.4. For events where only one τ lepton decays leptonically, the e/µ must satisfy p T > 20 GeV and |η| < 2.1, with the vectorial sum of the visible decay products of the hadronically decaying τ lepton having p T > 30 GeV and |η| < 2.3. For the case of events with two τ leptons decaying hadronically, both vectorial sums must have p T > 20 GeV and |η| < 2.1. Figure 1 shows the generated mass and visible mass spectrum for the simulated Higgs boson decaying to τ leptons. The mass corresponds to the true value of the mediator, while the visible mass is obtained by summing the momenta of the final state decay particles, not including the neutrinos. The di-τ gen mass spectrum is not flat due to selection requirements that are related to detector acceptance and identification effects. The visible mass of the di-τ system (visible di-τ gen mass) shows a much different spectrum. In order to separate signal from background, it is desirable to achieve a di-τ mass as close to the generated value as possible, and so information about the neutrino products must be inferred to improve the mass resolution. To prevent the neural network from learning the mass distribution of the training sample instead of the relationship between the inputs and the training target, the training sample consists of events from all the different Higgs boson masses, with the same number of events for each mass value.
The neural network is trained with 270'000 events and the performance of the neural network is tested with 100'000 independent events. The inputs for the neural network are the following parameters: 4 binary numbers classifying the decay channel (fully leptonic, semi-leptonic or fully hadronic), the p T , η, φ, energy, and invariant mass of the visible decay products from each τ lepton, the MET and φ of the MET vector, and the collinear di-τ mass. The collinear di-τ mass is the mass of the di-τ system computed assuming that the vectorial sum of the visible τ lepton decay products point along the original τ lepton direction. The collinear di-τ mass can be calculated as shown in equation 1 [15]: where m coll stands for the collinear di-τ mass, m di−τ vis for the visible di-τ mass, p Tτ vis1,2 and MET 1,2 for the visible transverse momentum and missing transverse momentum of τ 1 and τ 2 , respectively.

Model of the neural network
The neural network used is feedforward, has fully connected layers, and is written with the Python deep learning library, Keras [9]. Keras is a set of high-level building blocks that implement the deep learning model, and it interfaces with a backend, which handles operations such as tensor products and convolutions. The backends used in this work are Theano [19] for running on CPU and Tensorflow [1] for running on GPU. The model used for reconstructing the Higgs boson mass from the di-τ system is shown in Table 1. Batch sizes of 128 and 400 epochs are used for the training. The activation function used is the commonly employed "rectified linear unit (ReLU)" and the loss function is the mean squared error (MSE), which is described by Eq. 2: whereŶ i is the prediction, Y i is the training target value, n is the number of the training events, w i are the weights and b i are the biases used for the predictionŶ i . The optimizer used is called adam which stands for Adaptive Moment Estimation [16]. While the choice of network structure has been evaluated carefully and seems rather optimal for the problem and data set under investigation, the structure would have to be adjusted in case of adding or removing variables. Once the neural network is trained, the performance can be tested with an independent second sample of simulated events. A good quantifier for the performance of the neural network is the mean squared error of the deviation between the predicted value and the true value of the mass from the simulated di-τ system of the test sample, as well as the reconstructed di-τ mass resolution.

SVfit
The performance of the neural network is evaluated with respect to other current working methods for the reconstruction of the di-τ mass. For example the CMS Collaboration uses SVfit [8], which reconstructs the mass of the di-τ system using dynamical likelihood techniques. The term dynamical likelihood techniques refers to likelihood-based methods used for the reconstruction of kinematic quantities on an event-by-event basis. The inputs to SVfit are the visible decay products of the τ leptons, METx and METy as well as the MET covariance matrix. The MET covariance matrix represents the expected resolution of the MET reconstruction in the detector (since the MET measurement is affected by the accuracy of the energy calibrations).

Comparison between neural network and SVfit
If not specified differently, events with fully leptonic, semi-leptonic and fully hadronic decays are used. The predictions of the neural network are shown in figure 2 in comparison with the results from SVfit and with the mass of the simulated events (di-τ gen mass). The shape of the di-τ gen mass distribution originates from the applied cuts. The predictions of the neural network show deviations at the limits of the mass range. Because there are more events in the higher mass range, this effect can most significantly be seen in the range of 250 GeV to 300 GeV. The deviations at the mass range limits are most probably caused by the distribution of events in the training sample, which does not contain any events below 80 GeV or above 300 GeV. However, the results from the neural network show less of a deviation than that of SVfit. Such a deviation in the mean of the reconstructed di-τ mass would not lead to a wrong result in an analysis as the same function would be applied to both data and simulation. In figure 3, the loss (mean squared error) on the training and on the test sample can be seen. The loss shows no sign of overtraining. Overtraining would lead to a discrepancy between the loss on the training and on the test sample and is caused by the neural network describing minor fluctuations in the training sample instead of the underlying relationship.
In figure 4, the relative differences per event between the predictions of the neural network and SVfit are shown for the different decay modes. The relative difference per event is relative difference per event = di-τ gen mass − di-τ mass di-τ gen mass .
The mean and the standard deviation of the histograms in figure 4 refer to the bias and the resolution of the di-τ mass reconstruction, respectively. The bias is the systematic deviation of the predictions in comparison with the target values, and is independent of the bias used in the neural network, while the resolution is a measure of the accuracy of the reconstruction. The bias in the mass determined from the neural network approach using all decay modes is -0.001, which corresponds to only -0.1 % of the mean, and is smaller than the bias using SVfit, which is 0.06 or 6 % of the mean. The mass resolution determined from the neural network using all decay modes is 0.084 while SVfit has 0.17, which is twice as large. The relative differences are also shown for all the events containing only fully leptonic τ decays, semi-leptonic τ decays, and fully hadronic τ decays. The smallest resolution is achieved for both the neural network and SVfit for events which contain the least number of neutrinos, which are the fully hadronic decays. The resolution is larger for events containing semi-leptonic decays and largest for events which contain fully leptonic decays. In figure 5, the bias of the di-τ reconstruction and its standard error for each di-τ gen mass is shown for the neural network and SVfit in comparison. The neural network and SVfit have their smallest bias around 250 GeV and 100 GeV, respectively. Figure 6 shows the comparison of the resolution per di-τ gen mass using either the neural network or SVfit. The resolution of both the neural network and SVfit is constant over the mass range and is larger for SVfit as already shown in figure 4.

Computation time of neural network and SVfit
An important part of the comparison between the neural network and SVfit is the computation time. The benchmark SVfit algorithm requires 6 seconds of computation time relative difference per event  for one event in our computer architecture. Our neural network algorithm requires 110 microseconds per event, a factor of fifty thousand times faster than the SVfit algorithm in the same architecture. The neural network algorithm, however, requires about 17 hours to train on 270'000 events with 400 epochs. But once the neural network is trained, the model architecture and the model weights can be saved and used later to make new predictions. In a different computing architecture, using a GPU accompanied by Tensorflow [1] as the backend, the computation time to train the neural network was reduced down to 1.4 hours, with a prediction time of 7 microseconds per event.

Discrimination between Higgs boson Signal and Drell-Yan Background using NN and SVfit
A better discrimination between signal and background events can be achieved by an improved performance of the mass reconstruction, which can be estimated using the metric: to ascertain the signal significance for finding a signal of known cross section σ S and efficiency S over that of a known background, σ B and B , for a given integrated luminosity, L. Figure 7 shows the mass reconstruction for a Higgs boson mass of 125 GeV, consistent with the CMS and ATLAS observations [11] with MadGraph [2]. The integrated luminosity is set to a value of L = 100 fb −1 , similar to the data sets collected by the ATLAS and CMS experiments until the end of 2017. The efficiency is the number of events in the signal range divided by the total number of events passing the requirements in section 2. The DY background events are slightly biased towards higher masses since the neural network is only trained for masses of 80 GeV and above. Using a larger mass range of the training sample would lead to a smaller deviation for the DY background and therefore to an even larger signal to background significance.

Discussion of neural network and SVfit
In section 4.2, it can be seen that the reconstruction using a neural network shows a significantly smaller bias and better resolution in comparison with SVfit. The computation time, once the neural network is trained, is much shorter than using SVfit, as shown in section 4.3. Despite those advantages, the neural network has a few drawbacks. The  construction of a neural network model, which delivers predictions close to the target values, is time consuming because a slight modification in the model can change the predictions drastically. This is due to the fact that the neural network is trained multiple times with a large sample of events. Unlike SVfit, a large sample of simulated events is required for training the neural network. Using a neural network for reconstruction has a lot of potential, which warrants further study. A neural network is, in general, not restricted to one target parameter as used here. For example, the whole 4-vector of the di-τ system could be reconstructed if p T , η, φ and the mass are set as target parameters. If the target parameters vary over different scales, like p T as compared to η, finding a model, and especially an activation function, which delivers predictions close to the target values for all the target parameters, is more difficult than for just one target parameter.

Conclusion
We presented a technique for the reconstruction of the mass of a di-τ system using a neural network. The predictions of the neural network are compared to the results using the SVfit likelihood technique, which is currently used by the CMS Collaboration to reconstruct the di-τ mass. For the neural network, the bias in the reconstructed mass is -0.1 % and the mass resolution is 8.4 %, whereas SVfit has a bias of 6 % and a mass resolution of 17 %. The signal to background significance of a Higgs boson signal with a mass of 125 GeV and Drell-Yan background using the neural network is 16.5 ± 0.2 and shows an improvement with respect to SVfit, which has a signal to background significance of 11.2 ± 0.1. Using a neural network instead of SVfit improves the performance of the di-τ mass reconstruction significantly. The neural network predicts the di-τ mass approximately fifty thousand times faster than SVfit. With the neural network technique, one must factor a few hours of retraining each time the detector reconstruction or data conditions change in order to ensure that it performs without bias. This extra computing time is not significant, and is commonplace in other machine learning software algorithms in today's experiments. Using a carefully-optimized neural network for reconstruction of the invariant mass of di-τ resonances has significant advantages over common approaches, delivering better mass resolution, reduced bias, and a much faster computation time.