Identification of hadronic tau lepton decays using a deep neural network

A new algorithm is presented to discriminate reconstructed hadronic decays of tau leptons (τ h) that originate from genuine tau leptons in the CMS detector against τ h candidates that originate from quark or gluon jets, electrons, or muons. The algorithm inputs information from all reconstructed particles in the vicinity of a τ h candidate and employs a deep neural network with convolutional layers to efficiently process the inputs. This algorithm leads to a significantly improved performance compared with the previously used one. For example, the efficiency for a genuine τ h to pass the discriminator against jets increases by 10–30% for a given efficiency for quark and gluon jets. Furthermore, a more efficient τ h reconstruction is introduced that incorporates additional hadronic decay modes. The superior performance of the new algorithm to discriminate against jets, electrons, and muons and the improved τ h reconstruction method are validated with LHC proton-proton collision data at √s = 13 TeV.

Identification of hadronic tau lepton decays using a deep neural network The CMS collaboration E-mail: cms-publication-committee-chair@cern.ch A : A new algorithm is presented to discriminate reconstructed hadronic decays of tau leptons (τ h ) that originate from genuine tau leptons in the CMS detector against τ h candidates that originate from quark or gluon jets, electrons, or muons. The algorithm inputs information from all reconstructed particles in the vicinity of a τ h candidate and employs a deep neural network with convolutional layers to efficiently process the inputs. This algorithm leads to a significantly improved performance compared with the previously used one. For example, the efficiency for a genuine τ h to pass the discriminator against jets increases by 10-30% for a given efficiency for quark and gluon jets. Furthermore, a more efficient τ h reconstruction is introduced that incorporates additional hadronic decay modes. The superior performance of the new algorithm to discriminate against jets, electrons, and muons and the improved τ h reconstruction method are validated with LHC proton-proton collision data at √ = 13 TeV.
K : Large detector systems for particle and astroparticle physics; Particle identification methods; Pattern recognition, cluster finding, calibration and fitting methods

Introduction
Tau leptons play a crucial role in determining the nature of the 125 GeV Higgs boson [1][2][3][4][5], in searching for additional Higgs bosons [6][7][8][9][10][11][12][13], and in searching for other new particles [14][15][16]. Since tau leptons most frequently decay to hadrons and neutrinos, all these analyses require the efficient reconstruction and identification of hadronic tau lepton decays (τ h ). In this paper, we introduce a new algorithm to identify τ h candidates coming from hadronic tau lepton decays in the CMS detector and measure its performance with collision data from the CERN LHC. The new algorithm is based on a deep neural network (DNN) that significantly improves the performance of the τ h identification. Furthermore, we discuss an optimized τ h reconstruction algorithm that includes additional τ h decay modes.
-1 -collision events is then discussed in section 5. The paper is summarized in section 6. Tabulated results are provided in the HEPData record for this analysis [32].

The CMS detector
The central feature of the CMS apparatus is a superconducting solenoid of 6 m internal diameter, providing a magnetic field of 3.8 T. Within the solenoid volume are a silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections. Forward calorimeters extend the coverage provided by the barrel and endcap detectors. Muons are detected in gasionization chambers embedded in the steel flux-return yoke outside the solenoid. A more detailed description of the CMS detector, together with a definition of the coordinate system used and the relevant kinematic variables, is reported in ref. [33].

Event reconstruction
The particle-flow (PF) algorithm [34] reconstructs and identifies each individual particle in an event, with an optimized combination of information from the various elements of the CMS detector. The energy of photons is obtained from the ECAL measurement. The energy of electrons is determined from a combination of the electron momentum at the primary interaction vertex, as determined by the tracker, the energy of the corresponding ECAL cluster, and the energy sum of all bremsstrahlung photons spatially compatible with originating from the electron track. The energy of muons is obtained from the curvature of the corresponding track. The energy of charged hadrons is determined from a combination of their momentum measured in the tracker and the matching ECAL and HCAL energy deposits, corrected for the response function of the calorimeters to hadronic showers. Finally, the energy of neutral hadrons is obtained from the corresponding corrected ECAL and HCAL energies.
For each event, hadronic jets are clustered from all the PF candidates using the infraredand collinear-safe anti-T algorithm [35,36] with a distance parameter of 0.4. Jet momentum is determined as the vectorial sum of all particle momenta in the jet, and the simulation is, on average, within 5 to 10% of the true momentum over the whole T spectrum and detector acceptance. Additional pp interactions within the same or nearby bunch crossings (pileup) can contribute additional tracks and calorimetric energy depositions, increasing the apparent jet momentum. To mitigate this effect, charged particles identified as originating from pileup vertices are discarded and an offset correction is applied to correct for remaining contributions [37]. Jet energy corrections are derived from simulation to bring the measured response of jets to that of particle-level jets on average. In situ measurements of the momentum balance in dĳet, photon+jet, Z+jet, and multĳet events are used to correct any residual differences in the jet energy scale between data and simulation [38]. Additional selection criteria are applied to each jet to remove jets potentially dominated by instrumental effects or reconstruction failures [37].
The missing transverse momentum vector ì miss T is computed as the negative vector sum of the transverse momenta of all the PF candidates in an event, and its magnitude is denoted as miss T [39]. The ì miss T is modified to include corrections to the energy scale of the reconstructed jets in the event.

JINST 17 P07023
The candidate vertex with the largest value of summed physics-object 2 T is the primary pp interaction vertex. The physics objects specifically used for this purpose are jets, clustered using the jet finding algorithm [35,36] with the tracks assigned to candidate vertices as inputs, and the associated missing transverse momentum, taken as the negative vector sum of the T of those jets, which include tracks from leptons.
Muons are measured in the range | | < 2.4, with detection planes made using three technologies: drift tubes, cathode strip chambers, and resistive plate chambers. The single muon trigger efficiency exceeds 90% over the full range, and the efficiency to reconstruct and identify muons is greater than 96%. Matching muons to tracks measured in the silicon tracker results in a relative T resolution for muons with T < 100 GeV of 1% in the barrel and 3% in the endcaps. The T resolution in the barrel is better than 7% for muons with T < 1 TeV [40].
The electron momentum is estimated by combining the energy measurement in the ECAL with the momentum measurement in the tracker. The momentum resolution for electrons with T ≈ 45 GeV from Z → ee decays ranges from 1.7 to 4.5%. It is generally better in the barrel region than in the endcaps, and also depends on the bremsstrahlung energy emitted by the electron as it traverses the material in front of the ECAL [41].
Events of interest are selected using a two-tiered trigger system. The first level, composed of custom hardware processors, uses information from the calorimeters and muon detectors to select events at a rate of around 100 kHz within a fixed latency of about 4 µs [42]. The second level, known as the high-level trigger, consists of a farm of processors running a version of the full event reconstruction software optimized for fast processing, and reduces the event rate to around 1 kHz before data storage [43].

Simulated event samples
Monte Carlo simulated event samples are used to optimize the τ h reconstruction, as input for the training of the neural network, and to validate the performance of the τ h reconstruction and identification. Unless explicitly mentioned, the simulated event samples include decays to all leptons (ℓ), i.e. to electrons, muons, and tau leptons. The samples listed in the following are produced separately for conditions corresponding to the 2016, 2017, and 2018 data-taking periods; a range of versions of the different generators is used as specified below.
Events from the production of Z/ * and W bosons in association with jets (Z+jets and W+jets) are generated with M G 5_a @ v2.2.2 or v2.4.2 [44] at leading order (LO) in perturbative quantum chromodynamics (QCD) with the MLM jet merging scheme [45]. A separate Z+jets sample for the training, as well as diboson event samples (WW, WZ, and ZZ), are generated with M G 5_a @ at next-to-leading order (NLO) in perturbative QCD with the FxFx merging scheme [46]. Events from top quark-antiquark pair (tt) and single top quark production are generated with v2.0 at NLO in perturbative QCD [47][48][49][50][51][52]. Events from multĳet production via the strong interaction, referred to as QCD multĳet production, are generated at LO using 8.223, 8.226, or 8.230 [53]. The event generator is also used to produce heavy additional gauge boson event samples at LO, Z → ℓℓ, with (Z ) ranging from 1 to 5 TeV. The generation of 125 GeV Higgs boson (H) events via gluon fusion at NLO with H decays to tau leptons (H → ττ) is also performed with the 2.0 generator [54]. The NNPDF 3.0 or 3.1 parton distribution functions are used as input in all the calculations [55, 56].

The τ τ τ h reconstruction
The reconstruction of τ h candidates is carried out with the hadrons-plus-strips algorithm. The reconstruction proceeds in four steps, as summarized in the following and described in detail in refs. [17][18][19].
First, seed regions are defined, with the goal of reconstructing one τ h candidate per seed region. Each seed region is defined by a reconstructed hadronic jet. The jets for the seeding are clustered from all particles reconstructed by the PF algorithm using the anti-T algorithm with a distance parameter of = 0.4. All particles in an -cone of radius Δ ≡ √︁ Δ 2 + Δ 2 = 0.5 around the jet axis are considered for the next steps of the τ h reconstruction.
Second, π 0 candidates are reconstructed using "strips" in -space in which the four-momenta of electrons and photons are added, and charged-hadron (h ± ) candidates are selected using charged particles from the PF algorithm as input.
Third, all possible τ h candidates are reconstructed in a number of decay modes from the reconstructed charged hadrons and strips. The τ h four-momentum is obtained from summing the four-momenta of the charged hadrons and strips used to reconstruct the τ h candidate in a given decay mode. An overview of the τ decay modes and their branching fractions is given in table 1. There are seven different reconstructed τ h decay modes, including three new ones (with respect to the algorithm documented in ref. [19]) that target τ − → h − h + h − π 0 ν τ decays and τ h decays with three charged hadrons of which one is not reconstructed as a charged particle: Hadronic decays 64.8 Fourth, a single τ h candidate is chosen among all possible reconstructed τ h candidates within a seed region. Reconstructed τ h candidates are subject to the following constraints: • The mass of the reconstructed τ h candidate is required to be loosely compatible with: (i) the ρ (770) resonance if reconstructed in the h ± π 0 mode; (ii) the a 1 (1260) resonance if reconstructed in either of the h ± π 0 π 0 and h ± h ∓ h ± modes [19]; or (iii) the expected mass spectrum including the ρ (1450) resonance if reconstructed in the h ± h ∓ h ± π 0 modes. The mass of all h ± candidates is assumed to be the charged-pion mass, in line with the output from the PF algorithm.
• The τ h charge needs to correspond to ±1, unless the τ h candidate is reconstructed in a mode with a missed charged hadron, in which case the charge is set to correspond to the charge of the charged hadron with higher T .
• All reconstructed h ± and π 0 need to be in the signal cone, defined with radius Δ = 3.0/ T (GeV) (with Δ limited to the range 0.05-0.1) with respect to the τ h momentum.
Among the selected τ h candidates, the one with the highest T is chosen. This τ h candidate can subsequently be discriminated against quark or gluon jets, electrons, and muons.
A large fraction of τ h decays are reconstructed in their targeted decay modes, as inferred from figure 1, which shows the fractions of reconstructed decay modes for generated τ h with a given decay mode that fulfil T > 20 GeV and | | < 2.3. The fraction of generated τ h not reconstructed in any of the modes ranges from 11% for h ± to 25% for h ± π 0 . The overall reconstruction efficiency is mostly limited by the ability to reconstruct tracks from charged hadrons of around 90% [34]. For the decay modes without missing charged hadrons, the charge assignment is 99% correct for an average Z → ττ event sample, 98% for τ h with T ≈ 200 GeV, and 92% for τ h with T ≈ 1 TeV.
The decay modes with missing charged hadrons recover 19% of the τ − → h − h + h − ν τ decays and 13% of the τ − → h − h + h − π 0 ν τ decays. However, because of the missing charged hadron, Reconstructed decay mode (13 TeV) CMS Simulation Figure 1. Decay mode confusion matrix. For a given generated decay mode, the fractions of reconstructed τ h in different decay modes are given, as well as the fraction of generated τ h that are not reconstructed. Both the generated and reconstructed τ h need to fulfil T > 20 GeV and | | < 2.3. The τ h candidates come from a Z → τ τ event sample with τ τ > 50 GeV. Decay modes with the same numbers of charged hadrons and one or two π 0 s are combined and labelled as "π 0 s". the charge assignment is only correct in ≈70% of the cases, as opposed to an average of 99% for the other decay modes. Since most physics analyses apply requirements on the reconstructed τ h charge to suppress background events, these decay modes are only useful for analyses that are not limited by background events. None of the current analyses within the CMS collaboration fall into this category. Although we include the decay modes with missing charged hadrons in the new τ h identification algorithm, most of the subsequent results are shown only for reconstructed τ h candidates in one of the other decay modes, unless explicitly stated.

The τ τ τ h identification with a deep neural network
The newly developed τ h identification algorithm uses a deep neural network structure. Its architecture is based on the following three premises: • Multiclass: previously, separate dedicated algorithms were used to reject electrons, muons, or quark and gluon jets reconstructed as a τ h candidate, either based on tree ensembles (jets and electrons) or on a number of selection criteria (muons) [19]. Including electrons, muons, and jets in the same algorithm is expected to both improve identification performance and to reduce maintenance efforts.
-7 -• Usage of lower-level information: the MVA discriminators used previously were built on higher-level input variables and showed improved performance with respect to cutoff-based criteria. However, jets hadronize and fragment in complex patterns, and particles from concurrent interactions lead to similarly complex detector patterns. We expect that a machinelearned algorithm, which uses a sufficiently large data set for the training and is able to exploit lower-level information, can lead to improved performance. Therefore, information about all reconstructed particles near the τ h candidate is directly used as input to the algorithm.
• Domain knowledge: sets of handcrafted higher-level input variables used previously are utilized as additional inputs to the information about the single reconstructed particles.
Although it should, in principle, be possible to achieve the same performance with and without using these variables given a sufficiently large set of events for the training and a suitable network architecture, the usage of the higher-level inputs may reduce the number of training events needed and improve the convergence of the training, as seen in other studies [30].
Various network architectures were considered under the constraints that they satisfy these premises with manageable complexity. In particular, the algorithm must be trainable on the available hardware, and both its memory footprint and evaluation time are subject to constraints when used in the CMS reconstruction. To process the information on all reconstructed particles near the τ h axis, convolutional layers are employed. Compared with other approaches like fully connected layers or graph-based structures, convolutional layers have two main advantages: they have a smaller computational complexity and they are implemented efficiently in current deep learning software. On the other hand, they require a partitioning of inputs, in the case at hand a two-dimensional one in -space. As explained in the following, this partitioning both leads to many empty input cells and cells in which information is lost since multiple particles are found. The details of the architecture and the inputs are given in the following.

Particle-level inputs
Two grids are defined in -space, centred around the τ h axis, as displayed in figure 2: an inner grid with 11×11 cells and a grid size of 0.02×0.02, and an outer grid with 21×21 cells and grid size of 0.05×0.05. These two grids overlap, i.e. the information in the inner grid will also be part of the information that is processed by the outer grid. For each grid cell, properties of contained reconstructed particles of seven different types are used as inputs. If there is more than one particle of a given type, only the highest T particle is used as input. The various types are the five kinds of particles reconstructed by the PF algorithm in the central detector, i.e. muons, electrons, photons, charged hadrons, and neutral hadrons. The muons and electrons are reconstructed by dedicated standalone reconstruction algorithms, which provide more specific information to distinguish them from the other particles. For each particle, various types of information are taken into account, as listed in table 2. The major tradeoffs defining the grid size are the computational cost and the loss of information when more than one particle of a specific type is present in a grid cell. On average, 1.7% of the inner and 7.1% of the outer grid cells are occupied. Up to approximately 10% of the occupied cells contain more than one particle of a specific type.
-8 - Table 2. Input variables used for the various kinds of particles that are contained in a given cell. For each type of particle, basic kinematic quantities ( T , , ) are included but not listed below. Similarly, the reconstructed charge is included for all charged particles. An estimated per-particle probability for the particle to come from a pileup interaction using the pileup identification (PUPPI) algorithm [61] is labelled as PUPPI. A number of input variables that give the compatibility of the track with the primary interaction vertex (PV) or a possible secondary vertex (SV) from the τ h reconstruction are denoted as "Track PV" and "Track SV".   Figure 2. Layout of the grids in -space around the reconstructed τ h axis used to process the particle-level inputs for the convolutional layers of the DNN. The inner grid comprises 11×11 cells with a grid size of 0.02×0.02 and contains the signal cone with a radius of 0.05-0.1, which is defined in the τ h reconstruction (the charged hadrons and π 0 candidates used to reconstruct the τ h candidate need to be within the signal cone). For high-T quark and gluon jets, the finer grid is also able to resolve the dense core of the jet. The outer grid comprises 21×21 cells with a grid size of 0.05×0.05 and contains the isolation cone with a radius of 0.5 that is used to define higher-level observables that correlate with quark or gluon jet activity.
The finer binning near the τ h axis is motivated as follows: first, the τ h decay products occur predominantly in an -cone with radius 0.1 around the reconstructed τ h axis. Second, high-T jets are more collimated, so a finer grid is needed to capture all particles within them.

High-level inputs
The high-level input variables correspond primarily to those proven to be useful in the previous MVA classifier [19]. These variables include: (i) the τ h four-momentum and charge; (ii) the number -9 -of charged and neutral particles used to reconstruct the τ h candidate; (iii) isolation variables; (iv) the compatibility of the leading τ h track with coming from the PV; (v) the properties of a secondary vertex in case of a multiprong τ h ; (vi) observables related to the and distributions of energy reconstructed in the strips; (vii) observables related to the compatibility of the τ h candidate with being an electron; and (viii) the estimated pileup density in the event. In all, there are 47 high-level input variables.

Transformations
Input variables that are either integer in nature or have a significantly skewed distribution are modified with a linear transformation such that the values lie in the intervals [−1, 1] or [0, 1]. Most of the other input variables are subject to a standardization procedure, → ( − )/ , with denoting the mean of the distribution of and its standard deviation. The standardized inputs are truncated to [−5, 5] to protect against outliers.

Architecture
The overall DNN architecture is visualized in figure 3. The goal of the DNN is to use and process the inputs to optimally classify the τ h candidate as belonging to a target class, which corresponds to determining whether the reconstructed τ h originates from a genuine τ h , a muon, an electron, or a quark or gluon jet. Three different subnetworks are created that independently process the inputs from the high-level variables, the outer cells, and the inner cells. The outputs from these three subnetworks are then concatenated and passed through four fully connected layers with 200 nodes each and then to a final layer with four nodes that yield outputs , with ∈ {jet, µ, e}. A softmax activation function [62], is then applied to yield estimates of the probabilities for the τ h candidate to come from each of the four target classes (τ h , jet, µ, e).
The two subnetworks that process the inputs from the inner and outer cells have similar structures. As explained above, the main idea is to process the grids with convolutional layers [22][23][24]. In our case, the large number of input parameters for each cell makes it computationally too expensive to directly process the grid cells. To reduce the dimensionality, the inputs from each cell of the inner and outer grid are first sent through a number of fully connected layers. Since these layers act on single cells of the grids, with the full grid later on processed with convolutional layers, these fully connected layers can also be considered as one-dimensional convolutional layers, and the nodes can be considered as filters. For each cell, inputs from the electron, muon, and hadron blocks are first processed separately in three fully connected layers with a decreasing number of nodes. The outputs from the three separate processing blocks are then concatenated and passed through four additional fully connected layers with a decreasing number of nodes. The 188 input parameters are reduced to 64 output parameters (filters) for each cell that are input to the convolutional layers.
The convolutional layers each make use of 64 filters with size 3×3. These filters reduce the size of the grid with each step by two in each dimension since no padding is applied (i.e. the grid is not artificially extended with additional cells beyond the dimension of the grid). For both the inner cells with a dimension of 11×11 and the outer cells with a dimension of 21×21, the convolutional layers -10 -

Applied similarly for inner and outer cells
Applied similarly for inner and outer cells Figure 3. The DNN architecture. The three sets of input variables (inner cells, outer cells, and high-level features) are first processed separately through different subnetworks, whose outputs are then concatenated and processed through five fully connected layers before the output is calculated that gives the probabilities for a candidate to be either a τ h , an electron, a muon, or a quark or gluon jet. The subnetwork for the high-level inputs consists of three fully connected layers with decreasing numbers of nodes, taking 47 inputs and yielding 57 outputs. The features of both the inner and outer cells are input to complex subnetworks. In the first part, the observables in each grid cell are processed through a set of fully connected layers, first separately for electrons/photons (containing both the features for PF electrons and electrons from the standalone reconstruction), muons (similarly containing both features from PF and standalone muons), and charged/neutral hadrons, passing through three fully connected layers each. The outputs are concatenated and passed through four additional fully connected layers, yielding 64 outputs for each cell. The grids are then processed with convolutional layers, which successively reduce the size of the grid. For the inner cells, there are hence 5 convolutional layers that reduce the grid from 11×11 to a single cell; for the outer cells, there are 10 convolutional layers that reduce the grid from 21×21 to a single cell. The numbers of trainable parameters (TP) for the different subnetworks are also given for the different subnetworks.
are applied until the grid is reduced to a single cell. As a consequence, there are 5 convolutional layers for the inner cells and 10 for the outer cells. For both the inner and the outer cells, the final single cell implies that there are 64 outputs.
The 47 high-level input variables are processed by three fully connected layers with 113, 80, and 57 nodes, yielding 57 outputs. Taken together, the three subnetworks yield 185 output parameters that are processed through a number of fully connected layers as explained above.
After each of the fully connected and convolutional layers, batch normalization [63] and dropout regularization [64,65] with a dropout rate of 0.2 are applied. Furthermore, nonlinearity -11 -is introduced by applying the parametric rectified linear unit activation function, which is given by for < 0 and by for ≥ 0, with being a trainable parameter [66]. A few alternative approaches were tried. First, networks using only high-level inputs and an overall lower number of trainable parameters were tested. These showed significantly inferior performance in terms of τ h identification efficiency versus the misidentification probabilities for jets, electrons, and muons, as clarified in figures 4, 6, and 7. Second, alternative approaches were tried that were based on the idea of having two-dimensional grids for the particle-level inputs, but with different input cell sizes and with differently organized convolutional layers, e.g. convolutional layers with padding and pooling (combining the information from a number of nearby cells). The chosen architecture yielded the best performance among the ones tested within the predefined computing constraints.

Loss function and classifier training
The loss function is a sum of three different terms, given the four target classes (jet, µ, e, τ h ): 1. a regular cross-entropy term [62] for the τ h target class, 2. a term implementing focal loss [67] for the binary classification of τ h against all backgrounds combined, and 3. another focal-loss term with three components for the classification as µ, e, or jet, each combined with a smoothened step function that is nonzero only for τ h candidates that are sufficiently likely to be classified as τ h .
The introduction of the two focal-loss terms gives superior performance compared with using cross-entropy only. It is motivated by two considerations. First, analyses with τ h candidates in the final state, like the ones mentioned in the introduction, show the highest sensitivity for efficiencies in the range 50-80%, whereas performance for the highest efficiencies and purities is less important. Second, the classification performance as τ h is more important than the classification as e, µ, or jet; particularly in the high-purity regime where it is less important to distinguish between the three other classes. The full definition of the loss function is given in appendix A.
For the training, events from the following simulated processes are used: Z+jets (NLO), W+jets, tt, Z → ττ, Z → ee, Z → µµ (with (Z ) ranging from 1 to 5 TeV), and QCD multĳet production. For testing, additional event samples are used, including H → ττ and Z+jets (LO) event samples, and more event samples corresponding to the processes used for the training. The event samples are simulated and reconstructed according to the 2017 data-taking conditions. The prospective τ h candidates need to fulfil T > 20 GeV and | | < 2.3. The τ h candidates are sampled from the various event samples such that the contributions of the different classes (jet, µ, e, τ h ) and in different ( T , ) bins are the same. Additional weights are added to make the distributions from the different classes uniform within each ( T , ) bin. In total, around 140 million τ h candidates are used for the training, whereas 10 million are assigned as the validation samples.
The loss function is minimized with Nesterov-accelerated adaptive momentum estimation [68], which combines the Adam algorithm [69] with Nesterov momentum. The training setup uses the K library [70] with T F [71] as backend. The training was carried out with a single -12 -

JINST 17 P07023
Nvidia GeForce RTX ™ 2080 graphics processing unit. The training of the final classifier was run for 10 epochs, with the training for an epoch lasting 69 hours on average. The best performance on the validation samples has been obtained after 7 epochs. The final discriminators against jets, muons, and electrons are given by with taken from eq. (4.1) and ∈ {jet, µ, e}. In the following, these discriminators are labelled as D T discriminators against jets ( jet ), muons ( µ ), and electrons ( e ).

Expected performance
The performance of the D T discriminators is evaluated with the validation samples. Working points are defined to guide usage in physics analyses and to derive suitable data-to-simulation corrections. The target τ h identification efficiencies for the various working points used in the wide range of physics analyses with tau leptons are presented in table 3. The target τ h identification efficiencies are defined as the efficiency for genuine τ h in the H → ττ event sample that are reconstructed as τ h candidates with 30 < T < 70 GeV to pass the given discriminator. The target τ h identification efficiencies range from 40 to 98% for jet , from 99.5 to 99.95% for µ , and from 60 to 99.5% for e . In figure 4, the performance of the jet discriminator is evaluated by studying the efficiencies for quark and gluon jets and genuine τ h to pass the discriminator and comparing it with the previously used MVA classifier. All generated and reconstructed τ h candidates in this figure and those that follow are required to pass T > 20 GeV and | | < 2.3, VVVLoose e WP and VLoose µ WP. For the previously used MVA classifier, two curves are shown: the first curve corresponds to the τ h decay mode reconstruction as used in the previous publication and in previous physics analyses, whereas the second curve includes the additional τ h decay modes introduced in section 3, leading to an overall increase of the MVA classifier performance, in particular for higher efficiency values. The efficiency is evaluated separately for the W+jets and tt event samples. The W+jets sample is enriched in quark jets, whereas the tt event sample has a larger fraction of b quark and gluon jets and has a busier event topology. The efficiencies are also evaluated separately for τ h candidates with T < 100 GeV and T > 100 GeV to test the performance in different T regimes. For all considered scenarios, the jet discriminator shows a large improvement of the signal efficiency at a given background efficiency compared with the previously used MVA classifier. At a given  . Efficiency for quark and gluon jets to pass different tau identification discriminators versus the efficiency for genuine τ h . The upper two plots are obtained with jets from the W+jets simulated sample and the lower two plots with jets from the tt sample. The left two plots include jets and genuine τ h with efficiency for jets to pass the discriminator, there is a relative increase in signal efficiency of more than 30% for the tightest working points and more than 10% for the loosest working points. The larger gain at high T is a consequence of the inclusion of additional decay modes in the τ h reconstruction.  Figure 5. Efficiencies for simulated τ h decays with | | < 2.3 to pass the following reconstruction and identification requirements: to be reconstructed in any decay mode with T > 20 GeV and | | < 2.3 (black dashed line), to be reconstructed in a decay mode except for those with missing charged hadrons (labelled "2-prong" and shown as full black line), and to be reconstructed in a decay mode except the 2-prong ones and to pass the Loose, Medium, or Tight working point of the jet discriminator (blue lines), obtained with a Z → τ τ event sample. The efficiencies are shown as a function of the visible genuine τ h T obtained from simulated decay products.
The T dependence of the efficiency for genuine generated τ h with T > 20 GeV to be reconstructed with T > 20 GeV and to pass the jet discriminator is shown in figure 5. The reconstruction efficiency exceeds 80% for T > 30 GeV and is close to 90% for T > 100 GeV, and is mostly limited by the charged-hadron reconstruction efficiency. If decay modes with missing charged hadrons (the so-called two-prong decay modes) are excluded, the efficiency is reduced by around 10%. The efficiency to additionally pass either of the shown working points of the jet discriminator (for reconstructed τ h candidates without missing charged hadrons) is in line with the target τ h identification efficiencies listed in table 3.
The e discriminator leads to a significantly improved rejection of electrons compared with the MVA discriminator, as can be inferred from figure 6. The efficiencies for τ h candidates (with T > 20 GeV, | | < 2.3 and passing the VVVLoose jet and VLoose µ WP) and electrons to pass the two discriminators are shown separately for τ h and electrons with 20 < T < 100 GeV . Efficiency for electrons versus efficiency for genuine τ h to pass the MVA and e discriminators, separately for electrons and τ h with 20 < T < 100 GeV (left) and T > 100 GeV (right). Vertical bars correspond to the statistical uncertainties. The τ h candidates are reconstructed in one of the τ h decay modes without missing charged hadrons. Compared with the MVA discriminator, the e discriminator reduces the electron efficiency by more than a factor of two for a τ h efficiency of 70% and by more than a factor of 10 for τ h efficiencies larger than 88%. Furthermore, working points (indicated as full circles) are now provided for previously inaccessible τ h efficiencies larger than 90%, for a misidentification efficiency between 0.3 and 8%. and T > 100 GeV. Within the range of applicability of the previously used MVA discriminator, the e discriminator increases the τ h efficiency by consistently more than 10% for a constant misidentification probability. For example, the τ h efficiency increases from 87 to 99% for the loosest working point of the two discriminators with an electron efficiency of around 7% for T < 100 GeV. Besides increasing the reach in τ h identification efficiency, the new discriminator also provides the possibility to reject electrons with an efficiency of 10 −4 for a τ h efficiency of 55%. Similar, albeit slightly smaller, gains are observed for T > 100 GeV.
Gains are also observed in terms of the discrimination of τ h candidates (with T > 20 GeV, | | < 2.3 and passing the VVVLoose jet and VVVLoose e WP) versus muons, as shown in figure 7. Compared with the cutoff-based discriminator, the µ discriminator leads to an increase of the τ h efficiency of around 0.5% for a given prompt muon efficiency. For the τ h efficiency range of 99.1-99.4% addressed by the cutoff-based discriminator, the µ discriminator leads to a factor of 3-10 larger prompt muon rejection, and thereby provides a significantly improved rejection of prompt muons for analyses with hadronically decaying tau leptons, while leading to only a percent-level loss of τ h identification efficiency.
The performance results discussed so far have been obtained with simulated samples that are consistent with the 2017 data-taking conditions. Moreover, after applying the discriminators to simulated samples that correspond to the 2016 and 2018 data-taking conditions, we see consistent performance, both in absolute terms and relative to other discriminators. Based on these findings, the discriminators e , µ , and jet have been recommended for all CMS analyses of LHC data at √ = 13 TeV.  To summarize, the D T discriminators against jets, electrons, and muons lead to large gains in discrimination performance against various backgrounds. The gains in τ h identification efficiency at a typical jet or electron efficiency used in physics analyses amount to more than 30% and 10%, respectively. Therefore, a large gain in reach is expected in physics analyses with tau leptons, in particular for final states with two τ h candidates. These improvements are confirmed by physics analyses that already used the D T discriminators [72-74].

Performance with √ = 13 TeV data
The performance of the τ h reconstruction and of the D T discriminators is validated with collision data at √ = 13 TeV. The total integrated luminosity for the 2016-2018 data collected with the CMS detector and validated for use in analysis amounts to 138 fb −1 , of which 36 fb −1 were recorded in 2016, 42 fb −1 in 2017, and 60 fb −1 in 2018 [75-77]. The reconstruction and identification efficiencies for τ h are measured separately for the three data-taking years using a Z → ττ event sample in the µτ h final state, following methods similar to those established previously [19]. In addition, measurements are made of the misidentification efficiency with which jets, electrons, and muons are reconstructed and identified as τ h candidates. All of these measurements are essential ingredients for physics analyses that use τ h candidates.
In this paper, we present a subset of the measurements that were carried out, mostly based on the largest data set recorded in 2018. The identification efficiencies are generally measured for all working points introduced above. We discuss a subset of measurements here only for representative working points. Consistent results have been obtained for the other working points, and for the 2016 and 2017 data sets. We have not performed such efficiency measurements for the decay modes -17 -

JINST 17 P07023
with missing charged hadrons; future analyses including these decay modes will need to derive appropriate corrections. The full set of efficiency and energy scale corrections together with their corresponding uncertainties is available for data analyses in the CMS collaboration.

Reconstruction and identification efficiency
The measurement of the τ h reconstruction and identification efficiencies uses a µτ h event sample that is selected in the same way as in the previous measurement [19], with the following updates. Events are recorded with a single-muon trigger with a nominal T threshold of 24 GeV and are required to have at least one reconstructed and isolated muon with T > 25 GeV. The τ h candidate is required to be reconstructed with the decay mode finding algorithm, to have T > 20 GeV, and to pass a given threshold on the jet discriminator. In addition, the τ h candidate needs to pass the VVLoose working point of the e discriminator and the Tight working point of the µ discriminator, to reject background from muons or electrons misidentified as a τ h candidate. Additional criteria are applied to increase the purity of events with genuine τ h candidates, including a requirement on the transverse mass of the muon and ì miss where Φ is the angle in the transverse plane between the muon momentum and ì miss T ) and a maximum difference in between the reconstructed muon and the τ h candidate (|Δ | < 1.5). The resulting dataset is enriched in Z → ττ events, with the residual contamination from other processes amounting to approximately one fifth of the total yield (figure 10). In addition to the µτ h event sample, a µµ event sample is defined to normalize the Z → ττ event yields. This event sample is subject to the same trigger and muon selection criteria as the µτ h event sample, such that related uncertainties partially cancel in the normalization scale factor.
The efficiencies are extracted from a maximum likelihood fit to the distribution of the reconstructed visible invariant mass of the µτ h system, vis , and to the expected and observed event yields in the µµ region. In the fit, the ratio of the observed and expected efficiency for a genuine τ h to pass the selection is incorporated as a free parameter, such that the fit yields a scale factor with respect to the simulated efficiency. All known sources of systematic uncertainties are incorporated into the fit. Uncertainties include the uncertainty in the integrated luminosity, uncertainties in the muon trigger, identification, and isolation efficiencies, uncertainties due to the limited number of simulated events, and uncertainties in the normalizations of tt, QCD multĳet, and Z/ * +jets background. Since the scale factors are extracted after the full event selection, they incorporate any differences between data and simulation of the full product of all reconstruction and identification efficiencies. The scale factors are extracted from the maximum likelihood fit together with their corresponding uncertainties using the asymptotic properties of the likelihood function [78].
The scale factors are extracted in different T bins, T ∈ {[20, 25], [25,30], [30,35], [35,40], [40,50], [50,70], >70 GeV}, to account for a possible T dependence of the extracted scale factors, which would affect analyses that use different minimum T thresholds or that analyze τ h final states from processes with different T distributions. While analyses usually use all reconstructed decay modes, except for the modes with missing charged hadrons so a parametrization in terms of decay modes is not necessary, analyses that select events using the di-τ h trigger algorithms observe a different proportion of τ h candidates reconstructed in the various decay modes. This is a consequence of the dependency of τ h trigger efficiency on the various decay modes. Therefore, -18 -scale factors are separately determined in four bins of reconstructed decay modes (h ± ; h ± π 0 or h ± π 0 π 0 ; h ± h ∓ h ± ; and h ± h ∓ h ± π 0 ). These scale factors are shown in figure 8 for τ h T > 40 GeV, which corresponds to the minimum recommended threshold in analyses that use events recorded with the di-τ h trigger.
Since the reach in τ h T is limited in the Z → ττ sample, a separate off-shell W boson (W * ) event sample is used to measure additional scale factors for τ h T > 100 GeV in a τ h -plus-miss T final state without an additional hadronic jet, following closely the method discussed in ref. [19]. The requirement of high τ h T and high miss T combined with the hadronic jet veto leads to an event sample dominated by off-shell W * → τν τ boson decays. The events are recorded with a trigger that requires a large miss T . The efficiencies are obtained from a maximum likelihood fit to the transverse mass distribution of the τ h candidate and miss T . To constrain the fiducial W * cross section in the signal region, a W * → µν µ control region is also included in the fit.

JINST 17 P07023
The T -dependent uncertainties in the τ h scale factors range from 2 to 5%. For the τ h T spectrum in the Z → ττ sample, the combined scale factor uncertainty amounts to ≈2%. These uncertainties are considerably smaller than in previous measurements, which reported combined uncertainties of ≈6% [19]. The previous measurements used the "tag-and-probe" technique [79] that only measured the identification efficiency for the discriminator under consideration and added additional uncertainties in the reconstruction efficiency from independent measurements of single charged-hadron reconstruction efficiencies. The measurement of the product of all reconstruction and identification efficiencies with a maximum likelihood fit hence represents another important improvement for physics analyses with τ h final states, as it leads to a considerable reduction of the systematic uncertainties related to the τ h reconstruction and identification.
The data-to-simulation scale factors as a function of the τ h decay mode are shown in figure 8 (right) for τ h T > 40 GeV for the three data-taking years (2016, 2017, and 2018). The scale factors are generally a bit smaller than one but consistent with unity within 10% (20% for the τ h candidates reconstructed in the h ± h ∓ h ± π 0 decay mode). The scale factors below one can be traced back to different origins, including the imperfect modelling of hadronization and an imperfect simulation of the detector alignment and track hit reconstruction efficiencies.

The τ τ τ h energy scale
The τ h energy scale is obtained from the same µτ h event samples as the scale factors discussed in the previous section, following the same method described in previous publications [19]. For the main measurement discussed here, τ h candidates are required to pass the Medium working point of the jet discriminator. Consistent results are obtained with cross-check measurements for various working points of the jet discriminator. To improve the a priori data-simulation agreement, scale factors as described in the previous section are already applied to simulated samples with τ h final states.
Two different observables are used to measure the τ h energy scale: the µτ h invariant mass ( vis ) and the reconstructed τ h mass ( (τ h )). For τ h candidates reconstructed in the h ± decay mode, (τ h ) is constant, so the measurement is only performed with the vis observable.
Simulated templates of the vis and (τ h ) distributions are created for τ h energy scale shifts between ±3% in steps of 0.2%. For each shift, a maximum likelihood fit is performed, using the combined expectation from the Z → ττ in the µτ h final state and the background events, using the set of systematic uncertainties discussed above. The observed value of the τ h energy scale shift is obtained from the distribution of the likelihood ratio together with the corresponding uncertainty.
The resulting relative differences between the τ h energy scale in data and simulation are shown in figure 9 separately for the two different measurements. The relative difference is given as a correction, in percent, to a unit multiplicative factor applied to the τ h four-momentum in the simulation. The results obtained with the two methods are consistent with each other, and the τ h energy scale shifts are consistent with no data-simulation differences, with the largest difference amounting to 1.5 standard deviations. The relative uncertainties range from 0.6% for the h ± π 0 and h ± h ∓ h ± decay modes to 0.8% for the h ± mode.
To validate the extracted scale factors and to illustrate the distributions used for the extraction of the scale factors and the τ h energy corrections, distributions of vis and (τ h ) are investigated, requiring the τ h to pass the Tight working point of the jet discriminator ( figure 10). These -20 - distributions are obtained from a maximum likelihood fit to the µτ data, applying the same selection used in the extraction of the scale factors. This fit incorporates the derived scale factors and energy corrections with their uncertainties, together with the full set of systematic uncertainties used in the scale factor extractions. In the fit to data, the Z → ττ normalization is a freely floating parameter. Data and predictions agree within the uncertainties, indicating that the derived scale factors and energy corrections lead to a good and complete description of the data, including the description of the different τ h decay modes.

Lepton and jet misidentification efficiencies
The efficiencies for electrons, muons, and jets to pass the respective D T discriminators are measured with dedicated event samples, following closely the measurements explained in detail in ref. [19]. The efficiencies for electrons and muons to be reconstructed and misidentified as τ h candidates are measured with Z → ℓℓ events (ℓ ∈ {e, µ}) where one ℓ is misreconstructed as a τ h candidate; the event is reconstructed as an ℓτ h final state. The efficiencies are extracted with the tag-and-probe method, where ℓ candidates that pass a given working point of the µ or e discriminator end up in the pass region and those that do not pass in the fail region. A maximum likelihood fit is performed to obtain the efficiencies. The distribution used in the pass region is the invariant ℓτ h mass ( vis ), whereas the fail region is included as a single bin. Similar to the extraction of the τ h efficiencies, all relevant uncertainties are included in the maximum likelihood Post-fit unc. fit. To reject background from jets reconstructed as τ h candidates, the candidates are also required to pass the Medium working point of the jet discriminator. An additional parameter is introduced in the fit that scales the contributions in the pass and fail regions simultaneously and corresponds to the scale factor for the efficiency to pass the jet discriminator. This approach is based on the assumption that the probability for electrons or muons to pass the jet discriminator is independent of the probability to pass the e or µ discriminator. Another parameter is introduced for electrons misreconstructed as τ h candidates that allows for a shifted energy scale. The eτ h events for the measurement of the scale factors for electrons to pass different working points of the e discriminator are recorded with a single-electron trigger with a T threshold of 35 GeV. The efficiencies for electrons to pass the different working points in data and simulated events as well as the ratio of the efficiencies are shown in figure 11. The efficiencies are derived separately in the ECAL barrel region (| | < 1.46) and the ECAL endcap region (| | > 1.56). The observed and expected efficiencies agree within a few percent for the Loose and VLoose working points in the barrel and within 11% in the endcap region. For tighter working points, scale factors are obtained that are significantly different from one and that are generally greater than one in the barrel, with a maximum scale factor of 1.61 ± 0.28 for the VTight working point, and smaller in the endcap.
The µτ h events for the measurement of the scale factors for muons to be misreconstructed and misidentified as τ h candidates are recorded with a single-muon trigger with a threshold of 27 GeV. The leading muon is required to have T > 28 GeV, and the transverse mass of the muon and ì is required to be smaller than 30 GeV. The resulting data-to-simulation scale factors are extracted from a fit to the vis distribution and are shown in figure 12. The scale factors are shown separately for the Loose and Tight working points, and for five different bins in | |. Although the scale factors are consistent with each other for the different working points and consistent with unity for four of the | | bins, they differ significantly from one for | | > 1.7. As a result of the high rejection power of µ , muons that are not discarded belong to atypical regions of the phase space, far from the bulk of the distributions. For such uncommon muons, we observe that the simulation does not model the data well, especially for observables related to the muon track quality at large | |; this is compatible with being the cause of the sizeable data-to-simulation difference at | | > 1.7. The large µ scale factors at | | > 1.7 have no notable impact on physics analyses because of the small probability for muons to pass any µ working point and the limited extent of the region presenting large data-to-simulation differences.
For a measurement of the efficiency for jets to pass the jet discriminator, events are selected that are consistent with a W(→ µν µ )+jet topology. Similar to the previously discussed measurement, the events are recorded with a single-muon trigger with a T threshold of 27 GeV, and the reconstructed muon is required to have T > 28 GeV. Furthermore, there must be exactly one central jet with T > 20 GeV and | | < 2.3, and events with additional reconstructed muons or a reconstructed electron passing loose identification criteria are rejected. To select a pure sample of W+jets events, the transverse mass of the muon and miss T is required to be greater than 60 GeV. The reconstructed jet serves as a probe for the measurement of the jet-to-τ h misidentification efficiency. The efficiencies for jets to be reconstructed as τ h candidates and to pass the Loose and Tight jet working points are shown in figure 13 as a function of reconstructed jet T . In the measurement of the efficiency, background events with genuine τ h candidates are subtracted. The observed and expected efficiencies agree within the uncertainties. Besides the statistical uncertainty in the observed events, the error bars in the ratio of data to simulation also include uncertainties from the subtraction of events with genuine τ h candidates, electrons, or muons. The efficiency first rises with T near the 20 GeV threshold because it becomes more likely for a jet to give rise to a reconstructed τ h candidate that passes this threshold. For higher T , the particle multiplicity in a quark or gluon jet increases with T . Therefore, the jets become easier to distinguish from genuine τ h candidates and the probability for quark or gluon jets to pass the jet discriminator decreases with T .

Summary
A new algorithm has been introduced to discriminate hadronic tau lepton decays (τ h ) against jets, electrons, and muons. The algorithm is based on a deep neural network and combines fully connected and convolutional layers. As input, the algorithm combines information from individual reconstructed particles near the τ h axis with information about the reconstructed τ h candidate and other high-level variables. In addition, an improved τ h reconstruction algorithm is introduced that increases the overall efficiency of the reconstruction by explicitly considering the τ h decay mode with three charged hadrons and a neutral pion, and by applying looser quality criteria for the charged hadrons in the case of three-prong τ h decays. The performance of the new τ h identification and reconstruction algorithms significantly improves over the previously used algorithms, in particular in terms of discrimination against the background from jets and electrons. For a given jet rejection level, the efficiency for genuine τ h candidates increases by 10-30%. Similarly, the efficiency for genuine τ h candidates to pass the discriminator against electrons increases by 14% for the loosest working point that is employed in many analyses. Following its superior performance, CMS physics analyses with tau leptons will significantly increase their sensitivities when using the new algorithm. The superior performance of the algorithm is validated with collision data. The observed efficiencies for genuine τ h , jets, and electrons to be identified as τ h typically agree within 10% with the expected efficiencies from simulated events. The agreement is similar to the one observed with previous algorithms and confirms the improvements.

Acknowledgments
We congratulate our colleagues in the CERN accelerator departments for the excellent performance of the LHC and thank the technical and administrative staffs at CERN and at other CMS institutes for their contributions to the success of the CMS effort. In addition, we gratefully acknowledge the computing centers and personnel of the Worldwide LHC Computing Grid and other centers for delivering so effectively the computing infrastructure essential to our analyses. Finally, we acknowledge the enduring support for the construction and operation of the LHC, the CMS detector, and the supporting computing infrastructure provided by the following funding agencies:

A Loss function
The loss function used for the training of the D T algorithm has the form   Focal loss function: where bold fonts indicate groups of parameters; (·, ·; ·) corresponds to the categorical cross entropy and (·, ·; ·) to the focal loss function [67]; N is a factor to normalize (·, ·; ·) to unity in the interval 0-1; andˆ(·) is a smoothened step function that approaches 1 for τ > 0.1. The focal loss terms in eq. (A.1) (b) and (c) put more emphasis on the part of parameter space in which it is more difficult to separate τ h candidates from e, µ, and jets. The step function in eq. (A.1) (c) disregards the part of parameter space in which the probability for the τ h candidate to correspond to a genuine τ h is low, for which we are not interested in optimal separation between e, µ, and jets. The values and meanings of the parameters , , and are given in table 4. The values of the -factors are chosen such that the numerical values of the different components of the loss function are close to each other, while putting more emphasis on the discrimination against jets, since jets are the dominant source of background.       [42] CMS collaboration, Performance of the CMS Level-1 trigger in proton-proton collisions at √ = 13 TeV, 2020 JINST 15 P10017 [arXiv:2 6.1 165].