Identification of hadronic tau lepton decays using a deep neural network

A new algorithm is presented to discriminate reconstructed hadronic decays of tau leptons ($\tau_\mathrm{h}$) that originate from genuine tau leptons in the CMS detector against $\tau_\mathrm{h}$ candidates that originate from quark or gluon jets, electrons, or muons. The algorithm inputs information from all reconstructed particles in the vicinity of a $\tau_\mathrm{h}$ candidate and employs a deep neural network with convolutional layers to efficiently process the inputs. This algorithm leads to a significantly improved performance compared with the previously used one. For example, the efficiency for a genuine $\tau_\mathrm{h}$ to pass the discriminator against jets increases by 10-30% for a given efficiency for quark and gluon jets. Furthermore, a more efficient $\tau_\mathrm{h}$ reconstruction is introduced that incorporates additional hadronic decay modes. The superior performance of the new algorithm to discriminate against jets, electrons, and muons and the improved $\tau_\mathrm{h}$ reconstruction method are validated with LHC proton-proton collision data at $\sqrt{s} =$ 13 TeV.


Introduction
Tau leptons play a crucial role in determining the nature of the 125 GeV Higgs boson [1][2][3][4][5], in searching for additional Higgs bosons [6][7][8][9][10][11][12][13], and in searching for other new particles [14][15][16]. Since tau leptons most frequently decay to hadrons and neutrinos, all these analyses require the efficient reconstruction and identification of hadronic tau lepton decays (τ h ). In this paper, we introduce a new algorithm to identify τ h candidates coming from hadronic tau lepton decays in the CMS detector and measure its performance with collision data from the CERN LHC. The new algorithm is based on a deep neural network (DNN) that significantly improves the performance of the τ h identification. Furthermore, we discuss an optimized τ h reconstruction algorithm that includes additional τ h decay modes.
Hadronic tau lepton decays are reconstructed and identified as follows. First, τ h candidates are reconstructed in one of their decay modes with the "hadrons-plus-strips" algorithm [17][18][19], which reconstructs its decay mode and the visible four-momentum, i.e. the four-momentum of all decay products except for the neutrinos. Second, the τ h candidates are identified as coming from genuine τ h decays rather than from jets, electrons, and muons. Quark and gluon jets often constitute the most important source of background since they are produced abundantly in proton-proton (pp) collisions at the LHC.
Previously, τ h candidates were reconstructed only in their main decay modes: one-prong decays with and without π 0 s, and three-prong decays without π 0 s. Now, we also consider threeprong τ h decays with π 0 s, which account for 7.4% of all τ h decays. Furthermore, three-prong τ h decays are recovered in which one of the charged hadrons is either not reconstructed as a charged hadron or not even as a track.
The identification of τ h candidates in CMS previously proceeded with separate discriminators against jets, electrons, and muons [18,19]. The discriminators against jets and electrons combined information from high-level input variables (i.e. derived quantities, such as the transverse momentum p T sum of particles near the τ h axis) using a multivariate analysis (MVA) classifier based on an ensemble of boosted decision trees. A similar approach is used for τ h identification by the ATLAS Collaboration [20,21]. Our previous discriminator against muons was based on explicit thresholds of a few discriminating observables. Here, we introduce a new τ h identification algorithm, DEEPTAU, based on a DNN that simultaneously discriminates against jets, electrons, and muons. The DNN uses a combination of high-level inputs, similar to previous algorithms, and information from all reconstructed particles in the vicinity of the τ h candidate. The information from all reconstructed particles near the τ h axis is processed with convolutional layers [22][23][24] in pseudorapidity-azimuth (η-φ) space. Convolutional layers have been developed in the context of image recognition and are based on the premise that a fragment of the input, e.g. an image, can be processed independently of its position, exploiting the translational invariance of the problem. In a given physics analysis, the superior performance of such an algorithm leads to an increase of the signal efficiency for a given target background rate from jets, electrons, or muons misidentified as τ h , which typically translates directly into superior sensitivity or precision of the analysis. Similar approaches have been proposed and employed in the context of τ h reconstruction and identification as well as in the classification of jets. Identification algorithms for τ h using shallow neural networks were studied at LEP [25]. The ATLAS Collaboration developed an algorithm to identify visible τ h decay products with a recurrent DNN [26]. The ATLAS and CMS Collaborations also use DNNs to identify jets as coming from decays of b quarks [27][28][29][30]. In particular, the DEEPJET algorithm [30] employs multiclass classification to identify jets as coming from gluons, leptons, b quarks, c quarks, or light quarks. Convolutional layers have been used in algorithms that identify large-radius jets as originating from the decays of high-p T heavy particles, e.g. in the DEEPAK8 algorithm that identifies jets with p T > 200 GeV as coming from decays of top quarks or decays of W, Z, or Higgs bosons [31]. This paper is structured as follows. After an overview of the CMS detector and the event reconstruction in Section 2, the recent advances in the τ h reconstruction are discussed in Section 3. Section 4 describes the new τ h identification algorithm based on neural networks and gives an overview of its performance with simulated events. The performance of the algorithm with LHC collision events is then discussed in Section 5. The paper is summarized in Section 6. Tabulated results are provided in the HEPData record for this analysis [32].

The CMS detector
The central feature of the CMS apparatus is a superconducting solenoid of 6 m internal diameter, providing a magnetic field of 3.8 T. Within the solenoid volume are a silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections. Forward calorimeters extend the η coverage provided by the barrel and endcap detectors. Muons are detected in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid. A more detailed description of the CMS detector, together with a definition of the coordinate system used and the relevant kinematic variables, is reported in Ref. [33].

Event reconstruction
The particle-flow (PF) algorithm [34] reconstructs and identifies each individual particle in an event, with an optimized combination of information from the various elements of the CMS detector. The energy of photons is obtained from the ECAL measurement. The energy of electrons is determined from a combination of the electron momentum at the primary interaction vertex, as determined by the tracker, the energy of the corresponding ECAL cluster, and the energy sum of all bremsstrahlung photons spatially compatible with originating from the electron track. The energy of muons is obtained from the curvature of the corresponding track. The energy of charged hadrons is determined from a combination of their momentum measured in the tracker and the matching ECAL and HCAL energy deposits, corrected for the response function of the calorimeters to hadronic showers. Finally, the energy of neutral hadrons is obtained from the corresponding corrected ECAL and HCAL energies.
For each event, hadronic jets are clustered from all the PF candidates using the infrared-and collinear-safe anti-k T algorithm [35,36] with a distance parameter of 0.4. Jet momentum is determined as the vectorial sum of all particle momenta in the jet, and the simulation is, on average, within 5 to 10% of the true momentum over the whole p T spectrum and detector acceptance. Additional pp interactions within the same or nearby bunch crossings (pileup) can contribute additional tracks and calorimetric energy depositions, increasing the apparent jet momentum. To mitigate this effect, charged particles identified as originating from pileup vertices are discarded and an offset correction is applied to correct for remaining contributions [37]. Jet energy corrections are derived from simulation to bring the measured response of jets to that of particle-level jets on average. In situ measurements of the momentum balance in dijet, pho-ton+jet, Z+jet, and multijet events are used to correct any residual differences in the jet energy scale between data and simulation [38]. Additional selection criteria are applied to each jet to remove jets potentially dominated by instrumental effects or reconstruction failures [37].
The missing transverse momentum vector p miss T is computed as the negative vector sum of 2.2 Simulated event samples the transverse momenta of all the PF candidates in an event, and its magnitude is denoted as p miss T [39]. The p miss T is modified to include corrections to the energy scale of the reconstructed jets in the event.
The candidate vertex with the largest value of summed physics-object p 2 T is the primary pp interaction vertex. The physics objects specifically used for this purpose are jets, clustered using the jet finding algorithm [35,36] with the tracks assigned to candidate vertices as inputs, and the associated missing transverse momentum, taken as the negative vector sum of the p T of those jets, which include tracks from leptons.
Muons are measured in the range |η| < 2.4, with detection planes made using three technologies: drift tubes, cathode strip chambers, and resistive plate chambers. The single muon trigger efficiency exceeds 90% over the full η range, and the efficiency to reconstruct and identify muons is greater than 96%. Matching muons to tracks measured in the silicon tracker results in a relative p T resolution for muons with p T < 100 GeV of 1% in the barrel and 3% in the endcaps. The p T resolution in the barrel is better than 7% for muons with p T < 1 TeV [40].
The electron momentum is estimated by combining the energy measurement in the ECAL with the momentum measurement in the tracker. The momentum resolution for electrons with p T ≈ 45 GeV from Z → ee decays ranges from 1.7 to 4.5%. It is generally better in the barrel region than in the endcaps, and also depends on the bremsstrahlung energy emitted by the electron as it traverses the material in front of the ECAL [41].
Events of interest are selected using a two-tiered trigger system. The first level, composed of custom hardware processors, uses information from the calorimeters and muon detectors to select events at a rate of around 100 kHz within a fixed latency of about 4 µs [42]. The second level, known as the high-level trigger, consists of a farm of processors running a version of the full event reconstruction software optimized for fast processing, and reduces the event rate to around 1 kHz before data storage [43].
The PYTHIA program is used to simulate tau lepton decays, parton showering, hadronization, and multiparton interactions, with version 8.223, 8.226, or 8.230 and tune CUETP8M1 or CP5 [57]. For the Z → samples, TAUOLA is used instead to simulate tau lepton decays [58]. Pileup interactions are generated with PYTHIA and are overlaid on all simulated events according to the luminosity profile of the considered data. All generated events are passed through a detailed simulation of the CMS detector with GEANT4 [59]. The simulated and recorded events are reconstructed with the same CMS reconstruction software.

The τ h reconstruction
The reconstruction of τ h candidates is carried out with the hadrons-plus-strips algorithm. The reconstruction proceeds in four steps, as summarized in the following and described in detail in Refs. [17][18][19].
First, seed regions are defined, with the goal of reconstructing one τ h candidate per seed region. Each seed region is defined by a reconstructed hadronic jet. The jets for the seeding are clustered from all particles reconstructed by the PF algorithm using the anti-k T algorithm with a distance parameter of R = 0.4. All particles in an η-φ cone of radius ∆R ≡ √ ∆η 2 + ∆φ 2 = 0.5 around the jet axis are considered for the next steps of the τ h reconstruction.
Second, π 0 candidates are reconstructed using "strips" in η-φ space in which the four-momenta of electrons and photons are added, and charged-hadron (h ± ) candidates are selected using charged particles from the PF algorithm as input.
Third, all possible τ h candidates are reconstructed in a number of decay modes from the reconstructed charged hadrons and strips. The τ h four-momentum is obtained from summing the four-momenta of the charged hadrons and strips used to reconstruct the τ h candidate in a given decay mode. An overview of the τ decay modes and their branching fractions is given in Table 1. There are seven different reconstructed τ h decay modes, including three new ones (with respect to the algorithm documented in Ref. [19]) that target τ − → h − h + h − π 0 ν τ decays and τ h decays with three charged hadrons of which one is not reconstructed as a charged particle: Fourth, a single τ h candidate is chosen among all possible reconstructed τ h candidates within a seed region. Reconstructed τ h candidates are subject to the following constraints: Hadronic decays 64.8 • The mass of the reconstructed τ h candidate is required to be loosely compatible with: (i) the ρ (770) resonance if reconstructed in the h ± π 0 mode; (ii) the a 1 (1260) resonance if reconstructed in either of the h ± π 0 π 0 and h ± h ∓ h ± modes [19]; or (iii) the expected mass spectrum including the ρ (1450) resonance if reconstructed in the h ± h ∓ h ± π 0 modes. The mass of all h ± candidates is assumed to be the charged-pion mass, in line with the output from the PF algorithm.
• The τ h charge needs to correspond to ±1, unless the τ h candidate is reconstructed in a mode with a missed charged hadron, in which case the charge is set to correspond to the charge of the charged hadron with higher p T .
• All reconstructed h ± and π 0 need to be in the signal cone, defined with radius ∆R = 3.0/p T (GeV) (with ∆R limited to the range 0.05-0.1) with respect to the τ h momentum.
Among the selected τ h candidates, the one with the highest p T is chosen. This τ h candidate can subsequently be discriminated against quark or gluon jets, electrons, and muons.
A large fraction of τ h decays are reconstructed in their targeted decay modes, as inferred from Fig. 1, which shows the fractions of reconstructed decay modes for generated τ h with a given decay mode that fulfil p T > 20 GeV and |η| < 2.3. The fraction of generated τ h not reconstructed in any of the modes ranges from 11% for h ± to 25% for h ± π 0 . The overall reconstruction efficiency is mostly limited by the ability to reconstruct tracks from charged hadrons of around 90% [34]. For the decay modes without missing charged hadrons, the charge assignment is 99% correct for an average Z → ττ event sample, 98% for τ h with p T ≈ 200 GeV, and 92% for τ h with p T ≈ 1 TeV.
Reconstructed decay mode (13 TeV) CMS Simulation Figure 1: Decay mode confusion matrix. For a given generated decay mode, the fractions of reconstructed τ h in different decay modes are given, as well as the fraction of generated τ h that are not reconstructed. Both the generated and reconstructed τ h need to fulfil p T > 20 GeV and |η| < 2.3. The τ h candidates come from a Z → ττ event sample with m τ τ > 50 GeV. Decay modes with the same numbers of charged hadrons and one or two π 0 s are combined and labelled as "π 0 s".
for reconstructed τ h candidates in one of the other decay modes, unless explicitly stated.

The τ h identification with a deep neural network
The newly developed τ h identification algorithm uses a deep neural network structure. Its architecture is based on the following three premises: • Multiclass: Previously, separate dedicated algorithms were used to reject electrons, muons, or quark and gluon jets reconstructed as a τ h candidate, either based on tree ensembles (jets and electrons) or on a number of selection criteria (muons) [19]. Including electrons, muons, and jets in the same algorithm is expected to both improve identification performance and to reduce maintenance efforts.
• Usage of lower-level information: The MVA discriminators used previously were built on higher-level input variables and showed improved performance with respect to cutoff-based criteria. However, jets hadronize and fragment in complex patterns, and particles from concurrent interactions lead to similarly complex detector patterns. We expect that a machine-learned algorithm, which uses a sufficiently large data set for the training and is able to exploit lower-level information, can lead to improved performance. Therefore, information about all reconstructed particles near the τ h candidate is directly used as input to the algorithm.
• Domain knowledge: Sets of handcrafted higher-level input variables used previ-ously are utilized as additional inputs to the information about the single reconstructed particles. Although it should, in principle, be possible to achieve the same performance with and without using these variables given a sufficiently large set of events for the training and a suitable network architecture, the usage of the higherlevel inputs may reduce the number of training events needed and improve the convergence of the training, as seen in other studies [30].
Various network architectures were considered under the constraints that they satisfy these premises with manageable complexity. In particular, the algorithm must be trainable on the available hardware, and both its memory footprint and evaluation time are subject to constraints when used in the CMS reconstruction. To process the information on all reconstructed particles near the τ h axis, convolutional layers are employed. Compared with other approaches like fully connected layers or graph-based structures, convolutional layers have two main advantages: they have a smaller computational complexity and they are implemented efficiently in current deep learning software. On the other hand, they require a partitioning of inputs, in the case at hand a two-dimensional one in η-φ space. As explained in the following, this partitioning both leads to many empty input cells and cells in which information is lost since multiple particles are found. The details of the architecture and the inputs are given in the following.

Particle-level inputs
Two grids are defined in η-φ space, centred around the τ h axis, as displayed in Fig. 2: an inner grid with 11×11 cells and a grid size of 0.02×0.02, and an outer grid with 21×21 cells and grid size of 0.05×0.05. These two grids overlap, i.e. the information in the inner grid will also be part of the information that is processed by the outer grid. For each grid cell, properties of contained reconstructed particles of seven different types are used as inputs. If there is more than one particle of a given type, only the highest p T particle is used as input. The various types are the five kinds of particles reconstructed by the PF algorithm in the central detector, i.e. muons, electrons, photons, charged hadrons, and neutral hadrons. The muons and electrons are reconstructed by dedicated standalone reconstruction algorithms, which provide more specific information to distinguish them from the other particles. For each particle, various types of information are taken into account, as listed in Table 2. The major tradeoffs defining the grid size are the computational cost and the loss of information when more than one particle of a specific type is present in a grid cell. On average, 1.7% of the inner and 7.1% of the outer grid cells are occupied. Up to approximately 10% of the occupied cells contain more than one particle of a specific type.
The finer binning near the τ h axis is motivated as follows: First, the τ h decay products occur predominantly in an η-φ cone with radius 0.1 around the reconstructed τ h axis. Second, highp T jets are more collimated, so a finer grid is needed to capture all particles within them.

High-level inputs
The high-level input variables correspond primarily to those proven to be useful in the previous MVA classifier [19]. These variables include: (i) the τ h four-momentum and charge; (ii) the number of charged and neutral particles used to reconstruct the τ h candidate; (iii) isolation variables; (iv) the compatibility of the leading τ h track with coming from the PV; (v) the properties of a secondary vertex in case of a multiprong τ h ; (vi) observables related to the η and φ distributions of energy reconstructed in the strips; (vii) observables related to the compatibility Figure 2: Layout of the grids in η-φ space around the reconstructed τ h axis used to process the particle-level inputs for the convolutional layers of the DNN. The inner grid comprises 11×11 cells with a grid size of 0.02×0.02 and contains the signal cone with a radius of 0.05-0.1, which is defined in the τ h reconstruction (the charged hadrons and π 0 candidates used to reconstruct the τ h candidate need to be within the signal cone). For high-p T quark and gluon jets, the finer grid is also able to resolve the dense core of the jet. The outer grid comprises 21×21 cells with a grid size of 0.05×0.05 and contains the isolation cone with a radius of 0.5 that is used to define higher-level observables that correlate with quark or gluon jet activity. Table 2: [ Per-particle input variables.] Input variables used for the various kinds of particles that are contained in a given cell. For each type of particle, basic kinematic quantities (p T , η, φ) are included but not listed below. Similarly, the reconstructed charge is included for all charged particles. An estimated per-particle probability for the particle to come from a pileup interaction using the pileup identification (PUPPI) algorithm [61] is labelled as PUPPI. A number of input variables that give the compatibility of the track with the primary interaction vertex (PV) or a possible secondary vertex (SV) from the τ h reconstruction are denoted as "Track PV" and "Track SV".  1]. Most of the other input variables are subject to a standardization procedure, x → (x − µ)/σ, with µ denoting the mean of the distribution of x and σ its standard deviation. The standardized inputs are truncated to [−5, 5] to protect against outliers.

Architecture
The overall DNN architecture is visualized in Fig. 3. The goal of the DNN is to use and process the inputs to optimally classify the τ h candidate as belonging to a target class, which corresponds to determining whether the reconstructed τ h originates from a genuine τ h , a muon, an electron, or a quark or gluon jet. Three different subnetworks are created that independently process the inputs from the high-level variables, the outer cells, and the inner cells. The outputs from these three subnetworks are then concatenated and passed through four fully connected layers with 200 nodes each and then to a final layer with four nodes that yield outputs x α , with α ∈ {jet, µ, e}. A softmax activation function [62], is then applied to yield estimates y α of the probabilities for the τ h candidate to come from each of the four target classes (τ h , jet, µ, e).
The two subnetworks that process the inputs from the inner and outer cells have similar structures. As explained above, the main idea is to process the grids with convolutional layers [22][23][24]. In our case, the large number of input parameters for each cell makes it computationally too expensive to directly process the grid cells. To reduce the dimensionality, the inputs from each cell of the inner and outer grid are first sent through a number of fully connected layers. Since these layers act on single cells of the grids, with the full grid later on processed with convolutional layers, these fully connected layers can also be considered as one-dimensional convolutional layers, and the nodes can be considered as filters. For each cell, inputs from the electron, muon, and hadron blocks are first processed separately in three fully connected layers with a decreasing number of nodes. The outputs from the three separate processing blocks are then concatenated and passed through four additional fully connected layers with a decreasing number of nodes. The 188 input parameters are reduced to 64 output parameters (filters) for each cell that are input to the convolutional layers.
The convolutional layers each make use of 64 filters with size 3×3. These filters reduce the size of the grid with each step by two in each dimension since no padding is applied (i.e. the grid is not artificially extended with additional cells beyond the dimension of the grid). For both the inner cells with a dimension of 11×11 and the outer cells with a dimension of 21×21, the convolutional layers are applied until the grid is reduced to a single cell. As a consequence, there are 5 convolutional layers for the inner cells and 10 for the outer cells. For both the inner and the outer cells, the final single cell implies that there are 64 outputs.
The 47 high-level input variables are processed by three fully connected layers with 113, 80, and 57 nodes, yielding 57 outputs. Taken together, the three subnetworks yield 185 output parameters that are processed through a number of fully connected layers as explained above.
After each of the fully connected and convolutional layers, batch normalization [63] and dropout regularization [64,65] with a dropout rate of 0.2 are applied. Furthermore, nonlinearity is introduced by applying the parametric rectified linear unit activation function, which is given by αx for x < 0 and by x for x ≥ 0, with α being a trainable parameter [66].
A few alternative approaches were tried. First, networks using only high-level inputs and an overall lower number of trainable parameters were tested. These showed significantly inferior Applied similarly for inner and outer cells Applied similarly for inner and outer cells Figure 3: The DNN architecture. The three sets of input variables (inner cells, outer cells, and high-level features) are first processed separately through different subnetworks, whose outputs are then concatenated and processed through five fully connected layers before the output is calculated that gives the probabilities for a candidate to be either a τ h , an electron, a muon, or a quark or gluon jet. The subnetwork for the high-level inputs consists of three fully connected layers with decreasing numbers of nodes, taking 47 inputs and yielding 57 outputs. The features of both the inner and outer cells are input to complex subnetworks. In the first part, the observables in each grid cell are processed through a set of fully connected layers, first separately for electrons/photons (containing both the features for PF electrons and electrons from the standalone reconstruction), muons (similarly containing both features from PF and standalone muons), and charged/neutral hadrons, passing through three fully connected layers each. The outputs are concatenated and passed through four additional fully connected layers, yielding 64 outputs for each cell. The grids are then processed with convolutional layers, which successively reduce the size of the grid. For the inner cells, there are hence 5 convolutional layers that reduce the grid from 11×11 to a single cell; for the outer cells, there are 10 convolutional layers that reduce the grid from 21×21 to a single cell. The numbers of trainable parameters (TP) for the different subnetworks are also given for the different subnetworks.
performance in terms of τ h identification efficiency versus the misidentification probabilities for jets, electrons, and muons, as clarified in Figs. 4, 6, and 7. Second, alternative approaches were tried that were based on the idea of having two-dimensional grids for the particle-level inputs, but with different input cell sizes and with differently organized convolutional layers, e.g. convolutional layers with padding and pooling (combining the information from a number of nearby cells). The chosen architecture yielded the best performance among the ones tested within the predefined computing constraints.

Loss function and classifier training
The loss function is a sum of three different terms, given the four target classes (jet, µ, e, τ h ): 1. a regular cross-entropy term [62] for the τ h target class, 2. a term implementing focal loss [67] for the binary classification of τ h against all backgrounds combined, and 3. another focal-loss term with three components for the classification as µ, e, or jet, each combined with a smoothened step function that is nonzero only for τ h candidates that are sufficiently likely to be classified as τ h .
The introduction of the two focal-loss terms gives superior performance compared with using cross-entropy only. It is motivated by two considerations. First, analyses with τ h candidates in the final state, like the ones mentioned in the introduction, show the highest sensitivity for efficiencies in the range 50-80%, whereas performance for the highest efficiencies and purities is less important. Second, the classification performance as τ h is more important than the classification as e, µ, or jet; particularly in the high-purity regime where it is less important to distinguish between the three other classes. The full definition of the loss function is given in Appendix A.
For the training, events from the following simulated processes are used: Z+jets (NLO), W+jets, tt, Z → ττ, Z → ee, Z → µµ (with m(Z ) ranging from 1 to 5 TeV), and QCD multijet production. For testing, additional event samples are used, including H → ττ and Z+jets (LO) event samples, and more event samples corresponding to the processes used for the training. The event samples are simulated and reconstructed according to the 2017 data-taking conditions. The prospective τ h candidates need to fulfil p T > 20 GeV and |η| < 2.3. The τ h candidates are sampled from the various event samples such that the contributions of the different classes (jet, µ, e, τ h ) and in different (p T , η) bins are the same. Additional weights are added to make the distributions from the different classes uniform within each (p T , η) bin. In total, around 140 million τ h candidates are used for the training, whereas 10 million are assigned as the validation samples.
The loss function is minimized with Nesterov-accelerated adaptive momentum estimation [68], which combines the Adam algorithm [69] with Nesterov momentum. The training setup uses the KERAS library [70] with TENSORFLOW [71] as backend. The training was carried out with a single Nvidia GeForce RTX ™ 2080 graphics processing unit. The training of the final classifier was run for 10 epochs, with the training for an epoch lasting 69 hours on average. The best performance on the validation samples has been obtained after 7 epochs.
The final discriminators against jets, muons, and electrons are given by with y α taken from Eq. (1) and α ∈ {jet, µ, e}. In the following, these discriminators are labelled as DEEPTAU discriminators against jets (D jet ), muons (D µ ), and electrons (D e ).

Expected performance
The performance of the DEEPTAU discriminators is evaluated with the validation samples. Working points are defined to guide usage in physics analyses and to derive suitable data-tosimulation corrections. The target τ h identification efficiencies for the various working points used in the wide range of physics analyses with tau leptons are presented in Table 3. The target τ h identification efficiencies are defined as the efficiency for genuine τ h in the H → ττ event sample that are reconstructed as τ h candidates with 30 < p T < 70 GeV to pass the given discriminator. The target τ h identification efficiencies range from 40 to 98% for D jet , from 99.5 to 99.95% for D µ , and from 60 to 99.5% for D e . In Fig. 4, the performance of the D jet discriminator is evaluated by studying the efficiencies for quark and gluon jets and genuine τ h to pass the discriminator and comparing it with the previously used MVA classifier. All generated and reconstructed τ h candidates in this figure and those that follow are required to pass p T > 20 GeV and |η| < 2.3, VVVLoose D e WP and VLoose D µ WP. For the previously used MVA classifier, two curves are shown: The first curve corresponds to the τ h decay mode reconstruction as used in the previous publication and in previous physics analyses, whereas the second curve includes the additional τ h decay modes introduced in Section 3, leading to an overall increase of the MVA classifier performance, in particular for higher efficiency values. The efficiency is evaluated separately for the W+jets and tt event samples. The W+jets sample is enriched in quark jets, whereas the tt event sample has a larger fraction of b quark and gluon jets and has a busier event topology. The efficiencies are also evaluated separately for τ h candidates with p T < 100 GeV and p T > 100 GeV to test the performance in different p T regimes. For all considered scenarios, the D jet discriminator shows a large improvement of the signal efficiency at a given background efficiency compared with the previously used MVA classifier. At a given efficiency for jets to pass the discriminator, there is a relative increase in signal efficiency of more than 30% for the tightest working points and more than 10% for the loosest working points. The larger gain at high p T is a consequence of the inclusion of additional decay modes in the τ h reconstruction.
The p T dependence of the efficiency for genuine generated τ h with p T > 20 GeV to be reconstructed with p T > 20 GeV and to pass the D jet discriminator is shown in Fig. 5. The reconstruction efficiency exceeds 80% for p T > 30 GeV and is close to 90% for p T > 100 GeV, and is mostly limited by the charged-hadron reconstruction efficiency. If decay modes with missing charged hadrons (the so-called two-prong decay modes) are excluded, the efficiency is reduced by around 10%. The efficiency to additionally pass either of the shown working points of the D jet discriminator (for reconstructed τ h candidates without missing charged hadrons) is in line with the target τ h identification efficiencies listed in Table 3.
The D e discriminator leads to a significantly improved rejection of electrons compared with the MVA discriminator, as can be inferred from Fig. 6. The efficiencies for τ h candidates (with p T > 20 GeV, |η| < 2.3 and passing the VVVLoose D jet and VLoose D µ WP) and electrons to pass the two discriminators are shown separately for τ h and electrons with 20 < p T < 100 GeV and p T > 100 GeV. Within the range of applicability of the previously used MVA discriminator, the D e discriminator increases the τ h efficiency by consistently more than 10% for a constant misidentification probability. For example, the τ h efficiency increases from 87 to 99% for the loosest working point of the two discriminators with an electron efficiency of around 7% for p T < 100 GeV. Besides increasing the reach in τ h identification efficiency, the new discriminator DeepTau MVA Figure 4: Efficiency for quark and gluon jets to pass different tau identification discriminators versus the efficiency for genuine τ h . The upper two plots are obtained with jets from the W+jets simulated sample and the lower two plots with jets from the tt sample. The left two plots include jets and genuine τ h with p T < 100 GeV, whereas the right two plots include those with p T > 100 GeV. The working points are indicated as full circles. The efficiency for jets from the W+jets event sample, enriched in quark jets, to pass the discriminators is higher compared with jets from the tt event sample, which has a larger fraction of gluon and b-quark jets. The jet efficiency for a given τ h efficiency is larger for jets and τ h with p T < 100 GeV than for those with p T > 100 GeV. Compared with the previously used MVA discriminator, the DEEPTAU discriminator reduces the jet efficiency for a given τ h efficiency by consistently more than a factor of 1.8, and by more at high τ h efficiency. The additional gain at high p T comes from the inclusion of updated decay modes in the τ h reconstruction, as illustrated by the curves for the previously used MVA discriminator but including reconstructed τ h candidates with additional decay modes. also provides the possibility to reject electrons with an efficiency of 10 −4 for a τ h efficiency of 55%. Similar, albeit slightly smaller, gains are observed for p T > 100 GeV.  Figure 5: Efficiencies for simulated τ h decays with |η| < 2.3 to pass the following reconstruction and identification requirements: to be reconstructed in any decay mode with p T > 20 GeV and |η| < 2.3 (black dashed line), to be reconstructed in a decay mode except for those with missing charged hadrons (labelled "2-prong" and shown as full black line), and to be reconstructed in a decay mode except the 2-prong ones and to pass the Loose, Medium, or Tight working point of the D jet discriminator (blue lines), obtained with a Z → ττ event sample. The efficiencies are shown as a function of the visible genuine τ h p T obtained from simulated decay products.
Gains are also observed in terms of the discrimination of τ h candidates (with p T > 20 GeV, |η| < 2.3 and passing the VVVLoose D jet and VVVLoose D e WP) versus muons, as shown in Fig. 7. Compared with the cutoff-based discriminator, the D µ discriminator leads to an increase of the τ h efficiency of around 0.5% for a given prompt muon efficiency. For the τ h efficiency range of 99.1-99.4% addressed by the cutoff-based discriminator, the D µ discriminator leads to a factor of 3-10 larger prompt muon rejection, and thereby provides a significantly improved rejection of prompt muons for analyses with hadronically decaying tau leptons, while leading to only a percent-level loss of τ h identification efficiency.
The performance results discussed so far have been obtained with simulated samples that are consistent with the 2017 data-taking conditions. Moreover, after applying the discriminators to simulated samples that correspond to the 2016 and 2018 data-taking conditions, we see consistent performance, both in absolute terms and relative to other discriminators. Based on these findings, the discriminators D e , D µ , and D jet have been recommended for all CMS analyses of LHC data at √ s = 13 TeV.
To summarize, the DEEPTAU discriminators against jets, electrons, and muons lead to large gains in discrimination performance against various backgrounds. The gains in τ h identification efficiency at a typical jet or electron efficiency used in physics analyses amount to more than 30% and 10%, respectively. Therefore, a large gain in reach is expected in physics analyses : Efficiency for electrons versus efficiency for genuine τ h to pass the MVA and D e discriminators, separately for electrons and τ h with 20 < p T < 100 GeV (left) and p T > 100 GeV (right). Vertical bars correspond to the statistical uncertainties. The τ h candidates are reconstructed in one of the τ h decay modes without missing charged hadrons. Compared with the MVA discriminator, the D e discriminator reduces the electron efficiency by more than a factor of two for a τ h efficiency of 70% and by more than a factor of 10 for τ h efficiencies larger than 88%. Furthermore, working points (indicated as full circles) are now provided for previously inaccessible τ h efficiencies larger than 90%, for a misidentification efficiency between 0.3 and 8%.
with tau leptons, in particular for final states with two τ h candidates. These improvements are confirmed by physics analyses that already used the DEEPTAU discriminators [72-74].

Performance with √ s = 13 TeV data
The performance of the τ h reconstruction and of the DEEPTAU discriminators is validated with collision data at √ s = 13 TeV. The total integrated luminosity for the 2016-2018 data collected with the CMS detector and validated for use in analysis amounts to 138 fb −1 , of which 36 fb −1 were recorded in 2016, 42 fb −1 in 2017, and 60 fb −1 in 2018 [75-77]. The reconstruction and identification efficiencies for τ h are measured separately for the three data-taking years using a Z → ττ event sample in the µτ h final state, following methods similar to those established previously [19]. In addition, measurements are made of the misidentification efficiency with which jets, electrons, and muons are reconstructed and identified as τ h candidates. All of these measurements are essential ingredients for physics analyses that use τ h candidates.
In this paper, we present a subset of the measurements that were carried out, mostly based on the largest data set recorded in 2018. The identification efficiencies are generally measured for all working points introduced above. We discuss a subset of measurements here only for representative working points. Consistent results have been obtained for the other working points, and for the 2016 and 2017 data sets. We have not performed such efficiency measurements for the decay modes with missing charged hadrons; future analyses including these decay modes will need to derive appropriate corrections. The full set of efficiency and energy scale corrections together with their corresponding uncertainties is available for data analyses in the CMS Collaboration.  : Efficiency for muons versus efficiency for simulated τ h to pass the cutoff-based and D µ discriminators, separately for muons and τ h with 20 < p T < 100 GeV (left) and p T > 100 GeV (right). The four working points are indicated as full circles. Vertical bars correspond to the statistical uncertainties. In both p T regimes, the D µ discriminator rejects up to a factor of 10 more muons at τ h efficiencies around 99%, and it leads to an increase of the τ h efficiency for a similar background rejection by about 0.5%.

Reconstruction and identification efficiency
The measurement of the τ h reconstruction and identification efficiencies uses a µτ h event sample that is selected in the same way as in the previous measurement [19], with the following updates. Events are recorded with a single-muon trigger with a nominal p T threshold of 24 GeV and are required to have at least one reconstructed and isolated muon with p T > 25 GeV. The τ h candidate is required to be reconstructed with the decay mode finding algorithm, to have p T > 20 GeV, and to pass a given threshold on the D jet discriminator. In addition, the τ h candidate needs to pass the VVLoose working point of the D e discriminator and the Tight working point of the D µ discriminator, to reject background from muons or electrons misidentified as a τ h candidate. Additional criteria are applied to increase the purity of events with genuine τ h candidates, including a requirement on the transverse mass of the muon and p miss where Φ is the angle in the transverse plane between the muon momentum and p miss T ) and a maximum difference in η between the reconstructed muon and the τ h candidate (|∆η| < 1.5). The resulting dataset is enriched in Z → ττ events, with the residual contamination from other processes amounting to approximately one fifth of the total yield (Fig. 10). In addition to the µτ h event sample, a µµ event sample is defined to normalize the Z → ττ event yields. This event sample is subject to the same trigger and muon selection criteria as the µτ h event sample, such that related uncertainties partially cancel in the normalization scale factor.
The efficiencies are extracted from a maximum likelihood fit to the distribution of the reconstructed visible invariant mass of the µτ h system, m vis , and to the expected and observed event yields in the µµ region. In the fit, the ratio of the observed and expected efficiency for a genuine τ h to pass the selection is incorporated as a free parameter, such that the fit yields a scale factor with respect to the simulated efficiency. All known sources of systematic uncertainties are incorporated into the fit. Uncertainties include the uncertainty in the integrated luminosity, uncertainties in the muon trigger, identification, and isolation efficiencies, uncertainties due to the limited number of simulated events, and uncertainties in the normalizations of tt, QCD multijet, and Z/γ * +jets background. Since the scale factors are extracted after the full event selection, they incorporate any differences between data and simulation of the full product of all reconstruction and identification efficiencies. The scale factors are extracted from the maximum likelihood fit together with their corresponding uncertainties using the asymptotic properties of the likelihood function [78].
The scale factors are extracted in different p T bins, p T ∈ {[20, 25], [25,30], [30,35], [35,40], [40,50], [50,70], >70 GeV}, to account for a possible p T dependence of the extracted scale factors, which would affect analyses that use different minimum p T thresholds or that analyze τ h final states from processes with different p T distributions. While analyses usually use all reconstructed decay modes, except for the modes with missing charged hadrons so a parametrization in terms of decay modes is not necessary, analyses that select events using the di-τ h trigger algorithms observe a different proportion of τ h candidates reconstructed in the various decay modes. This is a consequence of the dependency of τ h trigger efficiency on the various decay modes. Therefore, scale factors are separately determined in four bins of reconstructed decay modes (h ± ; h ± π 0 or h ± π 0 π 0 ; h ± h ∓ h ± ; and h ± h ∓ h ± π 0 ). These scale factors are shown in Fig. 8 for τ h p T > 40 GeV, which corresponds to the minimum recommended threshold in analyses that use events recorded with the di-τ h trigger.
Since the reach in τ h p T is limited in the Z → ττ sample, a separate off-shell W boson (W * ) event sample is used to measure additional scale factors for τ h p T > 100 GeV in a τ h -plusp miss T final state without an additional hadronic jet, following closely the method discussed in Ref. [19]. The requirement of high τ h p T and high p miss T combined with the hadronic jet veto leads to an event sample dominated by off-shell W * → τν τ boson decays. The events are recorded with a trigger that requires a large p miss T . The efficiencies are obtained from a maximum likelihood fit to the transverse mass distribution of the τ h candidate and p miss T . To constrain the fiducial W * cross section in the signal region, a W * → µν µ control region is also included in the fit.
Figure 8 (left) shows the scale factors as a function of the reconstructed τ h candidate p T separately for the Loose, Medium, and Tight working points. The plots also show the uncertainties for the τ h candidates with p T > 40 GeV, which are obtained from a constant fit up to p T values of 500 GeV, assuming there is no statistically significant p T dependence of the scale factors. Beyond 500 GeV, the uncertainties are enlarged because of the lack of τ h candidates with such high p T values, with the uncertainty at 1000 GeV twice as large as at 500 GeV.
The p T -dependent uncertainties in the τ h scale factors range from 2 to 5%. For the τ h p T spectrum in the Z → ττ sample, the combined scale factor uncertainty amounts to ≈2%. These uncertainties are considerably smaller than in previous measurements, which reported combined uncertainties of ≈6% [19]. The previous measurements used the "tag-and-probe" technique [79] that only measured the identification efficiency for the discriminator under consideration and added additional uncertainties in the reconstruction efficiency from independent measurements of single charged-hadron reconstruction efficiencies. The measurement of the product of all reconstruction and identification efficiencies with a maximum likelihood fit hence represents another important improvement for physics analyses with τ h final states, as it leads to a considerable reduction of the systematic uncertainties related to the τ h reconstruction and identification. factors are generally a bit smaller than one but consistent with unity within 10% (20% for the τ h candidates reconstructed in the h ± h ∓ h ± π 0 decay mode). The scale factors below one can be traced back to different origins, including the imperfect modelling of hadronization and an imperfect simulation of the detector alignment and track hit reconstruction efficiencies.

The τ h energy scale
The τ h energy scale is obtained from the same µτ h event samples as the scale factors discussed in the previous section, following the same method described in previous publications [19]. For the main measurement discussed here, τ h candidates are required to pass the Medium working point of the D jet discriminator. Consistent results are obtained with cross-check measurements for various working points of the D jet discriminator. To improve the a priori data-simulation agreement, scale factors as described in the previous section are already applied to simulated samples with τ h final states.
Two different observables are used to measure the τ h energy scale: the µτ h invariant mass (m vis ) and the reconstructed τ h mass (m(τ h )). For τ h candidates reconstructed in the h ± decay mode, m(τ h ) is constant, so the measurement is only performed with the m vis observable.
Simulated templates of the m vis and m(τ h ) distributions are created for τ h energy scale shifts between ±3% in steps of 0.2%. For each shift, a maximum likelihood fit is performed, using the combined expectation from the Z → ττ in the µτ h final state and the background events, using the set of systematic uncertainties discussed above. The observed value of the τ h energy scale shift is obtained from the distribution of the likelihood ratio together with the corresponding uncertainty. Figure 9: Relative difference between τ h energy obtained in data and simulated events for the four main reconstructed τ h decay modes for the 2018 data set. The results are obtained from fits to the distribution of either the reconstructed m vis (µ, τ h ) (blue lines) or m τ h (black lines). The horizontal bars represent the uncertainties in the measurements. The measured values are consistent with no shift of the τ h energy scale between data and simulation, with the largest difference amounting to 1.5 standard deviations.
The resulting relative differences between the τ h energy scale in data and simulation are shown in Fig. 9 separately for the two different measurements. The relative difference is given as a correction, in percent, to a unit multiplicative factor applied to the τ h four-momentum in the simulation. The results obtained with the two methods are consistent with each other, and the τ h energy scale shifts are consistent with no data-simulation differences, with the largest difference amounting to 1.5 standard deviations. The relative uncertainties range from 0.6% for the h ± π 0 and h ± h ∓ h ± decay modes to 0.8% for the h ± mode.
To validate the extracted scale factors and to illustrate the distributions used for the extraction of the scale factors and the τ h energy corrections, distributions of m vis and m(τ h ) are investigated, requiring the τ h to pass the Tight working point of the D jet discriminator (Fig. 10). These distributions are obtained from a maximum likelihood fit to the µτ data, applying the same selection used in the extraction of the scale factors. This fit incorporates the derived scale factors and energy corrections with their uncertainties, together with the full set of systematic uncertainties used in the scale factor extractions. In the fit to data, the Z → ττ normalization is a freely floating parameter. Data and predictions agree within the uncertainties, indicating that the derived scale factors and energy corrections lead to a good and complete description of the data, including the description of the different τ h decay modes.

Lepton and jet misidentification efficiencies
The efficiencies for electrons, muons, and jets to pass the respective DEEPTAU discriminators are measured with dedicated event samples, following closely the measurements explained in detail in Ref. [19]. The efficiencies for electrons and muons to be reconstructed and misidentified as τ h candidates are measured with Z → events ( ∈ {e, µ}) where one is misreconstructed as a τ h candidate; the event is reconstructed as an τ h final state. The efficiencies are extracted with the tag-and-probe method, where candidates that pass a given working point of the D µ or D e discriminator end up in the pass region and those that do not pass in the fail region. A maximum likelihood fit is performed to obtain the efficiencies. The distribution used in the pass region is the invariant τ h mass (m vis ), whereas the fail region is included as a single bin. Similar to the extraction of the τ h efficiencies, all relevant uncertainties are included in the maximum likelihood fit. To reject background from jets reconstructed as τ h candidates, the candidates are also required to pass the Medium working point of the D jet discriminator. An additional parameter is introduced in the fit that scales the contributions in the pass and fail regions simultaneously and corresponds to the scale factor for the efficiency to pass the D jet discriminator. This approach is based on the assumption that the probability for electrons or muons to pass the D jet discriminator is independent of the probability to pass the D e or D µ discriminator. Another parameter is introduced for electrons misreconstructed as τ h candidates that allows for a shifted energy scale.
The eτ h events for the measurement of the scale factors for electrons to pass different working points of the D e discriminator are recorded with a single-electron trigger with a p T threshold of 35 GeV. The efficiencies for electrons to pass the different working points in data and simulated events as well as the ratio of the efficiencies are shown in Fig. 11. The efficiencies are derived separately in the ECAL barrel region (|η| < 1.46) and the ECAL endcap region (|η| > 1.56). The observed and expected efficiencies agree within a few percent for the Loose and VLoose working points in the barrel and within 11% in the endcap region. For tighter working points, scale factors are obtained that are significantly different from one and that are generally greater than one in the barrel, with a maximum scale factor of 1.61 ± 0.28 for the VTight working point, and smaller in the endcap.
The µτ h events for the measurement of the scale factors for muons to be misreconstructed and misidentified as τ h candidates are recorded with a single-muon trigger with a threshold of 27 GeV. The leading muon is required to have p T > 28 GeV, and the transverse mass of the muon and p miss T is required to be smaller than 30 GeV. The resulting data-to-simulation scale factors are extracted from a fit to the m vis distribution and are shown in Fig. 12. The scale factors are shown separately for the Loose and Tight working points, and for five different bins in |η|. Although the scale factors are consistent with each other for the different working points and consistent with unity for four of the |η| bins, they differ significantly from one for |η| > 1.7. As a result of the high rejection power of D µ , muons that are not discarded belong to atypical regions of the phase space, far from the bulk of the distributions. For such uncommon muons, we observe that the simulation does not model the data well, especially for observables related to the muon track quality at large |η|; this is compatible with being the cause of the sizeable data-to-simulation difference at |η| > 1.7. The large D µ scale factors at |η| > 1.7 have no notable impact on physics analyses because of the small probability for muons to pass any D µ working point and the limited extent of the region presenting large data-to-simulation differences.
For a measurement of the efficiency for jets to pass the D jet discriminator, events are selected that are consistent with a W(→ µν µ )+jet topology. Similar to the previously discussed mea- surement, the events are recorded with a single-muon trigger with a p T threshold of 27 GeV, and the reconstructed muon is required to have p T > 28 GeV. Furthermore, there must be exactly one central jet with p T > 20 GeV and |η| < 2.3, and events with additional reconstructed muons or a reconstructed electron passing loose identification criteria are rejected. To select a pure sample of W+jets events, the transverse mass of the muon and p miss T is required to be greater than 60 GeV. The reconstructed jet serves as a probe for the measurement of the jet-to-τ h misidentification efficiency. The efficiencies for jets to be reconstructed as τ h candidates and to pass the Loose and Tight D jet working points are shown in Fig. 13 as a function of reconstructed jet p T . In the measurement of the efficiency, background events with genuine τ h candidates are subtracted. The observed and expected efficiencies agree within the uncertainties.

Summary
A new algorithm has been introduced to discriminate hadronic tau lepton decays (τ h ) against jets, electrons, and muons. The algorithm is based on a deep neural network and combines fully connected and convolutional layers. As input, the algorithm combines information from individual reconstructed particles near the τ h axis with information about the reconstructed τ h candidate and other high-level variables. In addition, an improved τ h reconstruction algorithm is introduced that increases the overall efficiency of the reconstruction by explicitly considering the τ h decay mode with three charged hadrons and a neutral pion, and by applying looser quality criteria for the charged hadrons in the case of three-prong τ h decays. The performance of the new τ h identification and reconstruction algorithms significantly improves over the previously used algorithms, in particular in terms of discrimination against the background from jets and electrons. For a given jet rejection level, the efficiency for genuine τ h candidates increases by 10-30%. Similarly, the efficiency for genuine τ h candidates to pass the discriminator against electrons increases by 14% for the loosest working point that is employed in many analyses. Following its superior performance, CMS physics analyses with tau leptons will sig- Besides the statistical uncertainty in the observed events, the error bars in the ratio of data to simulation also include uncertainties from the subtraction of events with genuine τ h candidates, electrons, or muons. The efficiency first rises with p T near the 20 GeV threshold because it becomes more likely for a jet to give rise to a reconstructed τ h candidate that passes this threshold. For higher p T , the particle multiplicity in a quark or gluon jet increases with p T . Therefore, the jets become easier to distinguish from genuine τ h candidates and the probability for quark or gluon jets to pass the D jet discriminator decreases with p T . nificantly increase their sensitivities when using the new algorithm. The superior performance of the algorithm is validated with collision data. The observed efficiencies for genuine τ h , jets, and electrons to be identified as τ h typically agree within 10% with the expected efficiencies from simulated events. The agreement is similar to the one observed with previous algorithms and confirms the improvements.

Acknowledgments
We congratulate our colleagues in the CERN accelerator departments for the excellent performance of the LHC and thank the technical and administrative staffs at CERN and at other CMS institutes for their contributions to the success of the CMS effort. In addition, we gratefully acknowledge the computing centers and personnel of the Worldwide LHC Computing Grid and other centers for delivering so effectively the computing infrastructure essential to our analyses. Finally, we acknowledge the enduring support for the construction and operation of the LHC, the CMS detector, and the supporting computing infrastructure provided by the follow-      [9] CMS Collaboration, "Search for a heavy pseudoscalar Higgs boson decaying into a 125 GeV Higgs boson and a Z boson in final states with two tau and two light leptons at √ s = 13 TeV", JHEP 03 (2020) 065, doi:10.1007/JHEP03(2020)065, arXiv:1910.11634.
[72] CMS Collaboration, "Measurements of Higgs boson production in the decay channel with a pair of τ leptons in proton-proton collisions at √ s = 13 TeV", 2022. arXiv:2204.12957. Submitted to Eur. Phys. J. C.

A Loss function
The loss function used for the training of the DEEPTAU algorithm has the form L(y true , y; κ, γ, ω) = κ τ H τ (y true , y; ω) + κ e + κ µ + κ jet F cmb (1 − y true τ , 1 − y τ ; γ cmb )  where bold fonts indicate groups of parameters; H(·, ·; ·) corresponds to the categorical cross entropy and F(·, ·; ·) to the focal loss function [67]; N is a factor to normalize F(·, ·; ·) to unity in the interval 0-1; andθ(·) is a smoothened step function that approaches 1 for y τ > 0.1. The focal loss terms in Eq. A.1 (b) and (c) put more emphasis on the part of parameter space in which it is more difficult to separate τ h candidates from e, µ, and jets. The step function in Eq. A.1 (c) disregards the part of parameter space in which the probability for the τ h candidate to correspond to a genuine τ h is low, for which we are not interested in optimal separation between e, µ, and jets.
The values and meanings of the parameters κ, γ, and ω are given in Table A