Identification of heavy, energetic, hadronically decaying particles using machine-learning techniques

Machine-learning (ML) techniques are explored to identify and classify hadronic decays of highly Lorentz-boosted W/Z/Higgs bosons and top quarks. Techniques without ML have also been evaluated and are included for comparison. The identification performances of a variety of algorithms are characterized in simulated events and directly compared with data. The algorithms are validated using proton-proton collision data at $\sqrt{s} =$ 13 TeV, corresponding to an integrated luminosity of 35.9 fb$^{-1}$. Systematic uncertainties are assessed by comparing the results obtained using simulation and collision data. The new techniques studied in this paper provide significant performance improvements over non-ML techniques, reducing the background rate by up to an order of magnitude at the same signal efficiency.


Introduction
At the CERN LHC [1], an efficient classification of hadronic decays of heavy standard-model (SM) particles (objects) that are reconstructed within a single jet would provide a significant improvement in the sensitivity of searches for physics beyond the SM (BSM) and in measurements of SM parameters. The understanding of jet substructure in highly Lorentz-boosted W/Z/H bosons (where H is the Higgs boson) and top (t) quark jets has advanced dramatically in recent years, both experimentally [2] and theoretically [3]. For a particle with a Lorentz boost of γ, the angular separation between its decay products scales as θ ∼ 2/γ in radians. A knowledge of the radiation patterns of these jets and their substructure is an important topic in theoretical and experimental research.
In this paper, we present studies using the CMS detector [4] at the LHC to evaluate and compare the performances of a variety of algorithms ("taggers") designed to distinguish hadronically decaying massive SM particles with large Lorentz boosts, namely W/Z/H bosons and t quarks, from other jets originating from lighter quarks (u/d/s/c/b) or gluons (g). We refer to such jets as "boosted W/Z/H/t jets," or "W/Z/H/t-tagged jets". The machine-learning (ML) algorithms include the energy correlation functions tagger (ECF), the boosted event shape tagger (BEST), the ImageTop tagger, and the DeepAK8 tagger. Algorithms without ML techniques have also been evaluated and are included for comparison. An alternative approach for jet clustering and identification, named the "heavy object with variable R (HOTVR)", where the heavy object is a W/Z/H boson or t quark, is also studied.
The theoretical and experimental understanding of jet substructure has gained significant precision in recent years. The CMS Collaboration has made many relevant measurements of jet substructure, including measurements of the cross section of highly Lorentz-boosted t quarks [5], jet mass in tt [6], dijet [7,8], samples enriched in light-flavors [7], and substructure observables in jets of different light-quark flavors [9] in resolved tt events. Similar measurements by the AT-LAS Collaboration are found in Refs. [10][11][12][13][14]. Overall, the systematic effects of jet substructure are well understood and, after correcting for detector effects, the results are generally consistent with theoretical expectations as expressed in simulations. Residual differences between data and simulation can be adjusted using scale factors.
ML-based approaches can be tailored to suit the needs of individual analyses. Some analyses require as pure a sample as possible, with optimized signal efficiency for a fixed background rejection. Others require well-behaved background estimates as a function of kinematic variables. A characteristic example is the use of jet mass sidebands for the background estimation. In this case, removing dependencies on the jet mass is collectively referred to as "mass decorrelation", as described in Ref. [15]. This paper provides tools derived from a strong program of previous study [16][17][18][19][20] for both the jet-mass-decorrelated and nominal scenarios.
The paper is organized as follows. A brief description of the CMS detector is presented in Section 2. The Monte Carlo (MC) simulated events used for the results are discussed in Section 3, and details of the CMS event reconstruction and the event selections used for the studies are summarized in Sections 4 and 5, respectively. Section 6 presents an overview of the methods currently used in CMS for heavy-resonance (i.e., W/Z/H bosons and t quarks) identification, and describes a set of novel algorithms that utilize ML methods and observables for this task. Our discussion of the CMS methods builds on the work documented in Refs. [16][17][18][19][20]. Section 7 details the analyses performed to understand the complementarity between the algorithms using simulated events. The performance of the algorithms is validated in data samples collected in proton-proton (pp) collisions at √ s = 13 TeV by the CMS experiment at the LHC in 2016, and corresponding to an integrated luminosity of 35.9 fb −1 . The results, along with the effect of systematic uncertainties in their measurement, are presented in Section 8, followed by a discussion of the results and a summary in Section 9.

The CMS detector
The central feature of the CMS apparatus is a superconducting solenoid of 6 m internal diameter, providing a magnetic field of 3.8 T. Within the solenoid volume are a silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter (ECAL), and a brass and scintillator hadron calorimeter (HCAL), each composed of a barrel and two endcap sections. Forward calorimeters extend the pseudorapidity (η) coverage provided by the barrel and endcap detectors [4]. Muons are measured in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid.
In the barrel section of the ECAL, an energy resolution of about 1% is achieved for unconverted or late-converting photons in the tens of GeV energy range. The remaining barrel photons have a resolution of about 1.3% up to |η| = 1, rising to about 2.5% at |η| = 1.4. In the endcaps, the resolution of unconverted or late-converting photons is about 2.5%, while the remaining endcap photons have a resolution between 3 and 4% [21].
In the region |η| < 1.74, the HCAL cells have widths of 0.087 in η and 0.087 radians in azimuth (φ). In the η-φ plane, and for |η| < 1.48, the HCAL cells map onto 5×5 ECAL crystals arrays to form calorimeter towers projecting radially outwards from close to the nominal interaction point. At larger values of |η|, the size of the towers increases and the matching ECAL arrays contain fewer crystals.
Muons are measured in the η range |η| < 2.4, with detection planes made using three technologies: drift tubes, cathode strip chambers, and resistive-plate chambers. Matching muons to tracks measured in the silicon tracker results in a relative transverse momentum (p T ) resolution for muons with 20 < p T < 100 GeV of 1.3-2.0% in the barrel and better than 6% in the endcaps. The p T resolution in the barrel is better than 10% for muons with p T up to 1 TeV [22].
The silicon tracker measures charged particles within the pseudorapidity range |η| < 2.5. It consists of 1440 silicon pixel and 15 148 silicon strip detector modules. Isolated particles of p T = 100 GeV emitted at |η| < 1.4 have track resolutions of 2.8% in p T and 10 (30) µm in the transverse (longitudinal) impact parameter [23].
Events of interest are selected using a two-tiered trigger system [24]. The first level (L1), composed of custom hardware processors, uses information from the calorimeters and muon detectors to select events at a rate of around 100 kHz. The second level, known as the high-level trigger (HLT), consists of a farm of processors running a version of the full event reconstruction software optimized for fast processing, and reduces the event rate to around 1 kHz before data storage.
A more detailed description of the CMS detector, together with the definition of the coordinate system used and the relevant kinematic variables, is given in Ref. [4].

Simulated event samples
Simulated pp collision events are generated at √ s = 13 TeV using various generators described below. They are used for the design and the performance studies of the heavy-resonance identification algorithms to compare with data and to estimate systematic uncertainties. The signal samples, enriched in one or more W/Z/H/-tagged jets, are obtained from the simulation of BSM processes. The t and W jet signal samples are obtained from heavy spin-1 Z resonances decaying to either a pair of t quarks (tt) or a pair of W bosons, respectively. These resonances are narrow, having intrinsic widths equal to 1% of the resonance mass. The Z-and H-tagged jet samples are obtained from decays of spin-2 Kaluza-Klein graviton resonances in the Randall-Sundrum model [25,26] to a pair of Z or H bosons, following the narrow-width assumption. The Z and graviton samples are simulated at leading order (LO) with MADGRAPH5 aMC@NLO 2.2.2 [27] interfaced with PYTHIA 8.212 [28,29] with the CUETP8M1 underlying event tune [30] for the fragmentation and hadronization description. Signal events are generated over a wide range of p T for different Z and graviton mass values. The background sample is represented by jets produced via the strong interaction of quantum chromodynamics (QCD), referred to as "QCD multijet" processes. The QCD multijet events are generated using PYTHIA in exclusivê p T bins using the NNPDF2.3 LO [31] parton distribution function (PDF) set.
A variety of MC simulations are needed for the study of the performance of the tagging algorithms in data. The tt process is generated with the next-to-leading-order (NLO) generator POWHEG v2.0 [32][33][34] interfaced with PYTHIA for the fragmentation and hadronization description. Simulated events originating from W+jets, Z+jets, and γ+jets, are generated using MAD-GRAPH5 aMC@NLO at LO accuracy using the NNPDF3.0 LO [31] PDF set. The WZ, ZZ, ttW, and tt γ+jets processes are generated using MADGRAPH5 aMC@NLO at NLO accuracy, the single t quark process in the t W channel and the WW process are generated at NLO accuracy with POWHEG, all using the NNPDF3.0 NLO PDF set. In all of these cases, parton showering and hadronization are simulated in PYTHIA. Double counting of partons generated using PYTHIA and MADGRAPH5 aMC@NLO is eliminated using the MLM [35] and FxFx [36] matching schemes for the LO and NLO samples, respectively.
The systematic uncertainties associated with the performance of the taggers are evaluated using simulated events produced with alternative generation settings. For the tt process, an additional sample is generated using POWHEG interfaced with HERWIG++ v2.7.1 [37,38] with the UE-EE-5C underlying event tune [39] to assess systematic uncertainties related to the modeling of the parton showering and hadronization. Additional QCD multijet samples are generated at LO accuracy using MADGRAPH5 aMC@NLO, interfaced with PYTHIA to test the modeling of the hard scattering in background events, or generated solely with HERWIG++ with the CUETHppS1 underlying event tune [30] to provide an alternative model of the background jets.
The most precise cross section calculations available are used to normalize the SM simulated samples. In most cases, this is next-to-NLO accuracy in the inclusive cross section. Finally, the p T spectrum of top quarks in tt events is reweighted (referred to as "top quark p T reweighting") to account for effects due to missing higher-order corrections in MC simulation, according to the results presented in Ref. [40]. The simulation of the QCD multijet and γ+jets processes is based on LO calculations. To account for missing higher-order corrections, the simulated QCD multijet events and the γ+jets events are reweighted such that the p T distribution of the leading jet in simulation matches that in data. Before extracting the weights, contributions from other processes are subtracted from data using the predicted cross sections in both cases.
A full GEANT4-based model [41] is used to simulate the response of the CMS detector to SM background samples. Event reconstruction is performed in the same manner for MC simulation as for collision data. A nominal distribution of multiple pp collisions in the same or neighboring bunch crossings (referred to as "pileup") is used to overlay the simulated events. The events are then weighted to match the pileup profile observed in the data. For the data used in this paper, there were an average of 23 interactions per bunch crossing.

Event reconstruction and physics objects
Events are reconstructed using the CMS particle-flow (PF) algorithm [42], which aims to reconstruct and identify each individual particle in the event with an optimized combination of information from the various elements of the detector. Particles are identified as charged or neutral hadrons, photons, electrons, or muons, and cannot be classified into multiple categories. The PF candidates are then used to build higher-level objects, such as jets. Events are required to have at least one reconstructed vertex. The physics objects are those returned by a jet-finding algorithm [43,44] applied to the tracks associated with the vertex, and the associated missing transverse momentum p miss T , taken as the negative vector sum of the p T of those jets. In the case of multiple overlapping events with multiple reconstructed vertices, the vertex with the largest value of summed physics object p 2 T is defined to be the primary pp interaction vertex (PV).
Photons are reconstructed from energy depositions in the ECAL using identification algorithms that use a collection of variables related to the spatial distribution of shower energy in the supercluster (a group of 5×5 ECAL crystals), the photon isolation, and the fraction of the energy deposited in the HCAL behind the supercluster relative to the energy observed in the supercluster [21,45]. The requirements imposed on these variables ensure an efficiency of 80% in selecting prompt photons. Photon candidates are required to be reconstructed with p T > 200 GeV and |η| < 2.5. Simulation-to-data correction factors are used to correct photon identification performance in MC.
Electrons are reconstructed by combining information from the inner tracker with energy depositions in the ECAL [45]. Muons are reconstructed by combining tracks in the inner tracker and in the muon system [22]. Tracks associated with electrons or muons are required to originate from the PV, and a set of quality criteria is imposed to assure efficient identification [22,45]. To suppress misidentification of charged hadrons as leptons, we require electrons and muons to be isolated from jet activity within a p T -dependent cone in the η-φ plane, ∆R = √ (∆η) 2 + (∆φ) 2 , where φ is the azimuthal angle in radians. The relative isolation, I rel , is defined as the p T sum of the PF candidates within the cone divided by the lepton p T . Neither charged PF candidates not originating from the PV, nor those identified as electrons or muons, are included in the sum.
The isolation sum I rel is corrected for contributions of neutral particles originating from pileup interactions using an area-based estimate [46] of pileup energy deposition in the cone. The requirements imposed on the electron and muon candidates lead to an average identification efficiency of 70 and 95%, respectively. In addition, the electron and muon candidates are required to have p T > 40 GeV and be within the tracker acceptance of |η| < 2.5. The electron and muon identification performance in simulation is corrected to match the performance in data.
The primary jet collection in this paper, referred to as "AK8 jets", is produced by clustering PF candidates using the anti-k T algorithm [43] with a distance parameter of R = 0.8 with the FASTJET 3.1 software package [43,44].
A collection of jets produced using the Cambridge-Aachen (CA) [47,48] clustering algorithm with R = 1.5, referred to as "CA15 jets", is also used in this paper. In both jet collections, the "PileUp Per Particle Identification (PUPPI)" [49] method is used to mitigate the effect of pileup on jet observables. This method makes use of local shape information around each particle in the event, the event pileup properties, and tracking information. This PUPPI algorithm operates at the PF candidate level, before any jet clustering is performed. A local variable α is computed for each PF candidate, which contrasts the collinear structure of QCD with the low-p T diffuse radiation arising from pileup interactions. This α variable is used to calculate a weight correlated with the probability that an individual PF candidate originates from a pileup collision. These per PF candidate weights are used to rescale the four-momenta of each PF candidate to correct for pileup. The resulting PF candidate list is used as an input to the clustering algorithm. A detailed description of the PUPPI implementation in CMS can be found in Ref. [50]. No additional pileup corrections are applied to jets clustered from these weighted inputs. Corrections are applied to the jet energy scale to compensate for nonuniform detector response [51]. Jets are required to have p T > 200 GeV and |η| < 2.4.
A collection of jets, reconstructed with the anti-k T algorithm and a smaller distance parameter R = 0.4, referred to as "AK4 jets", are used to define the event samples for the validation of the algorithms. To reduce the effect of pileup collisions, charged PF candidates identified as originating from pileup vertices are removed before the jet clustering, based on the method known as "charged-hadron subtraction" [51]. An event-by-event correction based on jet area [51] is applied to the jet four-momenta to remove the remaining neutral energy from pileup vertices. As with the AK8 and CA15 jets described above, additional corrections to the jet energy scale are applied to compensate for nonuniform detector response. The AK4 jets are required to have p T > 30 GeV and be contained within the tracker volume of |η| < 2.4.
Jets originating from the hadronization of bottom (b) quarks are identified, or "tagged", using the combined secondary vertex (CSVv2) b tagging algorithm [20]. The working point, i.e., a selection on the algorithm's discriminant providing a well defined signal (e.g., b quarks) and background (e.g., light quarks) efficiency, used provides an efficiency for the b tagging of jets originating from b quarks that varies from 60 to 75%, depending on p T , whereas the misidentification rate for light quarks or gluons is ∼1%, and ∼15% for charm quarks.
For the studies presented in this paper, the simulated signal jets (AK8 or CA15 jets) are identified as W/Z/H/t-tagged jets when the ∆R between the reconstructed jet and the closest generated particle (W/Z/H boson or t quark) before the decay, denoted as ∆R(jet, generated particle), is less than 0.6. This definition allows for a consistent comparison of the performance of the algorithms using collections of jets clustered with different R. The choice of the 0.6 value approximately corresponds to the minima of the ∆R distribution between jets and the closest generated particle based on studies reported in Ref. [17]. The fraction of AK8 jets with ∆R(AK8, generated particle) < 0.6 as a function of the p T of the generated particle for jets initiated from the decay of a W boson (left) or t quark (right) is shown in Fig. 1. This "matching" efficiency of W bosons (t quarks) reaches a plateau of nearly 100% for p T 200 (400) GeV. The corresponding efficiency curve for CA15 jets is superimposed on the plots, and shows consistent efficiency with AK8 jets. A similar efficiency is obtained when a relaxed selection of ∆R(CA15, generated particle) < 1.2 is applied. This justifies the use of the same ∆R(jet, generated particle) reconstruction criteria for both jet collections.
Additional criteria are applied to simulated jets for the evaluation of the performance in data and for the calibration of the algorithms. The partonic decay products (b, q 1 , q 2 for t quarks, or q 1 , q 2 for W, Z or H bosons) are required to be fully contained in the AK8 (CA15) jet, satisfying ∆R(AK8, q i ) < 0.6 (∆R(CA15, q i ) < 1.2). These requirements were derived from the studies in Ref. [17]. The "merging" probability as a function of the p T of the generated particle (i.e., the efficiency for the decay products of the t quark or W boson to be fully contained in a single jet based on the above requirements) is also shown in Fig. 1. For W bosons (t quarks) with p T 200 (650) GeV, at least 50% of the AK8 jets fully contain the W (t) decay products. In the case of CA15 jets, similar efficiency is achieved for W bosons (t quarks) with p T 150 (350) GeV.
In the case of background jets, partons (u, d, s, c, b, and gluon) from the hard scattering are required to be contained in the jet cone for the jet to be classified as such.  Figure 1: Matching efficiency as a function of the p T of the generated particle, for hadronically decaying W bosons (left) and t quarks (right). This efficiency is defined as the fraction of the generated particles (t quarks or W bosons) that are within ∆R < 0.6 with an AK8 or CA15 jet with p T > 200 GeV and |η| < 2.4. Superimposed is the merging efficiency as a function of the generated particle p T when all decay products are within ∆R(AK8, q i ) < 0.6 (∆R(CA15, q i ) < 1.2) with an AK8 (CA15) jet.
Finally, the p miss T is defined as the negative of the vectorial sum of the p T of all PF candidates in the event [52]. Its magnitude is denoted as p miss T . The jet energy scale corrections applied to the jets are propagated to p miss T .

Event selection
Several samples are used to validate the performance of the tagging algorithms in data. A single-µ signal sample is used to calibrate the t quark and W boson identification performance in a sample enriched in hadronically decaying t quarks, as explained below. A dijet sample, dominated by light-flavor quarks and gluons, enables the study of the identification probability of background jets (misidentification rate) in a wide range of p T . The misidentification rate depends on the flavor of the parton that initiated the jet. Thus, in addition to the dijet sample, the single-γ background sample is further used. The dijet and single-γ samples differ in the light-flavor quark and gluon fractions. The former has a larger fraction of gluon jets than the latter.
Systematic effects are quantified using these samples to determine uncertainties in measurements corrected for the detector effects.

The single-µ signal sample
The single-µ signal sample was recorded using a single-muon trigger that selects events online based on the muon p T . Candidate events are required to have exactly one muon with p T > 55 GeV, satisfying the identification criteria defined in Section 4, except for the requirement related to the isolation of leptons I rel . In high-p T leptonic decays of the t quarks, the lepton from the W boson decay often overlaps with the b jet from the t quark decay, leading to large values of I rel , causing the event to be rejected. Therefore, a custom isolation criterion is applied by requiring a minimal distance between the muon and the nearest AK4 jet, ∆R(µ, AK4) > 0.4, or the perpendicular component of the muon p T with respect to the nearest AK4 jet, p T,rel > 25 GeV. This has been extensively used in measurements [5] and searches [53][54][55][56] involving high momentum t quarks in the single-µ sample.
The AK4 jets used in this selection are clustered from PF candidates after removing muons with p T > 55 GeV. The custom isolation requirement results in an up to 40% increase in the statistical power of the sample. To suppress the contribution from QCD multijet processes we require p miss T > 50 GeV. To enhance the sample purity in tt events, we require the presence of two or more AK4 jets, at least one of which is reconstructed as a b jet. In addition, to probe high momentum topologies, we require the p T of the leptonically decaying W bosons, defined as p T (W) = p T (µ) + p miss T , and the scalar p T sum of the AK4 jets, denoted as H T , to be greater than 250 GeV. The t/W candidate is the highest p T AK8 or CA15 jet in the event with p T > 200 GeV, satisfying the criteria discussed in Section 4. To further improve the purity, we require the azimuthal angle ∆φ between the AK8 or CA15 jet and the muon to be greater than 2 radians. The purity of the sample in semileptonic tt events is ∼70%; other contributions arise from QCD multijet (∼15%) and W+jets (∼10%) processes.

The dijet background sample
The dijet background sample was recorded with a trigger that uses H T . Events with H T > 1000 GeV are selected to ensure 100% trigger efficiency. Events are required to have at least one AK8 or CA15 jet meeting the requirements presented in Section 4, and the absence of electrons or muons, leading to a sample dominated by jets from the QCD multijet process, which are backgrounds to the algorithms presented here.

The single-γ background sample
The single-γ background sample was collected using an isolated-single-photon trigger. Events with a photon with p T > 200 GeV are selected to ensure 100% trigger efficiency. The photon is further required to satisfy the criteria presented in Section 4. In addition to the photon, the single-γ sample is required to have at least one AK8 or CA15 jet and no electrons or muons. The sample consists of ∼80% γ+jets events, but only ∼15% QCD multijet events.

Overview of the algorithms
This section presents recently developed ML-based CMS heavy-object tagging methods. However, to understand the historical developments and their limitations, we first present tagging algorithms that do not rely on selections involving ML-based methods, but instead rely on selections based on a set of jet substructure observables ("cutoff-based" approaches). To better explore the complementarity between the jet substructure variables, alternative tagging algorithms were developed using multivariate methods. Lastly, to exploit the full potential of the CMS detector and event reconstruction, methods based on Deep Neural Networks (DNNs) are explored using either high level inputs (e.g., jet substructure observables), or lower level inputs, such as PF candidates and secondary vertices. Finally, dedicated versions of the algorithms are developed that are only loosely correlated with the jet mass. A detailed discussion of each algorithm is presented in this Section and a summary of all t quark, W, Z or H boson identification algorithms is given in Table 1. Table 1: Summary of the CMS algorithms for the identification of hadronically decaying t quarks and W, Z and H bosons. See text for explanation of the algorithm names. The column "Subsection" indicates the subsection where the algorithm is described, and the column "jet p T [GeV]" indicates the jet p T threshold to be used in each algorithm. The * in DeepAK8 and DeepAK8-MD algorithms indicates the ability of these algorithm to also identify the decay modes of each particle.

Algorithm
Subsection

Substructure variable based algorithms
Historically, the high momentum t quark and W/Z/H boson tagging methods used by the CMS Collaboration are based on a combination of selection criteria on the jet mass and the energy distribution inside the jet [16][17][18][19][20].
The jet mass is one of the most powerful observables to discriminate t quark and W/Z/H boson jets from background jets (i.e., jets stemming from the hadronization of light-flavor quarks or gluons). The QCD radiation will cause a radiative shower of quarks and gluons, which will be collimated within a jet. The probability for a gluon to be radiated from a propagating quark or gluon is inversely proportional to the angle and energy of the radiated gluon. Hence, the radiated gluon will tend to appear close to the direction of the original quark or gluon. These radiated gluons tend to be soft, resulting in a characteristic "Sudakov peak" structure. This is explained in detail in Ref. [8]. Contributions from initial-state radiation, the underlying event, and pileup also contribute strongly to the jet mass, especially at larger values of R. As such, jet mass from QCD radiation scales as the product of the jet p T and R.
Several methods have been developed to remove soft or uncorrelated radiation from jets, a procedure generally called "grooming". These methods strongly reduce the Sudakov peak structure in the jet mass distribution. The removal of the soft and uncorrelated radiation results in a much weaker dependence of the jet mass on its p T .
The t quark and W/Z/H bosons have an intrinsic mass, and the jet substructure tends to be dominated by electroweak splittings [57] at larger angles than QCD. This can be exploited to separate such jets from jets arising from heavy SM particles.
The grooming method used most often in CMS is the "modified mass drop tagger" algorithm (mMDT) [58], which is a special case of the "soft drop" (SD) method [59]. This algorithm systematically removes the soft and collinear radiation from the jet in a manner that can be theoretically calculated [60,61] (comparisons to data are found in Ref. [8]).
The first step in the SD algorithm is the reclustering of the jet constituents with the CA algorithm, and then the identification of two "subjets" within the main jet by reversing the CA clustering history. The jet is considered as the final jet if the two subjets meet the SD condition: where R 0 is the distance parameter used in jet clustering algorithm, p T1 (p T2 ) is the p T of the leading (subleading) subjet and ∆R 12 is their angular separation. The parameters z cut and β define what the algorithm considers "soft" and "collinear," respectively. The values used in CMS are z cut = 0.1 and β = 0 (making this identical to the mMDT algorithm, although for notation we still denote this as SD). If the SD condition is not met, the subleading subjet is removed and the same procedure is followed until Eq. (1) is satisfied or no further declustering can be performed.
The two subjets returned by the SD algorithm are used to calculate the jet mass. Figure 2 shows the distribution of the AK8 jet mass after applying the SD algorithm (m SD ) in simulated signal and background jets. The jet mass has been measured in data in previous papers by CMS for t-tagged [6] and QCD jets [7,8].
The m SD in background jets peaks close to zero because of the suppression of the Sudakov peak [58], whereas the m SD for signal jets peaks around the mass of the heavy SM particle (t quark, or W/Z/H bosons). In Fig. 2 (right), the peak around 80 GeV is from jets that contain just the two quarks from the W decay and not all three quarks from the t decay. Similar conclusions also hold for CA15 jets. Based on these observations, we define three regions in m SD . The "W/Z mass region" with 65 < m SD < 105 GeV, the "H mass region" with 90 < m SD < 140 GeV, and the "t mass region" with 105 < m SD < 210 GeV. These definitions will be used throughout this paper unless stated otherwise.  An additional handle to separate signal from background events is to exploit the energy distribution inside the jet. Jets resulting from the hadronic decays of a heavy particle to N separate quarks or gluons are expected to have N subjets. For two-body decays like W/Z/H, there are two subjets, while for t quarks, there are three. In contrast, jets arising from the hadronization of light quarks or gluons are expected to only have one or two (in the case of gluon splitting) subjets. The N-subjettiness variables [62,63], provide a measure of the number of subjets that can be found inside the jet. The index i refers to the jet constituents, while the ∆R terms represent the spatial distance between a given jet constituent and the subjets. The quantity d 0 is a normalization constant. The centers of hard radiation are found by applying the exclusive k T algorithm [64,65] on the jet constituents before the use of any grooming techniques. The values of the τ N variables are typically small if the jet is compatible with having N or more subjets. However, a more discriminating observable is the ratio of different τ N variables. For this purpose, the ratio τ 3 /τ 2 ≡τ 32 is used for t quark identification, whereas the ratio τ 21 is used for W/Z/H boson identification. The distribution τ 21 and τ 32 for signal and background AK8 jets is shown in Fig. 3. Measured values of these distributions at CMS can also be found for light-flavor jets in Ref.  Figure 3: Comparison of the τ 21 (left) and τ 32 (right) shape in signal and background AK8 jets. The fiducial selection on the jets is displayed in the plots. As signal jets we consider jets stemming from hadronic decays of W, Z, or H bosons (left), or t quarks (right), whereas background jets are obtained from the QCD multijet sample.
The baseline W and Z boson (collectively referred to as V boson) tagging algorithm, based on selections on m SD and τ 21 , will be labelled as "m SD + τ 21 " in this paper. The V tagging with this method is used frequently in current analyses (e.g., in Refs. [66][67][68][69]) starting at approximately 200 GeV in p T .
For t quark tagging we studied a tagger based on m SD and τ 32 , which will be referred to as "m SD + τ 32 ". An additional improvement in the performance of the t quark identification is achieved by applying the CSVv2 b tagging algorithm discussed in Section 4 on the subjets returned by the SD algorithm. In the studies presented in this paper we require at least one of the two subjets to pass the loose working point of the CSVv2 algorithm, corresponding to the b quark jet identification efficiency ∼85%, with a misidentification rate for light-flavor quarks and gluon jets of ∼10%, and ∼60% for the c quark jets. This version of the baseline t quark tagging algorithm is referred to as "m SD + τ 32 + b". Top-quark tagging with this method is used extensively in physics analyses (e.g., in Refs. [56,[70][71][72]) tagging high-p T t quarks, which start to merge into the AK8 cone at p T ∼ 350 GeV and are 50% efficient at around 600 GeV. For applications below this mass range, analyses can profit from the larger (or variable) R clustering algorithms discussed in the following sections.

Heavy object tagger with variable R
The heavy object tagger with variable R (HOTVR) [73] is a new cutoff-based algorithm for the identification of jets originating from hadronic decays of boosted heavy objects. It introduces a new jet clustering technique with a variable R and removal of soft contributions during the clustering. The clustering is similar to other standard sequential clustering algorithms such as the CA algorithm, where particles are sequentially added. However, instead of a fixed R, HOTVR uses a p T -dependent R (R HOTVR ), defined as: ( The value of ρ is chosen to correspond to a typical energy scale of the event (O(100) GeV). In the case of ρ → 0, the algorithm is identical to the CA algorithm for R = R min , whereas for ρ → ∞ it is identical to the CA algorithm for R = R max . Higher values of ρ result in larger jet sizes. The parameters R min and R max are introduced for robustness of the algorithm with respect to detector effects.
Inspired by Ref. [73], at each clustering step, the invariant mass m ij between two subjets i and j is calculated. If m ij is greater than a threshold, µ, the following condition is verified: where m i and m j are the masses of the two subjets, and θ is a parameter that determines the strength of the condition and ranges from 0 to 1. If the condition in Eq. (4) is not fulfilled, the subjet with the lower mass is discarded; otherwise depending on the relative p T difference of the subjets they are either combined into a single subjet or the softer one is discarded. The algorithm continues until no other subjet is found. The detailed description of the HOTVR algorithm is presented in Ref. [73]. Table 2 lists the values of HOTVR parameters used in CMS.
In the CMS implementation, HOTVR jets are clustered using PUPPI corrected PF candidates. The HOTVR clustering algorithm is currently being explored in CMS for t quark identification. The jets returned by HOTVR (i.e., "HOTVR jets") are required to have mass consistent with m t , namely 140 < m HOTVR < 220 GeV, and at least three subjets, N sub, HOTVR ≥ 3, the minimum pairwise mass of which should be m disub, min > 50 GeV. In addition, the p T of the hardest subjet must be less than 80% of the HOTVR jet p T . Lastly, to further improve the discrimination, τ 32 < 0.56 is required. The shape comparison of the main variables of the HOTVR algorithm for signal and background, for different parton p T ranges, is shown in Fig. 4.

Energy correlation functions
A new set of N-prong identification algorithms, the generalized energy correlation functions (ECFs) [74], are now used by the CMS Collaboration. The ECFs explore the energy distribution inside a jet by aiming to quantify the number of centers of hard radiation using an axis-free approach, differing from the axis-dependent definition used by N-subjettiness, which reduces the dependence of the observable on the jet p T . This allows the exploration of complementary information between the two techniques. For a jet containing N C particles, an ECF is defined as: where 1 ≤ i 1 < i 2 < · · · < i N ≤ N C range over the jet constituents. The symbols p i k T and p J T are the p T of the constituent i k and the p T of the jet, respectively. The notation min (m) refers to the mth smallest element, and ∆R i j ,i k is the angular distance between constituents i j and i k . The parameters N and q must be positive integers, and the exponent β must be positive as well. For a concrete example, we calculate the ECF corresponding to q = 2, N = 3, β = 1. This ECF tests the compatibility of a jet with three centers of hard radiation, but only considering the two smallest angles (q = 2): Moreover, there is the possibility to select subsets of the jet that contain large energy fractions and pairwise opening angles only if the size of the subset is less than or equal to the number of the centers of radiation in the jet. In general, a jet with N centers of radiation has e N e M , for M > N.

The ECFs for 3-prong decay identification
The ratios of type (N = 4)/(N = 3) can identify the hadronic 3-body decays, such as those of t quarks. Reference [74] proposes to use the specific ratio N 3 for this purpose: (7) Since a jet contains N C ∼ O(p T /GeV) constituents, and the sum has ( N C N ) terms, it is prohibitively expensive to compute e(N = 4) on high-p T jets. For example, about 10-15% of CA15 jets with p T ∼ 500 GeV have more than 100 particles. However, we find that these functions are dominated by the hardest particles, and therefore limiting to the 100 hardest particles makes the calculation tractable without significant performance degradation.
In our reconstruction, the ECF ratios are calculated for jets after the SD grooming is applied, which improves the stability of ECF as a function of jet mass and p T . An example of the ECF ratios is shown in the left plot of Fig. 5 for simulated t quark and QCD jets. The ECF ratios are measured in data in Ref. [9] showing reasonable agreement with the expectation from simulation. While N 3 is designed to have comparable performance with τ 32 , its dependence on p T is reduced.  (left) and the N 3 -BDT (CA15) discriminant (right) in t quarks jets (signal) and jets from QCD multijet processes (background).
Therefore, a set of ECFs is chosen based on the improvement in the performance of the t tagging algorithm, while in parallel maintaining small dependence on jet p T . Despite the fact that the terms of the ECFs are dimensionless, the angular component of ECF function is modified according to the boost of the parent particle. Hence, scale invariant ECF ratios are constructed by only considering those ratios that satisfy: Only ratios that are not highly correlated among themselves are considered for the t quark tagging algorithm, and ECF ratios that are not well described by simulation are discarded. The and the second is the f rec variable of the HEPTopTagger algorithm [75][76][77], which quantifies the difference between the reconstructed W boson and t quark masses and their expected values, and is defined as: where i, j range over the three chosen subjets, m ij is the mass of subjets i and j, and m 123 is the mass of all three subjets.
The ECF-based t quark tagger, referred to as "N 3 -BDT (CA15)", is based on a boosted decision tree (BDT) [78] with the 11 ECF ratios, the τ SD 32 , and the f rec as inputs. The N 3 -BDT (CA15) algorithm was trained using jets with 110 < m SD < 210 GeV. To avoid possible bias in the identification performance due to differences in the p T spectrum of the signal (t quarks) and background (light quarks or gluons) jets, their contributions are reweighted such that they have a flat distribution in jet p T .
Figure 5 (right) shows a comparison of the N 3 -BDT (CA15) discriminant distribution between signal and background jets. The final N 3 -BDT (CA15) algorithm also requires at least one of the two subjets returned by the SD method to be identified as a b jet by the CSVv2 algorithm using the loose working point. The ECF BDT tagger is used for t quark jet identification in the context of dark matter production in association with a single t quark in the p T > 250 GeV range [79].

The ECFs for 2-prong decay identification
The use of ECFs is also explored for the identification of 2-prong decays, such as hadronic decays of W/Z/H bosons. In this case, the signal jets have a stronger 2-point correlation than a 3-point correlation and the discriminant variable N 1 2 can be used to separate jets originating from W/Z/H bosons. The N 2 variable is constructed via the ratio and shows similar performance to N-subjettiness ratio τ 21 , with the advantage that it is more stable as a function of the jet mass and p T . This method is referred to as "m SD + N 2 ".
A decorrelation procedure is further applied to avoid distorting the jet mass distribution when a selection based on N 2 is made. We design a transformation from N 2 to N DDT 2 , where DDT stands for "designed decorrelated tagger" described in Ref. [15]. The transformation is defined as a function of the dimensionless scaling variable ρ = ln(m 2 SD /p 2 T ) and the jet p T : where N (X%) 2 is the X percentile of the N 2 distribution in simulated QCD events. This ensures that the selection N DDT 2 < 0 yields a constant QCD background efficiency of X% across the mass and p T range considered with no loss in performance. The value X = 5 is used throughout this paper, following the choice in [80]. The m SD + N DDT 2 observable was used and validated in several analyses, including the ones described in Refs. [80,81].

The double-b tagger
The standard b tagging tools, such as the CSVv2 discussed in Section 4, can be applied to the subjets returned by the SD algorithm applied to AK8 jets. Characteristic examples are the m SD + τ 32 + b and N 3 -BDT (CA15) algorithms. However, these tools have limitations in certain topologies, for example when the two subjets become very collimated. The "double-b" tagger was developed to specifically target Higgs decays to pairs of b quarks in the boosted regime [20]. While it utilizes many of the variables used in the standard CSVv2 b tagging algorithm, it also employs variables related to the track properties, such as the track impact parameter and its significance, the positions of secondary vertices, and information from the two-secondary-vertex system, among others listed in Ref. [20]. An important feature of the double-b algorithm is that it uses the N-subjettiness axes, defined in Eq. (2), for N = 2, to group the tracks to the direction of the partons giving rise to the two subjets. The double-b variables are then used as inputs to a BDT. A key feature of the double-b algorithm is that it is designed to minimize the dependence of the BDT discriminant on the jet mass and p T , thus making it suitable for other topologies such as decays of boosted Z bosons to bottom quarks [81].
The performance of the double-b tagger in simulation is detailed in Ref. [20] using H boson jets as signal, and single-b, double-b jets from gluon splitting to a pair of b quarks, and light-flavor quark or gluon jets. The H → bb identification efficiency is ∼25% (∼70%) for ∼1% (∼10%) misidentification rate [20].
The double-b tagger performance in data is studied in [20] using data in a recent inclusive search for the Higgs boson in the bb decay mode [81]. In that analysis, the Z boson was ob-served for the first time in the single-jet topology and bb decay mode, with a rate consistent within uncertainties with the SM expectation, validating the double-b tagging algorithm for the Higgs boson measurements and future searches.
The double-b tagger will serve as a reference for the performance of the new methods explored in CMS.

Boosted event shape tagger
The boosted event shape tagger (BEST) [82] is a multi-classification algorithm designed to discriminate hadronic decays of high-p T t quarks and W/Z/H bosons from jets arising from b quarks, light flavor quarks, and gluons. The original algorithm was demonstrated using generator-level particles and efficiently separated jets originating from W/Z/H bosons, t quarks, and b jets. The algorithm has been extended and deployed for use in the CMS experiment, adding an additional category to discriminate jets from light-flavor quarks and gluons.
The BEST algorithm obtains discrimination on a jet-by-jet basis by transforming the entire set of jet constituents four times, each with a different boost vector. The boost vectors are obtained by assuming the jet originating from one of the heavy objects under consideration (W/Z/H/t). The jet momentum is held constant while the mass of the jet is adjusted to the theoretical value of the corresponding particle. This results in four distributions of constituents that can be used to discriminate between particle origins. If a jet did originate from one of the hypothesized heavy objects, its jet constituents will, in general, be more isotropic in the rest frame of that particle. By examining the differences between heavy object hypotheses, discrimination is obtained between the categories of interest (W/Z/H/t/b/other).
In total, 59 quantities are used to train a neural network (NN) and classify the AK8 jets. The variables are listed in Table 3. For each boost transformation, we calculate the following observables: Fox-Wolfram moments [83]; aplanarity, sphericity, and isotropy quantities based on the eigenvalues of sphericity tensor, as defined in Ref. [84]; and jet thrust [85]. Additionally, in each boost hypothesis, AK4 subjets are clustered from the constituents and used to compute pairwise subjet masses for the leading three subjets, as well as the combined mass of the leading four subjets m 1234 . These AK4 subjets are also used to compute the longitudinal asymmetry A L , defined as the ratio of the sum of longitudinal components of the AK4 subjet momenta to the sum of the total AK4 subjet momenta. In addition to these quantities evaluated for each set of jet constituents, the m SD , rapidity, charge, τ 32 , τ 21 , and the CSVv2 discriminant for each subjet provide additional inputs for each set of boosted jet constituents.
The NN is trained with the SCIKIT-LEARN package [86] using the MLPCLASSIFIER module. The network architecture is fully connected and consists of 3 hidden layers with 40 nodes in each layer using a rectified linear unit (ReLU) [87] activation function. The six output nodes correspond to the 6 particle species of interest. We use 500 000 jets to train the network, split evenly between the 6 training samples. The training is performed using the ADAM [88] optimizer to minimize the cross entropy loss with a constant learning rate of 0.001. Cross entropy is a measure of the difference (entropy) between two probability distributions and it is used for optimizing a classification model. The BEST W/Z/H/t/b/other multi classification is currently used for tagging high-p T jets in the search for vector-like quark pair production [69].

Identification using particle-flow candidates: ImageTop
Recent studies, e.g., in Ref. [89], have shown that jet identification algorithms deploying ML methods directly on the jet constituents yield significantly improved performance compared to traditional algorithms. To this end, the "ImageTop" t quark identification algorithm was developed. The ImageTop algorithm closely follows the network framework described in Ref. [89], which is an optimization based on the DeepTop framework described in Ref. [90]. This tagging approach uses standard image recognition techniques based on two-dimensional convolutional neural networks (CNNs) to discriminate t quark jets from QCD jets. This is performed by pixelizing the jet energy deposits and define different channels based on relevant detector information. Before pixelization, the centroid of the jet is shifted to the origin and then a rotation is performed to make the major principal axis vertical. The image is then flipped along both the horizontal and vertical axes as appropriate such that the maximum intensity is in the lower-left quadrant. After this, the image intensity is normalized and the image is pixelized using 37×37 pixels with a total ∆η = ∆φ = 3.2, with channels split into neutral p T , track p T , number of muons, and of tracks as an analogue to colors used in image recognition. The network architecture uses a layer of 128 feature maps with a 4×4 kernel followed by a second convolutional layer of 64 feature maps each. Then a max-pooling layer with a 2×2 reduction factor is used, followed by two more consecutive convolutional layers with 64 features maps followed by another maxpooling layer. A zero-padding in each convolutional layer is used to correct for image-border effects. In the last pooling layer, the 64 maps are flattened into a single one that is passed into a set of three fully connected dense layers, one of 64 neurons, and two more with 256 neurons. The training is performed using the Tensorflow [91] software package using the ADADELTA optimizer [92] with a learning rate of 0.3, a minibatch size of 128, and the binary cross entropy loss function.
The tagger is modified to use the PF candidates contained in the AK8 jets as inputs, with the colors being the p T of the PF candidates for the full greyscale image, and a separate color for each PF candidate flavor, namely charged and neutral hadrons, photons, electrons, and muons. The pixelized greyscale images used in the ImageTop network for QCD and t quark jets are shown in Fig. 7. The characteristic flavor of the t quark decay is included by applying the DeepFlavor [93] b tagging algorithm to the SD subjets of the AK8 jet. The subjet b tagging outputs include the probability of the jet to originate from the following six sources: b quark, bb pair, leptonic b decays, c quark, light-flavor quark, or gluon. These output probabilities calculated for both subjets along with m SD , are used as inputs (13 in total) into a 64-neuron dense layer and merged with the previous flattened CNN layer and finally input into three fully connected layers of 256 neurons each. The factorization of the b flavor discrimination is important for the versatility of the network, allowing for the flavor identification to be easily removed or validated in parallel, which can be necessary for the validation of objects with no SM analog. The diagram of the CMS application of this NN can be seen in Fig. 8.
The training is performed for jets in the p T > 600 GeV region. To sustain the ImageTop perfor-mance over a wide range of p T (jet), the image is adaptively zoomed based on p T (jet) to account for the increased collimation of the t quark decay products at high Lorentz boosts and maintain a static pixel size. The functional form of the zoom is extracted from the average ∆R of the three generator-level hadronic t quark decay products, and the jet energy deposits are corrected to make this constant on average, as evaluated from a fit using the inverse jet p T functional form f (p T ) = 0.066 + 264/p T .
A jet p T bias is further reduced by ensuring that the input p T distributions for signal and background jets are similarly shaped by probabilistically removing QCD events based on the ratio of t quark and QCD jet p T distributions when training the nominal ImageTop tagger. The mass correlation of the tagger is reduced by additionally constraining m SD in a similar manner to define a new discriminator, which will be referred to as "ImageTop-MD". Since the inputs are relatively simple and do not exhibit secondary mass correlation, this passive approach for decorrelating the ImageTop network is sufficient to remove the mass bias in the fiducial training region (p T > 600 GeV and |η| < 2.4). This method of mass decorrelation also leads to a factorized sensitivity where the sensitivity of the full ImageTop network in the t quark mass region is closely approximated by the sensitivity of the mass-decorrelated version after including a mass selection.  Figure 7: The pixelized images used in the ImageTop network with PF candidate colors summed together ("greyscale") for QCD (left) and t quark (right) jets. The x and y axes are the pixel number, and roughly scale with ∆R. The Z axis is the intensity of the greyscale image in the given pixel, related to the PF candidate p T , and has been normalized to unity. This figure shows an ensemble of overlaid images after the image post processing; we can see clear differences between the QCD jet energy and t quark deposition patterns.

Identification using particle-flow candidates: DeepAK8
An alternative approach to exploit particle-level information directly with customized ML methods is the "DeepAK8" algorithm, a multiclass classifier for the identification of hadronically decaying particles with five main categories, W/Z/H/t/other. To increase the versatility of the algorithm, the main classes are further subdivided into the minor categories corresponding to the decay modes of each particle (e.g., Z → bb, Z → cc and Z → qq).
In the DeepAK8 algorithm, two lists of inputs are defined for each jet. The first list (the "par-  The neural network inputs are the 37x37 pixelized PF candidate p T map, which is split into colors based on the PF candidate flavor, and the DeepFlavor subjet b tags applied to both subjets. The pixelized images are sent through a two-dimensional CNN, and the subjet b tags are inputs to a dense layer. After flattening the CNN, the two networks are taken as input to three dense layers and finally to the two-node output, which is used as the top tagging discriminator. ticle" list) consists of up to 100 jet constituent particles, sorted by decreasing p T . Typically less than 5% of the jets have more than 100 reconstructed particles, therefore restricting to the 100 hardest particles results in a negligible loss of performance. Measured properties of each particle, such as the p T , the energy deposit, the charge, the angular separation between the particle and the jet axis or the subjet axes, etc., are included to help the algorithm extract features related to the substructure of the jet. For charged particles, additional information measured by the tracking detector is also included, such as the displacement and quality of the tracks, etc. These inputs are particularly useful to enable the algorithm to extract features related to the presence of heavy-flavor (b or c) quarks. In total, 42 variables are included for each particle in the "particle" list. A secondary vertex (SV) list consists of up to 7 SVs, each with 15 features, such as the SV kinematics, the displacement, and quality criteria. The SV list helps the network to extract features related to the heavy-flavor content of the jet. The elements of the SV list as sorted based on the two-dimensional impact parameter significance (S IP2D ). A significant challenge posed by the direct use of particle-level information is a substantial increase in the number of inputs. Additionally, the correlations between these inputs are of vital importance. Therefore, an algorithm that can both process the inputs efficiently and exploit the correlations effectively is required. A customized DNN architecture is thus developed in DeepAK8 to fulfill this requirement. As illustrated in Fig. 10, the architecture consists of two steps. In the first step, two one-dimensional CNNs are applied to the particle list and the SV list in parallel to transform the inputs and extract useful features. In the second step, the outputs of these CNNs are combined and processed by a simple fully connected network to perform the jet classification. The CNN structure in the first step is based on the ResNet model [94], but adapted from two-dimensional images to one-dimensional particle lists. The CNN for the particle list has 14 layers, and the one for the SV list has 10 layers. A convolution window of length 3 is used, and the number of output channels in each convolutional layer ranges between 32 to 128. The ResNet architecture allows for an efficient training of deep CNNs, thus leading to a better exploitation of the correlations between the large inputs and improving the performance. The CNNs in the first step already contain strong discriminatory ability, so the fully connected network in the second step consists of only one layer with 512 units, followed by a ReLU activation function and a Dropout [95] layer of 20% drop rate. The NN is implemented using the MXNET package [96] and trained with the ADAM optimizer to minimize the cross-entropy loss. A minibatch size of 1024 is used, and the initial learning rate is set to 0.001 and then reduced by a factor of 10 at the 10th and 20th epochs to improve convergence. The training completes after 35 epochs. A sample of 50 million jets is used, of which 80% are used for training and 20% for validation. Jets from different signal and background samples are reweighted to yield flat distributions in p T to avoid any potential bias in the training process. The DeepAK8 algorithm is designed for jets with p T > 200 GeV and typical operating regions for which the misidentification rate is greater than 0.1%.

A mass-decorrelated version of DeepAK8
As will be discussed in Section 7, background jets selected by the DeepAK8 algorithm exhibit a modified mass distribution similar to that of the signal. The mass of a jet is one of the most discriminating variables and, although it is not directly used as an input to the algorithm, the CNNs are able to extract features that are correlated to the mass to improve the discrimination power. However, such modification of the mass distribution may be undesirable (as described in Ref. [15]) if the mass variable itself is used for separating signal and background processes. Thus, an alternative DeepAK8 algorithm, "DeepAK8-MD", is developed to be largely decorrelated with the mass of a jet, while preserving the discrimination power as much as possible using an adversarial training approach [97]. Jets from different signal and background samples are also weighted to yield flat distributions in both p T and m SD to aid the training. The architecture of DeepAK8-MD is shown in Fig. 10. Compared to the nominal version of DeepAK8, a mass prediction network is added with the goal of predicting the mass of a background jet from the features extracted by the CNNs. The mass prediction network consists of 3 fully-connected layers, each with 256 units and a SELU activation function [98]. It is trained to predict the m SD of background jets to the closest 10 GeV value between 30 and 250 GeV by minimizing the cross-entropy loss. When properly trained, the mass prediction network becomes a good indicator of how strongly the features extracted by the CNNs are correlated with the mass of a jet, because the stronger the correlation is, the more accurate the mass prediction will be. With the introduction of the mass prediction network, the training target of the algorithm can be modified to include the accuracy of the mass prediction for the background jets as a penalty, therefore preventing the CNNs from extracting features that are correlated with the mass. In this way, the final prediction of the algorithm also becomes largely independent of the mass. As the features extracted by the CNNs evolve during the training process, the mass prediction network itself needs to be updated regularly to adapt to the changes of its inputs and remain as an effective indicator of mass correlation. Therefore, for each training step of the DeepAK8 network (the Particle and SV CNNs and the 1-layer fully-connected network), the mass prediction network is trained for 10 steps. Each training step corresponds to a minibatch of 6000 jets. A large minibatch size is used to reduce statistical fluctuation on the mass correlation penalty evaluated by the mass prediction network, since only background jets are used in the evaluation. Both the DeepAK8 network and the mass prediction network are trained with the ADAM optimizer. A constant learning rate of 0.001 (0.0001) is used for the training of the DeepAK8 (mass prediction) network.
Forcing the algorithm to be decorrelated with the jet mass, inevitably leads to a loss of discrimination power, and the resulting algorithm is a balance between performance and mass independence. Because the training of DeepAK8-MD is carried out only on jets with 30 < m SD < 250 GeV, jets with m SD outside this range should be removed when using DeepAK8-MD.

Performance in simulation
As presented in Section 6, a variety of algorithms have been developed by the CMS Collaboration to identify the hadronic decays of W/Z/H/ bosons and t quarks. To gain an initial understanding of the tagging performance and the complementarity between the different approaches, the algorithms were studied in simulated events. The performance of the algorithms is evaluated using the signal and background efficiencies, S and B , respectively, as a figure of merit. The efficiencies S and B are defined as: ) is the number of signal (background) jets satisfying the identification criteria of each algorithm, and N total ) is the total number of generated particles considered to be signal (background). Hadronically decaying W/Z/H bosons or t quarks are signal, whereas quarks (excluding t quarks) and gluons from the QCD multijet process are background.
First, for each algorithm, the B as a function of S is evaluated in terms of a receiver operating characteristic (ROC) curve. Figures 11-14 summarize the ROC curves of all algorithms for the identification of t quarks, and W, Z, and H bosons, respectively. The comparisons are performed at low and high values of the generated particle p T . The fiducial selection criteria applied to the generator-level particles are displayed in the plots. For the cutoff-based algorithms, namely m SD + τ 32 , m SD + τ 32 + b, m SD + τ 21 , m SD + N 2 , and m SD + N DDT 2 , all selections except the selection on τ 32 , τ 21 , or N 2 , are applied, as described in Sections 6.1 and 6.3.2.
In t tagging, the addition of the subjet b tagging in the m SD + τ 32 algorithm reduces the misidentification probability for t quarks by up to ∼50% depending on the p T . The performance of the HOTVR algorithm lies between m SD + τ 32 and m SD + τ 32 + b, and the N 3 -BDT (CA15) algorithm shows improved performance compared to these algorithms, particularly in the low-p T range. The improved performance stems from the usage of the ECFs, which provide complementary information to τ 32 . Particularly in the low-p T region, the gain is mainly due to the use of larger-cone jets (i.e., jets clustered with R = 1.5). The BEST algorithm targets the high-p T regime and shows similar performance to the ECF algorithm in this regime. The best discrimination is achieved with algorithms based on lower-level information, namely the ImageTop and DeepAK8 algorithms. ImageTop and DeepAK8-MD yield comparable performance in the low and high p T regions. The best performance in terms of ROC curves is achieved with the nominal version of DeepAK8 over the entire p T region.
Various arguments contribute to the significantly improved performance of ImageTop and DeepAK8 with respect to the other algorithms. First, the usage of lower-level variables as inputs to the network exploits the high granularity of the CMS detector. Second, the architectures of these algorithms provide quark-gluon discrimination information. Moreover, information about the jet flavor content is extracted, which is particularly important for t quark and Z/H boson identification. The flavor identification in jets from boosted object decay is very challenging because the decay products overlap and traditional b tagging algorithms perform significantly less well. The usage of the type of the PF candidates, and the secondary vertices in the case of DeepAK8, provides a more precise description of the flavor content inside the jet. Similar conclusions hold for the identification of hadronically decaying W and Z bosons. The BEST, DeepAK8, and DeepAK8-MD algorithms show enhanced performance compared with the simpler m SD + τ 21 algorithm. The gain in terms of misidentification rate can be as large as an order of magnitude in the case of DeepAK8. The smaller relative gain of DeepAK8 over BEST for discriminating between W or Zbosons, and t quarks occurs because flavor information for the W and Z bosons is not as critical as for t quarks. The m SD + N 2 and m SD + N DDT 2 show weaker performance compared with the m SD + τ 21 algorithm.
The double-b, BEST, DeepAK8, and DeepAK8-MD algorithms are used to identify hadronic decays of the H boson. In Fig. 14, the H boson decays to a pair of b quarks. The performance of the BEST algorithm lies between the double-b algorithm and DeepAK8. The gain with DeepAK8 is expected just as in t quark identification for similar arguments.
To gain a deeper understanding of the DeepAK8 performance, two alternative versions of DeepAK8 were trained using a subset of the input features. Three sets of input features were studied and compared. The "Particle (kinematics)" set consists of only the kinematic information on the PF candidates, e.g., the four-momenta and the distances to the jet and subjet axes. This set serves as a baseline to evaluate the performance using only substructure of the jets. The "Particle (w/o Flavor)" set includes additional experimental information for each PF candidate, such as the electric charge, particle identification, and track quality information. Compared with the nominal DeepAK8 algorithm, input features that contribute to the identification of heavy-flavor quarks, such as the displacement of the tracks, the association of tracks to the reconstructed vertices, and the SV features, are not included in the "Particle (w/o Flavor)" set. The performances of the three versions of DeepAK8 are compared in Fig. 15 for t quark and Z boson identification. In both cases, the addition of experimental information brings sizable  Figure 11: Comparison of the identification algorithms for hadronically decaying t quark in terms of ROC curves in two regions based on the p T of the generated particle; Left: 300 < p T < 500 GeV, and Right: 1000 < p T < 1500 GeV. Additional fiducial selection criteria applied to the jets are listed on the plots.  Figure 12: Comparison of the identification algorithms for hadronically decaying W boson in terms of ROC curves in two regions based on the p T of the generated particle; Left: 300 < p T < 500 GeV, and Right: 1000 < p T < 1500 GeV. Additional fiducial selection criteria applied to the jets are listed on the plots.  Figure 13: Comparison of the identification algorithms for hadronically decaying Z boson in terms of ROC curves in two regions based on the p T of the generated particle; Left: 300 < p T < 500 GeV, and Right: 1000 < p T < 1500 GeV. Additional fiducial selection criteria applied to the jets are listed on the plots.  Figure 14: Comparison of the identification algorithms for hadronically decaying H boson in terms of ROC curves in two regions based on the p T of the generated particle; Left: 300 < p T < 500 GeV, and Right: 1000 < p T < 1500 GeV. The H boson decays to a pair of b quarks. Additional fiducial selection criteria applied to the jets are listed on the plots. improvement in performance. Although the additional features contributing to heavy-flavor identification lead to no improvement for the identification of Z bosons decaying to a pair of light-flavor quarks, a significant improvement is observed for Z bosons decaying to a pair of b quarks, as well as t quark decays, showing the strong complementarity between heavyflavor identification and jet substructure for heavy-resonance identification where heavy-flavor quarks are involved in the decay.

Robustness of tagging algorithms
In addition to the performance of the algorithms in pure discrimination, an important ingredient is their robustness to changes in jet kinematics and data-taking conditions. To quantify this, we study the S and B of the algorithms as a function of the p T of the generated particle and the number of reconstructed vertices (N PV ) in the event. For these studies, a common working point is defined, corresponding to S = 30 (50)% for t quark (W/Z/H boson) with 500 < p T (generated particle) < 600 GeV. Working points used in CMS analyses vary from analysis to analysis, since they are optimized to achieve the best sensitivity for the targeted sig-nal processes. For example, CMS employs a t quark tagging working point at approximately 40% signal efficiency in the search for BSM tt production [56], a W tagging working point at approximately 20% signal efficiency in the search for BSM diboson production [66], and an H tagging working point at approximately 30% signal efficiency in the search for H boson pair production [99].
The distributions of the S and B as a function of p T of the generated particle for the different particle identification scenarios are displayed in Figs. 16 and 17, respectively. In the low-p T range for the t tagging case, the S for the algorithms using AK8 jets increases rapidly until p T 600 GeV, where a large fraction of jets contain all the t decay products. As expected, the N 3 -BDT (CA15) and HOTVR algorithms have a stable S as a function of the generator-level particle p T . Similar behavior is observed for the t quark misidentification rate.
In the case of the W and Z boson tagging, the S for the m SD + τ 21 algorithm decreases as a function of p T (generated particle), whereas the BEST, DeepAK8, and DeepAK8-MD algorithms exhibit improvements in S as a function of p T (generated particle). The drop in S for m SD + τ 21 is a result of the correlation between m SD + τ 21 and the jet p T , leading to a shift in the jet mass distribution to higher values. The m SD + N 2 algorithm shows similar behavior to BEST and DeepAK8 algorithms, whereas the S in the case of m SD + N DDT 2 is stable as a function of p T (generated particle). In contrast to N-subjettiness, the ECF observable uses an axis-free approach, which is more efficient in the case of highly collimated decay products.
The misidentification rate has a nontrivial behavior for most algorithms. In the case of DeepAK8 and DeepAK8-MD the B value decreases with p T (generated particle), which is mainly a result of the use of low-level features as inputs to the algorithm. For m SD + N 2 , the B increases with p T (generated particle), whereas for m SD + N DDT 2 , it is, by design, significantly more stable. In the case of m SD + τ 21 , the decrease of B as a function of p T (generated particle) is mainly caused by the strong shift of the m SD shape of the background jets to larger values as a result of the selection on τ 21 . This will be discussed in more detail in Section 7.2. Finally, for the BEST, the B decreases up to p T (generated particle) ∼ 1000 GeV, and then increases again. This is a feature of the training of the BEST algorithm, stemming from an imbalance in the relative fraction of jets between the low-and high-p T regimes.
In the case of H tagging, the BEST and DeepAK8 algorithms have stable S for p T (generated particle) 600 GeV, whereas for the double-b algorithm the S starts to decrease around this p T regime. There are two main reasons for this behavior. First, the double-b algorithm exploits axis-dependent observables, similar to τ 21 , which are less efficient at high p T where the decay products become highly collimated. Second, the selection on the tracks used to construct the variables used for the training of the double-b algorithm, discussed in Section 6.4, is suboptimal in the very high-p T regime. The efficiency B for both double-b and DeepAK8 decreases as a function of p T (generated particle), whereas for BEST it shows a modest increase for p T (generated particle) 1000 GeV, for the same reasons as in the W and Z boson tagging case.
The dependence of the algorithms on N PV is also examined using simulated events. Figure 18 shows the distribution of S , and Fig. 19 that of B , as a function of N PV for generated particles with 500 < p T < 1000 GeV, operating at a working point with S = 30 (50)% for t quark (W/Z/H boson) identification as defined above. The algorithms make use of jets that employ PUPPI for pileup mitigation, which results in a roughly constant S and B for all different pileup scenarios.

Correlation with jet mass
A set of studies was performed to understand the correlation of the algorithms with the jet mass. This understanding benefits from the theoretical progress made in jet substructure studies [3], which can result in reduced systematic uncertainties [15]. The jet mass is one of the most discriminating variables, and many analyses require a smoothly falling background jet mass spectrum under a signal peak (e.g., in Ref. [100]). Figure 20 displays the normalized m SD distribution for jets obtained from the QCD multijet sample, inclusively and after applying a selection with each algorithm. The working point chosen corresponds to S = 30 (50)% for t quark (W/Z/H boson). The results are shown for one region of the generated particle p T distribution, but similar behavior is seen for other p T regions as well. By design, the BEST and the nominal version of the DeepAK8 algorithms lead to significant sculpting of the background jet mass shape, but this does not affect analyses unless the jet mass distribution is explicitly used in signal extraction, e.g., Ref. [101]. An alternative way of presenting the sculpting of the background jet mass introduced by each tagging algorithm is displayed in Fig. 21. To quantify the level of mass sculpting we use the Jensen-Shannon divergence (JSD) [102], which is a symmetrized version of the Kullback-Leibler divergence (KLD) [103], and provides a metric for the similarity of the shape between distributions. The KLD is defined as: where P(i) and Q(i) are the normalized mass distributions of the background jets that fail and pass a selection with a given algorithm, respectively, and the symbol || represents the divergence of P from Q. The index i runs over the bins of the distributions.
The JSD metric is defined as: Lower values of JSD indicate larger similarity between the mass distributions of jets passing and failing a selection on a given algorithm.
In our studies, the jet mass distributions lay between 30 and 300 GeV with a bin size of 10 GeV.
The JSD values for successively tighter selections (expressed in terms of decreasing B ) for the various t quark and W boson tagging algorithms are shown in Fig. 22. The best decorrelation for the t tagging cases is achieved with the DeepAK8-MD algorithm, which exploits an adversarial network to reduce the correlation of the tagging score with the jet mass. For W tagging, m SD + N DDT 2 and DeepAK8-MD achieve similar levels of mass decorrelation. As expected, tighter selection on the tagging score results in an increase of the mass sculpting. A similar behavior is observed for all algorithms.
The robustness of the mass decorrelation techniques is further studied as a function of jet p T and N PV . These studies are performed for a working point corresponding to S = 30 (50)% for t (W) tagging. Figure 23 shows the JSD values as a function of the jet p T for jets from QCD multijet events. Most algorithms show modest dependence on jet p T , except for ImageTop-MD, where the mass dependence increases rapidly when p T 600 GeV as the training was  only performed for jets with p T > 600 GeV. The DeepAK8-MD and m SD + N DDT 2 algorithms for W tagging also show modestly increased mass dependence in the p T range of 1200 and 1600 GeV, respectively. The dependence of the mass mitigation techniques on N PV was also studied and was small.

Performance in data and systematic uncertainties
In this section, the validation of the algorithms using data is presented. The validation is performed in two steps. In the first step, we focus on studying the overall modeling of key variables in simulation and their agreement with data, as well as the dependence on the simulation details. The second step is to use these results to extract corrections to the simulation so that the algorithms perform similarly in simulation and data. Differences in the performance between data and simulation are corrected by scale factors (SF) extracted by comparing the efficiencies in data and simulation. To account for effects not captured in the SF, multiple sources of systematic uncertainties are considered. The data and simulated samples used for these studies are described in Section 5.
In this paper, we focus on the calibration of the t quark and W/Z boson tagging algorithms. The calibration of tagging algorithms where Z and H bosons decay to a pair of bottom or charm quarks requires alternative methods that go beyond the scope of this paper. Since it is challenging to obtain a pure Z or H boson sample, the calibration of such taggers relies on the use of a proxy jet, i.e., a jet obtained in the dijet sample with characteristics similar to signal jets. Data-to-simulation correction factors are extracted based on these proxy jets, which are then applied to signal jets. Therefore, the proxy jets should be selected to have similar characteristics to the signal jets. To this end, jets arising from gluon splitting to bb or cc are used as proxy jets from a sample dominated by QCD multijet events. Such approaches have been followed in Refs. [20,81,104].

Systematic uncertainties
A number of sources of systematic effects can affect the modeling of the performance of the algorithms in data by the simulation. These include systematic uncertainties in the parton showering model, renormalization and factorization scales, PDFs, jet energy scale and resolution, p miss T unclustered energy, trigger and lepton identification, pileup modeling, and integrated luminosity, as well as statistical uncertainties of simulated samples.
Parton shower uncertainties for signal jets are evaluated using samples with the same event generators but a different choice for the modeling of the parton showering. For background jets, a sample produced using an alternative generator for both the hard scattering and the parton shower is used. The details of the samples are discussed in Section 3. Changes in renormalization (µ R ) and factorization (µ F ) scales are estimated by varying the scales separately by a factor of two up and down, relative to the choices of the scale values used in the sample generation. The uncertainty related to the choice of the PDFs is obtained from the standard deviation in 100 replicas of the NNPDF3.0 PDF set [31]. The jet energy scale and resolution are changed within their p T -and η-dependent uncertainties, based on the studies presented in Ref. [51]. Their effects are also propagated to p miss T . The effect of the uncertainty in the measurement of the unclustered energy (i.e., contribution of PF candidates not associated to any of the physics objects) is evaluated based on the momentum resolution of each PF candidate, which depends on the type of the candidate [52]. Uncertainties in the measurement of the trigger efficiency and in the energy/momentum scale and resolution of the leptons are propagated in the SF extraction. The uncertainty in the pileup weighting procedure is determined by varying   the minimum bias cross section used to produce the pileup profile by ±5% from the measured central value of 69.2 mb [105,106]. The limited size of the simulated samples and the size of the data control samples are also considered.
The uncertainties described above contribute in different ways to the modeling of jet kinematics and the extraction of SF. Because many of the algorithms detailed in this paper use jet substructure and jet constituent information, either directly or as input to multivariate techniques, the uncertainties in the choice of parton shower are significant. Different parton showers directly affect the number, momentum, and distribution of jet constituents, influencing the observables used as inputs to the multivariate techniques, and eventually propagating to the outputs of those algorithms. The magnitude of this source of systematic uncertainty is from 10-30%. The uncertainty in the value of µ R and µ F chosen for event generation also has a sizable impact (5-15%), because this changes the amount of radiation that can enter into a reconstructed jet. These dominant components contribute a total combined uncertainty of 10-50%, depending on the specific jet kinematics of interest.
Additional sources of systematic uncertainties, with smaller impact, are also considered. For example, the trigger and lepton identification uncertainties are a few percent, and do not include uncertainties in the kinematic distributions. The identification of leptons, especially muons, is nearly fully efficient, and the trigger is selected to ensure full efficiency in the regime of interest. The jet energy scale and resolution uncertainties are similar, including shape components, and are between 1 and 5% for the high-p T jets studied here. Uncertainties related to pileup modeling and the integrated luminosity measurement have an effect smaller than 3%.
These uncertainties partially cancel in the SF measurement, as will be discussed in Section 8.4.

The t quark and W boson identification performance in data
The single-µ event selection discussed in Section 5.1 provides a sample dominated by semileptonic tt events. One of the t quarks decays to a Wboson that decays leptonically (to pass the selection), and the other provides a hadronic decay to be used in validating the algorithms.
To study possible dependence of the tagging efficiency on the parton showering scheme, we consider two alternative simulated tt samples. As discussed in Section 3, both samples are generated with the same generator (i.e., POWHEG), but one uses PYTHIA for the modeling of the parton showering, whereas the other uses HERWIG++. The total SM expectation from simulation using the latter tt sample will be referred to as "SM (Herwig)". As we will see, the choice of the parton showering generator has only a small impact on the overall agreement between data and simulation in signal jets.
To account for the differences in the design of the algorithms, the large-R jets discussed in Section 5.1 are either AK8, CA15, or HOTVR jets. For brevity we focus mainly on results using AK8 jets, unless otherwise stated, but similar conclusions can be drawn from all three jet collections.
The data-to-simulation comparisons of basic jet kinematic and substructure variables: p T (jet), m SD , the N-subjettiness ratios τ 32 and τ 21 , and the N 2 and N DDT 2 , are shown in Fig. 24. Figure 25 displays the main observables of the HOTVR algorithm, m HOTVR , m min,HOTVR and N sub,HOTVR , in data and simulation. The next set of comparisons includes tagging algorithms that are based on high-level jet substructure observables and explore ML techniques to improve performance, namely the BEST and the N 3 -BDT (CA15) algorithms. Figure 26 shows the t quark and W boson identification probabilities of BEST and the t tagging discriminant for the N 3 -BDT (CA15), in data and simulation. The last set of comparisons is related to the ImageTop and the DeepAK8 algorithms, which both explore lower-level observables. Figure 27 displays the distributions of the t quark identification probability for the two versions of ImageTop, and the t quark and W boson identification probabilities for DeepAK8 algorithms.
Because the selection applied to events shown in Figs. 24-27 results in a sample with low purity of fully merged t quark decay products, we also study the same distributions after applying a tighter requirement on the jet momenta: p T > 500 GeV. This selection results in a sample consisting of a higher fraction of fully merged t quark jets, relative to the jet component from the decay of a boosted W boson jet. Figures 28-31 show the same distributions for this high-p T selection.
The total background yield is normalized to the observed number of data events. The systematic uncertainties discussed in Section 8.1 are also considered and are shown via the shaded dark-grey band in the figures. Overall, the shapes in data are compatible with the expectation from simulation within uncertainties for all the algorithms.

Misidentification probability in data
The misidentification probability of the algorithms is studied in the dijet and single-γ data samples. The two samples differ in the relative fraction of light-flavor quarks and gluons in the final state. To study the dependence of the misidentification probability on the choice of the event generator and the parton showering scheme, we consider two different simulated samples to model the QCD multijet background. The nominal sample uses MADGRAPH for the event generation and PYTHIA (P8) for the parton showering and hadronization, whereas the alternative sample uses HERWIG++ for event generation and the modeling of the parton showering. More information on the generation of these samples is discussed in Section 3. The total SM contribution estimated using the HERWIG++ QCD multijet sample is referred to as "SM (Herwig)". As in Section 8.2, we will focus on results using jets with R = 0.8, unless otherwise stated. To account for possible differences in the p T distribution of the QCD multijet and γ+jet simulated events, the total background yield is weighted to match the p T distribution in data, following the procedure discussed in Section 3.
The distributions of m SD , jet p T , the N-subjettiness ratios τ 32 and τ 21 , and the N 2 and N DDT 2 , in the dijet sample are displayed in Fig. 32. For this event selection, the shapes of the m SD and the N-subjettiness ratios are described well by simulation, whereas there is disagreement between data and simulation for high values of N 2 and N DDT 2 . A better description of the data, particularly for N DDT 2 , is achieved with the HERWIG++ QCD multijet sample, which hints that the disagreement is related to the description of the parton shower. For the other observables we observe similar level of agreement between the two generators.
The same set of variables is presented in Fig. 33 for the single-γ sample. From previous measurements [8], the m SD agrees very well with simulation except at low masses. The modeling of the N-subjettiness and N 2 ratios is poorer in the single-γ sample. Figures 34 and 35 show the distribution of the main observables of the HOTVR algorithm, namely m HOTVR , m min,HOTVR , and N sub,HOTVR , in data and simulation, in the dijet and single-γ samples, respectively. In both samples, m HOTVR and m min,HOTVR show good agreement between data and simulation. The N sub,HOTVR distribution in data is softer than in simulation. Similar conclusions hold using HERWIG++ to simulate the QCD multijet events. The difference is more pronounced in the single-γ sample. The N sub,HOTVR is particularly sensitive to the precise modeling of the parton showering.
The distribution of the t quark and W boson identification probabilities for BEST and the t quark tagging discriminant for the N 3 -BDT (CA15) algorithm in the dijet sample are presented   ), and N sub,HOTVR (lower right) in data and simulation in the single-µ signal sample. The pink line corresponds to the simulation distribution obtained using the alternative tt sample. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), the pink line to the data to simulation ratio using the alternative tt sample, and the vertical black lines correspond to the statistical uncertainty of the data. The vertical pink lines correspond to the statistical uncertainty of the alternative tt sample. The distributions are weighted according to the top quark p T weighting procedure described in the text.  Figure 26: Distribution of the t quark (upper left) and W boson (upper right) identification probabilities for the BEST algorithm, and the N 3 -BDT (CA15) discriminant in data and simulation in the single-µ signal sample. The pink line corresponds to the simulation distribution obtained using the alternative tt sample. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), the pink line to the data to simulation ratio using the alternative tt sample, and the vertical black lines correspond to the statistical uncertainty of the data. The vertical pink lines correspond to the statistical uncertainty of the alternative tt sample. The distributions are weighted according to the top quark p T weighting procedure described in the text.  Figure 27: Distribution of the ImageTop (upper left) and ImageTop-MD (upper right) discriminant in data and simulation in the single-µ sample. The plots in the middle row show the t quark (left) and W boson (right) identification probabilities in data and simulation for the DeepAK8 algorithm. The corresponding plots for DeepAK8-MD are displayed in the lower row. The pink line corresponds to the simulation distribution obtained using the alternative tt sample. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), the pink line to the data to simulation ratio using the alternative tt sample, and the vertical black lines correspond to the statistical uncertainty of the data. The vertical pink lines correspond to the statistical uncertainty of the alternative tt sample. The distributions are weighted according to the top quark p T weighting procedure described in the text. (lower right) in data and simulation in the single-µ signal sample after applying a jet momentum cut p T > 500 GeV. The pink line corresponds to the simulation distribution obtained using the alternative tt sample. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), the pink line to the data to simulation ratio using the alternative tt sample, and the vertical black lines correspond to the statistical uncertainty of the data. The vertical pink lines correspond to the statistical uncertainty of the alternative tt sample. The distributions are weighted according to the top quark p T weighting procedure described in the text. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), the pink line to the data to simulation ratio using the alternative tt sample, and the vertical black lines correspond to the statistical uncertainty of the data. The vertical pink lines correspond to the statistical uncertainty of the alternative tt sample. The distributions are weighted according to the top quark p T weighting procedure described in the text.  Figure 30: Distribution of the t quark (upper left) and W boson (upper right) identification probabilities for the BEST algorithm, and the N 3 -BDT (CA15) discriminant in data and simulation in the single-µ signal sample, after applying a jet momentum cut p T > 500 GeV. The pink line corresponds to the simulation distribution obtained using the alternative tt sample. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), the pink line to the data to simulation ratio using the alternative tt sample, and the vertical black lines correspond to the statistical uncertainty of the data. The vertical pink lines correspond to the statistical uncertainty of the alternative tt sample. The distributions are weighted according to the top quark p T weighting procedure described in the text. and W boson (right) identification probabilities in data and simulation for the DeepAK8 algorithm after applying a jet momentum cut p T > 500 GeV. The corresponding plots for DeepAK8-MD are displayed in the lower row. The pink line corresponds to the simulation distribution obtained using the alternative tt sample. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), the pink line to the data to simulation ratio using the alternative tt sample, and the vertical black lines correspond to the statistical uncertainty of the data. The vertical pink lines correspond to the statistical uncertainty of the alternative tt sample. The distributions are weighted according to the top quark p T weighting procedure described in the text.
in Fig. 36, and the equivalent plots for the single-γ selection are shown in Fig. 37. In both samples the agreement between data and simulation is reasonable. Some disagreement is observed in the very high values ( 0.95) for the t quark identification probability of the BEST algorithm in the single-γ sample. The disagreement is observed in the region of the t quark probability greater than 0.95, which is significantly higher than the recommended operating points. Some disagreement is observed between the nominal QCD multijet simulated sample and the alternative sample for large values of the W boson probability of the BEST algorithm, with the nominal sample showing better agreement with the data.
The distributions of the ImageTop and DeepAK8 discriminants are shown in Figs. 38 and 39 for jets in the dijet and single-γ samples, respectively. The overall agreement between data and simulation in the single-γ is better than in the dijet sample. Moreover, the discrepancy in the shape is mainly observed at the very low values of the discriminant and is more enhanced in the t tagging case. The dijet sample is dominated by jets initiated by gluons, especially at low values of the discriminant. In addition, ImageTop and DeepAK8 are very sensitive to mismodeling of quarks or gluons in the simulation, and so exhibit more sample dependence. QCD multijet events simulated using HERWIG++ generally show better agreement with the data.

Corrections to simulation
The measurement of the t quark and W boson tagging efficiencies in data are performed in the single-µ sample using a "tag-and-probe" method [107]. The muon, in combination with the b-tagged jet, is used as the "tag". In the opposite hemisphere of the event, the jet is considered as the "probe jet".
The total SM sample is decomposed into three categories based on the spatial separation of the partons from the t quark decay with respect to the AK8 jet, following the discussion in Section 4. The "Merged t quark" category includes cases where the three partons and the jet have ∆R < 0.6. The "Merged W boson" category includes cases where only the two partons from the W boson decay are within ∆R < 0.6 of the jet and the b quark from the top quark decay is outside the jet cone. Any other topology falls in the "Nonmerged" category. In the cases of the HOTVR and N 3 -BDT (CA15) algorithms, the matching requirement is adjusted from 0.6 to 1.2.
The jet mass distributions in simulation of each one of the three categories are used to derive templates to fit the jet mass distribution in data. For a given working point, the fit is done for all three categories simultaneously for both the "passing" and "failing" events. The fit is performed in the range from 50 to 250 GeV with a bin width of 10 GeV. The sources of systematic uncertainties discussed in Section 8.1 are considered and are treated as nuisance parameters in the fit. After calculating the efficiencies in data ( Data ) and simulation ( Simulation ), the SF is determined as the ratio of Data over Simulation .
The SFs are extracted differentially in jet p T for the t quark and W boson tagging working points discussed in Section 7.1. For the case of t quark identification the following exclusive jet p T regions are considered: 300-400, 400-480, 480-600, and 600-1200 GeV. To increase the purity of "Merged W boson" candidates, we consider regions with lower jet p T : 200-300, 300-400, 400-550, and 550-800 GeV. The effects of the systematic sources discussed in Section 8.1 are propagated to uncertainties in the SF. The m SD distributions after performing the maximum likelihood fit for data and simulation in the passing and failing categories for DeepAK8-MD for 400 < p T < 480 GeV are displayed in Fig. 40. (lower right) in data and simulation in the dijet sample. The pink line corresponds to the simulation distribution obtained using the alternative QCD multijet sample. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), the pink line to the data to simulation ratio using the alternative QCD multijet sample, and the vertical black lines correspond to the statistical uncertainty of the data. The vertical pink lines correspond to the statistical uncertainty of the alternative QCD multijet sample. The distributions are weighted so that the jet p T distribution of the simulation matches the data. (lower right) in data and simulation in the single-γ sample. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), and the vertical lines correspond to the statistical uncertainty of the data. The distributions are weighted so that the jet p T distribution of the simulation matches the data. The pink line corresponds to the simulation distribution obtained using the alternative QCD multijet sample. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), the pink line to the data to simulation ratio using the alternative QCD multijet sample, and the vertical black lines correspond to the statistical uncertainty of the data. The vertical pink lines correspond to the statistical uncertainty of the alternative QCD multijet sample. The distributions are weighted so that the jet p T distribution of the simulation matches the data.  Figure 36: Distribution of the t quark (upper left) and W boson (upper right) identification probabilities for the BEST algorithm, and the N 3 -BDT (CA15) discriminant in data and simulation in the dijet sample. The background event yield is normalized to the total observed data yield. The pink line corresponds to the simulation distribution obtained using the alternative QCD multijet sample. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), the pink line to the data to simulation ratio using the alternative QCD multijet sample, and the vertical black lines correspond to the statistical uncertainty of the data. The vertical pink lines correspond to the statistical uncertainty of the alternative QCD multijet sample. The distributions are weighted so that the jet p T distribution of the simulation matches the data.  Figure 37: Distribution of the t quark (upper left) and W boson (upper right) identification probabilities for the BEST algorithm, and the N 3 -BDT (CA15) discriminant in data and simulation in the single-γ sample. The background event yield is normalized to the total observed data yield. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), and the vertical lines correspond to the statistical uncertainty of the data. The distributions are weighted so that the jet p T distribution of the simulation matches the data.  Figure 38: Distribution of the ImageTop (upper left) and ImageTop-MD (upper right) discriminant in data and simulation in the dijet sample. The plots in the middle row show the t quark (left) and W boson (right) identification probabilities in data and simulation for the DeepAK8 algorithm. The corresponding plots for DeepAK8-MD are displayed in the lower row. The pink line corresponds to the simulation distribution obtained using the alternative QCD multijet sample. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), the pink line to the data to simulation ratio using the alternative QCD multijet sample, and the vertical black lines correspond to the statistical uncertainty of the data. The vertical pink lines correspond to the statistical uncertainty of the alternative QCD multijet sample. The distributions are weighted so that the jet p T distribution of the simulation matches the data.  Figure 39: Distribution of the ImageTop (upper left) and ImageTop-MD (upper right) discriminant in data and simulation in the single-γ sample. The plots in the middle row show the t quark (left) and W boson (right) identification probabilities in data and simulation for the DeepAK8 algorithm. The corresponding plots for DeepAK8-MD are displayed in the lower row. The background event yield is normalized to the total observed data yield. The lower panel shows the data to simulation ratio. The solid dark-gray (shaded light-gray) band corresponds to the total uncertainty (statistical uncertainty of the simulated samples), and the vertical lines correspond to the statistical uncertainty of the data. The distributions are weighted so that the jet p T distribution of the simulation matches the data. Data / Post-fit Figure 40: The m SD distribution in data and simulation in the passing (left) and failing (right) categories for DeepAK8-MD for the jet p T in the 400-800 GeV range. The solid lines correspond to the contribution of each category after performing the maximum likelihood fit as described in the text. The dashed lines are the expectation from simulation before the fit. The lower panel shows the data to simulation ratio. The vertical black lines correspond to the total uncertainty, including the statistical uncertainty of the data, after the fit.
The SFs measured for each of the t quark and W boson identification algorithms are summarized in Fig. 41. The SFs are typically consistent with unity, within the uncertainties. The largest SF is measured for the identification of t quarks using DeepAK8-MD. The statistical and parton shower uncertainties dominate the SF measurement. The algorithms designed to avoid strong dependence on the mass, such as the DeepAK8-MD, have typically smaller uncertainties than the other algorithms. The effect of systematic uncertainties is more pronounced in algorithms that utilize a larger set of observables to increase discrimination power. These algorithms (i.e., BEST, ImageTop, and DeepAK8) are more sensitive to the simulation details. The features are more evident in the W boson case, due to the larger sample size of the "Merged W boson" category compared to the "Merged t quark" category, which allows for more precise comparisons due to increased number of events.
The misidentification rate as a function of the jet p T , is displayed in Figs. 42 and 43 for the t and W tagging algorithms. To study the dependence of the misidentification probability on the matrix element generator, and on the parton showering, we use an additional simulation sample for the QCD multijet background, which uses HERWIG++ for both the hard scattering generation and parton showering. In some cases, the misidentification probabilities show a significant dependence (up to ∼25%) on the simulation details, particularly for the ImageTop and DeepAK8 algorithms. The main source of this dependence is the description of gluon content; these are the only algorithms that have access to quark-gluon separation to improve the performance. Differences in the quark/gluon content can have large effects on the uncertainties.
The misidentification probability is also studied in the single-γ sample. Overall the performance in data and simulation agrees better in this sample than in the dijet sample. This can be attributed to the fact that the single-γ sample has a larger fraction of light-flavor quarks, which are better modeled in simulation [18].  Figure 41: Summary of the scale factors (SF) measured for each of the t quark (upper) and W boson (lower) identification algorithms. The markers correspond to the SF value, the error bars to the statistical uncertainty on the SF measurement, and the band is the total uncertainty, including the systematic component.

Summary
A review of the heavy-object tagging methods recently developed in CMS has been presented. The variety of tagging strategies is diverse, including algorithms based on more traditional theory-inspired high-level per-jet observables with and without multivariate techniques, as well as methods based on lower-level information from individual particles. New tagging approaches, such as the Energy Correlation Functions (ECF) tagger and the Boosted Event Shape Tagger (BEST), utilize multivariate methods (i.e., boosted decision trees and deep neural networks) on physically motivated high-level observables and attain enhanced performance. Two novel tagging algorithms, ImageTop and DeepAK8, are developed based on candidate-level information, allowing the exploitation of more information, where lower-level information is processed using advanced machine-learning methods. Moreover, the BEST and DeepAK8 algorithms are developed to provide multi-class tagging capabilities. Finally, dedicated versions of the algorithms that are only weakly correlated with the jet mass are developed. Such tools are particularly important for analyses that rely on the jet mass sidebands to estimate the background contribution under the heavy resonance mass. The mass-decorrelated algorithms (m SD + N DDT 2 , ImageTop-MD, and DeepAK8-MD) typically show weaker discriminating power  Figure 42: The ratio of the misidentification rate of t quarks in data and simulation in the dijet (upper and middle rows) and the single-γ (lower row) samples. The QCD multijet process is simulated using MADGRAPH for the hard process and PYTHIA for parton showering (upper) and HERWIG++ for both (middle). The vertical lines correspond to the statistical uncertainty of the data and the simulated samples.  Figure 43: The ratio of the misidentification rate of W bosons in data and simulation in the dijet (upper and middle rows) and the single-γ (lower row) samples. The QCD multijet process is simulated using MADGRAPH for the hard process and PYTHIA for parton showering (upper) and HERWIG++ for both (middle). The vertical lines correspond to the statistical uncertainty of the data and the simulated samples. than their counterparts. However, they can yield better sensitivity in some physics analyses because of smaller uncertainties in background estimations.
The performances of the various tagging algorithms are directly compared using simulation in a jet p T range from 200 to 2000 GeV. Overall, the application of machine-learning techniques for jet tagging shows strong improvement compared to cutoff-based methods. The approaches based on low-level information yield the best performance, with as much as an order of magnitude gain in background rejection for the same signal efficiency. Another important aspect essential for the application of the new techniques in physics analysis is the systematic uncertainties associated to each algorithm. Those based on low-level features and advanced machine-learning techniques are typically prone to larger systematic uncertainties. However, these uncertainties are usually small enough to preserve the significant improvements observed. The techniques have also been validated in collision data, with scale factors extracted, including systematic uncertainties. The performances of these tagging algorithms are in good agreement between data and simulation. [5] CMS Collaboration, "Measurement of the integrated and differential tt production cross sections for high-p t top quarks in pp collisions at √ s = 8 TeV", Phys. Rev. D 94 (2016) 072002, doi:10.1103/PhysRevD.94.072002, arXiv:1605.00116.