A new tagger for hadronically decaying heavy particles at the LHC

A new algorithm for the identification of boosted, hadronically decaying, heavy particles at the LHC is presented. The algorithm is based on the known procedure of jet clustering with variable distance parameter R and adapts the jet size to its transverse momentum pT\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_\mathrm {T}$$\end{document}. Subjets are found using a mass jump condition. The resulting algorithm – called Heavy Object Tagger with Variable R (HOTVR) – features little algorithmic complexity and combines jet clustering, subjet finding and rejection of soft clusters in one sequence. While the HOTVR algorithm can be used for the identification of any heavy object decaying hadronically, e.g. W, Z, H, t, or possible new heavy resonances, this paper targets specifically the tagging of boosted top quarks. The studies presented here demonstrate a stable performance of the HOTVR algorithm in a wide range of top quark pT\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_\mathrm {T}$$\end{document}, from low pT\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_\mathrm {T}$$\end{document}, where the decay products can be resolved, to the region of boosted decays at high pT\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$p_\mathrm {T}$$\end{document}.


Introduction
The identification of hadronically decaying heavy standard model (SM) particles (W, Z, H, t) is an important ingredient in an increasing number of SM analyses and searches for new physics at the LHC. For a particle with high energy, the large Lorentz factor leads to decay products which are collimated in the laboratory rest frame and result in a single jet. The task of separating these decays from the vast amount of background from QCD multijet production has been approached with a variety of jet substructure developments in recent years [1][2][3][4][5][6][7]. The techniques face the challenge of a stable performance in significantly different kinematic regimes: from the region of low transverse momentum p T , where the decay products can be resolved, to the boosted regime of high p T . a e-mail: roman.kogler@physik.uni-hamburg. de The existing algorithms can be classified into two approaches. The bottom-up approach extrapolates from the resolved into the boosted regime by successively combining small radius jets, similar to an exclusive jet clustering (e.g. the JADE algorithm [8,9]). Modern algorithms have been devised for the task of heavy object identification; examples are the collection of jets in buckets [10,11] or the recently proposed XCone algorithm [12,13]. These methods combine manageable complexity with promising performance, but have not been studied in experimental analyses so far. The larger number of algorithms follows the top-down approach which starts from large radius jets followed by subsequent declustering steps. These algorithms are based on jet clustering with a fixed distance parameter R, where jet grooming methods like filtering [1], pruning [14], trimming [15] or soft drop [16] are used to remove soft radiation and contributions from the underlying event such that substructure observables like the jet mass reflect the hard underlying process. Alternatively the variable R jet algorithm [17] can be used to dynamically reduce the jet distance parameter with increasing p T of the decaying particle. The algorithm was used in studies of new heavy resonances decaying to final states with two and four gluons [15], and also top quark, W and Higgs decays at LHC energies [18], and a similar algorithm at energies of a future hadron collider [19]. Additional substructure information like the k t splitting scales [20], N-subjettiness [21][22][23], energy correlation functions [24] or Qjets [25] are often used to further improve the performance of substructure algorithms. Combinations of these methods are used for the tagging of top quarks [26][27][28][29][30], where also more theoretically motivated taggers have been proposed [31,32]. The ATLAS and CMS collaborations have commissioned a number of the techniques mentioned above and studied their behaviour [33][34][35][36][37][38][39][40]. Several top-tagging algorithms have been employed successfully in various searches for new physics  and in SM top quark measurements [70,71] in LHC analyses with Run 1 data. At LHC's Run 2 the production rates of particles with high p T have increased and the importance of boosted analyses is further enhanced. Modifications of existing taggers have been proposed and studied in simulation [72][73][74][75]. In most cases, a modest performance improvement is contrasted with a significantly increased algorithmic complexity. A simple but robust algorithm [76] proposed by the ATLAS collaboration has a slightly reduced performance. In addition, recent developments of top-down taggers aim at closing the gap between the resolved and the boosted regime with rather complex algorithmic procedures using several clustering, declustering, mass drop and filtering steps. Examples are the scale invariant tagger [77] and the HEPTopTagger in OptimalR mode [30].
In this work we introduce a new tagger useful in the resolved, the transition and the boosted regime, achieved with only little algorithmic complexity. The tagger is based on the variable R jet algorithm [17], which adapts the jet distance parameter dynamically to the p T of the boosted object. A mass jump condition [78,79] is included in the clustering process, which forms subjets reflecting the dynamics of the underlying hard decay, and enables efficient background suppression. The resulting Heavy Object Tagger with Variable R (HOTVR) accommodates jet clustering, subjet finding and the rejection of soft radiation in one sequence, without the need of declustering and following grooming steps. In this paper we demonstrate the algorithm's properties and characteristics in hadronic top decays and leave studies of the decays of W, Z, H and possible new resonances to future work.
The paper is organised as follows. In Sect. 2 the HOTVR algorithm is described. Its characteristics, free parameters and their influence on the jet and subjet clustering, the collinear and infrared safety and timing performance are discussed in Sect. 3. The algorithm's performance for hadronic top quark decays and a comparison with other commonly used taggers is presented in Sect. 4. A conclusion is given in Sect. 5.

The algorithm
The HOTVR algorithm is based on the variable R (VR) jet algorithm [17]. Like all sequential recombination algorithms, it starts with an input list of pseudojets 1 and continues the processing until the input list is empty. The algorithm uses the distance measures d i j and d iB , defined as The value of d i j can be interpreted as distance between two pseudojets i and j, where p T,i is the transverse momentum 2 is the angular distance in rapidity y and azimuth φ between the pseudojets i and j. The value of d iB denotes the distance between pseudojet i and the beam. For a fixed distance parameter of R eff = R in Eq. (3), the anti-k t [80], Cambridge/Aachen (CA) [81,82] and k t [83,84] algorithms are obtained for the choices n = −1, 0, 1, respectively. For the HOTVR algorithm n = 0 is used, corresponding to CA clustering. However, in the VR algorithm R eff is an effective distance parameter, which scales with 1/ p T (cf. Eq. (3)) leading to broader jets at low p T and narrower jets at high p T . The scale ρ determines the slope of R eff . For robustness of the algorithm with respect to experimental effects a minimum and a maximum cut-off for R eff is introduced, A known shortcoming of the VR algorithm is the clustering of additional radiation into jets in QCD multijet production, resulting in a higher jet p T on average and an increased rate once a p T selection is applied [17]. The HOTVR algorithm approaches this issue by modifying the jet clustering procedure with a veto based on the invariant mass of the pseudojet pair, inspired by the recently proposed mass jump algorithm [78]. The mass jump veto prevents the recombination of two pseudojets i and j if the combined invariant mass m i j is not large enough, The parameter θ determines the strength of the mass jump veto and can be chosen from the interval [0, 1]. The mass jump criterion (5) is only applied if the mass m i j is larger than a mass threshold μ In case a mass jump is found and the p T of the pseudojets i and j fulfil the pseudojets are combined. The resulting pseudojet enters the next clustering step and the initial pseudojets are stored as separate subjets. In case the mass jump criterion is not fulfilled or the pseudojets are softer than p T,sub , the lighter pseudojet or the one too soft is removed from the list. This step reduces the effect of additional activity (soft radiation, underlying event, pile-up) and effectively stabilises the jet mass over a large range of p T . The full HOTVR algorithm can be summarised as follows.
(1) If the smallest distance parameter is d iB , store the pseudojet i as jet and remove it from the input list of pseudojets. (2) If the smallest distance parameter is d i j and m i j ≤ μ, combine i and j.  (i) If p T,i < p T,sub or p T, j < p T,sub , remove the respective pseudojet from the input list. (ii) Else, combine pseudojets i and j. Store the pseudojets i and j as subjets of the combined pseudojet. In case i or j have already subjets, associate their subjets with the combined pseudojet.
(4) Continue with (1) until the input list of pseudojets is empty.
The algorithm results in jets with an effective size depending on p T and associated subjets. It incorporates jet finding, subjet finding and the rejection of soft radiation in one clustering sequence. The algorithm is available as plugin to FastJet [85,86] and can be obtained through the FastJet Contribs package [87]. Its implementation is based on the implementations of the mass jump and VR algorithms in the Fast-Jet Contribs packages ClusteringVetoPlugin 1.0.0 and VariableR 1.1.1, respectively. These implementations have been adapted and modified to make the HOTVR software an independent FastJet plugin.

Parameters, jet and subjet finding
In total, the algorithm has six parameters, which are listed in Table 1. While the first three parameters steer the VR part of the algorithm, the last three define the mass jump condition.
The default values given in the table have been optimised for top quark tagging in pp collisions at √ s = 13 TeV. The original VR algorithm is recovered for μ → ∞. In this case, for ρ → 0 the algorithm is identical to the CA The number of subjets found is modified by the mass jump parameters μ, θ and p T,sub . Once the pseudojets become sufficiently heavy due to clustering, the mass jump threshold μ results in a rejection of soft and light pseudojets. For a fixed value of μ, the strength of this jet grooming depends on the parameters θ and p T,sub . For θ = 1 the condition (5) is always fulfilled and no pseudojets are rejected (equivalent to the case μ → ∞). Conversely, the case of θ = 0 results in a VR jet clustering which stops as soon as a jet mass of μ is reached. The algorithm results in subjets with a maximum mass of μ. Additional jet grooming is obtained by setting p T,sub > 0. This results in subjets with a minimum p T of p T,sub , effectively removing soft radiation and improving the tagging performance at small p T of the heavy object.
The algorithm's behaviour is visualised in Fig. 1 where two example tt events, generated with Pythia 8 [88][89][90] at low p T (top row, Event 1) and at high p T (bottom row, Event 2), are clustered with the CA algorithm (left column) and with the HOTVR algorithm (right column). The active catchment areas of the hard jets are obtained using ghost particles [91] and are illustrated by the coloured (orange/blue) areas. 2 The impact of the VR part of the algorithm is nicely illustrated by the largely different jet sizes of the two events clustered with the HOTVR algorithm (right column). The grey regions in the right panels were rejected by the mass jump criterion and are not part of the HOTVR jets. This criterion has largest impact in events at low p T as exemplified in Event 1 (top, right). The HOTVR jets together with their subjets reproduce the kinematics of the top decay adequately, both at low and high p T , demonstrating a better adaptation to the decay topology than CA jets. A similar picture is obtained when comparing HOTVR jets to anti-k t jets.

Collinear and infrared safety
The HOTVR algorithm is infrared and collinear (IRC) safe, except for the unnatural parameter choice of μ = 0. For parameter choices corresponding to the original VR clustering, the HOTVR algorithm is trivially infrared and collinear (IRC) safe [17]. Similarly, for choices of μ > 0 the algorithm is IRC safe, as soft and collinear splittings do not generate mass. This has also been verified in a numerical test, where the stability of the jets as well as subjets found with the HOTVR algorithm was studied with respect to soft radiation and collinear splittings. The algorithm proved to be IRC safe with no events out of 10 6 failing the test [92].

Timing
For timing tests, and throughout this work, the FastJet 3.2.1 [85,86] framework is used, together with FastJet Contribs version 1.024. Starting from FastJet version 3.2, advanced clustering strategies became available which led to substantial speed improvements, especially at high particle multiplicities. For this reason the run time of the algorithm has been studied for different particle multiplicity scenarios, low O(50), medium O(300) and high O(3000).
In Table 2 the CPU time of the HOTVR algorithm with default parameters (cf. Table 1) is compared to those of the CA jet algorithm [81,82], the CMS top tagger [26,27], the HEPTopTagger [28,29], the HEPTopTagger in OptimalR mode [30], the VR algorithm [17] as well as the mass jump algorithm [78]. For the various top taggers the CPU time listed includes the time for the underlying jet finding as well as for the top tagger specific processing steps. The developments in FastJet 3.2 result in a much faster runtime of the VR and HOTVR clustering, compared to previous versions (not shown). At low and medium multiplicities, the runtime of the HOTVR algorithm is comparable to that of the other top-tagging algorithms tested. At high multiplicities, it is about a factor four slower than the HEPTopTagger algorithms, but it is still fast enough for practical uses. 3 The original mass jump algorithm has not been updated to employ the new clustering strategies, which leads to a much worse performance at medium and high multiplicities.

Physics performance
Studies of the physics performance are carried out using the event generator Pythia 8 [88][89][90]. A pp → tt sample is used as signal process, background events are obtained by simulating QCD dijet production in pp collisions. For both samples a centre-of-mass energy of √ s = 13 TeV is used, the multiple parton interaction tune Monash 2013 [93] and the LO NNPDF2.3 QCD+QED [94] PDFs with α s (M Z ) = 0.130 are employed. At this stage no additional pp interactions during a single bunch crossing (pile-up) are simulated. 4 Throughout this work, jets are clustered using all stable particles from the Pythia 8 output. In some studies, additional jets (labelled parton jets) are obtained using a list of all final state partons 5 as input to the anti-k t algorithm with distance parameter R = 0.4 with a minimum p T of 100 GeV. In case of tt production, the top quark is effectively treated as stable for the purposes of defining the parton jet: after showering the top quarks are added to the parton list, and all partons from the top quark decay are removed. In case a matching between particle and parton jets is employed, the geometrical matching condition R < R eff is used.

Reconstruction of masses and transverse momenta
The key to the tagger's effectiveness is the accurate reconstruction of subjets originating from the top quark decay, achieved by the VR condition and the mass jump criterion. This leads to a stable peak position for the mass of top jets over a large range of jet p T , as shown in Fig. 2. The jet mass m jet distribution for jets with two different subjet multiplicity N sub selections is shown for two ranges in the p T of the parton jet matched to the particle jet. For tt events with N sub ≥ 2 the distributions feature a dominant peak, stable around the top quark mass 6 for fully merged decays, and two smaller peaks at lower masses corresponding to partially merged top quark decays. The requirement of N sub ≥ 3 leads to a depletion of the two secondary peaks, while the peak around the mass of the top quark is hardly affected. At low p T (left) the top quark peak is wider with a larger tail and is situated on a larger plateau than at high p T (right) because of contributions from additional radiation which aggregate in the jet due to its large size. While this leads to a larger misidentification rate at low p T , it results in a non-vanishing efficiency already at top quark transverse momenta as low as 100 GeV. For typical QCD jets a falling distribution is observed. The wide peak at mass values around 140 GeV observed at low p T (left) is a result of the subjet kinematics, where an angular separation of R = 1.0-1.5 leads to jet masses around this value. When changing the kinematics by relaxing the p T,sub requirement, the peak vanishes and a falling distribution is obtained. The width of this peak is reduced for intermediate (400 < p T < 600 GeV) transverse momenta (not shown) and a monotonically falling background distribution is obtained for values of 600 < p T < 800 GeV (right). Very similar distributions are obtained for values of p T > 800 GeV.
The distributions of the leading subjet's fractional transverse momentum f p T = p T,1 / p T is shown in Fig. 2 (middle). Signal jets contain subjets with more evenly distributed transverse momenta, while for background jets the leading subjet carries a larger fractional p T on average. The variable f p T shows good separation power between signal and background jets before a subjet multiplicity selection. After the shown for subjet multiplicities N sub ≥ 2 (dashed lines) and N sub ≥ 3 (solid lines). Note that the minimum pairwise mass is only defined for N sub ≥ 3. The distributions have been normalised to unit area for N sub ≥ 2 requirement of N sub ≥ 3, the separation power is reduced, but the variable is still useful, especially at high p T .
For jets with N sub ≥ 3, the distribution of the minimum pairwise mass m min [26,27], defined as the minimum invariant mass of pairs of the three highest p T subjets m min = min[m 12 , m 13 , m 23 ], is shown in Fig. 2 (bottom) for two regions of p T of the parton jet. The distributions show a clear cut-off at the chosen value of the mass jump threshold (μ = 30 GeV). Above this value the distribution is steeply falling for background jets, while tt signal jets exhibit a pronounced peak around the value of the W boson mass, as expected for top quark jets. The tail below the mass jump threshold is a result of light subjets combined with a heavier pseudojet, fulfilling the mass jump criterion in step (3) of the algorithm.
Besides an adequate reconstruction of masses, algorithms should also be able to reconstruct the kinematics of the initial heavy particle. In particular the size of the catchment area, which is responsible for the amount of additional radiation clustered into the jets, and the intensity of the grooming procedure are critical components for the performance in this area. For an evaluation of the kinematic object reconstruction by the HOTVR algorithm, we calculate the p T ratio of the HOTVR jets and the matched parton jets containing a top quark. We find a mean value of the p T ratio of 1.0 within small deviations of the order of 1%, independent of the parton jet p T . The widths of the p T ratio distributions are about 5%. This shows that the HOTVR algorithm is able to accurately reconstruct the kinematics of the heavy object with the parameter choice given above.

Selection cuts in top-tagging mode
For the discrimination of hadronically decaying top quarks from QCD multijets a selection based on simple cuts using commonly employed substructure variables has been implemented. The variables m jet and m min calculated from the HOTVR subjets are in principle sufficient for building a robust top tagger over a large region of p T . However, cuts on additional variables have been added to obtain a selection that allows a fair comparison with other top-tagging algorithms using similar selections. To ensure only a limited impact of not-included experimental effects (e.g. broadening of distributions) the cut values have not been optimised rigorously. Nevertheless, they result in an improved discrimination between signal and background. 7 The following selection defines the standard working point of the HOTVR algorithm in top-tagging mode.
1. The leading subjet is required to have a fractional transverse momentum with respect to the jet, f p T = p T,1 / p T < 0.8, which ensures that the jet's momentum is distributed among its subjets and not carried by only the leading subjet. 2. The number of subjets N sub is required to be N sub ≥ 3, which increases the probability of reconstructing fully merged top jets and rejects a fair amount of QCD jets. 3. The jet mass is required to fulfil 140 < m jet < 220 GeV. 4. The minimum pairwise mass has to fulfil m min > 50 GeV.
These selection criteria lead to similar subjet kinematics as obtained by the CMS and HEPTopTagger algorithms with default parameters. This provides the basis for the comparison made in the following.

Performance comparison with ROC curves
The signal efficiency and misidentification rate are studied using single variable receiver operating characteristic (ROC) curves. The signal efficiency ε S is defined as the fraction of tagged jets matched to parton jets containing the top quark, with respect to all top quarks decaying hadronically. The background efficiency (or misidentification rate) ε B is calculated as the fraction of tagged jets matched to parton jets in a QCD multijet sample, with respect to the total number of parton jets. Both, ε S and ε B therefore combine identification and matching efficiencies. These definitions allow for a comparison of different tagging algorithms, in particular using different choices of the jet distance parameter R, since the reference p T is defined by the parton jet matched to the tagged jet and does not depend on the specifics of the tagging algorithm under study. In the following the performance of the HOTVR algorithm in top-tagging mode is compared with the performance of three top-tagging algorithms especially designed for dedicated regions of p T : the CMS top-tagger targets the region of high p T , the HEPTopTagger is designed for low p T and its improved version with OptimalR has been developed to extend its usability to higher p T . The free parameters of these taggers are listed in Table 3 together with a choice of working points [29,73,95]. The ROC curves are obtained by keeping the free parameters fixed at the values given and scanning only the N-subjettiness [21][22][23] ratio τ 3/2 = τ 3 /τ 2 with β = 1. The choice of τ 3/2 as scanning variable 8 ensures an unprejudiced comparison of the algorithms, which all rely on different reconstruction techniques and substructure variables, since this variable is not used in the definition of any of the taggers under study. Furthermore, τ 3/2 has been shown to improve the performance of existing taggers (see for example Refs. [30,73]).
In Fig. 3 the ROC curves of the four top-tagging algorithms are shown for four different p T regions, where p T is defined by the parton jet matched to the tagged jet. The events were reweighted to obtain a flat p T spectrum such that all events in the interval have the same weight. At low p T (200 < p T < 400 GeV, top left) the CMS top-tagging algorithm has very small efficiency due to the choice of R = 0.8 which results in jets not large enough to cluster all particles from the top quark decay chain. The HOTVR Table 3 Settings of the top-tagging algorithms used. The parameter R is the distance parameter of the jet clustering. The definition of the parameters follows Ref. [95] for the CMS top tagger, Ref. [29] for the HEPTopTagger and Ref. [30] for the HEPTopTagger in OptimalR mode

CMS top tagger
HEPTopTagger Same as HEPTopTagger algorithm is able to provide a comparable performance as the two HEPTopTagger algorithms which were optimised for this p T region. For increasing values of p T the CMS tagger becomes more efficient, with a similar performance as the OptimalR HEPTopTagger starting from p T > 600 GeV. In the p T regions with 400 < p T < 600 GeV (top right) and 600 < p T < 1000 GeV (bottom left) the HOTVR algorithm shows a similar relation between ε S and ε B as the CMS and OptimalR HEPTopTagger, and is especially useful for high efficiencies. In the highest p T region considered (1000 < p T < 2000 GeV, bottom right) the HOTVR algo-rithm features overall the best performance over all ε S values, outperforming the CMS tagger, which was designed for the high p T region.
In summary, the HOTVR algorithm shows a remarkably stable performance over a large range in p T with similar or even better performance than algorithms especially designed for certain p T regions. Detector reconstruction and resolution effects, which are not included in these studies, are expected to improve the performance of the HOTVR algorithm relative to the other algorithms studied [92].

Conclusion
A new algorithm for the reconstruction and identification of hadronically decaying heavy particles at the LHC has been introduced in this paper. The algorithm combines variable R jet clustering with a veto based on a mass jump criterion. It performs jet and subjet finding, and the rejection of soft radiation in one sequence. This combination results in a stable determination of jet substructure variables like the jet mass over a large range in p T of the heavy object. In top-tagging mode the HOTVR algorithm provides an excellent ratio of signal to background efficiency at low top quark p T as well as at high p T , making the HOTVR algorithm useful in the regions of resolved and boosted decays at the same time.
While we focussed on top tagging in this work, the algorithm is also applicable for the tagging of W, Z, H or possible BSM resonances, where studies are ongoing. Because of its algorithmic simplicity combined with remarkable performance, this tagger could become a helpful ingredient for future boosted analyses at the LHC.