Evidence for four-top quark production in proton-proton collisions at $\sqrt{s}$ = 13 TeV

The production of four top quarks ($\mathrm{t\bar{t}t\bar{t}}$) is studied with LHC proton-proton collision data samples collected by the CMS experiment at a center-of-mass energy of 13 TeV, and corresponding to integrated luminosities of up to 138 fb$^{-1}$. Events that have no leptons (all-hadronic), one lepton, or two opposite-sign leptons (where lepton refers only to prompt electrons or prompt muons) are considered. This is the first $\mathrm{t\bar{t}t\bar{t}}$ measurement that includes the all-hadronic final state. The observed significance of the $\mathrm{t\bar{t}t\bar{t}}$ signal in these final states of 3.9 standard deviations (1.5 expected) provides evidence for $\mathrm{t\bar{t}t\bar{t}}$ production, with a measured cross section of 36 $^{+12}_{-11}$ fb. Combined with earlier CMS results in other final states, the signal significance is 4.0 standard deviations (3.2 expected). The combination returns an observed cross section of 17 $\pm$ 4 (stat) $\pm$ 3 (syst) fb, which is consistent with the standard model prediction.


Introduction
The production of four top quarks (tttt) is predicted to occur very rarely in the standard model (SM). In proton-proton (pp) collisions at √ s = 13 TeV, the production cross section has been calculated to be 12.0 +2.2 −2.5 fb at next-to-leading order (NLO) in quantum chromodynamics (QCD) and electroweak (EW) contributions [1][2][3][4][5]. Examples of the SM lowest order contributions to tttt production in pp collisions are shown in Fig. 1. Deviations from the predicted value occur in many proposed models of physics beyond the SM, such as supersymmetry [6,7], composite models [8], top quark compositeness [9], two Higgs doublet models [10][11][12], and models with extra spatial dimensions [13,14]. Measurements of tttt production can also be used to constrain the top quark Yukawa coupling, CP-related parameters, and effective field theory operators [15].
In the SM, the top quark dominantly decays to a bottom quark and a W boson. Each W boson decays to either leptons or quarks, so the tttt final state consists of four bottom quarks and up to four leptonic W boson decays. No analyses described or cited in this Letter attempt to explicitly identify τ lepton products and hence hereafter "lepton" will refer only to e and µ, whether produced directly in W → e/µ + ν decays or via W → τ + ν with τ → e/µ + 2ν.
The production of tttt has been searched for by both ATLAS [16][17][18] and CMS [19][20][21][22] at the CERN LHC. The ATLAS Collaboration has reported evidence for tttt production in final states with either two same-sign leptons or at least three leptons (referred to as "SSDL&ML") [18], with an observed significance of 4.3 standard deviations (2.4 expected) and a measured production cross section of 24 +7 −6 fb (assuming SM branching ratios). The significance observed in searches for tttt production by the CMS Collaboration in similar SSDL&ML final states is 2.6 standard deviations (2.7 expected), with a measured production cross section of 12.6 +5. 8 −5.2 fb, using data collected in 2016-2018 with an integrated luminosity of 138 fb −1 [21]. The CMS Collaboration has also reported an observed significance of 1.4 standard deviations and a measured production cross section of 13 +11 −9 fb in final states with one lepton or two opposite-sign leptons, using data collected in 2016 with an integrated luminosity of 38 fb −1 [22].
In this Letter, we present a search by the CMS Collaboration for the production of tttt in final states with zero leptons (all-hadronic), one lepton (single-lepton), or two opposite-sign leptons (referred to as "opposite-sign dileptons" or "OSDL"). The search uses data samples of pp collision data collected in 2016-2018 at √ s = 13 TeV, with integrated luminosities of up to 138 fb −1 [23][24][25]. To discriminate the tttt signal events from the dominant background of tt production we take advantage of the higher multiplicity of jets, particularly those produced by the hadronization of b quarks. Events are required to have at least three candidate b jets in the single-lepton and all-hadronic final states, and at least two candidate b jets in the OSDL final state.
The all-hadronic final state was not used in previous measurements because of the difficulty of modeling the QCD multijet background in the relevant kinematic regions. Utilizing this channel for the first time is made possible by a novel background estimation strategy that uses a deep neural network (DNN) to estimate background distributions from data in control regions (CRs). The analyses of channels with leptons in the final state improve upon previous results by taking advantage of the increased integrated luminosity, upgraded detectors, and improved algorithms for the identification of bottom and top quarks. The results are combined with the previous SSDL&ML CMS results [21], which are also based on the 2016-2018 data set but use non-overlapping data regions (including CRs) to ensure statistical independence. Tabulated results are provided in the HEPData record for this analysis [26].

The CMS detector
The central feature of the CMS apparatus is a superconducting solenoid providing a magnetic field of 3.8 T, and enclosing a silicon pixel and strip tracker, a lead tungstate crystal electromagnetic calorimeter, and a brass and scintillator hadron calorimeter, each composed of a barrel and two endcap sections. Forward calorimeters extend the pseudorapidity coverage provided by the barrel and endcap detectors. Muons are measured in gas-ionization detectors embedded in the steel flux-return yoke outside the solenoid. A two-level trigger system reduces the rate of events retained for further processing to around 1 kHz. The first-level trigger is composed of custom hardware processors, using information from the calorimeters and muon detectors [27]. The software-based high-level trigger [28] uses the full event information. A more detailed description of the CMS detector, together with a definition of the coordinate system used and the relevant kinematic variables, can be found in Ref. [29].

Simulated event samples
Signal and background processes are modeled using several Monte Carlo event generators. Multiple simulated minimum bias interactions within the same or nearby bunch crossings (pileup) are superimposed on the hard scattering process with the multiplicity distribution matched to the data. Signal events are generated using the MADGRAPH5 aMC@NLO generator [1] versions 2.2.2 (2016) and 2.4.2 (2017-2018) in a tree-level approximation, with emission of up to two additional partons in matrix element calculations. Inclusive tt production is generated at NLO precision using POWHEG v2 [30-32]. Smaller contributions from tt production in association with one or two bosons (H, W, Z, WH, ZH, WW, WZ, ZZ), W and Z production, triple top quark production (ttt, tttW), single top quark (tW) production, and Drell-Yan processes constitute the remaining backgrounds, particularly where additional hadronic jets are produced by QCD radiation [ Cross sections at NLO and NNLO are used to normalize the simulated background samples. Events in the tt sample with one or more additional b jets from initial-or final-state radiation are reweighted to match corresponding cross sections measured by the CMS experiment: Ref.
[41] is used for the SL analysis and Ref. [42] for the OSDL analysis. The detector response is simulated using GEANT4 [43]. In the single-lepton and OSDL final states, the contributions from all background processes are estimated using simulation. In the all-hadronic final state, CRs in data are used to estimate the dominant multijet and tt backgrounds.

Event reconstruction and data samples
A particle-flow algorithm [44] is used to reconstruct and identify each particle in an event, with an optimized combination of information from various CMS subdetectors. The objects identified by the algorithm comprise candidate electrons, muons, photons, and charged or neutral hadrons. Muon and electron candidates are restricted to the ranges |η| < 2.4 and |η| < 2.5, respectively [45][46][47], and are required to be isolated from other objects [45,47,48]. Jets are reconstructed from particle flow objects using the anti-k T algorithm [49,50] with distance parameters of 0.4 (AK4) and 0.8 (AK8). Residual differences in the jet energy scale and resolution between data and simulation are corrected [51]. Charged particles identified as originating from pileup are discarded, and the measured energy is corrected to remove the estimated contribution from neutral pileup particles [52,53]. The H T in an event is then defined as the scalar p T sum of all AK4 jets with p T > 30 GeV and |η| < 2.5.
The DEEPCSV [54] and DEEPJET algorithms [55] are used to discriminate (b tag) AK4 jets produced by the hadronization of bottom quarks from those produced by gluon and lighter quark hadronization. A misidentification probability of 1% and efficiencies of approximately 68% for DEEPCSV and 75% for DEEPJET are measured in simulated tt events [56].
Hadronic top quark decays with a small or moderate Lorentz boost will typically produce three separate AK4 jets. These decays are referred to as "resolved" and are identified using multivariate top quark tagging (t tagging) algorithms. For the analysis of the single-lepton final state we use the DNN-based DEEPRESOLVED t tagger [57,58] while for the all-hadronic final state we use a custom boosted decision tree (BDT) t tagger [59,60] based on the one described in Ref. [61], updated to use DEEPJET b tagging inputs.
In contrast, hadronically decaying top quarks with a large boost can result in a single merged jet. Such decays are identified in the all-hadronic channel by applying the CMS DEEPAK8 algorithm [62] to AK8 jets with p T > 400 GeV. In order to avoid double counting of objects, we require merged top quark jets to be separated from resolved top quark candidates by ∆R ≡ √ (∆η) 2 + (∆φ) 2 > 0.8.
The OSDL events are collected using electron-muon, dimuon, and dielectron triggers. The eµ selection uses a combination of triggers that require either an electron with p T > 23 GeV and a muon with p T > 12 GeV, or vice-versa. For the dimuon channel, a trigger with p T thresholds of 17 and 8 GeV for the two highest p T muons is used. Similarly the dielectron channel uses a trigger with p T thresholds of 23 and 12 GeV for the two highest p T electrons. A prioritized trigger strategy of eµ, µµ, and ee is used to ensure that events satisfying multiple triggers exclusively enter the appropriate OSDL final state. For the single-lepton channel, two different triggers are used. The first requires events to contain an isolated electron (muon) with p T > 35 (50) GeV. The second requires a very loosely isolated electron or muon with p T > 15 GeV in addition to the event having H T > 450 GeV. All-hadronic events are selected with a variety of triggers that require at least six AK4 jets and H T greater than thresholds in the range 380-450 GeV. A minimum of either one or two b-tagged jets is required, depending on the trigger. Scale factors are applied to simulated samples in all final states to correct for the differences in trigger efficiencies between data and simulation.

Opposite sign dilepton final state
The OSDL channel contains data corresponding to an integrated luminosity of 101 fb −1 collected in 2017-2018, with previously published results on 2016 data included in the final fit to all channels. Offline, events in the OSDL channel are required to have exactly two opposite-sign leptons, one with p T > 25 GeV and the other with p T > 15 GeV, at least 4 jets with p T > 30 GeV, and H T > 500 GeV. Events with ee and µµ invariant masses below 20 GeV or within 15 GeV of the Z boson mass are excluded. The distribution of H T is chosen as the input to the final fit, as this variable is sensitive to the presence of the extra two hadronic top quark decays in the signal as compared to the tt background. To increase sensitivity, events are categorized by lepton decay channel (ee, µµ, eµ), the number of AK4 jets (N j = 4, 5, 6, 7, ≥8), and how many of them are b tagged (N b = 2, 3, ≥4). The signal regions (SRs) contain seven or more jets, three or more of which are b tagged by DEEPJET, while categories containing fewer jets or exactly two b-tagged jets serve as CRs.
The number of simulated tt events with no jets from the hadronization of additional b quarks is corrected to ensure consistency with the observed data. The correction is determined from CRs which contain exactly one b-tagged jet but satisfy all other criteria of the OSDL SRs and CRs included in the fit (i.e., two opposite-sign leptons, N j ≥ 4, and H T > 500 GeV). The selected sample is depleted of tt with extra b jets. The required correction is determined to be a scaling factor of 0.78 ± 0.05 (statistical uncertainty only), independent of lepton and jet kinematic properties.
The OSDL signal and background normalizations are obtained from a simultaneous binned maximum likelihood fit to all the categories, where a template is created for each jet and b tag category, leptonic decay channel, and year. Figure 2 shows the jet multiplicity distributions for the ≥4 b tag categories after the fit to data. The dominant background contribution changes with b tag multiplicity. With increasing N b , backgrounds such as tt + ≥1b, tt + H, and tt + V (V = W, Z) become more important. In the most sensitive categories, tt + ≥1b becomes the dominant background.

Single-lepton final state
Single-lepton events are required to have exactly one lepton with p T > 20 GeV, at least four AK4 jets, at least two of which must be b tagged using the DEEPCSV algorithm, and H T > 500 GeV. A BDT is trained using the TMVA package [63] to discriminate between signal and background in regions with large N j and N b . A large number of possible BDT inputs are constructed, based on kinematic variables such as p T , b tagging discriminants, resolved t tagging discriminants, mass, and angular separations of various objects and their combinations (e.g., jets, dijets, trijets, b-tagged jets, lepton-b pairs, and bb pairs). Information about the event topology is incorporated via event shape variables, such as centrality, planarity, sphericity, and the second Fox-Wolfram moment [64] calculated using all AK4 jets. These potential inputs are evaluated based on whether they are well modeled in the simulation (using χ 2 , Kolmogorov-   Here, tt + ≥1b refers to tt events with at least one additional b jet, tt + 0b includes all other tt events not produced in association with a boson, and EW refers to events that contain W and Z bosons but no top quarks. The backgrounds and tttt signal (derived from the fit described below) are shown as a stacked histogram. The hatched bands correspond to the estimated total uncertainty after the fit.
Smirnov, or Anderson-Darling goodness of fit tests) and whether they increase the BDT discrimination power. The final BDT classifier uses 40 inputs. Signal and background samples are randomly divided into three equally populated parts: for training, for testing the performance, and for evaluating the classifier for the maximum likelihood fit. The contributions from the dominant tt background and all other SM processes are included in the training. The training is optimized by targeting a region enriched in signal events by requiring N j ≥ 6 and N b ≥ 2.
The BDT is validated in the N b = 2 region, which has a low signal-to-background ratio. Small corrections to the slope of the BDT distributions are derived in this region and applied to simulated events in the SR with N b ≥ 3. The correction is verified to be valid in higher b-tag multiplicities, within uncertainties.
To maximize sensitivity to tttt production, events are categorized based on lepton flavor (e, µ), N j (6, 7, 8, 9,   CMS e/ + jets  Here, tt + ≥1b refers to tt events with at least one additional b jet, while tt + 0b includes all other tt events not produced in association with a boson. The TOP grouping contains single top quark production along with the other tt processes not explicitly shown, and EW refers to events that contain W and Z bosons but no top quarks. The backgrounds and tttt signal (derived from the fit described below) are shown as a stacked histogram. The hatched bands correspond to the estimated total uncertainty after the fit. While the bins are shown to be equal width, they do not correspond to equal width in BDT value.

All-hadronic final state
The all-hadronic SR is defined by requiring H T > 700 GeV, no prompt isolated leptons, at least nine AK4 jets of which at least three are b tagged by DEEPJET, and at least one resolved top quark candidate. This selection results in a signal-to-background ratio of about 10 −5 assuming the SM expectation for the signal.
Events in the SR are subdivided into 12 categories based on N RT , the number of boosted top quarks (N BT ), and H T as outlined in Table 1. The categorization by top quark tags defines three groups: N RT ≥ 2, N RT = 1 and N BT ≥ 1, and N RT = 1 and N BT = 0. The first two groups are each further categorized into two ranges in H T : 700-1100 and >1100 GeV. For the third group, there are six equally spaced bins in the range 700 < H T < 1300 GeV, and two additional bins with H T in the ranges 1300-1500 and ≥1500 GeV. The SR categories were chosen to optimize sensitivity to tttt production.
An event-level BDT is trained using CATBOOST [65] in each category of the SR to further distinguish between tttt signal events and the dominant backgrounds originating from tt and QCD multijet production, by exploiting differences in kinematic distributions of reconstructed objects. The 20 optimized BDT input variables include N j and N b ; the kinematic distributions of jets, b-tagged jets, and t-tagged candidates; and variables related to the angular distributions of jets.
Techniques using CRs in data are employed to estimate the dominant backgrounds from QCD multijet and tt + jet production, as described in the following text. The ratio of the QCD multijet to the tt + jet background is expected to be approximately 3:2 in the SR.
Estimates of the absolute normalization of the background and the shape of the BDT distributions in the 12 SR categories are obtained from an extrapolation based on five CRs. While the SR categories have N j ≥ 9 and N b ≥ 3, these five CRs are defined to have (N j , N b ) = (7, 2), (8, 2), (≥9, 2), (7, ≥3), and (8, ≥3). Figure 5 illustrates how these control regions are related as a function of N j and N b . The absolute normalization of the background is estimated using an "extended ABCD" method [66], where the number of events in the SR is derived from the number of events in several independent CRs. This method improves the accuracy of background yield estimates in cases where control variables are weakly correlated (such as N j and N b ), compared to the traditional ABCD method in which only three CRs are used. Specifically, this method is used to predict the number of tt and QCD multijet events in the SR from the number of events observed in the five (N j , N b ) CRs, after subtracting the number of events from minor backgrounds.
The shape of the BDT distribution in each SR is predicted using a DNN trained on the same five CRs that are used to estimate the absolute normalization. A normalizing autoregressive flow [67] is trained on the CRs to learn a bijective transformation of the H T and BDT output distributions between a source and a target. The procedure is sensitive to statistical fluctuations in the source sample, so the input source used is a tt sample with a large number of simulated events. The target is the total tt and multijet background (estimated by subtracting the other, simulated, background contributions from the data). The QCD simulation is not included as an input source due to the very small number of simulated events in the regions of interest.
In each of the CRs, the normalizing flow algorithm learns a transformation from the source distribution to the target distribution, before applying that transformation to the source distribution in the SR. The validity of the method was demonstrated with various tests, including transformations between distributions from physics processes with large differences in shape. The algorithm is trained regressively, starting from the less SR-like CRs (in terms of N j and N b ) and including weights from previous training cycles in subsequent more SR-like training cycles. For each data-taking period (2016, 2017, and 2018), the BDT output and H T distributions are predicted simultaneously in the three SR N RT and N BT categories. The predicted BDT distributions are then split according to the predicted H T , resulting in predicted BDT output distributions binned in H T for each of the 12 SR categories per data-taking period. These distributions are normalized to the absolute normalizations estimated from the "extended ABCD" extrapolation.
The predictions are checked in validation regions (VRs) defined in parallel to those for the SR, but with N j = 8, N b ≥ 3 and extrapolating from CRs in which (N j , N b ) = (6, 2), (7, 2), (8, 2), (6, ≥3), and (7, ≥3). The VRs and their associated CRs are split into the same N RT , N BT , and H T categories as the SRs and their associated CRs as detailed in Table 1. Figure 4 shows BDT disciminant distributions in two VRs, which demonstrate the robustness of this method.
For each category, the difference between the total number of predicted and observed events in the corresponding VR is taken as a normalization uncertainty. This is added in quadrature to a second normalization uncertainty estimated from the bin-to-bin variations within the VR. Shape uncertainties are also estimated for each category, characterized as a linear scaling of the BDT discriminant for each event. The maximum scaling required for the predicted and observed distribution shapes to agree within the uncertainty in all bins in any VR is ±3%. These uncertainties are considered to be correlated for any H T categories that use the same DNN training.
Other backgrounds, originating from the associated production of tt with W, Z, or H bosons, are estimated from simulation. Figure 6 shows the BDT output distributions for the two SR categories that are most sensitive to the tttt signal. The distribution of the BDT discriminant in each category is used as the input to the final fit.

Statistical procedure and systematic uncertainties
An estimate of the tttt production significance and the measurement of the cross section is obtained by a likelihood fit, based on the procedure described in Refs. [68,69]. The fit incorporates the experimental and theoretical uncertainties as nuisance parameters, assuming that shared   systematic uncertainties between analyses are correlated. Different b tagging algorithms are treated as uncorrelated; studies confirm that this choice has no significant effect on the fit results. The significance of signal events is obtained from a comparison to the background-only hypothesis assuming the asymptotic regime as detailed in Ref. [70].
A likelihood fit is performed to the all-hadronic, single-lepton, and OSDL results first, and then a second likelihood fit is performed including previously published results. Uncertainties in both the normalizations and the shapes of the discriminant distributions are considered for all background processes. For the signal, the uncertainties are applied to the predicted shapes and normalizations of the discriminant distributions while the cross section is unconstrained. Systematic uncertainties that come from the same experimental or theoretical source are treated consistently between results.
The sensitivity of the analysis is limited primarily by the small size of the sample, which contributes an uncertainty of 22% in the measured signal strength. Approximately 400 nuisance parameters representing systematic uncertainties are considered; the two with the largest individual effects are the theoretical uncertainty in the ttH cross section (4.6% impact on signal strength) and the uncertainty in the modeling of tt with additional b jets (3.7%). The background estimation in the all-hadronic final state contributes up to 2.7% per SR category, dominated by statistical fluctuations in the CRs. The jet energy scale contributes up to 2.4% (depending on the year and channel), renormalization and factorization scales 2.1%, and leptonic fake rates 1.9%. The largest components of the b tagging and light quark mistagging efficiency uncertainties each contribute 1.8%.
Further nuisance parameters with smaller individual contributions include those relating to lepton reconstruction and trigger efficiencies, the input cross section for tt + additional b quark production, jet energy resolution, matrix element to parton shower matching, the modeling of the resolved t tagging in the single-lepton channel, PDFs, pileup, the theoretical uncertainty in the tt cross section, the theoretical modeling of initial-state radiation in tt production, and the delivered luminosity. The total systematic uncertainty, considering the effects of all nuisance parameters, is 17%. Table 2 shows the fitted values of the signal strength (the ratio of the measured cross section to the prediction), the measured cross section, and the expected and observed significance of tttt production from 2017-2018 data in the OSDL channel, and 2016-2018 data in the single-lepton and all-hadronic channels. The signal strength is calculated with all systematic uncertainties, including all theoretical uncertainties. The cross section measurement is performed with all systematic uncertainties except theoretical uncertainties that affect the rate of the signal process. The table also includes the combination of these new results with the CMS OSDL analysis of 2016 data [22], and the same-sign dilepton and multilepton analysis of the 2016-2018 data [21], following the procedure described in Refs. [68,69]. The expected and observed significances for each final state and the combination are also shown in Fig. 7. The combination of results described in this Letter gives a signal strength of 2.5 ± 0.5 (stat) ± 0.5 (syst) and a measured cross section of 36 ± 7 (stat) +10 −8 (syst), providing evidence for tttt production with an observed significance of 3.9 standard deviations (1.5 expected). The full combination of all channels for the CMS 2016-2018 data set gives a signal strength of 1.4 ± 0.3 (stat) ± 0.2 (syst) and a measured cross section of 17 ± 4 (stat) ± 3 (syst), increasing the observed significance to 4.0 standard deviations (3.2 expected). This is the most sensitive tttt analysis to date. Table 2: Measured signal strength (µ = σ tt tt /σ SM tt tt where σ SM tt tt = 12 fb), corresponding cross section (in fb), and the expected and observed significance (in standard deviations) for tttt production from all analysis channels. This table shows production from each analysis channel in this Letter, the combination of those channels, the results from previously published results, and the full combination of all CMS 2016-2018 results.

Analysis
Signal

Summary
We have measured the cross section for the simultaneous production of four top quarks (tttt) in proton-proton collisions. The data were collected by the CMS experiment at the LHC in 2016-2018, and correspond to an integrated luminosity of up to 138 fb −1 at a center-of-mass energy of 13 TeV. The all-hadronic final state has been studied for the first time in a tttt production analysis, using a background estimation strategy based on a deep neural network trained  using control regions in data. Final states with one lepton (electron or muon), or two oppositesign leptons have also been analyzed. The observed and expected significances obtained from the combination of the new analyses described here are 3.9 and 1.5 standard deviations, respectively. When combined with published CMS results in other final states, the significances increase to 4.0 (observed) and 3.2 (expected) standard deviations. This is a significant improvement compared to previous CMS results and the first CMS evidence for tttt production with a significance above three standard deviations.

Acknowledgments
We congratulate our colleagues in the CERN accelerator departments for the excellent performance of the LHC and thank the technical and administrative staffs at CERN and at other CMS institutes for their contributions to the success of the CMS effort. In addition, we gratefully acknowledge the computing centers and personnel of the Worldwide LHC Computing Grid and other centers for delivering so effectively the computing infrastructure essential to our analyses. Finally, we acknowledge the enduring support for the construction and operation of the LHC, the CMS detector, and the supporting computing infrastructure provided by the following funding agencies:   [18] ATLAS Collaboration, "Evidence for tttt production in the multilepton final state in proton-proton collisions at √ s = 13 TeV with the ATLAS detector", Eur. Phys. J. C 80 (2020) 1085, doi:10.1140/epjc/s10052-020-08509-3, arXiv:2007.14858.
[37] CMS Collaboration, "Investigations of the impact of the parton shower tuning in Pythia 8 in the modelling of tt at √ s = 8 and 13 TeV", CMS Physics Analysis Summary CMS-PAS-TOP-16-021, 2016.