A deep neural network to search for new long-lived particles decaying to jets

A tagging algorithm to identify jets that are significantly displaced from the proton-proton (pp) collision region in the CMS detector at the LHC is presented. Displaced jets can arise from the decays of long-lived particles (LLPs), which are predicted by several theoretical extensions of the standard model. The tagger is a multiclass classifier based on a deep neural network, which is parameterised according to the proper decay length $\mathrm{c}\tau_0$ of the LLP. A novel scheme is defined to reliably label jets from LLP decays for supervised learning. Samples of pp collision data, recorded by the CMS detector at a centre-of-mass energy of 13 TeV, and simulated events are used to train the neural network. Domain adaptation by backward propagation is performed to improve the simulation modelling of the jet class probability distributions observed in pp collision data. The potential performance of the tagger is demonstrated with a search for long-lived gluinos, a manifestation of split supersymmetric models. The tagger provides a rejection factor of 10000 for jets from standard model processes, while maintaining an LLP jet tagging efficiency of 30-80% for gluinos with 1 mm $\leq c\tau_0 \leq$ 10 m. The expected coverage of the parameter space for split supersymmetry is presented.


Introduction
Machine-learned algorithms are routinely deployed to perform event reconstruction, particle identification, event classification, and other tasks [1] when analysing data samples recorded by experiments at the CERN LHC. Machine learning techniques have been widely adopted to classify a jet, a collimated spray of particles that originate from the hadronisation of a parton, according to the underlying flavour of the initial parton [2]. For example, jets that originate from the hadronisation of b quarks (b jets) exhibit characteristic experimental signatures that can be exploited by dedicated algorithms to identify b jets, in a procedure known as b tagging. The b hadrons, with proper lifetimes of O(10 −12 s), typically travel distances of approximately 1-10 mm, depending on their momenta, before decaying. As a result, charged particle tracks in jets can originate from one or more common vertices that may be displaced with respect to the proton-proton (pp) collision region. Furthermore, the impact parameter of each track, defined as the spatial distance between the originating pp collision and the track at its point of closest approach, can have a significant nonzero value. The ATLAS [3] and CMS [4] Collaborations have developed numerous algorithms based on boosted decision trees or neural networks to identify b jets [5,6] using the aforementioned and other high-level engineered features. The latest b tagging algorithm developed by the CMS Collaboration is the DeepJet tagger [7,8], which is a multiclass classifier that discriminates between jets originating from the hadronisation of heavy-(b or c) or light-flavour (u, d, and s) quarks or gluons (g) with unprecedented performance. The algorithm is based on a deep neural network (DNN) that exploits particle-level information, as well as jet-level engineered features used in preceding b tagging algorithms [5,6].
Various theoretical extensions of the standard model (SM) [9][10][11][12][13][14][15] predict the existence of longlived particles (LLPs) with a proper lifetime τ 0 that can be very different from those of known SM particles. Consequently, the production and decay of LLPs at the LHC could give rise to atypical experimental signatures. These possibilities have led to the development of a broad search programme at the LHC, based around LLP simplified models [16,17] and novel reconstruction techniques. A comprehensive review of LLP searches at the LHC can be found in Ref. [18].
In this paper, we present the novel application of a DNN to tag (i.e. identify) a jet originating from the decay of an LLP (LLP jet). The DNN is trained and evaluated using a range of signal hypotheses comprising simplified models of split supersymmetry (SUSY) [9,10] with R parity [19] conservation. These models assume the production of gluino ( g) pairs. The gluino is a long-lived state that decays to a quark-antiquark (qq) pair and a weakly interacting and massive neutralino ( χ 0 1 ), which is the lightest SUSY particle and a dark matter candidate. Simplified models of SUSY that yield LLPs are widely used as a benchmark for searches in final states containing jets and an apparent imbalance in transverse momentum, p miss T [20-31]. The search in Ref. [27] does not explicitly target LLPs, but its inclusive approach provides sensitivity over a large range of lifetimes. This search is used later as a performance benchmark for the methods presented in this paper.
The LLP jet tagger is inspired by the DeepJet approach, albeit with significant modifications. The DNN extends the multiclass classification scheme of the DeepJet algorithm to accommodate the LLP jet class. A procedure to reliably label LLP jets using generator-level information from the Monte Carlo (MC) programs is defined. The experimental signature for an LLP jet depends strongly on the proper decay length cτ 0 . Hence, a parameterised approach [32] is adopted by using cτ 0 as an external parameter to the DNN, which permits hypothesis testing using a single network for models with values of cτ 0 that span several orders of magnitude. Furthermore, the jet momenta depend strongly on the masses m g and m χ 0 1 muon hypothesis. Charged hadrons are identified as charged particle tracks neither identified as electrons, nor as muons. Finally, neutral hadrons are identified as HCAL energy clusters not linked to any charged hadron trajectory, or as a combined ECAL and HCAL energy excess with respect to the expected charged hadron energy deposit. The inclusive vertex finding algorithm [41] is used as the standard secondary vertex reconstruction algorithm, which uses all reconstructed tracks in the event with p T > 0.8 GeV and a longitudinal impact parameter of greater than 0.3 cm.
Jets are clustered from PF candidates using the anti-k T algorithm [35, 36] with a distance parameter of 0.4. Additional pp interactions within the same or nearby bunch crossings (pileup) can contribute additional tracks and calorimetric energy depositions to the jet momentum. To mitigate this effect, charged particles identified to be originating from pileup vertices are discarded [42] and an offset correction [43] is applied to correct for remaining contributions. In this study, jets are required to satisfy p T > 30 GeV and |η| < 2.4, and are subject to a set of loose identification criteria [44] to reject anomalous activity from instrumental sources, such as detector noise. These criteria ensure that each jet contains at least two PF candidates and at least one charged particle track, the energy fraction attributed to charged-and neutral-hadron PF candidates is nonzero, and the fraction of energy deposited in the ECAL attributed to charged and neutral PF candidates is less than unity.
The most accurate estimator of p miss T is computed as the negative p T vector sum of all the PF candidates in an event, and its magnitude is denoted as p miss T [45]. The p miss T is modified to account for corrections to the energy scale [43] of the reconstructed jets in the event. Anomalous high-p miss T events can be due to a variety of reconstruction failures, detector malfunctions, or noncollision backgrounds. Such events are rejected by event filters that are designed to identify more than 85-90% of the spurious high-p miss T events with a rejection (mistagging) rate of less than 0.1% for genuine events [45].
Events of interest are selected using a two-tiered trigger system [46]. The first level, composed of custom hardware processors, uses information from the calorimeters and muon detectors alone, whereas a version of the full event reconstruction software optimised for fast processing is performed at the second level, which runs on a farm of processors.

Event selection and sample composition
Split SUSY, as characterised by the simplified models discussed in this paper, would reveal itself in events containing jets, significant p miss T from undetected neutralinos, and an absence of photons and leptons. Candidate signal events are required to satisfy a set of selection requirements that define a signal region (SR). Conversely, event samples that are enriched in the same background processes that populate the SR, while being depleted in contributions from SUSY processes, are identified as control regions (CRs).
Candidate signal events in the SR are required to satisfy the following set of selection requirements. Events are required to contain at least three jets, as defined in Section 2. Events are vetoed if they contain at least one electron (muon), isolated from other activity in the event, that satisfies p T > 15 (10) GeV, |η| < 2.4, and loose identification criteria [39,47]. The mass scale of each event is estimated from the scalar p T sum of the jets, H T = ∑ jets i p i T , which is required to be larger than 300 GeV. An estimator for p miss T is given by the magnitude of the vector p T sum of the jets, H miss T = |∑ jets i p i T |, which is required to be larger than 300 GeV. Events in the SR can be efficiently recorded with a trigger condition that requires the presence of a single jet with p T > 80 GeV, H miss T > 120 GeV, and p miss T > 120 GeV.
Following these selections, the dominant contribution to the SM background comprises multijet events produced via the strong interaction, a manifestation of quantum chromodynamics (QCD). The multijet contribution is reduced to a negligible level using the following three criteria. Events containing at least one jet that satisfies p T > 50 GeV and 2.4 < |η| < 5 are vetoed to ensure an adequate resolution performance for the H miss T variable. Events are required to satisfy H miss T /p miss T < 1.25, which mitigates the rare circumstance in which several jets with p T below the aforementioned 30 GeV threshold and collinear in φ lead to large values of H miss T relative to p miss T . The minimum azimuthal separation between each jet and the vector p T sum of all other jets in the event, denoted ∆φ min [48], is required to be greater than 0.2 radians.
The remaining background events in the SR are dominated by contributions from processes that involve the production of high-p T neutrinos in the final state, such as the associated production of jets and a Z boson that decays to νν . A further significant background contribution arises from events that contain a W boson that undergoes a leptonic decay, W(→ ν)+jets, where the charged lepton ( ≡ e, µ or τ) is outside the experimental acceptance, or is not identified, or is not isolated. Hence, a substantial contribution is also expected from the production of single top quarks and top quark-antiquark pairs (tt), both of which can lead to a final state containing one leptonically decaying W boson and at least one b jet. Residual contributions from rare SM processes, such as diboson production or the associated production of tt and a vector or scalar boson, are not considered in this study.
Two CRs are used to assess differences in the performance of the tagger when using simulated events or pp collision data. The CRs are defined in terms of leptons and jets that satisfy η < 2.4 and the p T requirements defined below. The single muon (µ+jets) CR is required to contain exactly one muon satisfying p T > 26 GeV. The dimuon (µµ+jets) CR is required to contain a second muon that satisfies p T > 15 GeV. The muons are required to be isolated from other activity in the event and satisfy identification criteria [47]. Events containing additional electrons (muons) with p T > 15(10) GeV and satisfying looser identification criteria, are vetoed. Both CRs must contain at least two jets satisfying p T > 30 GeV. Events in the µ+jets and µµ+jets CRs are required, respectively, to satisfy p miss T > 150 GeV and p T (µµ) > 100 GeV, where the latter variable is the magnitude of the vectorial p T sum of the two muons. The µ+jets CR comprises, in approximately equal measure, events from the associated production of jets and a W boson, and single top quark and tt production. The µµ+jets CR contains Drell-Yan (qq → Z/γ * → µ ± µ ∓ ) events with subdominant contributions from tt and Wt-channel single top quark production. Events in both CRs are efficiently recorded with a trigger condition that requires the presence of a single isolated muon that satisfies p T > 24 GeV.

Monte Carlo simulation
The DNN is trained to predict the jet class using supervised learning, which relies on generatorlevel information from MC programs. Various simulated event samples are also used during the evaluation of the DNN to benchmark the performance of the tagger.
Split SUSY predicts the unification of the gauge couplings at high energy [49][50][51] and a candidate dark matter particle, the neutralino. Apart from a low mass scalar Higgs boson, assumed in this model to be the state observed at 125 GeV [52,53], only the fermionic SUSY particles may be kinematically accessible at the LHC. All other SUSY particles are assumed to be ultraheavy. The gluino is only able to decay through the highly virtual squark states. Hence, the gluino hadronises and forms a bound state with SM particles called an R hadron. The R hadron can travel a significant distance before the gluino undergoes a three-body decay to qq χ 0 1 , depending on cτ 0 .
The split SUSY simplified models are defined by three parameters: cτ 0 and the masses of the gluino m g and the neutralino m χ 0 1 . The following model parameter space is considered by the search: 600 < m g < 2400 GeV, m g − m χ 0 1 > 100 GeV, and 10 µm < cτ 0 < 10 m. The lower and upper bounds of the cτ 0 range are motivated by the O(10 µm) position resolution of the tracker subdetector [54] and the physical dimensions of the CMS detector, respectively. The tagger performance is also assessed using two split SUSY benchmark models that feature: an "uncompressed" mass spectrum, with a large mass difference between the gluino and χ 0 1 , (m g , m χ 0 1 ) = (2000, 0) GeV; or a "compressed" spectrum that is nearly degenerate in mass, (1600, 1400) GeV. The value of cτ 0 for both models is defined in the text on a case-by-case basis.
Two benchmark models of GMSB and RPV SUSY are also considered to demonstrate the generalisation of the DNN. A GMSB-inspired model assumes a long-lived gluino, with a mass of 2500 GeV and cτ 0 = 1 mm or 1 m, that decays to a gluon and a light gravitino of mass 1 keV. Again, all other SUSY particles are assumed to be ultraheavy and decoupled from the interaction. An RPV-inspired model assumes the production of top squark-antisquark pairs. The decay of the long-lived top squark, with a mass of 1200 GeV and cτ 0 = 1 mm or 1 m, to a bottom quark and a charged lepton is suppressed through a small R parity violating coupling.
Samples of simulated events are produced with several MC generator programs. For the split SUSY simplified models, the associated production of gluino pairs and up to two additional partons are generated at leading-order (LO) precision in QCD with the MADGRAPH5 aMC@NLO 2.2.2 [55] program. The decay of the gluino is performed with the PYTHIA 8.205 [56] program. The RHADRONS package within the PYTHIA program, steered according to the default parameter settings, is used to describe the formation of R hadrons through the hadronisation of gluinos [57][58][59]. A similar treatment is performed for the GMSB-and RPV-inspired benchmark models.
The MADGRAPH5 aMC@NLO event generator is used to produce samples of W(→ ν)+jets, Z(→νν )+jets, and Z/γ * (→µµ)+jets events at next-to-leading-order (NLO) precision in QCD. The samples of W(→ ν)+jets events are generated with up to two additional partons at the matrix element level and are merged with jets from the subsequent PYTHIA parton shower simulation using the FxFx scheme [60]. The POWHEG v2 [61][62][63] event generator is used to simulate tt production [64] and the t-channel [65] and Wt-channel [66] production of single top quarks at NLO accuracy. Multijet events are simulated at LO accuracy using PYTHIA.
The PYTHIA program with the CUETP8M1 tune [71] is used to describe parton showering and hadronisation for all simulated samples except top quark-antiquark production, which used the CUETP8M2T4 tune [72]. The NNPDF3.0 LO and NLO parton distribution functions [73] are used with the LO and NLO event generators, respectively. Minimum bias events are overlaid with the hard scattering event to simulate pileup interactions, with the multiplicity distribution matched to that observed in pp collision data. The resulting events undergo a full detector simulation using the GEANT4 [74] package. The model-dependent interactions of R hadrons with the detector material are not considered in this study.

The LLP jet algorithm
In this section, the DNN architecture and its technical implementation are presented. The use of a single parameterised DNN for hypothesis testing and the application of domain adaptation (DA) to samples of pp collision data recorded by the CMS experiment are presented for the first time.

Jet labelling
Generator-level information from MC programs is often used to label a jet according to its initiating parton for supervised learning. A standard procedure known as "ghost" labelling [42] determines the jet flavour by clustering the reconstructed final-state particles and the generatorlevel b and c hadrons into jets. Only the directional information of the four-momentum of the generator-level (ghost) hadron is retained to prevent any modification to the four-momentum of the corresponding reconstructed jet. Jets containing at least one ghost hadron are assigned the corresponding flavour label, with b hadrons preferentially selected over c hadrons. Similarly, labels are defined for jets originating from gluons (g) or light-flavour (uds) quarks.
The LLP jet tagger adopts the ghost labelling approach for jets originating from SM background processes. However, a complication arises when applying ghost labelling to the jets originating from g decays. The quark and antiquark produced in the gluino decay can interact with each other, potentially leading to one or more jets that do not point in the same direction as the quarks. Two examples of g → qq χ 0 1 decays are shown in Fig. 1. For each example, the finalstate particles resulting from the hadronisation of one of the quarks are sufficiently diffuse that they are clustered into multiple distinct jets. By definition, ghost tagging cannot account for multiple jets originating from a single ghost particle, and it may even fail to associate the ghost particle with any of the jets if the jets are sufficiently distanced in η-φ space.  Figure 1: Two example g → qq χ 0 1 decay chains, constructed from information provided by the MADGRAPH5 aMC@NLO [55] and PYTHIA [56] programs. The positions of various particles in the η-φ plane are shown: the LLP ( g) and its daughter particles (qq χ 0 1 ) are shown in the lower and middle planes, respectively; the upper plane depicts the location of the stable particles after hadronisation, with shaded ellipses overlaid to indicate the reconstructed jets. Each quark and its decay is assigned a unique colour. The dotted lines indicate the links between parent and daughter particles.
An alternative labelling scheme is defined for jets originating from gluino decays, which can be extended to other LLP decays. All stable SM particles are grouped according to their simulated vertex position and linked to one of the quark daughters from the LLP decay. All stable 5.2 Deep neural network architecture to predict the jet class SM particles, except neutrinos, are clustered into generator-level jets using the anti-k T algorithm [35, 36] with a distance parameter of 0.4. Given that constituent particles in a jet may originate from different vertices, the momentum p of a given jet is shared between vertices according to the vectorial momentum sum ∑ i p i of the constituent particles i in a jet that share the same vertex v. Per jet, the jet-vertex shared momentum fraction f v with respect to vertex v is then defined as Each jet is then associated to the vertex v from which the majority of its momentum originates, i.e.v = argmax( f v ). This criterion prevents the coincidental association of jets containing very few or very soft particles from a gluino decay to a vertex for which the majority of the constituent particles stem from initial-state radiation or the underlying event. A reconstructed jet is uniquely associated with a generator-level jet and they adopt the same label if their axes are aligned within a cone ∆R = The LLP label is prioritised over the other jet labels to prevent ambiguities. Jets from the split SUSY samples that are not labelled as LLP by this scheme can still comprise a nonnegligible fraction of displaced particles and are thus discarded to prevent class contamination. Non-LLP jets for the DNN training are instead taken from simulated samples of SM backgrounds. Applying an artificial 20% contamination of the LLP jet class from pileup jets, to test the robustness of the labelling procedure, leads to no discernible effect on the tagger performance given the statistical precision of the study.

Deep neural network architecture to predict the jet class
The architecture used to predict the jet class, inspired by the DeepJet algorithm, is shown in Fig. 2. The DNN considers approximately 600 input features, which can be grouped into four categories: up to 25 charged (neutral) PF candidates, each described by 17 (6) features and ordered by impact parameter significance (transverse momentum); up to four secondary vertices, each described by 12 features; and 14 global features associated with the jet. Zero padding is used to accommodate the variable numbers of PF candidates and secondary vertices.
Each charged PF candidate is described by the following features: the kinematical properties of the associated track (both absolute and relative to the jet axis), the track quality, and the transverse and three-dimensional impact parameters (and their significances) of the track. Each neutral PF candidate is described by its energy, the fractions of its energy deposited within the ECAL and HCAL subdetectors, the compatibility with the photon hypothesis, the compatibility with the pileup hypothesis as determined by the PUPPI algorithm [75,76], and the collinearity with respect to the jet axis and the nearest secondary vertex. The features that describe each reconstructed secondary vertex include its kinematical properties, the number of associated tracks, and the three-dimensional displacement (and significance) with respect to the primary pp collision vertex. The global jet features comprise the jet momentum and pseudorapidity, the number of constituent PF candidates, the number of reconstructed secondary vertices, and several high-level engineered features used by the CSV b tagging algorithm [5].
Four sequential layers of one-dimensional convolutions with a kernel size of one are used, with each layer comprising 64, 32, 16, 8, or 4 filters depending on the group of input features. Per particle candidate or vertex, each convolutional layer transforms the features from the preceding layer according to its filter size. By choosing a small filter size for the final layer, the overall operation can be viewed as a compression. After each layer, a leaky rectified linear (LeakyReLU) [77] activation function is used. Dropout [78] layers are interleaved throughout the network with a 10% dropout rate to mitigate overfitting. After the final convolutional The resulting feature vector is fed into a multilayer perceptron, a series of dense layers comprising 200, 100, or 50 neurons. The softmax activation function is used in the last layer. Categorical cross entropy is used for the loss function to predict the jet class probability. The DNN provides an estimate of the probabilities for the following jet classes: LLP jet, b or c jet, uds (light-flavour quarks) jet, and g jet. The latter two classes are frequently combined when evaluating the network performance to give a light-flavour (udsg) jet class.

Network parameterisation according to cτ 0
The experimental signature for a jet produced in a gluino decay depends strongly on cτ 0 . Information from all CMS detector systems is available if the gluino decay occurs promptly, in the vicinity of the pp collision region, while information can be limited if the decay occurs in the outermost detector systems. Hence, cτ 0 is introduced as an input parameter to the dense network, as indicated in Fig. 2. This parameterised approach [32] allows for hypothesis testing with a single network for values of cτ 0 that span six orders of magnitude: 10 µm < cτ 0 < 10 m.
During the network evaluation for jets from both signal and SM processes, the cτ 0 value is given by the signal hypothesis under test.

Domain adaptation by backward propagation
The simulated event samples are of limited accuracy and do not exhaustively reproduce all features observed in the pp collision data. Hence, a neural network may produce different identification efficiencies when evaluated on simulated samples and pp collision data if the training of the neural network relies solely on simulation. Domain adaptation is a technique that attempts to construct a feature representation that is invariant with respect to the domain from which the features are obtained. In this study, the domain is either pp collision data or simulation, and the domain-invariant representation is obtained from the output of the feature extraction subnetwork, as indicated in Fig. 2. Hence, for the subsequent task of classifying jets, a similar performance is expected for both simulated events and pp collision data [79].
In this paper, we use DA by backpropagation of errors [34]. To achieve this, an additional branch is added after the first dense layer of 200 nodes. Binary cross entropy is used for the loss function to predict the jet domain probability. During training, the combined loss L class + λL domain is minimised, where λ is a hyperparameter that controls the magnitude of the jet domain loss. A gradient reversal layer is inserted in the domain branch directly after the first dense layer. This special layer is only active during backward propagation and reverses the gradients of the domain loss L domain with respect to the network weights w in the preceding layers. This forces the feature extraction subnetwork to focus less on domain-dependent features because its weights effectively minimise L class − λL domain instead. Only features that are common to both pp collision data and simulation are retained. For a full mathematical description of the gradient reversal layer, see Ref. [34].

Deep neural network training
Supervised training of the DNN is performed to predict the jet class and domain. The Adam optimiser [80] is used to minimise the loss function with respect to the network parameters. The DNN training relies on simulated events of gluino production from split SUSY models, and multijet and tt production. Jets in these samples that have an uncorrected p T greater than 20 GeV and satisfy the loose requirements outlined in Section 2 are considered for the DNN training. Approximately 20 million jets are used. Event samples of split SUSY models are utilised over a wide range of (m g , m The jets from SM processes are drawn randomly (from a larger sample) to obtain the same p T and η distributions available for the LLP jet class. This resampling technique is done to ensure an adequate generalisation that is largely independent of kinematical features related to the physics process. The cτ 0 values assigned to the SM jets are generated randomly according to the cτ 0 distribution obtained per batch for the LLP jet class.
The training of the domain branch uses batches of 10 000 jets, drawn from samples of µ+jets data and simulated W(→ ν)+jets and tt events. The DNN is trained using the class and do-main batches simultaneously. Events in the domain batch are assigned the same cτ 0 values as used by the class batch. For the domain batch, only the six highest p T jets are used, and jets may be reused multiple times per epoch.
The DNN is initially trained to predict only the jet class to determine the optimal scheduling of the learning rate α, which decays from an initial value of 0.01 according to α = 0.01/(1 + κn) where n is the epoch number and κ is the decay constant. The classifier performance is optimal for κ = 0.1 and only weakly dependent on κ. The DNN is then trained to predict both the jet class and domain, and the λ hyperparameter is increased according to λ = λ 0 [2/(1 + e −0.2n ) − 1] with λ 0 = 30, such that λ increases from 0 to 0.9λ 0 after 15 epochs. The parameterisations used to evolve the values of the α and λ hyperparameters during the DNN training were chosen from several trials to ensure reproducible and optimal performance.

Workflow
Queue initialised with file names  The KERAS v2.1.5 [81] software package is used to implement the DNN architecture. The TEN-SORFLOW v1.6 [82] queue system is used to read and preprocess files for the DNN training. A schematic overview of the pipeline used to preprocess batches of jets for the class prediction is given in Fig. 3, while a similar queue is also used for the domain prediction. At the beginning of each epoch, a queue holding a randomised list of the input file names is initialised. File names are dequeued asynchronously in multiple threads. For each thread, ROOT v6.18.00 [83] trees contained in the files are read from disk to memory in batches using a TENSORFLOW operation kernel, developed in the context of this paper. The resulting batches are resampled to achieve the same distributions in p T and η for all jet classes and are enqueued asynchronously into a second queue, which caches a list of tensors. The DNN training commences by dequeuing a randomised batch of tensors and generating cτ 0 values for all SM jets within the batch. The advantages of this system lie in its flexibility to adapt to new input features or samples on-the-fly. The reading of ROOT trees and the (p T , η) resampling for the SM jets proceeds asynchronously in multiple threads, managed by TENSORFLOW, on the CPU while the network is being trained. A demonstration of the workflow can be found at Ref. [84].

Validation with pp collision data
In the absence of DA, the LLP jet probability P(LLP|cτ 0 ) obtained from the simulation of the rel-  : Distributions of the maximum probability for the LLP jet class obtained from all selected jets in a given event, P max (LLP|cτ 0 ). The distributions from pp collision data (circular marker) and simulated events (histograms) are compared in the µ+jets (upper row) and µµ+jets (lower row) CRs, using a DNN trained without (left column) and with (right column) DA. All probabilities are evaluated with cτ 0 = 1 mm. The Jensen-Shannon divergence (JSD) is introduced in the text. The lower subpanels show the ratios of the binned yields obtained from data and Monte Carlo (MC) simulation. The statistical (hatched bands) and systematic (solid bands) uncertainties due to the finite-size simulation samples and the simulation mismodelling of the mistag rate, respectively, are also shown. evant SM backgrounds and the CR data can differ significantly, with deviations of up to 50% in the binned counts of P(LLP|cτ 0 ) distributions. A similar event-level variable is P max (LLP|cτ 0 ), which is defined as the maximum value of P(LLP|cτ 0 ) obtained from all selected jets in a given event. A comparison of the P max (LLP|cτ 0 ) distributions obtained from pp collision data and simulated events in both the µ+jets and µµ+jets CRs, using a DNN trained with and without DA, is shown in Fig. 4. The use of DA in the network leads to a significant improvement in the level of agreement in the binned counts of P max (LLP|cτ 0 ) for the two domains of data and simulation, with only small residual differences remaining. This improvement is expected for the µ+jets CR, as the same events are used to train, evaluate, and optimise the domain branch of the DNN. The µµ+jets CR, comprising a statistically independent event sample, validates the method.
An estimate of the uncertainty in P max (LLP|cτ 0 ) due to simulation mismodelling is determined from jets in a statistically independent sample of µµ+jets events that satisfy p T (µµ) < 100 GeV.
The magnitude of the uncertainty is assessed by weighting up or down the simulated events by the factor w ± = ∏ jets i 1 ± ξ i (LLP) − 1 , where ξ i (LLP) is the ratio of counts from data and simulation in bin i of the P(LLP|cτ 0 ) distribution. The ratios of event counts binned according to P max (LLP|cτ 0 ) from pp collision data and simulation, as well as the corresponding uncertainty estimates, are shown in the lower panels of Fig. 4. The ratios are closer to unity, with reduced uncertainties, following the application of DA. The level of agreement between data and simulation is further quantified by the Jensen-Shannon divergence (JSD) [85], a measure of similarity between two probability distributions that is bound to [0, 1] and takes a value of zero for identical distributions. The JSD is reduced by an order of magnitude following the application of DA. The quoted uncertainties in JSD reflect the finite sizes of the data and simulated samples.
The application of DA leads to significantly reduced biases and uncertainties in the modelling of P(LLP|cτ 0 ) and related variables in the signal-depleted CRs. This behaviour would translate into an improved treatment for the estimation of SM backgrounds in the SR and the associated experimental systematic uncertainties. However, only modest gains in sensitivity to new highmass particle states may be expected, as the dominant uncertainties arise from the finite-size samples of pp collision data and simulated events.

Performance
The performance of the tagger is demonstrated using simulated event samples for split SUSY benchmark models with an uncompressed mass spectrum, as defined in Section 4, and cτ 0 values of 1 mm and 1 m. The two values of cτ 0 give greater weight to the roles of the tracker and calorimeter systems, respectively. Negligible SM background contributions are expected for the 1 m scenario. An inclusive sample of tt events is used to provide both light-flavour (udsg) jets, through initial-state radiation and hadronic decays of the W boson, and b jets, with p T > 30 GeV and |η| < 2.4.
The efficiency of the tagger to identify correctly the LLP jet class depends on the chosen working point, defined by a threshold requirement on the jet class probability. The mistag rates for the remaining jet classes also depend on the same working point. The receiver operating characteristic (ROC) curves that provide the LLP jet tagging efficiency and the mistag rate for the udsg jet class as a function of the working point are shown in Fig. 5. The uncertainties indicated by the shaded bands are determined from the maximum variations obtained from a threefold cross validation. Given a mistag rate of 0.01%, equivalent to a background rejection factor of 10 000, efficiencies of 40 and 70% are obtained for split SUSY models with cτ 0 values of 1 mm and 1 m, respectively. These efficiencies decrease by a factor ≈0.6 if the DNN relies solely on the global jet features. Additional studies reveal that the tagger efficiency for the LLP jet class falls to 25% in order to achieve the same mistag rate of 0.01% for the (SM) b jet class when considering split SUSY models with cτ 0 = 1 mm. Figure 5 also shows ROC curves when evaluating the DNN using simulated events from the GMSB and RPV SUSY benchmark models, as defined in Section 4. The jets originate from uds quarks (gluons) from the gluino decay in the case of split (GMSB) SUSY models, and b hadrons from the top squark decay for RPV SUSY. A similar level of performance is observed for these SUSY models, in which the LLP jets have a different underlying flavour. Furthermore, the ROC curves indicated by the thick and thin solid curves illustrate the tagger performance when the DNN is trained with or without DA, respectively, for the split SUSY benchmark models. Studies demonstrate a comparable performance for split SUSY models with and without DA for the region cτ 0 ≤ 10 mm. For larger values of cτ 0 , the performance is overestimated and the uncertainties are small (not visible in the figure) in the absence of DA, as indicated by the cτ 0 = 1 m scenario shown in Fig. 5. This is because the DNN is able to exploit patterns in the features obtained from simulation that are not representative of those obtained from data.  The LLP jet tagging efficiency is shown in Fig. 6 as a function of the jet p T , η, and the number of reconstructed secondary vertices within the jet (N SV ) for a working point that yields a mistag rate of 0.01% for the udsg jet class obtained from simulated tt events. Efficiencies are highest for high-p T , centrally produced jets with N SV = 0. The latter observation demonstrates the complementary performance of the tagger with respect to a more standard approach of relying on reconstructed secondary vertices to identify displaced jets. The performance of the DNN parameterisation according to cτ 0 is shown in Figure 7. The left panel shows the LLP jet tagging efficiency, using a working point that yields a mistag rate of 0.01% for the udsg jet class, as a function of the generated cτ 0 value, for both an uncompressed and a compressed split SUSY model. Efficiencies in the range 40-80 (30-70)% are achieved for uncompressed (compressed) scenarios with 1 mm ≤ cτ 0 ≤ 10 m. The compressed scenarios are characterised by low-p T jets, resulting in lower efficiencies.
The performance is further tested using two uncompressed split SUSY models with cτ 0 = 1 mm and 1 m. The LLP jet tagging efficiency is obtained by evaluating the DNN for each value of cτ 0 in the range 10 µm ≤ cτ 0 ≤ 10 m. Again, the efficiency is determined using a working point that is tuned for each evaluated cτ 0 value to yield a mistag rate of 0.01% for the udsg jet class. The efficiency as a function of the evaluated cτ 0 value is shown in Fig. 7 (right). The maximum efficiency is obtained when the evaluated value of cτ 0 approximately matches the parameter value of the split SUSY model. This behaviour may be used to characterise a potential signal contribution in terms of the model parameter cτ 0 . Finally, studies demonstrate that the parameterised approach does not significantly impact performance with respect to the training of multiple DNNs, one per cτ 0 value.

Application to a search for LLPs
The performance of the LLP jet tagger is demonstrated by applying it to the search for longlived gluinos with 10 µm ≤ cτ 0 ≤ 10 m. Expected limits on the theoretical production cross section for gluino pairs are determined.

Categorisation of events and uncertainties
Candidate events that satisfy the SR requirements defined in Section 3 are categorised according to: the number of jets in the event, N jet ; the number of jets for which P(LLP|cτ 0 ) is above a predefined threshold, N tag ; and H T . The resulting six categories are summarised in Table 1. Models with an uncompressed mass spectrum, m g − m 200 GeV, are characterised by lower values of H T because of the limited kinematic phase space available to the jets from the gluino decay and an increased reliance on associated jet production from initial-state radiation. Events that satisfy N tag < 2 are grouped into a single additional category, which is used to constrain the normalisation of simulated background events during the statistical evaluation. Table 1: The event counts and uncertainties for SM backgrounds and split SUSY models, as determined from simulation, in categories defined by H T and (N jet , N tag ). The simulated samples are normalised to an integrated luminosity of 35.9 fb −1 . The uncompressed and compressed split SUSY models are defined in Section 4. The value of cτ 0 is assumed to be 1 mm. The uncertainties include both statistical and systematic contributions. Expected counts for events that satisfy N tag < 2 are not shown.
H T (GeV) 300-800 300-800 300-800 >800 >800 >800 The tagger is evaluated using simulated samples of all relevant SM backgrounds, described in Section 4. The negligible background contribution from multijet events in the SR is not considered in this exploratory study. The predefined threshold on P(LLP|cτ 0 ) is determined per cτ 0 value such that the most sensitive event categories are nearly free of SM background contributions, while control over uncertainties due to the finite-size simulated samples is maintained.
The P(LLP|cτ 0 ) thresholds fall in the range 30-50% and yield an LLP jet tagging efficiency of about 30-90% for cτ 0 ≥ 1 mm. Table 1 summarises the expected counts and uncertainties for the contributions from SM backgrounds, in the various event categories, for cτ 0 = 1 mm. The statistical uncertainty arising from the finite size of the simulated samples is the dominant contribution to the quoted uncertainties. Additional systematic contributions are described below. The expected event counts from the uncompressed and compressed benchmark models of split SUSY, defined in Section 4, are also provided.
Several sources of systematic uncertainty in the SM background expectations are considered. An uncertainty of ±20% is assumed in the normalisation of each dominant background process, W(→ ν)+jets, Z(→νν )+jets, and single top quark and tt production, which is motivated by theoretical uncertainties in the production cross sections and uncertainties in the experimental acceptance for the final-state leptons [86]. The uncertainty in the mistag rate for jets from SM processes is typically below 10%, determined by the procedure described in Section 6. The jet energy is varied within its uncertainty and resolution [43], and the resulting shifts are propagated to p miss T . The unclustered component of p miss T is varied within its uncertainties. The uncertainty in the number of pileup interactions is determined by varying the inelastic pp cross section within its uncertainty of ±5% [87]. The renormalisation and factorisation scales of the aforementioned four dominant SM backgrounds are varied independently per process by factors of 0.5 and 2 [88] to estimate the migration of events between categories. An uncertainty of ±2.5% in the integrated luminosity is assumed [89].
The LLP jet tagging efficiency determined from simulated events of split SUSY models, MC , may require the application of corrections to account for sources of potential mismodelling. The corrections are applied by reweighting simulated events as follows: where SF denotes the scale factor correction to MC , and the product of MC and the SF is bound to [0, 1]. The (N jet , N tag ) categorisation scheme allows the SF to be constrained in situ during hypothesis testing by adding the SF as a nuisance parameter to the likelihood model, described below.

Signal hypothesis testing
A likelihood model is used to test for the presence of new-physics signals in the SR. The observed event count in each event category is modelled as a Poisson-distributed variable around the sum of the SM expectation and a potential signal contribution. The expected event counts from SM backgrounds are obtained from the simulated samples. The uncertainties resulting from the finite simulated samples are modelled using the Barlow-Beeston method [90]. The systematic uncertainties in the SM background estimates are accommodated in the likelihood model as nuisance parameters.
Hypothesis testing is performed using an Asimov data set [91] to provide expected constraints on simplified models of split SUSY. A modified frequentist approach is used to determine the expected upper limits at 95% confidence level (CL) on the theoretical gluino pair production cross section as a function of m g , m χ 0 1 , and cτ 0 . The signal strength parameter, r UL , expresses the upper limit on the production cross section relative to the theoretical value. Alternatively, expected lower limits at 95% CL on m g can be determined as a function of cτ 0 . The approach is based on the profile likelihood ratio as the test statistic [92], the CL s criterion [93,94], and the asymptotic formulae [91] to approximate the distributions of the test statistic under the background-only and signal-plus-background hypotheses. Figure 8 shows the negative log-likelihood from a maximum likelihood fit to the Asimov data as a function of both the SF and r/r UL , where the signal strength parameter r represents the injected gluino pair production cross section relative to the theoretical value, for two split SUSY benchmark scenarios with an uncompressed and compressed mass spectrum, and cτ 0 = 10 mm. The model assumptions are indicated in the figure legends. The SF is constrained to >0.6 (>0.3) at 68% CL for r/r UL = 1 for the uncompressed (compressed) model, and it is bound to 1.2 by the condition SF MC ∈ [0, 1]. Figure 9 summarises the expected lower limit on m g (95% CL) as a function of cτ 0 for simplified models of split SUSY that assume the production of gluino pairs. The model assumptions 1 mm. The coverage is also competitive with respect to a dedicated reconstruction technique that is reported in Ref. [23]. For the region cτ 0 < 1 mm, the tagger performance degrades because of a limited ability to tag an LLP jet in the vicinity of the primary pp interaction vertex, while the reference search is able to exploit the distinguishing kinematical features of split SUSY events through a finer categorisation of candidate signal events. For the region cτ 0 > 1 m, the LLPs frequently decay outside the experimental acceptance, which leads to an increased reliance on the presence of initial-state radiation.

Summary
Many models of new physics beyond the standard model predict the production of long-lived particles (LLPs) in proton-proton (pp) collisions at the LHC. Jets arising from the decay of LLPs (LLP jets) can be appreciably displaced from the pp collisions. A novel tagger to identify LLP jets is presented. The tagger employs a deep neutral network (DNN) using an architecture inspired by the CMS DeepJet algorithm. Simplified models of split supersymmetry (SUSY), which yield neutralinos and LLP jets from the decay of long-lived gluinos, are used to train the DNN and demonstrate its performance.
The application of various techniques related to the tagger are reported. A custom labelling scheme for LLP jets based on generator-level information from Monte Carlo programs is defined. The proper decay length cτ 0 of the gluino is used as an external parameter to the DNN, which allows hypothesis testing over several orders of magnitude in cτ 0 with a single DNN. The DNN is trained using samples of both simulated events and pp collision data. The application of domain adaptation by backward propagation significantly improves the agreement of the DNN output for simulation and data, by an order of magnitude according to the Jensen-Shannon divergence, when compared to training the DNN with simulation only. The method is validated using signal-depleted control samples of pp collisions at a centre-of-mass energy of 13 TeV. The samples were recorded by the CMS experiment and correspond to an integrated luminosity of 35.9 fb −1 . Training the DNN with pp collision data does not significantly degrade the tagger performance. The tagger rejects 99.99% of light-flavour jets from standard model processes, as measured in an inclusive tt sample, while retaining approximately 30-80% of LLP jets for split SUSY models with 1 mm ≤ cτ 0 ≤ 10 m and a gluino-neutralino mass difference of at least 200 GeV.
Finally, the potential performance of the tagger is demonstrated in the framework of a search for split SUSY in final states containing jets and significant missing transverse momentum. Simulated event samples provide the expected contributions from standard model background processes. Candidate signal events were categorised according to the scalar sum of jet momenta, the number of jets, and the number of tagged LLP jets. Expected lower limits on the gluino mass at 95% confidence level are determined with a binned likelihood fit as a function of cτ 0 in the range from 10 µm to 10 m. A procedure to constrain a correction to the LLP jet tagger efficiency in the likelihood fit is introduced. Competitive limits are demonstrated: models with a long-lived gluino of mass 2 TeV, a neutralino mass of 100 GeV, and a proper decay length in the range 1 mm ≤ cτ 0 ≤ 1 m are expected to be excluded by this search.

Acknowledgments
We congratulate our colleagues in the CERN accelerator departments for the excellent performance of the LHC and thank the technical and administrative staffs at CERN and at other CMS institutes for their contributions to the success of the CMS effort. In addition, we gratefully acknowledge the computing centers and personnel of the Worldwide LHC Computing Grid for delivering so effectively the computing infrastructure essential to our analyses.  [20] ATLAS Collaboration, "Search for long-lived neutral particles in pp collisions at √ s = 13 TeV that decay into displaced hadronic jets in the ATLAS calorimeter", Eur. Phys. J. C 79 (2019) 481, doi:10.1140/epjc/s10052-019-6962-6, arXiv:1902.03094.    [31] LHCb Collaboration, "Updated search for long-lived particles decaying to jet pairs", Eur.