Simultaneous energy and mass calibration of large-radius jets with the ATLAS detector using a deep neural network

The energy and mass measurements of jets are crucial tasks for the Large Hadron Collider experiments. This paper presents a new calibration method to simultaneously calibrate these quantities for large-radius jets measured with the ATLAS detector using a deep neural network (DNN). To address the specificities of the calibration problem, special loss functions and training procedures are employed, and a complex network architecture, which includes feature annotation and residual connection layers, is used. The DNN-based calibration is compared to the standard numerical approach in an extensive series of tests. The DNN approach is found to perform significantly better in almost all of the tests and over most of the relevant kinematic phase space. In particular, it consistently improves the energy and mass resolutions, with a 30% better energy resolution obtained for transverse momenta $p_{\text{T}}>500$ GeV.


Introduction
Jets -the collimated sprays of hadrons resulting from a high momentum quark or gluon emission -are abundantly produced during the operation of the Large Hadron Collider (LHC).They are relevant to almost all physics analyses of the collisions recorded by the LHC experiments, either because they are part of the final state of the studied signal or because they generate background for the process of interest.To achieve high-quality physics results, it is crucial to precisely measure the angular position and energy of the jets.This task is made difficult by multiple effects including fluctuations in the parton showers or in the interactions between the particles forming the jets and the detectors [1].Another important task is to measure the collimated hadronic decay of high-momentum heavy particles such as top quarks, ,  or Higgs bosons.These are reconstructed as jets with large angular opening (large- jets) and the measurement of their mass as well as their energy is therefore necessary.
To measure the jet energy and mass from raw detector signals, the ATLAS and CMS collaborations use calibration procedures comprising multiple steps [2,3].In these procedures, the main energy and mass corrections are evaluated numerically from simulated collision events.The corrections are evaluated and applied in two distinct successive steps, in bins of a limited (usually two or three) number of characteristic variables.However, calibrating both highly correlated energy and mass while exploiting all shower variables and their correlations in a single correction step is possible through approaches based on deep neural networks (DNN).This paper presents a full implementation developed by the ATLAS Collaboration using 13 TeV simulated collision events, overcoming several difficulties that are specific to the calibration problem.This new calibration method aims at replacing the current simulation-based corrections of the energy and mass of large- jets.After a brief introduction to the ATLAS detector (Section 2), the paper explains why jet energy and mass calibrations are necessary, reviews the current procedure in ATLAS and motivates the new DNN approach (Section 3).The samples used and the selections applied to them are described in Section 4. Section 5 details the specific methodology needed to construct and properly train the network, while Section 6 discusses the validation and performance of the resulting DNN calibration, comparing it with the current numerical method.Finally Section 7 identifies and discusses limitations in this approach as well as some possible solutions.

The ATLAS detector
The ATLAS detector [4] covers nearly the entire solid angle around the collision point. 1 It consists of an inner tracking detector surrounded by a thin superconducting solenoid, electromagnetic and hadron calorimeters, and a muon spectrometer incorporating three large superconducting air-core toroidal magnets.
The inner-detector system (ID) is immersed in a 2 T axial magnetic field and provides charged-particle tracking in the range of || < 2.5.The high-granularity silicon pixel detector covers the vertex region and typically provides four measurements per track, the first hit normally being in the insertable B-layer (IBL) [5,6] installed before Run 2 of the LHC.It is followed by the silicon microstrip tracker (SCT), which usually provides eight measurements per track.These silicon detectors are complemented by the transition 1 ATLAS uses a right-handed coordinate system with its origin at the nominal interaction point (IP) in the centre of the detector and the -axis along the beam pipe.The -axis points from the IP to the centre of the LHC ring, and the -axis points upwards.Cylindrical coordinates (, ) are used in the transverse plane,  being the azimuthal angle around the -axis.
radiation tracker (TRT), which enables radially extended track reconstruction up to || = 2.0.The TRT also provides electron identification information based on the fraction of hits (typically 30 in total) above a higher energy-deposit threshold corresponding to transition radiation.
The calorimeter system covers the pseudorapidity range of || < 4.9.In the region || < 3.2, electromagnetic calorimetry is provided by barrel and endcap high-granularity lead/liquid-argon (LAr) calorimeters, with an additional thin LAr presampler covering || < 1.8 to correct for energy loss in material upstream of the calorimeters.Hadron calorimetry is provided by the steel/scintillator-tile calorimeter, segmented into three barrel structures with || < 1.7, and two copper/LAr hadron endcap calorimeters.The solid angle coverage is completed with forward copper/LAr and tungsten/LAr calorimeter modules optimised for electromagnetic and hadronic energy measurements respectively.
The angular and longitudinal segmentation of the calorimeters provides valuable 3-dimensional information about the location of the energy deposits.This allows the calculation of quantities useful for the characterization of the energy shower resulting from various type of particles or groups of particles such as hadronic jets.
An extensive software suite [7] is used in data simulation, in the reconstruction and analysis of real and simulated data, in detector operations, and in the trigger and data acquisition systems of the experiment.

Jets and their calibration with the ATLAS detector
After the parton emission, the complete hadronisation process results in a set of particles from which the particle jets are formed.These are also named true jets in the context of simulated events discussed in this paper.In practice the group of particles forming a true jet, its constituents, is obtained by applying a physically motivated procedure called a jet clustering algorithm on the set of kinematic 4-vectors of all the stable interacting particles of the collision (details are given in Section 4.1).The sum of the 4-vectors of a jet's constituents defines its momentum and thus the true jet energy and mass.These are the reference quantities for the jet calibration that is applied to experimental jets reconstructed with the ATLAS detector.Jet reconstruction is based on two types of primary signals: charged particle tracks, which are reconstructed from measurements of hits in the inner detector [8,9] and topological clusters, which are reconstructed from energy deposits in the calorimeters [10].Tracks and clusters can then be combined to form higher-level objects, each characterised by a kinetic 4-vector [11,12].Together, these objects approximate the flow of hadronic particles and are used as the constituents of the reconstructed jets.The final jets are obtained by regrouping these constituents using the same jet algorithm as used for the true jets.
Due to various detector inefficiencies and limitations, the kinematics of the experimentally reconstructed jets differ significantly from the corresponding true jets.Precise and complex corrections are thus applied to the jet energy and mass before these quantities are used in physics analyses [2].These corrections are separated into two steps.Firstly, a series of corrections is obtained from simulation that allows alignment between the reconstructed-level energies and masses and the generated-level ones; they ensure that the jet energy scale (JES) or the jet mass scale (JMS) of reconstructed jets is correct (a more precise definition is given below).They also compensate for the effect of multiple proton-proton interactions happening in the same bunch crossing and residual effects in the detector from collisions in nearby crossings (these effects are collectively referred to as pile-up).Secondly, other corrections obtained from real data, using known physical processes or well-reconstructed objects such as photons, allow corrections for residual differences (of the order of 2-3%) to be made between real and simulated reconstructed jets.The calibration method presented in this paper is an alternative simulation-based correction.
Simulation-based corrections are derived from studies of the individual response of the quantity of interest , the jet energy, (but all the following equations stand identically for the mass ), defined by   =  reco  true where  true is calculated from the jets clustered from generated particles and  reco , from the jets reconstructed after detector simulation (true and reco jets are more precisely defined in Section 4.2).Because of the stochastic nature of both the quantum chromodynamics (QCD) shower and its interactions with the detector, true jets that have the same parameters (such as energy, angular position, mass, . . . ) can generate distributions of possible   .Individual response distributions depend on these parameters and are usually unimodal: their most probable value, their mode   = mode({  }), is called the average response, JES (or JMS in the equivalent case for jet mass), or simply response when the context is non-ambiguous.The mode, as opposed to the mean or the median, is chosen as the definition of such a central value to avoid ambiguities or biases that can arise when the distributions are asymmetric.The width of the distribution around the mode is used to define the resolution in .In this paper, the chosen definition is the inter quantile range (IQR) at 68%2 divided by the mode . Because of experimental limitations, the average response may differ from unity and the objective of the calibration is to find the relevant corrections such that   = 1 for all populations of true jets (which formalizes the idea that true and reco scales are on average identical), with   as small as possible.
The individual response distributions can be parameterised by a set of parameters describing the true jet (here taken to be the energy, pseudorapidity and mass) but also by inter-dependent parameters of the reconstructed jets such as the energy, pseudorapidity, mass, fractions of energy deposited in the various layers of the calorimeter, fraction of the momentum of the associated tracks, and others.Formally, it is then possible to model the distribution as a probability distribution function (  | ì ) where ì  is the list of input parameters that can be separated into the various true and reconstructed parameters ì  = ( ì  true , ì  reco ).The average response for true jets with parameters ì  true is then given by The aim of the calibration is then to find a function of the reconstructed parameters  ( ì  reco ) defining the calibrated quantity  calib =  ( ì  reco ) reco such that the most probable value becomes where (  ×  ( ì  reco )) is the probability distribution function obtained from (  ) by changing   →   calib =   ×  ( ì  reco ) and the mode is taken over all ì  reco .
In general this mathematical problem is difficult.It has an exact solution in simple cases where the response depends on only one parameter, the quantity  itself, and when   ( true ) true is an affine function [13].
In this case the solution is to choose the function  such as This solution is called the numerical inversion (NI) because   (and thus the pair (  (), 1   (  ) )) can be obtained numerically from the simulated data and because  true   ( true ) represents an inversion of the true quantity. 3This method (referred to here as standard calibration) is employed by ATLAS to obtain sequentially a jet energy scale calibration and then a mass scale calibration.In practice, in the standard calibration,   is evaluated as a function of ( true ,  true ) (and  true for mass calibration) by collecting individual responses in bins of these quantities.This function   is then converted into a correction factor function depending only on reconstructed quantities ( reco ,  reco and  reco ) according to the NI principles described above and further detailed in [2,14].This procedure gives satisfactory performance.The variables ,  and  are the principal dependencies of the JES and JMS and the use of true quantities to form the response evaluation bins allows potential biases due to the energy or mass spectrum of the generated sample to be avoided when evaluating the average response.
As mentioned above, there is much more information available to describe the reconstructed hadronic showers than just the energy, pseudorapidity and mass.Using these additional variables can potentially help obtain a more precise JES and JMS and improve their resolution, including subsamples of jets such as quark-initiated or gluon-initiated jets.However, these variables can be correlated.Optimally exploiting all the correlations in the same way as performed in the standard calibration would require highly multidimensional bins, which is practically impossible.On the other hand the task of numerically modelling a multidimensional function can be successfully performed with modern DNN techniques.Furthermore, a key problem of the calibration evaluation, estimating the mode of distributions, is solvable with these techniques; a solution is discussed in detail in Section 5.1.A proof of concept of using DNN for JES and JMS calibration, including an implementation of the NI procedure, has already been presented in [15] and DNN techniques are also employed to perform a particular step of the energy (but not the mass) calibration for small-R jets [16].Building on this, a complete and operational solution and its performance is presented here.This new procedure extends several aspects of the DNN calibration but does not implement NI to simplify the method by avoiding the need to train two DNNs sequentially, one to model the function  as a function of  true , the second the function  as in Eq. ( 2).Instead, it evaluates directly the uncalibrated responses in terms of reconstructed quantities, which means that the network evaluates   ( ì instead of mode((  | ì  true )) as in Eq. (1).The calibration function is then chosen to be simply 1/  (ì  reco ).The price to pay for this simpler procedure is that the response mode((  | ì  reco )) a priori depends on the distribution of the simulated input samples and this dependency can bias the resulting calibration even in the simple cases where the standard calibration, which uses mode((  | ì  true )), would not be biased.However, because ì  reco includes many variables relevant to the description of the hadronic shower, it can be expected that the response distributions at each point of the phase space are narrow enough so that the overall bias on the mode is negligible.Although this hypothesis is not testable in practice because of the high dimensionality, verifications are performed to ensure the input samples do not bias the performance (they are presented in Section 6.3).

Event simulation
The training and validation of the DNN is based on Monte Carlo simulations of proton-proton ( ) collisions at a centre-of-mass energy of √  = 13 TeV.All events were passed through the complete ATLAS detector simulation based on Geant4 [17,18].Various physics processes were simulated as described in Table 1 to train and test the DNN on different jet topologies.Training is performed on the QCD dĳet sample as this type of process is, by orders of magnitude, more probable at the LHC and enters as a background to many analyses.In a validation step, it is verified that the training generalises and produces a good calibration for other important physics processes such as those listed in Table 1.All samples (except the Sherpa one) were produced using Pythia 8.230 [19] with the NNPDF2.3lo[20] parton distribution function (PDF) set and the A14 set of tuned parameters [21].The training sample containing jets originating from light quarks or gluons was obtained by generating dĳets in slices of the jet transverse momentum,  T , to sufficiently populate the kinematic region of interest (from around 200 GeV up to 6000 GeV).For the validation samples, physics processes modelling phenomena beyond the Standard Model are used to populate the high- T region.
• The /-boson sample was obtained by generating the  ′ →   →  q q process with   ′ = 2 TeV.
• The top-quark sample was obtained by generating the  ′ →  t →  +   − b →  q b q process with   ′ = 4 TeV.
• The sample of Higgs bosons, which displays a broad Higgs boson  T spectrum, was obtained by generating the production of Randall-Sundrum gravitons  * in a benchmark model with a warped extra dimension [22] over a range of graviton masses from 300 to 6000 GeV, where the graviton decays according to  * →  →  b b.
The Sherpa dĳet sample was generated using the Sherpa 2.2.5 [23] generator.The matrix element calculation was included for the 2 → 2 process at leading order, and the default Sherpa parton shower [24] based on Catani-Seymour dipole factorisation was used for the showering with  T ordering, using the CT14nnlo PDF set [25].These samples used the dedicated Sherpa AHADIC model for hadronisation [26], based on cluster fragmentation ideas.
Finally, the effect of multiple interactions in the same and neighbouring bunch crossings (pile-up) was modelled by overlaying the simulated hard-scattering event (event of interest) with inelastic   events generated with Pythia 8.186 [27] using the NNPDF2.3loset of PDFs [20] and the A3 set of tuned parameters [28].The number of overlaid events is representative of the amount of pile-up present in the 2017-2018 ATLAS   data sample.

Jet reconstruction and selection
In each event, the flow of hadronic particles is represented by a set of 4-vectors built by combining momentum and energy measurements from the inner tracker and calorimeters.In this paper, the combination technique producing unified flow objects (UFO) [29] is used.These input constituents are clustered into jets with the anti-  jet algorithm with angular radius  = 1.0 [30] using the FastJet software package [31].The jets then undergo a grooming procedure aimed at suppressing soft radiation from pile-up and the underlying event; the soft-drop procedure [32] is employed with parameters  = 1 and  cut = 0.1.
Stable particles, with a lifetime  > 10 mm, produced by the generators (with the exception of neutrinos and muons) are also clustered into jets using the same algorithm to obtain true jets.
To obtain training samples that are not biased by experimental effects such as pile-up, noise or misidentifications, some requirements are imposed on the jets.Reconstructed jets are selected if they are matched to a true jet and isolated as described in Ref. [2].A reconstructed and a true jet are matched if they are within an angular distance Δ = 0.3.Two isolation requirements are then applied; reconstructed jets are rejected if a true jet other than the matched one is found within Δ = 2.5 or if another reconstructed jet is found within Δ = 1.5.Finally, both the reconstructed and true jets are required to have  T > 50 GeV and, in an event, only the two highest- T jets satisfying these criteria are considered.The selection results in about 270 million jets being available for the DNN training sample.

Methodology of the deep neural network calibration
The simultaneous calibration of the jet energy and mass is performed using the calculations of a single DNN taking as inputs reconstructed jet-level and event-level quantities.The DNN is set up to output two numbers corresponding to the energy and mass responses.However, as mentioned in the previous section, it is not possible to directly predict the exact individual  (or ) response for a given reconstructed jet as each true jet corresponds to a distribution of possible responses.The DNN predictions are thus interpreted as the most probable values of the possible responses for the input jets (that is, the modes of their distribution).Following Eq. ( 3), the DNN is trained so that The calibrated energy (or mass) is then defined as .
Thus, the DNN output predictions are During the training, the target quantities corresponding to the inputs ì  reco are the individual responses discussed in Section 3: ).
These are not directly the desired average responses   and   and the necessity to predict distribution modes (and not means nor medians) requires a careful choice of the loss function.Beside these mathematical concerns, experimental difficulties related to the detector geometry -and especially the geometry of the calorimeters -strongly interfere in the calibration predictions for certain angular regions.This is accounted for by a dedicated processing of the jet angular position.
Finally, the mass calibration alone has specific difficulties.Firstly, the mass response distributions are generally wider and asymmetric, making it harder for the DNN to predict their mode.Secondly, large- jets are mainly used to reconstruct heavy particles whose decays are collimated (top quarks, , , or Higgs bosons) and for which the mass measurement is crucial.So the DNN performance needs to be good for these jets, even if their topology differs from standard QCD jets.This has motivated the use of a specific architecture (including a residual connection-like structure) and training procedure (with a dedicated mass-only step).

Loss function
The loss function L ( ì  true , ì  pred ) is a crucial part of the DNN training as it defines what mathematical quantity the DNN will learn to predict.This is particularly true in the case presented here where the targets (ì  true ) are realizations of a stochastic distribution.In this case, it is proven that minimising the usual mean square error loss (L = ( ì  true − ì  pred ) 2 ) would have ì  pred converge to the mean of the distribution, while minimising the mean absolute error (L = | ì  true − ì  pred |) would imply a convergence to the median [33].Converging to the distribution mode, as is necessary for the jet calibration, would require taking a Dirac delta distribution as the loss L = ( ì  true − ì  pred ).This is not possible in practice and an approximate function is therefore necessary.One possibility is the Gaussian kernel as suggested in Ref. [33], another solution is inspired by mixture-density-network (MDN) techniques and is described below.Numerical tests show the MDN losses are less biased than the Gaussian kernel losses.They are also found to converge faster and allow more flexibility in the training procedure.
First, as the DNN produces two outputs ì  pred = ( DNN  ,  DNN  ), two independent loss functions are considered, the sum of them being the final minimised quantity: The MDN loss is constructed from the assumption that the response distribution of the DNN targets is a Gaussian distribution with central value  and width .
With the parameter of interest being  (the mode of the distribution), an estimator μ (and σ) can be obtained from the maximum of the likelihood (LH) over a given sample of As a consequence, a DNN outputting two values ì  pred = ( pred ,  pred ) and trained with the loss: would minimise the logarithm of Eq. ( 4) and thus predict the estimators, that is ( pred ,  pred ) = ( μ, σ).Using this MDN loss, the DNN has to predict the width of this distribution ( pred ) in addition to predicting the mode of the response distribution.For an energy and mass calibration DNN: where only the predicted modes are eventually used in the final calibration procedures.
In practice, the energy or mass response distributions are not perfect Gaussian distributions especially for jets with low  T and high mass or with high ||.To take this into account, the assumed underlying distribution can be replaced by an asymmetric Gaussian distribution or a truncated Gaussian distribution.
asym () ∼ Although not physically motivated these losses help with the numerical convergence of the network.In particular, the asymmetric MDN loss (MDNA) helps to avoid biases from asymmetric shapes in the response distributions, but it requires the DNN to output three values per prediction (,  1 ,  2 ).The use of a truncated distribution can apply to both the MDN and MDNA cases.In practice, truncation consists of excluding from the loss evaluation jets for which | − | >  and adapting Eq. ( 5) with the relevant distribution normalisation factor.Asymmetric loss helps to converge faster in regions of phase space where the distributions are asymmetric and the truncation is used at the end of the training to improve the accuracy of the calibration.Details of the choice and tuning of the parameters of the loss function are further discussed in Section 5.4.

Input features
To obtain precise predictions, it is necessary to give the DNN an as complete description of the jet and the event as possible.A list of 21 jet or event variables is chosen as the input to the DNN.These features are chosen for their relevance in calibration or jet identification tasks as demonstrated in previous studies [2,34].They can be divided in four categories: the jet kinematics, which are the main dependencies of the energy and mass responses;4 the jet substructure variables, which describe the particle-shower topology used to classify jets; the detector variables, which quantify the relative charged-particle contribution and the development of the calorimeter showers; and finally, the event variables that describe the jet environment.Table 2 presents these input variables and their definitions.To have homogeneous inputs for the DNN, each input is normalised so that its distribution falls in the range of In addition, the DNN has difficulties learning the sharp variations in the energy and mass responses along  that are due to the complex ATLAS detector geometry and in particular to the poorly instrumented regions between the central and forward calorimeters between || = 0.8 and || = 1.5.

Architecture
The DNN architecture is mainly based on two specific parts, the  annotation (described in Section 5.2) and a fork between the energy and mass outputs.Both are designed to address the main difficulties of the jet calibration, the  dependency of the responses and the important differences between the energy and mass response distributions.Building on this base, the architecture was iteratively improved and tested to obtain a structure giving good performance.Consequently, the DNN has a relatively complex structure as represented in Figure 2, where: • Dense (N) blocks represent the matrix, bias and activation function resulting in a hidden layer with N neurons.
• The Concatenate block is a static layer (without trainable parameters) that concatenates its two input layers.
• The Multiply block is a static layer that performs an element-wise multiplication of its input layers.
• The final Output layers give the predictions for the energy and mass responses.They may consist of two or three neurons depending on the loss function (as discussed in Section 5.1).The DNN contains approximately 764 000 trainable parameters and is structured in different parts as indicated in Figure 2.: • Input processing: contains the input layer with the 21 input features and the  annotation described above.
• Core: contains several hidden layers (210 neurons each) common to both the energy and mass predictions.
• Early fork: premature splitting between the energy and mass predictions to allow for a strong specialisation of the DNN in each calibration, both forks containing the same hidden layers.Having independent weights also allows to freeze some of them during the training procedure.
• Multiplicative residual connection: loosely inspired from Ref. [40] with the substitution of the addition by a multiplication to increase the class of functions that can be modelled by the network.These layers link the input layer directly to the mass output, with the intention of making the DNN learn which inputs are the most important for the mass calibration and thus helping to improve the mass predictions.5 • Outputs: the DNN has two output predictions, one for the energy calibration and one for the mass calibration, ì ), along with subsidiary predictions (  pred ,   pred ).The activation function for each hidden layer is chosen to be the MISH function [41] and is given by: The only exception is the activation function of the last hidden layer of the residual connection that is the softmax function: its outputs are restricted to [0, 1] and are used as multiplicative weights.
The last activation function applied to the output layers is chosen to be: to obtain predictions between 0 and 2, which is the expected range of the energy and mass central responses.

Training procedure
Establishing the training strategy cannot be left to a simple hyperparameter scan.Firstly, because the energy and mass responses vary significantly across the phase space that is not uniformly populated and, secondly, because converging to a loss minimum does not guarantee that the physics performance goals are satisfied for the full phase space.Furthermore, these performance goals (described in Section 6) are costly to evaluate and cannot be tested during the training procedure.As a consequence the strategy presented here is obtained with a series of attempts and refinements gradually improving the performance.This training strategy is described in Table 3: it totals 107 epochs and is divided into three parts.
In a first initialisation step, the DNN is trained with small batch sizes and an MDNA loss with no or small truncation of the loss.This aims at quickly converging to a rough estimate of the mode of the response distributions thanks to small batch size (i.e., frequent weight updates) and a few epochs while adapting to possible asymmetric distributions (e.g., at high ) with the MDNA loss.
The second step is designed to obtain very precise predictions by taking advantage of large batch sizes and a progressive narrowing of the loss truncation.This narrowing allows a gradual focus on the centre of the response distributions, improving the predictions.The larger batch sizes improve the calculation speed and thus the training time.
The third step is a fine-tuning step with an exclusive mass training.The weights involved in the energy prediction 6 are frozen to their values at the end of the second step and only those involved in the mass prediction 7 are free to be tuned.This step is necessary in order to have as good predictions for the mass as for the energy for which the two first steps are enough.For this last step, a MDN loss with a large truncation at 1.0 is used in order to focus more on the core of the mass response distribution.A symmetric loss at this stage is observed to converge better, possibly because it results in a simpler network, easier to optimize on the central and more symmetric parts of the response distributions.
During the training two different optimizers are used, Rectified Adam [42] in the initialisation step and diffGrad [43] in the common training and exclusive mass training.The learning rate is decreased in the latter steps of the training following the improvements in the predictions.

Performance and validation of the calibration
The evaluation of the DNN performance and the validation of the predicted energy and mass calibrations are presented in this section.Unless mentioned otherwise, these are evaluated on the dĳet sample, hence for standard quark and gluon jets.

Technical performance
For this study, the validation data sample is a subset of the dĳet data sample that is set aside at the beginning of the training (about ten million jets corresponding to ∼ 4% of the training data sample).Figure 3 shows the evolution of the loss with the number of epochs (i.e., the number of jets processed).The training and validation losses (obtained for the energy, the mass and their total) are shown as solid and dotted lines, respectively.The different losses display very good agreement, showing that the DNN does not overfit the training data sample, as expected given the size of the training data sample; overfitting would take many more epochs.It also shows that the loss does not evolve significantly after two thirds of the training.For a more common DNN set-up, it could be wise to stop the training earlier, as it seems -by that metric alone -that its performance is not improving anymore.However, it is observed that the predicted calibrations continue to improve even if the loss does not, which makes it difficult to find an optimum training procedure.3) and are due to the variations of the loss function.
The consistency of the training is also verified by training DNNs that differ only by weight initialisation and ensuring that their losses and calibration performance are identical.

Calibration performance
The performance of the energy and mass calibrations are quantified using the response   calib and the resolution  of the calibrated distributions, as defined in Section 3.
More precisely, from the predicted energy and mass response ì  pred = ( DNN  ,  DNN  ), the individual energy (or mass) calibrated response is defined as: As discussed in Section 3,   calib values are realisations of a distribution  from which the calibrated response can be obtained: In practice, the individual responses   calib are collected in histograms for each bin of  true (where typically  true = (  true T ,  true ,  true )).From these histograms,   calib ( true ) is extracted from a fit to a Gaussian distribution 8 and the calibrated response resolution  from the calculation of the IQR divided by   calib (not to be confused with  pred predicted by the network and described in Sec.5.1).
In the following, the performance is presented by plotting the response or resolution as a function of one of the bin-defining variables.Better performance is indicated by a response closer to 1.0 (called response closure) and a resolution closer to 0.0 in all of the phase space.All plots include statistical uncertainties computed from the fits of the response distributions but they are generally too small to be visible.

Performance on QCD dĳets
As a closure test, the DNN calibration is applied to the QCD sample used during training.This allows verification that the DNN convergence translates into good physics performance in a sample with large statistics.Overall, the calibration is found to perform very well for || < 2 with the best results at high mass in terms of response closure and resolution.Figure 4 presents the energy and mass calibrated responses as a function of  true T and  true and the energy and mass resolution as a function of  true T for specific bins.It shows excellent response closure for both energy and mass, closer to unity than the standard calibration [2].The DNN calibration also shows an important improvement in the resolution relative to the standard calibration.For the mass, the uncalibrated resolution appears better than the calibrated ones.However, this is a relative quantity (as defined in Section 3) while the absolute value of the IQR scales with the central value.As a consequence, comparing the uncalibrated resolution corresponding to responses significantly differing from 1.0 with a calibrated resolution is not very meaningful.

Effect of 𝜼 annotation
Section 5.2 describes the necessity to add the  annotation to help the DNN adapt to the complex detector geometry.The assumption is that the extra information represented by the angular encoding together with the related weights allows the training process to specialise the predictions according to the detector region.As a validation step, Figure 6 shows a comparison of the jet energy and mass response closure between two DNN setups with the same training procedure and the same architecture except for the  annotation step.Its inclusion clearly improves the DNN calibrated response closure in the almost complete  range and especially in the calorimeter transition region above || = 1.2.The JMS improvement is less important, as expected since the uncalibrated response (squared markers) variations for the mass are less important than for the energy.These improvements are present in every  T and  bin, proving the importance of the  annotation step.

Performance on other topologies
Physics analyses mainly use large- jets to reconstruct the decays of massive and boosted particles like top quarks, ,  or Higgs bosons.These decays result in jets with different shower topologies than those originating from the dĳet samples with which the DNN is trained; this can have an impact on the mass response as demonstrated in Ref. [16] where significant differences between -initiated and /-initiated jets were observed.It is hence very important to verify that the DNN predictions generalise to other such topologies as the same calibration is applied to all jets in data as their origin will be unknown.For this, simulated samples of processes with heavy hadronically decaying particles, as described in Section 4.1, are considered.For each sample, the same jet selection and DNN calibration as for QCD jets are applied and the resulting scale and resolution are evaluated in mass bins centred on the mass of the corresponding massive particle.The population of these massive, high-momentum jets is then expected to be entirely dominated by the hadronic decays of the heavy particles considered so that the effect of the topology can be assessed.
Figure 7 presents the jet mass response distributions obtained with boosted massive jets (/, Higgs boson and top-quark decay) for different calibrations (uncalibrated, standard calibration and DNN calibration).It shows the benefits of the DNN calibration compared with the standard calibration or the absence of a calibration; the distributions are narrower with less asymmetric tails and are better centered on 1.0.
Figure 8 shows the jet energy and mass average responses as a function of the true  T for the same samples and calibrations.As expected, large differences between the different topologies are observed before calibration, especially in the mass response.While both the standard and the DNN calibrations reduce these response differences, the DNN calibration outperforms the standard one as it gives better closure for each topology, showing that the DNN predictions generalise well to more complex topologies.A small non-closure in the energy response is however observed for all the boosted topologies at low  true T for the DNN calibration.For the / and Higgs boson jets, this small under-performance can easily be explained by the internal structure of the boosted massive jets; at high momentum, / and Higgs boson large- jets are usually composed of two sub-jets with an angular separation of Δ ∼ 2  T .At low  T , their angular separation becomes comparable to the large- jet radius of 1.0, and the structure becomes neither 1-prong nor 2-prong.Such a structure is possibly rare in the dĳet training sample or not captured by the input substructure variables, making it impossible for the DNN to learn to correct for this effect.For top-quark jets, a similar effect could be at play when the 3-prong decay  →   →  ′  is not fully contained within the jet radius, thus leading to atypical jets.To solve this difficulty, a dedicated additional network structure and training can be envisaged and is discussed in Section 7.
Figure 9 presents the jet energy and mass resolution for the boosted topologies.Here as well, the DNN calibration shows improved performance relative to the standard calibration.The resolution differences between the topologies here can be explained by the increasing mass bins in which the responses are evaluated; the resolution usually gets better at higher masses.

Validation of the DNN calibration
An extensive series of tests are performed to ensure the good physical behaviour of the DNN calibration.As described above, both the standard and DNN calibrations are tested and compared.

Spectrum dependency:
The DNN is trained and validated on a specific set of jets produced through a particular Monte Carlo process with a large range of energy.This set of jets leads to specific distributions of the input variables.The DNN calibration performance should be independent of the distribution to which it is applied, so it can be used in different physics analyses with different event selections resulting in different input distributions.Since the calibration is already assessed as a function of true  T , the impact of the distribution of this variable (or the energy one) is not expected to be strong.However, other input variables are correlated with the  T and are not binned in the performance tests; their distribution shapes may thus impact the calibration.To verify this, the performance is re-evaluated by changing the response distributions using statistical weights when constructing the response histograms.The weights are chosen to obtain a flat jet energy distribution, entirely different from the original one used in the training as illustrated in Figure 10(a). 9The performance is then evaluated in large  T bins similar to those that physics analysis may use and inside which distribution differences may have significant effects.Figure 10 (b, c) shows the jet energy and mass responses obtained with the DNN calibration with (square markers) and without (triangular markers) re-weighting.The differences between the two calibrated response closures are very small, proving that the DNN calibration does not depend significantly on the input jet distribution.Pile-up dependency: The impact of the pile-up conditions is tested and evaluated by considering the dependency of the response on two quantifying variables: the number of reconstructed primary vertices in an event, NPV, and the average number of collisions per bunch crossing, .For this, the energy (or mass) response   calib is estimated in bins of ( PU ,  true ,  true ) where  PU is NPV or .A linear fit to   calib against  PU is then performed for each ( true ,  true ) and the corresponding slope, the gradient relative to  PU denoted as    PU , is extracted.Figure 11 presents the gradient relative to NPV of the energy and mass calibrated response modes as a function of .For both the energy and the mass, the DNN calibration shows smaller gradients than the standard calibration.Similar results are obtained for .
This shows that the grooming procedure performed during the jet reconstruction is not enough to suppress pile-up effects, while the DNN is able to do so by including NPV and  as input variables.
Jet flavour dependence: As discussed above, the DNN performance is assessed with different topologies (QCD, /, top-quark, or Higgs boson jets).It is also important to understand the dependency of the DNN calibrations on the actual parton that initiated the jet, also known as its flavour (light-quark (, , , ), -quark or gluon jets), as different responses for different flavours will translate into systematic uncertainties through the uncertainty in the flavour composition of the jets.Figure 12 presents the energy and mass calibrated responses as a function of the jet energy for different jet flavours, comparing the DNN calibration to the standard calibration.The jet flavour is defined here as the flavour of the parton with the highest energy in the parton shower.These show that the DNN predicts better calibrations for the different partons, with smaller inter-flavour differences relative to the standard calibration, especially at low  T .

Dependency on the generator:
The DNN calibration will be applied to all simulated samples used in ATLAS analyses that are produced with a variety of generators, as well as jets in real data events.Therefore it is desirable to evaluate if the performance of the network is strongly affected by the modelling of jets within Pythia8.Generators have specific ways of evaluating the parton shower and hadronisation resulting in differences in the predicted particle flow and simulated detector showers.Since such differences can be representative of those between simulation and real data, it is important to verify they are not detrimental to the performance of the DNN calibration.A data sample of Sherpa-generated dĳets with the same data preparation as described in Section 4.2 is used to compute jet energy and mass responses with the standard calibration and the DNN calibration. 10oth are then compared with the performance obtained with the Pythia dĳet data sample.Figure 13 shows this comparison.The DNN energy calibration shows almost identical performance for the two generators, except at low  true T where a small under-performance is observed with the Sherpa data sample although closure is however still obtained.Moreover, the DNN calibration shows a lesser dependency on the generator than the standard calibration for the energy.For the mass calibration with the Sherpa data sample, the DNN response closes within 5% but outside 1% in all the  true T range.This can be expected as the jet's mass is more dependent on its particle flow structure.This under-performance is observed to be reduced in higher mass bins.The standard mass calibration shows smaller differences between the two generators but overall, the DNN mass calibration is still better than the standard one for both the Pythia and Sherpa generators.

Limitations and possible solutions
Despite the satisfying results obtained with the calibration methodology presented above, some limitations and open questions are identified.They are discussed together with possible solutions in this section.

Training convergence and ending criterion
A weakness in this DNN training procedure is the lack of a clear criterion to mark the final convergence of the network and end the training.It is observed that early interruptions of the training (as soon as the losses start to fluctuate around a minimum) result in worse calibration performance in terms of JES, JMS or resolution in significant parts of the phase space.Instead, training for numerous additional epochs (such as described in Section 5.4) during which the training loss remains essentially constant is found to be necessary.A possible explanation is that the goodness of the performance is based on relatively strict expectations (e.g., closure within 1% for the energy response) to be obtained over all the (, , ) phase space while only a part of it is statistically dominant in the training sample.As a consequence, the global loss can be mostly determined by this dominant phase space region in which the optimum is quickly reached, while it requires more training to improve the sparsely populated regions.This behaviour also resembles the 'grokking phenomenon' observed in some DNN problems where sometimes performance suddenly increases after a very long training period [45].
Although the present study shows that training long enough allows coverage of the full phase space, a robust methodology would require a well-understood criterion to stop the training.Evaluating the calibration performance across all the phase space is not a practical option as it is too costly to perform automatically at each epoch.An interesting approach would be to evaluate the loss in predefined, well-chosen bins (in terms of statistical contents and of physics relevance) of the phase space.Monitoring the losses in these bins with training and validation samples would possibly be more indicative than monitoring the virtually flat total loss.

Calibration of heavy boosted jets
During the validation of the DNN calibration performance for /, Higgs boson and top-quark jets, nonclosures of up to a few percent are observed for the energy (see Figures 8).While this under-performance is also visible for the standard calibration, the DNN calibration performance is slightly worse in these cases.Contrary to the DNN calibration presented here, the standard calibration is obtained using different types of jets.At high  T (above ∼ 500 GeV) QCD jets are used, while at low  T the calibration is only derived from / jets as this choice leads to better calibration performance for both / jets and QCD jets.
A similar approach with mixed training samples is possible with the DNN.However, simply adding / samples to the training does not improve -or even degrades -the performance for mass regions away from the  and  masses.This issue could be mitigated with the availability of special pseudo-/ samples having a flat mass distribution.Alternatively, an ad hoc solution mimicking the standard calibration is possible by adding a subnetwork to the already QCD-trained DNN.This subnetwork would branch from the input layer with  annotation, produce output values used as multiplicative correction factors to the main network predictions and enforce these factors to be 1.0 for large  T jets (typically above 500 GeV) to maintain the good performance of the DNN in this phase space.The subnetwork could then be trained on / samples, while keeping the main network frozen to improve the lower  T regions.

Importance of the features
Insights into which input features are the most influential on the DNN predictions are important for two reasons.Firstly, they generally help to understand the jet energy and mass responses.Secondly, they can indicate if some features could be removed from the inputs, potentially leading to a simpler DNN or even to better performance if the removed variables are confusing the network.A simple way to quantify the importance of a feature is to measure the average of the relative variation in the prediction when this feature is systematically varied for each input jet.A ranking of the different features obtained by following this method is shown in Figure 14 where each feature is varied by ±1% of the standard deviation of its overall distribution.Both the energy and mass prediction variations are displayed for each feature.As expected, the variables , , and  have a large influence on the predictions according to this measure.Some substructure features 11 have a stronger impact on the mass prediction than on the energy prediction.This is also expected since the mass depends on the structure of the particle flow objects in the jet.While by this metric  and NPV do not seem so important, it was demonstrated in Section 6.3 that these features are important as they allow the DNN calibration to remove the response dependence on these variables.Further investigations are therefore necessary to understand more precisely the importance of the features.Those could consist of considering the magnitude of the DNN gradient with respect to its inputs in bins of  T or mass, or to apply the SHAP methodology. 12

Conclusion
A full implementation of a deep neural network based simultaneous calibration of the energy and mass of large- jets with the ATLAS detector is presented.It solves problems that are specific to the calibration task.Firstly, rather than segmenting the calibration in angular slices, a single network covers multiple 11 The Width, groomMRatio, ChargedPTFrac, ChargedMFrac, Split12, and C2. 12 SHapley Additive exPlanations [46].detector regions and a dedicated encoding of the jet angular direction is included to help it to adapt to the corresponding strongly varying response.Secondly, the network architecture is designed to efficiently accommodate both the energy and the mass predictions.Finally, loss functions are carefully chosen so that the DNN prediction converges as desired toward the most probable value of the energy or mass response distribution; this also implies non-trivial training procedures involving variants of the loss functions.
The resulting calibration is compared to the standard calibration on QCD jets and on other jet topologies not seen during the DNN training.Over all the phase space and for all processes of interest, the DNN is found to achieve better energy and mass scale closure and similar or better resolution, with a typical 30% improvement in the energy resolution for   > 500 GeV.The DNN calibration is also shown to be robust against various tests such as pile-up dependency or changes to the sample distribution.Finally, some limitations and possible improvements are discussed.Overall, as this simulation-based method can directly replace the current analogous steps of the ATLAS large- jet calibration chain, no adaptation of the data-based methodology is required and this DNN procedure can be used for the calibration of large-radius jets used in future ATLAS analyses.
In an effort to help the modelling, an additional input preparation step called  annotation is implemented as the first operation of the DNN.This step computes extra features based on  that are passed along with other input features to the rest of the DNN.These extra features are calculated as   () where   are Gaussian functions with different central values and widths that are chosen to cover key angular regions of the detector (such as the central part or edges of the electromagnetic barrel calorimeter and of other calorimeters).Figure1shows how the 11 different Gaussian functions are positioned, providing a continuous encoding of the proximity of the jet to these regions.These processed values are included alongside the unprocessed  for a total of 12 representations.

Figure 1 :
Figure 1: The principle of the  annotation (left) and the various   () functions (right).The vertical shaded band on the right plot marks the calorimeter transition region.

Figure 2 :
Figure 2: The architecture of the DNN.

Figure 3 :
Figure 3: The loss evolution for the training (plain lines) and validation (dotted lines) data samples.The different spikes and drops correspond to the transitions between training steps (see Table3) and are due to the variations of the loss function.

Figure 5 Figure 4 :
Figure 5 presents the excellent consistency of the DNN calibration performance for different  true T bins.In every bin of (  true T ,  true ,  true ) with  true T > 200 GeV,  true > 40 GeV and | true | < 2, the DNN calibration is equally or more performant than the standard calibration with a better response closure and resolution.Improvements of a few percent up to around 10% are observed in both energy and mass response closure and resolution.

Figure 5 :
Figure 5: The (a) jet energy and (b) mass calibrated response for the DNN calibration as a function of  true and for multiple bins in  true T .The horizontal bands represent 1% (dark) and 5% (light) up and down divergences from closure.

Figure 6 :
Figure 6: The (a) jet energy and (b) mass response before calibration (squares) and after the DNN calibration with (triangles) or without (circles) the  annotation step as a function of  true .The horizontal bands represent 1% (dark) and 5% (light) up and down divergences from closure.

Figure 8 :Figure 9 :
Figure 8: Boosted W/Z, top-quark and Higgs boson jet energy (left) and mass (right) responses as a function of  true T for different calibrations: (a, b) no calibration, (c, d) standard calibration and (e, f) DNN calibration.The horizontal bands represent 1% (dark) and 5% (light) up and down divergences from closure.

Figure 10 :
Figure 10: Normalised distributions of (a) the jet energy before (black) and after (red) re-weighting.And the corresponding (b) jet energy and (c) mass responses obtained with the DNN calibration as a function of  true T for these two energy distributions, the unweighted distribution (squares) (as used for training) and the flat-energy weighted distribution (triangles).The horizontal bands represent 1% (dark) and 5% (light) up and down divergences from closure.

Figure 11 :
Figure 11: The dependence of (a) the jet energy and (b) the mass response on NPV for different calibrations (squares: no calibration, triangles: standard calibration, circles: DNN predicted calibration) as a function of | true |.The horizontal band represent 0.1% up and down divergences from a null gradient.

Figure 12 :
Figure 12: Jet energy (left) and mass (right) responses versus  true for different jet flavours, calibrated with the standard calibration (a, b) and the DNN calibration (c, d).The horizontal bands represent 1% (dark) and 5% (light) up and down divergences from closure.

Figure 13 :
Figure 13: Jet energy and mass responses versus  true T calibrated with (a, b) the standard calibration and (c, d) with the DNN predictions for a dĳet data sample generated with Pythia (squares) or Sherpa (triangles).The horizontal bands represent 1% (dark) and 5% (light) up and down divergences from closure.

Figure 14 :
Figure 14: The variation of the energy (upward triangles) and mass (downward triangles) predictions when input features are varied (see the text for details).

Table 1 :
Simulated samples used for training and validation

Table 2 :
The input features of the DNN.Energy of the jet in GeV, log  is taken to reduce the spread of its distribution  Mass of the jet in GeV, log  is taken to reduce the spread of its distribution   T ) where Δ is the angular distance (sum over the jet constituents) Split12,Split23 Splitting scales at the 1st and 2nd exclusive   declusterings[35]C2, D2 Energy correlation ratios[36, 37]  21 ,  32 N-Subjettiness ratios using WTA axis [38, 39] Qw Smallest invariant mass among the proto-jets pairs of the last 3 steps of a   reclustering sequence Detector level EMFrac Energy fraction deposited in the electromagnetic calorimeter EM3Frac Energy fraction deposited in the third layer of the electromagnetic calorimeter Tile0Frac Energy fraction deposited in the 1st layer of the hadronic calorimeter EffNConsts (    ) 2 /(   2 T Δ (, jet)/(

Table 3 :
The specific training procedure of the DNN.MDN(A) stands for mixture density network (asymmetric).The loss truncation is indicated in number of standard deviations, .When  and  are not mentioned, it means that the energy and mass losses are identical.