1 Introduction

Measurements of deep-inelastic scattering (DIS) by a multitude of experiments all-over the world [1,2,3,4,5] and the study of these measurements by the theoretical and experimental communities [5] have revealed information on the quark-gluon structure of nuclear matter and established quantum chromodynamics (QCD) as the theory of the strong interaction. The experiments at the HERA collider facility at the DESY research centre in Hamburg, Germany have played an important role in these studies. HERA has been the only electron–proton collider built so far [6]. The data collected in the years 1992 - 2007 have provided truly unique information on the internal structure of the proton and other hadrons [7].

The key component in these studies has been a precise reconstruction of the DIS kinematics, using information from the accelerator and the detectors. Multiple methods have been applied at the HERA experiments [8] to reach an optimal precision in each particular measurement. Each of the classical reconstruction methods uses only partial information from the DIS event and is subject to specific limitations, either arising from the detector or the assumptions used in the method.

With the DIS measurements at the upcoming Electron-Ion Collider in mind [9], we present in this work a novel method for the reconstruction of DIS kinematics based on supervised machine learning and study its performance using Monte Carlo simulated data from the ZEUS experiment [10] at HERA. For our approach, we develop deep neural network (DNN) models that are optimised for the problem and are allowed to take full information from the DIS event into account. We train the DNN models on simulated data from the ZEUS experiment and compare the results from our trained model with the results from the classical reconstruction methods.

We show that the reconstruction of the DIS kinematics using deep neural networks provides a rigorous, data-driven method to combine and outperform classical reconstruction methods over a wide kinematic range. In the past, neural networks had already been used in the context of DIS experiments [11] and we expect that our novel method and similar approaches will play an even more important role in ongoing and future DIS experiments [12, 13].

2 Deep inelastic scattering

Deep inelastic scattering is a process in which a high-energy lepton (l) scatters off a nucleon or nucleus target (h) with large momentum transfer (the momentum of each entity is given in parenthesis):

$$\begin{aligned} l({k}) + h({P}) \rightarrow l^\prime ({k^\prime }) + {\mathcal {H}}({P^\prime }) +\textrm{remnant}. \end{aligned}$$
(1)

The detectors in collider experiments are designed to measure the final state of the DIS process, consisting of the scattered lepton \(l^\prime \) and the hadronic final state (HFS) \({\mathcal {H}}\). The latter consists of hadrons with a relatively long lifetime as well as some photons and leptons but does not include the hadron remnant. The H1 [14] and ZEUS experiments were not able to register the remnant of the target due to its proximity to the proton beam pipe.

2.1 Deep inelastic scattering at Born level

In the leading order (Born) approximation, the leptons interact with quarks in the hadrons by the exchange of a single virtual \(\gamma \) or Z boson in the neutral current (NC) reaction, and the exchange of single \(W^{\pm }\) boson in the charged current (CC) reaction. The kinematics of the leading order DIS process in a Feynman diagram-like form is shown in Fig. 1. In this paper, we will only consider the neutral current electron scattering off a proton in a collider experiment. In this reaction, the final state lepton is a charged particle (electron or positron) that can be easily registered and identified in the detector.

Fig. 1
figure 1

Schematic representation of Deep Inelastic Scattering process at Born level

With a fixed centre-of-mass energy, \(\sqrt{s}=\sqrt{({k}+{P})^2}\), two independent, Lorentz-invariant, scalar quantities are sufficient to describe the deep inelastic scattering event kinematics in the Born approximation. Typically, the used quantities are:

  • the negative squared four-momentum of the exchanged electroweak gauge boson:

    $$\begin{aligned} Q^2 = -{q}\cdot {q} = -({k}-{k^\prime })^2, \end{aligned}$$
    (2)
  • the Bjorken scaling variable, interpreted in the frame of a fast moving nucleon as the fraction of incoming nucleon longitudinal momentum carried by the struck parton:

    $$\begin{aligned} x = \frac{Q^2}{2{P} \cdot {q}}. \end{aligned}$$
    (3)

In addition to that, the inelasticity y is used to define the kinematic region of interest. It is defined as the fraction of incoming electron energy taken by the exchanged boson in the proton rest frame

$$\begin{aligned} y = \frac{{P} \cdot {q}}{{P} \cdot {k}}. \end{aligned}$$
(4)

Therefore, for the DIS an equation

$$\begin{aligned} Q^2 = s y x \end{aligned}$$
(5)

holds. However, the Born-level picture of the DIS process is not sufficient for the description of the observed physics phenomena. A realistic description of DIS requires the inclusion of higher order QED and QCD processes [15].

2.2 Higher order corrections to deep-inelastic scattering process

The DIS process with leading order QED corrections can be written as

$$\begin{aligned} l({k}) + h({P}) \rightarrow l^\prime ({k^\prime }) + \gamma ({P_{\gamma }}) + X({P^\prime }) +\textrm{remnant}, \end{aligned}$$
(6)

so the kinematics is defined not only by the kinematic of the scattered electron and the struck parton, but also by the momentum of the radiated photon, \({P_{\gamma }}\). The lowest order electroweak radiative corrections can be depicted in a form of Feynman diagrams as shown in Fig. 2a–d and should be considered together with the virtual corrections e–g. The Fig. 2a, d correspond to the initial state radiation (ISR) and the Fig. 2b, c to the final state radiation (FSR).

Fig. 2
figure 2

Feynman diagrams for Deep Inelastic Scattering process with some leading order electroweak corrections (ag) and QCD corrections (hk). The proton remnant is omitted

With the virtual corrections taken into account in the DIS process, Eq. (2) no longer holds, i.e.

$$\begin{aligned} {q}\cdot {q} \ne -({k}-{k^\prime })^2. \end{aligned}$$
(7)

The presence of higher order QCD processes, (e.g. the boson-gluon fusion in Fig. 2h and QCD Compton Fig. 2i, j makes the kinematic description of the DIS process even more complicated. Therefore, the exact definitions of the kinematic observables used in the analysis of the DIS events and the corresponding simulations are essential for the correct physics interpretation.

2.3 The simulation of the DIS events and the kinematic variables in the simulated events

The simulation of the inclusive DIS events in the Monte Carlo event generators (MCEGs) starts from the simulation at the parton level, i.e. the simulation of the hard scattering process and the kinematics of the involved partons, e.g. given by parton distribution functions (PDFs) [16] for the given hadron and considering processes with all types of partons in the initial state. The modelling of the hard scattering process combines the calculations of the perturbative QED and/or QCD matrix elements for the \(2\rightarrow n \) processes at parton level with the different QCD parton cascade algorithms designed to take into account at least some parts of the higher order perturbative QCD corrections not present in the calculations of matrix elements.

The simulated collision events on the particle (hadron) level are obtained using the parton level simulations as input and applying phenomenological hadronisation and particle decay models to them.

As of 2022, multiple MCEG programs are capable of simulating the inclusive DIS process at the hadron level with different levels of theory precision and sophistication of modelling of hadronisation, beam remnant and parton cascades, e.g. Pythia6 [17], Pythia8 [18], SHERPA-MC [19], WHIZARD [20] and Herwig7 [21]. In addition to that, the Lepto [22], Ariadne [23], Cascade [24] and Rapgap [25] programs can simulate the DIS process using parts of the Pythia6 framework for the simulation of hadronisation processes and decays of particles.

As it was discussed above, DIS beyond the Born approximation has a complicated structure which involves QCD and QED corrections [9]. The most recent MCEG programs, e.g. Herwig7, Pythia8, SHERPA-MC, or WHIZARD, contain these corrections as a particular case of their own general-purpose frameworks or are able to use specialised packages like OpenLoops [26], blackhat [27] or MadGraph [28]. In general, the modern MCEGs do not specify their definitions of the DIS kinematic observables, but in some cases they can be calculated from the kinematics of the initial and final state, both for the true and reconstructed kinematics. For instance, under some assumptions, the \(Q^{2}\) in the event could be calculated according to Eq. (2). The total elimination of the ambiguities for such calculations is not possible, as the final state kinematics even at the parton level depend on the kinematics of all the emitted partons. The calculations of the kinematic observables from the momenta of particles at hadron level add an additional ambiguity related to the identification of the scattered lepton and the distinction of that lepton from the leptons produced in the hadronisation and decay processes.

Contrary to the approaches adopted in modern MCEG programs, the MCEGs used for the HERA experiments relied upon generator-by-generator implementations of the higher order QED and QCD corrections specific for DIS or alternatively applied HERACLES [15] for the corrections.

The way to get a simulation of the DIS collision even in a specific detector is the same as for any other type of particle collision event. It involves simulation of the particle transport through the detector material, simulation of detector response and is typically performed in Geant [29, 30] or similar tools. The simulated detector response is passed to the experiment-specific reconstruction programs and should be indistinguishable from the real data recorded by the detector and processed in the same way.

2.4 Reconstruction of the kinematic variables at the detector level

The kinematics of the DIS events are reconstructed in collider experiments by identifying and measuring the momentum of the scattered lepton \(l^\prime \) and/or the measurements of the hadronic final state (\({\mathcal {H}}\)). The identification of the scattered lepton is ambiguous even at the particle level of the simulated DIS collision events. The same ambiguity is present in the reconstructed real and simulated data at the detector level. Therefore, the identification of the scattered electron candidate for the purposes of physics analyses is a complicated task on itself and was a subject of multiple studies in the past, some of which also involved neural network-related techniques [11]. In our paper, we rely on the standard method of the electron identification at the ZEUS experiment [11] and discuss solely the reconstruction of the kinematic variables using the identified electron and other quantities measured in the detector.

The physics analyses performed in the experiments at HERA relied on the following quantities for the calculation of the kinematic observables x and \(Q^2\):

  • The energy (\(E_{l^\prime }\)) and polar angle (\(\theta _{l^\prime }\)) of the scattered electron. Most of the DIS experiments are equipped to register the scattered electron using the tracking and calorimeter detector subsystems. While the tracking system is able to provide information on the momentum of the scattered electron, the calorimeter system can be used to estimate the energy of the electron and the total energy of the collinear radiation emitted by the electron. At the detector level, the estimation of these energies can be done by comparing the momentum of the electron reconstructed in the tracking system and the energy deposits registered in the calorimeter system around the extrapolated path of the electron in the calorimeter.

  • The energy of the HFS expressed in terms of the following convenient variables:

    $$\begin{aligned} \delta _{{\mathcal {H}}} =\sum _{i\in {\mathcal {H}}} E_i-P_{Z,i} \end{aligned}$$
    (8)

    and

    $$\begin{aligned} P_{T,{\mathcal {H}}}=\sqrt{\left( \sum _{i\in {\mathcal {H}}} P_{X,i} \right) + \left( \sum _{i\in {\mathcal {H}}} P_{Y,i} \right) }, \end{aligned}$$
    (9)

where the sums run over the registered objects i excluding the scattered electron. Depending on the analysis requirements, the used objects could be either registered tracks, energy deposits in the calorimeter system, or a combination of both.

The measurements of the quantities listed above overconstrain the reconstruction of the DIS kinematics. Therefore, in the simplest case, any subset of two observables in \(E_{l^\prime }\), \(\theta _{l^\prime }\), \(\delta _{{\mathcal {H}}}\), and \(P_{T,{\mathcal {H}}}\), can be used for the reconstruction.

In our analysis, we consider three specific classical reconstruction methods based on these observables which were used by the ZEUS collaboration in the past: the electron, the Jacques-Blondel, and the double-angle methods. We briefly provide some details on the methods in this section, while a more detailed description can be found elsewhere [31].

The electron (EL) method uses only measurements of the scattered lepton, \(E_{l^\prime }\) and \(\theta _{l^\prime }\), to do the reconstruction of \(Q^2\) and x. The kinematic variables calculated from these measurements are given by:

$$\begin{aligned} Q^2_{EL} = 2 E_l E_{l^\prime }(1+\cos \theta _{l^\prime }) \end{aligned}$$
(10)

and

$$\begin{aligned} x_{EL} = \frac{E_l E_{l^\prime }(1+\cos \theta _{l^\prime })}{E_P(2E_l-E_{l^\prime }(1-\cos \theta _{l^\prime }))}. \end{aligned}$$
(11)

The electron method provides precise reconstruction of kinematics, but, since it uses only information from the scattered lepton, this method is affected by initial and final state QED radiation. Namely, the QED radiation registered in the detector separately from the scattered electron will not be taken into account in the calculations with this method. Practically, the reconstruction with this method gives reasonable results when \(E_l\) and \(E_{l^\prime }\) are significantly different from one another, but the resolution and stability becomes poor otherwise.

The Jacques-Blondel (JB) method uses only measurements of the final state hadronic system, \(\delta _{{\mathcal {H}}}\) and \(P_{T,{\mathcal {H}}}\), for the reconstruction. The kinematic variables are calculated from these by:

$$\begin{aligned} Q^2_{JB}= & {} \frac{2 E_l P_{T,{\mathcal {H}}}^2}{2 E_l - \delta _{{\mathcal {H}}}}, \end{aligned}$$
(12)
$$\begin{aligned} x_{JB}= & {} \frac{2 E_l Q^2_{JB}}{s \delta _{{\mathcal {H}}}}. \end{aligned}$$
(13)

The JB method is resistant to possible biases because of unaccounted QED FSR, but requires precise measurements of the particles momenta in the hadronic final state. The factors that limit the precision of the measurements are the uncertainties in the particle identification, the finite resolution of the calorimeter and tracking detectors, the inefficiencies of these detectors, the impossibility of the particle detection around the beampipe, and the presence of objects that avoid detection (e.g. neutrinos from particle decays).

The double-angle (DA) method combines measurements from the scattered lepton and the final-state hadronic system, \(\theta _{l^\prime }\) and \(\gamma _{\mathcal {H}}\), to perform the kinematic reconstruction as follows:

$$\begin{aligned} Q^2_{DA}= & {} \frac{4E_l^2\sin \gamma _{{\mathcal {H}}}(1+\cos \theta _{l^\prime })}{\sin \gamma _{{\mathcal {H}}} +\sin \theta _{l^\prime }-\sin (\gamma _{{\mathcal {H}}} + \theta _{l^\prime })}, \end{aligned}$$
(14)
$$\begin{aligned} x_{DA}= & {} \frac{E_l\sin \gamma _{{\mathcal {H}}}(1+\cos \theta _{l^\prime })}{E_P\sin \theta _{l^\prime }(1-\cos \gamma _{{\mathcal {H}}})}, \end{aligned}$$
(15)

where the angle \(\gamma _{\mathcal {H}}\) is defined as

$$\begin{aligned} \cos \gamma _{\mathcal {H}}=\frac{P_{T,{\mathcal {H}}}^2-\delta _{{\mathcal {H}}}^2}{P_{T,{\mathcal {H}}}^2+\delta _{{\mathcal {H}}}^2}. \end{aligned}$$
(16)

The angle \(\gamma _{\mathcal {H}}\) depends on the ratio of the measured quantities \(\delta _{{\mathcal {H}}}\) and \(P_{T,{\mathcal {H}}}\), and thus, uncertainties in the hadronic energy measurement tend to cancel, leading to good stability of the reconstructed kinematic variables. Similar to the electron method, when \(E_l\) and \(E_{l^\prime }\) are significantly different from one another, the double-angle method provides reliable results, but the resolution and stability are poor otherwise.

2.5 The methodology of measurements in the deep inelastic ep collisions

The methodology of measurements at lepton-hadron colliders in general is similar to the methodologies used at \(e^+ e^-\) and hadron-hadron colliders. Briefly, in the most cases the quantity of interest is measured from the real data registered in the detector corrected for detector effects. The corrections are estimated by comparing the analysis at the detector level with the same analysis at particle level using detailed simulations of the collision events with the inclusion of higher order QED and QCD processes.

The main difference between the measurements at lepton-hadron colliders and elsewhere, is in the way the measurements involve collision kinematics. At \(e^+ e^-\) colliders, the initial kinematics of the interactions is given by the lepton energies that are known parameters of the accelerators. Therefore, it is straightforward for most of the measurements in the \(e^+ e^-\) experiments to estimate the centre-of-mass energy of the hard collision process. In hadron-hadron collider experiments, there is no way to measure the kinematic properties of the partons initiating the collision process, as the involved partons cannot be observed in a free state and most measurements in the hadron-hadron collisions are inclusive in the kinematics of the initial state. The DIS collisions at electron–proton colliders take a middle stance between these cases. The kinematic observables of the DIS process are measured on an event-by-event basis at the detector level using the methods described above.

In an experiment, the measurements of event kinematics is affected by various effects. For a proper comparison of the measurements of HFS, e.g. jet cross-sections or event shape observables to corresponding perturbative QCD (pQCD) predictions, the detector-level measurements are unfolded for detector effects while hadronisation correction factors are calculated using MCEGs or specialised programs and applied to the pQCD predictions [32,33,34,35]. The prescription for calculation of those correction factors vary depending on the HFS quantities measured and the used definitions of the kinematic observables. Typically, at ZEUS and other experiments at HERA, after the unfolding of the detector effects, the measurements were also scaled by radiation correction factors to facilitate a comparison to theoretical calculations available at Born level in QED, see e.g. Ref. [36]. The factors were obtained from separate high-statistics MC simulations. This is a well-understood Monte Carlo approach and our deep learning technique can be used with it in exactly the same way as the classical methods, both for the experimental and the MC data. For future DIS measurements, e.g. at the upcoming Electron-Ion Collider, it is expected that the effects of QED and QCD radiation can be treated in an unified formalism [35], with the QED effects taken into account into a factorised approach. Our DNN-based reconstruction of DIS kinematics is compatible with such a factorised approach as well.

Therefore, in this analysis, we keep the calculation of correction factors for QED and hadronisation effects out of scope and limit the discussion to detector-level measurements. At particle-level in the MC generated events, we use a single definition of “true” kinematic observables that is based on the kinematics of the exchanged boson extracted from the MC event information.

Namely, we use the definitions of the true-level observables \(Q^{2}_{\textrm{true}}\) and \(x_{\textrm{true}}\) that follow the definitions implemented in ZEUS Common NTuples [37, 38] and were used in many ZEUS analyses. With these definitions the \(Q^{2}_{\textrm{true}}\) is calculated directly from the squared four-momentum of the exchange boson \({q}_{\textrm{boson}}\),

$$\begin{aligned} Q^{2}_{\textrm{true}} = - |{q}_{\textrm{boson}}|^2. \end{aligned}$$
(17)

The \(x_{\textrm{true}}\) is calculated according to the formula

$$\begin{aligned} x_{\textrm{true}} = \frac{Q^{2}_{\textrm{true}}}{y_{\mathrm{after\ ISR}} s_\mathrm{after\ ISR}}, \end{aligned}$$
(18)

with

$$\begin{aligned} y_{\mathrm{after\ ISR}} = \frac{{q}_{\textrm{boson}}\cdot {P}_\mathrm{proton\ beam}}{ ({P}_{\mathrm{lepton\ beam}} - {P}_\mathrm{ISR\ radiation} )\cdot {P}_{\mathrm{proton\ beam}}} \end{aligned}$$
(19)

and

$$\begin{aligned} s_{\mathrm{after\ ISR}} = | {P}_{\mathrm{proton\ beam}} + {P}_\mathrm{lepton\ beam} - {P}_{\mathrm{ISR\ radiation}} |^2, \end{aligned}$$
(20)

where \({P}_{\mathrm{proton\ beam}}\), \({P}_{\mathrm{lepton\ beam}}\) and \({P}_{\mathrm{ISR\ radiation}}\) represent the four-momenta of the proton beam particles, lepton beam particles and the momenta lost by the lepton beam particles to ISR. Thus, \(y_{\mathrm{after\ ISR}}\) corresponds to the fraction of energy of the bare lepton (i.e. without the ISR) transferred to the HFS in the centre of mass of the proton and \(s_{\mathrm{after\ ISR}}\) to the centre-of-mass-energy squared of the proton and the incoming bare electron.

3 Machine learning methods

The reconstruction of DIS event kinematics is overconstrained by the previously mentioned measurements, \(E_{l^\prime }\), \(\theta _{l^\prime }\), \(\delta _{{\mathcal {H}}}\), and \(P_{T,{\mathcal {H}}}\). We trained an ensemble of deep neural networks (DNNs) to reconstruct x and \(Q^2\) by correcting results from classical reconstruction methods using the information on the scattered lepton and the final-state hadronic system.

The ensemble DNN method presented here is a new approach designed specifically to address this reconstruction problem. The remainder of this section discusses in detail specifics of the DNN architectures, the specific optimisation methods used to find the optimal parameters defining the DNNs, and the specific structure of the ensemble models. Shown rigorously in what follows, the main motivation for using the unique model is:

  • the universal approximation capabilities of the DNN models with our specific architectures

  • the necessary reduction in the approximation error with the increase in the depth of our DNN models

  • avoidance of a degradation problem [39] in training due to the residual structure of our DNN models.

Further details on the DNN-based reconstruction method will be provided in Ref. [40].

3.1 Neural networks architecture

The problem of reconstruction of DIS event kinematics can be posed as a task to construct a functional relationship between a set of input variables \(\textrm{x}\) and a set of response variables \(\textrm{y}\). In many cases, such a relation is a continuous function, and can be approximated to arbitrary accuracy with a neural network. The corresponding neural network can have various architectures. The simplest architecture is a sequence of fully-connected, feed-forward (hidden) layers. For an input \(\textrm{x} \in {\mathbb {R}}^m\), the output of the first hidden layer is:

$$\begin{aligned} h^1 (\textrm{x}) = \alpha \left( A_0 \textrm{x} \right) , \quad h^1 \in {\mathbb {R}}^{d_1}, \end{aligned}$$
(21)

where \(A_0 : {\mathbb {R}}^m \rightarrow {\mathbb {R}}^{d_1}\) is an affine transformation, the composition of a translation and a linear mapping, and \(\alpha : {\mathbb {R}}\rightarrow {\mathbb {R}}\) is a nonlinear function, called an activation function. Define the application of \(\alpha \) on a vector \(\textrm{x}:=\left( \textrm{x}_1,\textrm{x}_2,...,\textrm{x}_m \right) ^T \in {\mathbb {R}}^m\), by \(\alpha (\textrm{x}) := \left( \alpha (\textrm{x}_1),\alpha (\textrm{x}_2),...,\alpha (\textrm{x}_m) \right) ^T\).

For \(A_1 : {\mathbb {R}}^{d_1} \rightarrow {\mathbb {R}}^{d_2}\), the output \(h^1\) is passed into a second hidden layer:

$$\begin{aligned} h^2 (\textrm{x})= \alpha \left( A_1 h^1(\textrm{x}) \right) , \quad h^2 \in {\mathbb {R}}^{d_2}, \end{aligned}$$
(22)

and so on, for a desired number of hidden layers. An affine transformation on the output of the final hidden layer is the output of the network. For \(i\in {\mathbb {N}}\), call \(d_i\) the number of nodes in hidden layer i.

Define \(\varvec{\Psi }^D_{m,n}(\alpha )\) to be the set of neural networks with D hidden layers that map \({\mathbb {R}}^m \rightarrow {\mathbb {R}}^n\) with continuous, nonlinear activation function \(\alpha \). If \(\psi \in \varvec{\Psi }^D_{m,n}\), then there exists a set of natural numbers \(\{d_i\}_{i=0}^{D+1} \subset {\mathbb {N}}\) with \(d_0 = m\) and \(d_{D+1} = n \), a set affine transformations \(\{A_i : {\mathbb {R}}^{d_{i}} \rightarrow {\mathbb {R}}^{d_{i+1}} \}_{i=0}^D\), such that, for \(\textrm{x} \in {\mathbb {R}}^m\),

$$\begin{aligned} {\left\{ \begin{array}{ll} h^0 (\textrm{x})= \textrm{x}, \\ h^{i+1} (\textrm{x})= \alpha \left( A_{i} h^i (\textrm{x}) \right) \text { for } 0 \le i < D, \\ \psi (\textrm{x}) = A_D h^{D} (\textrm{x}). \end{array}\right. } \end{aligned}$$
(23)

Specific universality properties of \(\varvec{\Psi }^D_{m,n}(\alpha )\) are proven in Refs. [41, 42]. In particular, there exists some element in the class of neural networks that is arbitrarily close to any continuous function. For this reason, we say that \(\varvec{\Psi }^D_{m,n}(\alpha )\) is dense in the space of continuous functions.

We consider a particular subclass of neural networks that also maintain the universal approximation property. We call it \(\varvec{\Phi }^D_{m,n}(\alpha )\). If a function \(\phi \in \varvec{\Phi }^D_{m,n}(\alpha ) \), there exists a set of natural numbers \(\{d_i\}_{i=0}^{D+1} \subset {\mathbb {N}}\) with \(d_0 = m\) and \(d_{D+1} = n \), a set of affine transformations \(\{A_i : {\mathbb {R}}^{d_{i}} \rightarrow {\mathbb {R}}^{d_{i+1}} \}_{i=0}^{D-1}\), a set of matrices \(\{W_i \in {\mathbb {R}}^{d_i \times n}\}_{i=0}^{D}\), such that, for \(\textrm{x} \in {\mathbb {R}}^m\),

$$\begin{aligned} {\left\{ \begin{array}{ll} h^1 (\textrm{x})= \alpha \left( A_{0} \textrm{x} \right) , \\ h^{i+1} (\textrm{x})= \alpha \left( A_{i} h^i (\textrm{x}) \right) \text { for } 0 \le i < D, \\ \phi (\textrm{x}) = W_0 \textrm{x} + \sum _{i=1}^D W_i h^i (\textrm{x}). \end{array}\right. } \end{aligned}$$
(24)

Here the affine transformations \(A_i\), \(i=0,1,\dots , D-1\), have the form

$$\begin{aligned} A_i\textrm{x}=M_i\textrm{x}+\textrm{b}_i, \ \ \text{ for }\ \ \textrm{x}\in {\mathbb {R}}^{d_i} \end{aligned}$$

for matrices \(M_i\in R^{d_i\times d_{i+1}}\) and vectors \(\textrm{b}_i\in R^{d_{i+1}}\). The function \(\phi \) defined by (24) is determined by matrices \(M_i\) and \(W_i\), and vectors \(\textrm{b}_i\) whose entries/components will be embraced in a single notation \(\omega \) later. That is, \(\phi =\phi _\omega \). When \(\omega \) is chosen via an optimisation problem, \(\phi (\textrm{x})\) is a function of the input variable \(\textrm{x}\in {\mathbb {R}}^{d_0}\).

Let \(\varvec{\Phi }^D_{m,n}(\alpha )\) be the function class in which we search for an optimal kinematic reconstruction method. In addition to the universality property, it can be shown that this class is good for searching for an appropriate reconstruction function because there the necessary reduction in the approximation error with the increase in the depth and the residual structure avoids a degradation problem in the weights during training [40].

3.2 Optimisation methods

In any given problem to be solved with the DNN, the task is to choose an optimal function from the class \({\mathcal {F}}=\varvec{\Phi }^D_{m,n}(\alpha )\) to be a functional relationship between a set of input variables \(\textrm{x}\) and a set of response variables \(\textrm{y}\).

Each function \(\phi \in {\mathcal {F}}\) is completely determined by the elements of the matrices and vectors that determine the values of the hidden layers. Let \(\omega \in {\mathbb {R}}^p\) be the collection of all these elements, where p is the total number of them. We can then write for any \(\textrm{x}\in {\mathbb {R}}^m\): \(\phi (\textrm{x})=\phi _\omega (\textrm{x})\).

Choosing an optimal function from this class entails finding the collection of parameters \(\omega \) in \({\mathbb {R}}^p\) that minimises \({\mathcal {L}}_{\mathcal {D}}\), the expected discrepancy or generalisation error between \(\textrm{y}\) and \(\phi _\omega (\textrm{x})\), over the joint probability distribution of \(\textrm{x}\) and \(\textrm{y}\), \({\mathcal {D}}\), defined by:

$$\begin{aligned} {\mathcal {L}}_{\mathcal {D}}(\omega ) = \mathop {{\mathbb {E}}}_{\textrm{x},\textrm{y}\sim {\mathcal {D}}} \ell (\omega ,\textrm{x},\textrm{y}) = \int \ell (\omega ,\textrm{x},\textrm{y}) d {\mathcal {D}}(\textrm{x},\textrm{y}), \end{aligned}$$
(25)

where \(\ell \) is some loss function measuring the discrepancy between \(\textrm{y}\) and \(\phi _\omega (\textrm{x})\).

With a randomly sampled data set \(\{\textrm{x}_i,\textrm{y}_i\}_{i=1}^N\), if N is sufficiently large, then the generalisation error can be approximated by an empirical error:

$$\begin{aligned} {\mathcal {L}}_{\mathcal {D}} \approx \frac{1}{N} \sum _{i=1}^N \ell (\omega ,\textrm{x}_i,\textrm{y}_i). \end{aligned}$$
(26)

This summation is sometimes called the fidelity term, as it measures the discrepancy between the data and model predictions. A commonly used fidelity term is the mean square error, with:

$$\begin{aligned} \ell (\omega ,\textrm{x}_i,\textrm{y}_i) = \Vert \textrm{y}_i - \phi _\omega (\textrm{x}_i)\Vert ^2_2. \end{aligned}$$
(27)

The fidelity term used in this study is a modification of this, the mean square logarithmic error:

$$\begin{aligned} \ell (\omega ,\textrm{x}_i,\textrm{y}_i) = \Vert \log (\textrm{y}_i) - \log (\phi _\omega (\textrm{x}_i))\Vert ^2_2. \end{aligned}$$
(28)

Due to the universality of neural networks, there is a model that can achieve zero empirical error. However, in the presence of noise this means the model can overfit to the data sample and loose its generalisability. This problem can be addressed by adding a regularisation term to the optimisation problem as a penalty for certain irregular behaviour. Common regularisation terms include the \(\ell ^2\)-norm, used to limit the size of the parameters, and the \(\ell ^1\)-norm, which can induce sparse solutions. For out analysis, we choose the \(\ell ^1\)-regularisation. The Theorem 1.3 in [43] and Proposition 27 of [44] provide an evidence that minimising the \(\ell ^1\) norm provides sparse optimal solutions with a minimal number of nonzero elements. Well-constructed models should be able to generalise the information from one given sample to any possible event. Thus, regularisation with the \(\ell ^1\) norm produces a model determined by a minimal number of parameters so that the optimal solution does not fit completely to the training set and loose its generalisability.

Therefore, the determination of the optimal neural network model consists of minimising this final loss function, the sum of the sample fidelity term and a weighted regularisation term:

$$\begin{aligned} \min _{\omega \in {\mathbb {R}}^p} \frac{1}{N} \sum _{i=1}^N \ell (\omega ,\textrm{x}_i,\textrm{y}_i) + {\mathcal {R}} \cdot ||\omega ||_{1}, \end{aligned}$$
(29)

where \(\ell \) is defined in Eq. (28) and \({\mathcal {R}}\) is a hyperparameter to determine the magnitude of the regularisation.

Expression 29 can be minimised using stochastic gradient methods on batches of the data sample [45]. The training is accelerated using classical momentum methods [46]. In particular, randomly select an initial set of parameters \(\omega ^0\). Select a sequence of step sizes, or learning rates, \({{\mathcal {L}}}^k\) that diminish to zero. Randomly selecting a batch of data with indices \(I \subset \{ 1,...,N \}\). Choosing a momentum parameter \(\mu \). By defining:

$$\begin{aligned}{} & {} v^{k+1} = \mu v^k - {{\mathcal {L}}}^k \nabla _\omega L_I(\omega ^k), \end{aligned}$$
(30)
$$\begin{aligned}{} & {} \omega ^{k+1} = \omega ^k + v^{k+1}. \end{aligned}$$
(31)

Then the sequence \(\omega ^k\) converges to a set of parameters defining the optimal neural network with the minimal generalisation error, in the sense described here.

3.3 The model

We construct a model to rigorously weight classically derived reconstructions of x and \(Q^2\) with corrections based on measurements from the final state lepton and hadronic system. The final reconstruction of these observables with the neural network approach are labelled below as \(Q^2_{NN}\) and \(x_{NN}\) respectively.

The constructed model is an ensemble of networks from the previously defined function class \(\varvec{\Phi }^D_{m,1}(\alpha )\), where \(\alpha \) is the rectified linear unit (ReLU). The values of D and m vary with each network in the ensemble.

The NC DIS events studied in this analysis are by definition the events containing the scattered electron in the final state, therefore we aim to reconstruct the \(Q^2_{NN}\) primarily from the related observables, i.e. using the properties of the electron directly measured in the experiment. In particular, we use as inputs three set of variables: the classically reconstructed kinematic observables \(\left( Q^2_{EL},Q^2_{DA},Q^2_{JB}\right) \), measurements on the scattered lepton \(\left( E_{l^{\prime }},\theta _{l^{\prime }}\right) \), and measurements on the final-state hadronic system \(\left( \delta _{{\mathcal {H}}},P_{T,{\mathcal {H}}}\right) \). We reconstruct the \(Q^2\) in the form:

$$\begin{aligned} Q^2_{NN}&= A_{Q^2}\left( Q^2_{EL},Q^2_{DA},Q^2_{JB}\right) + L_{Q^2}\left( A_{Q^2},E_{l^{\prime }},\theta _{l^{\prime }}\right) \nonumber \\&\quad + H_{Q^2}\left( A_{Q^2},\delta _{{\mathcal {H}}},P_{T,{\mathcal {H}}}\right) , \end{aligned}$$
(32)

in which \(A_{Q^2}\) could be understood as a rigorous average of classically derived reconstructions, \(L_{Q^2}\) is a correction term computed from the scattered lepton, and \(H_{Q^2}\) is another correction term computed from the final-state hadronic system. In our analysis \(A_{Q^2}\), \(L_{Q^2}\), \(H_{Q^2}\) are simultaneously trained networks in \(\varvec{\Phi }^5_{3,1}(\alpha )\), with each hidden layer of the networks containing 2000 nodes.

The x observable for the NC DIS events is actually calculated from the electron-related observables as well. Therefore, we reconstruct \(x_{NN}\), with \(Q^2_{NN}\) also as an input, for a total of eight inputs (\(x_{EL},x_{DA},x_{JB},E_{l^{\prime }},\theta _{l^{\prime }},\delta _{{\mathcal {H}}}, P_{T,{\mathcal {H}}},Q^2_{NN}\)), in the form:

$$\begin{aligned} \begin{aligned} x_{NN}&= A_{x}\left( x_{EL},x_{DA},x_{JB}\right) + L_x\left( A_{x},Q^2_{NN},E_{l^{\prime }},\theta _{l^{\prime }}\right) \\&\quad + H_{x}\left( A_{x},Q^2_{NN},\delta _{{\mathcal {H}}},P_{T,{\mathcal {H}}}\right) , \end{aligned} \end{aligned}$$
(33)

where \(A_{x}\), \(L_x\), and \(H_x\) are defined similarly to \(A_{Q^2}\), \(L_{Q^2}\), and \(H_{Q^2}\), but \(A_{x}\) is a network in \(\varvec{\Phi }^{20}_{3,1}(\alpha )\) with each hidden layer containing 1000 nodes, and \(L_{x}\), \(H_{x}\) are networks in \(\varvec{\Phi }^{10}_{4,1}(\alpha )\), with each hidden layer containing 500 nodes.

The number of hidden layers and the number of nodes in each hidden layer were each progressively increased until sufficiently desirable results were achieved. Smaller networks provided good results on average, but larger networks were needed to find best results in small, specific regions of the kinematic space. A further increasing of the numbers of hidden layers and nodes per layer beyond the chosen values was found to not significantly alter the performance of the kinematic reconstruction, due to the convergence results of deep neural networks. The convergence theorems of deep neural networks with the ReLU activation function as the number of layers increases were recently established in [47, 48].

In the ensemble neural network model, we used the ReLU

$$\begin{aligned} \alpha (\textrm{x}) := \max (\textrm{x},0), \ \ \textrm{x}\in {\mathbb {R}} \end{aligned}$$
(34)

as the nonlinear activation function It has been shown [49] that with a gradient descent algorithm, using the ReLU function as the activation function provides a smaller training time compared to that with the use of functions with saturating nonlinearities, such as a sigmoid or hyperbolic tangent function. The reduced training time enabled us to experiment with more sophisticated networks. In addition to this, the ReLU functions do not need input normalisation to prevent them from saturating, which is a desirable property for the present analysis. Moreover, it was shown in [47] that the selection of the ReLU activation function produces a neural network as a piecewise linear function over a nonuniform partition, of the domain, determined by parameters involved in the affine transformations \(A_i\). Such a structure ensures a good representability of the neural network and can overcome the problem of underfitting.

To accommodate for the large range of the \(Q^{2}\) and x variables in the analysis, we select the loss function defined in Eq. (28) for the training of the DNN models.

4 Experimental setup

The Monte Carlo simulated events used to train our deep neural networks were specifically generated using the conditions of \(e^{\pm }p\) scattering in the ZEUS detector at HERA. A detailed description of the ZEUS detector can be found elsewhere [10].

In our analysis, we have used two samples of Monte Carlo simulated \(e^{+}p\) DIS events that are provided by the ZEUS collaboration. These samples were generated with an inclusion of QED and higher order QCD radiative effects using the HERACLES 4.6.6 [15] package with DJANGOH 1.6 [50] interface and the ARIADNE 4.12 and LEPTO 6.5.1 packages for the simulation of the parton cascade. For both samples the same set of kinematic cuts was applied during the generation, the same set of PDFs were used, CTEQ5D [51] and the same hadronisation settings were used to model the hadronisation with the Pythia6 program. Therefore, the essential difference between the two samples is the way the higher order corrections are partially modelled with the corresponding algorithms (QCD cascades). Namely, the LEPTO MCEG utilises the parton shower approach [52], while ARIADNE implements a color-dipole model [53]. Accordingly, we label the data-sets produced by the LEPTO generator as “CDM” data sets and those with ARIADNE as “MEPS” data sets.

The generated particle-level events were passed through the ZEUS detector and trigger simulation programs based on Geant 3.21 [29], assuming the running conditions of the ZEUS experiment in the year 2007 with a proton beam energy of 920 GeV. The simulated detector response was processed and reconstructed using exactly the same software chain and the same procedures as being used for real data. The results of the processing were saved in ROOT [54] files in a form of ZEUS Common NTuples [37, 38], a format that can be easily used for physics analysis without any ZEUS-specific software.

5 Event selection

The selection of events for the neural network training is motivated by the selection procedure applied in the previous ZEUS analyses [55,56,57,58,59,60]. Even though the presented analysis performed on the Monte Carlo simulated events, the selection cuts are choosen and applied as if the analysis is performed on real data for the purpose of being as close as possible to the measurements. The general motivation for these cuts is the same as in many analyses performed by the ZEUS collaboration: to define unambiguously the phase space of the measurement, ensure low fraction of background events, and a reasonable description of the detector acceptance by Monte Carlo simulations.

5.1 Phase space selection

The phase space for the training of the neural networks in this analysis is selected as \(100{\,\text {GeV}}^{2}<Q^{2}<20480{\,\text {GeV}}^{2}\), being close to the phase space of the physics analysis in Ref. [59].

In the phase-space region at low x and very low inelasticity y, the QED predictions from the Monte Carlo simulations are not reliable because of a limit of higher orders in the calculations [61]. To avoid these phase-space regions, the events are required to have \(y_{JB}\cdot (1-x_{DA})^2>0.004\) and \(y_{JB} > 0.04\) [61]. To ensure optimal electron identification and electron energy resolution, similar to the previous physics analyses, a kinematic cut \(0.2<y_{EL}< 0.7\) is used.

5.2 Event selection

The deep inelastic scattering events of interest are those characterised by the presence of a scattered electron in the final state and a significant deposit of energy in the calorimeter from the hadronic final state. The scattered electrons are registered in the detector as localised energy depositions primarily in the electromagnetic part of the calorimeter, with little energy flow into the hadronic part of the calorimeter. On the other hand, hadronic showers propagate in the detector much more extensively, both transversely and longitudinally.

In addition to the DIS process, there are also background processes which leave similar signatures in the detector as those described above. Therefore, the correct and efficient identification of the scattered lepton is crucial for the selection of the NC DIS events. For this analysis, the SINISTRA algorithm is used to identify lepton candidates [11]. Based on the information from the detector and the results of the SINISTRA algorithm, the following selection criteria are applied to select the events for the further analysis:

  • Detector status: It is required that for all the events the detector was fully functional.

  • Electron energy: At least one electron candidate with energy greater than \(10{\,\text {GeV}}\) [59] is identified in the event.

  • Electron identification probability: The SINISTRA [11] probability of a lepton candidate being the DIS lepton was required to be greater than 90%. If several lepton candidates satisfy this condition, the candidate with the highest probability is used. In addition to this, there must be no problems reported by the SINISTRA algorithm.

  • Electron isolation: To assist in removing events where the energy deposits from the hadron system overlap with those of the scattered lepton, the fraction of the energy not associated to the lepton is required to be less than 10% over the total energy deposited within a cone around the lepton candidate. The cone is defined with a radius of 0.7 units in the pseudorapidity-azimuth plane around the lepton momentum direction [59].

  • Electron track matching: The tracking system covers the region of polar angles restricted to \(0.3< \theta < 2.85\) rad. Electromagnetic clusters within that region that have no matching track are most likely photons. If the lepton candidate is within this region, the presence of a matched track is required. This track must have a distance of closest approach between the track extrapolation point at the front surface of the CAL and the cluster centre-of-gravity-position of less than 10 cm. The track energy must be greater than 3 GeV [59].

  • Electron position: To minimise the impact of imperfect simulation of some detector regions, additional requirements on the position of the electromagnetic shower are imposed. The events in which the lepton is found in the following regions (x,y and z being the cartesian coordinates in the ZEUS detector) are rejected [62]:

    • \(\sqrt{x^2 + y^2} < 18 \text { cm}\), regions close the beam pipe

    • \(z < -148 \text { cm}\) and \(y > 90 \text { cm}\) and \(-14< x < 12 \text { cm}\), a part of the RCAL where the depth was reduced due to the cooling pipe for the solenoid (chimney),

    • \(-104< z < -98.5 \text { cm}\) or \(164< z < 174 \text { cm}\), regions in-between calorimeter sections (super-crack).

  • Primary vertex position: It was required that the reconstructed primary vertex position is close to the central region of the detector, applying the selection \(-28.5<Z_{\textrm{vtx}}<26.7\text { cm}\) [59].

  • Energy-longitudinal momentum balance: To suppress photoproduction and beam-gas interaction background events and imperfect Monte Carlo simulations of those, restrictions are put on the energy-longitudinal momentum balance. This quantity is defined as:

    $$\begin{aligned} \delta= & {} \delta _l + \delta _{{\mathcal {H}}} = (E_{l^\prime }-P_{z,l^\prime }) + (E_{{\mathcal {H}}}-P_{z,{\mathcal {H}}})\nonumber \\= & {} \sum _{i} (E_i-P_{z,i}), \end{aligned}$$
    (35)

    where the final summation index runs over all energy deposits in the detector. In this analysis, we apply a condition of \(38< \delta < 65 {\,\text {GeV}}\) [59].

  • Missing transverse energy: To remove beam-related background and cosmic-ray events, a cut on the missing energy is imposed. \(P_{T,\mathrm miss}/\sqrt{E_{T}}<2.5{\,\text {GeV}}^{1/2}\), where \(E_{T}\) is the total transverse energy in the CAL and \(P_{T,\mathrm miss}\) is the missing transverse momentum, the transverse component of the vector sum of the hadronic final state and scattered electron momenta.

6 Training the DNN models

We use the neural network models defined in Sect. 3.3 and consider them as optimisation problem in terms of Eq. (29), i.e., we minimise the loss function across the selected training set and satisfy sparse regularity conditions. Every neural network making the ensemble model for x and \(Q^2\) are trained simultaneously. The optimal regularity condition depends on the selection of the training set, the events batch size, the initial learning rate, and the regularisation parameter. We aim to select these in a way that minimises overfitting in particular regions of the kinematic space while still maximising the mean accuracy of the model.

We select a set of events from the “MEPS” data sets for training, as described in Sect. 5, and define the true values of the x and \(Q^{2}\) as described in Sect. 2.3. Figure 3 shows a distribution of selected events in the \((x_{\textrm{true}},Q^2_{\textrm{true}})\) plane and the boundaries of the chosen analysis bins.

Fig. 3
figure 3

Distribution of events from the training set in \((x_{\textrm{true}},Q^2_{\textrm{true}})\) plane and the boundaries of the analysis bins from Table 3

First, we train the network to reconstruct \(Q^2\) by optimising Eq. (29) with an initial learning rate of \({\mathcal {L}} = 1.0 \times 10^{-5}\) and a regularisation parameter of \({{\mathcal {R}} =1.0\times 10^{-5}}\). We select a batch size of 10,000.

Table 1 Resolution of \(\log {x}\) reconstruction after 200 epochs of training with different values of initial learning rate \({\mathcal {L}}\) and batch size \({\mathcal {B}}\)
Table 2 Resolution of \(\log {x}\) reconstruction after 200 epochs of training with different values of regularisation parameter \({\mathcal {R}}\)
Fig. 4
figure 4

Training history for x reconstruction model using different training parameters but the same \(Q^2\) reconstruction model obtained with \({{\mathcal {L}}}=1.0\times 10^{-5}\), \({\mathcal R}=1.0\times 10^{-6}\) and \({{\mathcal {B}}}=10{,}000\). In each of the cases, the initial learning rate was set to \({{\mathcal {L}}}=1.0\times 10^{-5}\) and the regularisation parameter to \({{\mathcal {R}}}=1.0\times 10^{-6}\)

The reconstruction of x is more complex. We fix the regularisation parameter to \({\mathcal {R}} = 1.0\times 10^{-5}\) and select optimal parameters for the learning rate \({{\mathcal {L}}}\) and batch size \({{\mathcal {B}}}\) experimentally by varying the learning rate against the batch size. The appropriate selection of these parameters ensures fast convergence of the stochastic gradient method by balancing noise in the gradients with the stability of the algorithm. For each set of parameters, ten attempts are made and the best result in terms of the mean square error of the x reconstruction model over the training set is chosen. The results are listed in Table 1. The smaller learning rate assures a higher stability of the results than a larger learning rate. It prolongs the training process but avoids a poor convergence of the learning process as observed for larger learning rates. The larger batch size does not offer any advantages in our analysis as shown in Table 1. We, therefore, select an initial learning rate of \(10^{-5}\) with a minimal batch size. To choose the regularisation parameter close to the optimal one, we vary its value with constant batch size of 10,000 and initial learning rate of \(10^{-5}\) and again observe the mean square error of the x reconstruction model over the training set. For each set of parameters, ten attempts are made and the best result is chosen. The results are presented in Table 2. Accordingly, we choose a regularisation parameter of \(10^{-6}\). Using this regularisation, the neural network models for both x and \(Q^2\) are defined by weight parameters, of which 50% are effectively zero, or less than \(10^{-8}\) Following the suggestions in Ref. [63], we start with a small batch size, and increase it in initial training epochs. We test this approach by comparing the mean square error of the x reconstruction model over the training set over the first 200 epochs of training over three different training regimes. The results are summarised in Fig. 4 and imply to use a gradually increasing batch size up to a maximum batch size of 1000.

Table 3 Kinematic bins in x and \(Q^2\), see also Fig. 3
Fig. 5
figure 5

Distributions of \(\log {Q^2/1{\,\text {GeV}}^{2}} - \log {Q^2_{\textrm{true}}/1{\,\text {GeV}}^{2}}\) for various reconstruction methods in individual analysis bins. For better visibility, the data points for each reconstruction method are connected with straight lines

Fig. 6
figure 6

Distributions of \(\log {x}-\log {x_{\textrm{true}}}\) for various reconstruction methods in individual analysis bins. For better visibility, the data points for each reconstruction method are connected with straight lines

7 Results

We evaluate the performance of our approach for the reconstruction of DIS kinematics by applying it to detailed Monte Carlo simulations from the ZEUS experiment and by comparing our results to the results from the electron, double-angle, and Jacques-Blondel reconstruction methods. For the comparison, we use various combinations of statistically independent data sets, one for the training, and another for the evaluation. In our systematic studies, we have found no signs of overtraining and also no indication that the results depend on the selected Monte Carlo simulations. For the results presented in this section, we use the “MEPS” data set for the training and the “CDM” data set for evaluation. The main quantities of the comparison are the resolutions of the reconstructed variables \(\log {Q^2/1{\,\text {GeV}}^{2}}\) and \(\log {x}\) as measured in selected \(x-Q^{2}\) regions (bins). The resolutions are defined as \(\sqrt{\sum ^{N}_{i} \left( \log {Q^2_{i}/1{\,\text {GeV}}^{2}} - \log {Q^2_{i, \mathrm true}/1{\,\text {GeV}}^{2}} \right) ^{2} /N}\) and \(\sqrt{ \sum ^{N}_{i} \left( \log {x_{i}}-\log {x_{i,\mathrm true}} \right) ^{2}/N}\), where N stands for the number events in the corresponding bin. The boundaries of the bins are given in Table 3 and are chosen to be close to the bins used in ZEUS DIS analyses, e.g. in Ref. [57].

Table 4 Resolution of the reconstructed kinematic variables in bins of x and \(Q^2\). The resolution for x and \(Q^{2}\) is defined as the RMS of the distributions \(\log (x)-\log (x_{\textrm{true}})\) and \(\log (Q^{2})-\log (Q^{2}_{\textrm{true}})\) respectively

The distributions of the \(\log {Q^2/1{\,\text {GeV}}^{2}} - \log {Q^2_\textrm{true}/1{\,\text {GeV}}^{2}}\) and \(\log {x}-\log {x_{\textrm{true}}}\) quantities are given in the Figs. 5 and  6, respectively. The numerical values for the resolution are summarised for all the bins and methods in Table 4. The DNN optimisation procedure minimises the generalisation error described in Eq. (25) plus the regularisation penalty, so distributions in Figs. 5 and  6 do not necessarily peak at zero. In addition to that, Figs. 7 and  8 show the two dimensional distributions of events in \(\log {Q^2/1{\,\text {GeV}}^{2}}\) vs. \(\log {Q^2_{\textrm{true}}/1{\,\text {GeV}}^{2}}\) and \(\log {x}\) vs. \(\log {x_\textrm{true}}\) planes.

Fig. 7
figure 7

Distribution of events in \(L_\textrm{reco}=\log {Q^2/1{\,\text {GeV}}^{2}}\) versus \(L_{\textrm{true}}=\log {Q^{2}_\textrm{true}/1{\,\text {GeV}}^{2}}\) plane for various reconstruction methods in individual analysis bins

Fig. 8
figure 8

Distribution of events in \(L_{\textrm{reco}}=\log {x}\) versus \(L_{\textrm{true}}=\log {x_\textrm{true}}\) plane for various reconstruction methods in individual analysis bins

The comparison of the DNN-based approach with the classical methods demonstrates that the DNN-based approach is well suited for the reconstruction of DIS kinematics. Specifically, for most of the bins, our approach provides the best resolution as measured by the standard deviation of the logarithmic differences of true and reconstructed variables. The better performance of the DNN-based approach in most of the bins is a consequence of using additional available information about the final state. In this respect, the DNN-based approach is similar to averaging of the values provided by the classical methods with some weights or to alternative approaches for the same task, e.g. see kinematic fitting in Ref. [64]. However, in addition to a better resolution, the reconstruction with the DNN has an important advantage over the classical methods or any simple combination of them. It allows to combine the methods without the intrinsic biases of each method and enables an extension of the model with additional physics observables in a robust way.

The resolution is improved by the DNN-based approach in two ways. For the first couple of bins, the main improvement is caused by the more precise estimation of the reconstructed observables. This is clearly seen in Figs. 5 and 6. For the bins with higher \(Q^2\) and x the main improvement is due to the rejection of outliers, which can be seen in Figs. 7 and 8. The bins with higher \(Q^2\) and x also demonstrate another, very specific advantage on the DNN approach. Due to the low number of training events in the high \(Q^2\) and x region, it would not be possible to train a DNN model or combine the \(Q^{2}\) and x observables with other methods using information from this region only. However, the DNN training process benefits from the constraints from the higher number of events available elsewhere in the kinematical space and delivers models that perform well even in the bins with highest \(Q^{2}\) and x.

8 Conclusions

We have presented the use of DNN to reconstruct the kinematic observables \(Q^2\) and x in the study of neutral current DIS events at the ZEUS experiment at HERA. The DNN models are specially designed to be effective in their universal approximation capability, robust in the sense that increasing the depth of the networks will necessarily reduce empirical error, and computationally efficient with a structure that avoids “vanishing” gradients arising in the backpropagation algorithm.

Compared to the classical reconstruction methods, the DNN-based approach enables significant improvements in the resolution of \(Q^{2}\) and x. At the same time, it is evident that the usage of the DNN approach allows to match easily any definition of \(Q^{2}\) and x at the true level that is preferred for a given physics analysis.

The large samples of simulated data required for the training of the DNN can be generated rapidly at modern data centres. Also, DNN allow to effectively extract information from large data sets. This suggests that our new approach for the reconstruction of DIS kinematics can serve as a rigorous method to combine and outperform the classical reconstruction methods at ongoing or upcoming DIS experiments. We will extend the approach beyond inclusive DIS measurements and will study next the use of DNN for the reconstruction of event kinematics in semi-inclusive and exclusive DIS measurements.