Diffusion model approach to simulating electron-proton scattering events

Generative AI is a fast-growing area of research offering various avenues for exploration in high-energy nuclear physics. In this work, we explore the use of generative models for simulating electron-proton collisions relevant to experiments like CEBAF and the future Electron-Ion Collider (EIC). These experiments play a critical role in advancing our understanding of nucleons and nuclei in terms of quark and gluon degrees of freedom. The use of generative models for simulating collider events faces several challenges such as the sparsity of the data, the presence of global or event-wide constraints, and steeply falling particle distributions. In this work, we focus on the implementation of diffusion models for the simulation of electron-proton scattering events at EIC energies. Our results demonstrate that diffusion models can accurately reproduce relevant observables such as momentum distributions and correlations of particles, momentum sum rules, and the leading electron kinematics, all of which are of particular interest in electron-proton collisions. Although the sampling process is relatively slow compared to other machine learning architectures, we find diffusion models can generate high-quality samples. We foresee various applications of our work including inference for nuclear structure, interpretable generative machine learning, and searches of physics beyond the Standard Model.


I. INTRODUCTION
High-energy particle and nuclear collider experiments along with theoretical progress in the past decades have allowed for an increasingly sophisticated understanding of the quark and gluon dynamics at subatomic scales.Electron-proton scattering experiments including HERA at DESY, CEBAF at JLab, and the future Electron-Ion Collider (EIC) at BNL [1] and LHeC [2]/FCC-eh at CERN [3] play a critical role in advancing our understanding of the structure of hadrons, probing cold nuclear matter effects, and searching for physics beyond the Standard Model.In particular, the measurement of the scattered leading electron provides a clean electromagnetic probe of the inner structure of hadrons and nuclei.The experimental data has been analyzed within the framework of QCD factorization to extract the threedimensional structure of hadrons in terms of quantum correlation functions, such as parton distribution functions (PDFs).In addition, collider studies related to the emergence of hadrons and the associated neutralization of color have remained at the forefront of collider experiments.See Fig. 1 for an illustration of the average distribution of particles in high-energy electron-proton collisions, which will be discussed in more detail below.
The rapid development of AI and machine learning in recent years has led to a wide range of applications in particle and nuclear physics [4,5].Examples include the simulation of lattice gauge configurations [6][7][8][9][10], the classification of jets [11][12][13][14][15][16][17][18][19][20], the simulation of collider FIG. 1. Momentum distribution of the particles in the ηϕ plane created in electron-proton scattering events in the laboratory frame at √ s = 105 GeV.The events that have been generated with Pythia8 [50] are rotated such that the scattered electron is in the same azimuthal direction for all events as indicated in the figure .events [21][22][23][24][25], the unfolding of detector effects [26][27][28][29][30], data analyses with machine learning-improved Bayesian posterior sampling [31][32][33][34][35][36], regression tasks [37][38][39][40][41][42], and searches of physics beyond the Standard Model [43][44][45][46][47][48].See Ref. [49] for a broad overview.Several of these applications rely on generative models that can learn the structure or latent space of a data set and generate new samples.Various types of generative models have been developed including Variational Autoencoders [51], Autoregressive Models [52], Generative Adversarial Networks (GANs) [53], flow based models [54] and diffusion models [55,56].The different types of generative models each have their own advantages and disadvantages.The choice of generative models for a particular application depends for example on the computational cost of training and sampling from the model, the quality of the generated samples, the scalability, and the stability of the training procedure, etc.In this work, we implement a diffusion model, which can generate samples from a data distribution by learning to reverse a stepwise noising or diffusion process, see Fig. 2. While sampling from diffusion models is generally relatively slow compared to other architectures, they have been shown to generate high-quality samples and allow for a scalable and stable training procedure.For example, in Ref. [57] it was found that diffusion models outperform GANs in image synthesis.The ability of diffusion models to generate high-quality samples is essential for the applications we foresee in the context of high-energy collider physics.In addition, Ref. [58] reported that diffusion models may cover a larger portion of the target distribution as GANs.
The development of generative models for simulating collider events or jets, collimated sprays of particles, was first initiated with GANs in Refs.[21,59,60].Since then, different architectures such as normalizing flows [61] have been explored as well as different data representations such as point clouds instead of images have been considered [62][63][64][65][66].In addition, efforts have been made to increase the interpretability of generative models [67,68].One of the challenges for generative models is the sparsity of collider events, which is distinct from typical tasks encountered in computer vision.In addition, event-wide constraints such as momentum conservation and steeplyfalling momentum distributions of particles add to the complexity of the problem.Recently, diffusion and scorebased generative models [69] have been developed for simulations of calorimeter showers [70][71][72][73][74] and jets using point clouds [62][63][64][65]75].In this work, we will simulate full collider events with diffusion models focusing in particular on the unique characteristics of electron-proton (and similarly electron-nucleus) collisions.For example, the kinematics of the leading electron play a critical role since it is used to determine both the photon virtuality Q 2 and the scaling variable Bjorken x of the event.In addition, particle spectra span up to 6 orders in magnitude and, depending on the type of particle, they peak in different regions of phase space.We address this challenge by using a suitable preprocessing step before training the model on the data.In addition, we explore different loss functions and optimization procedures of diffusion models.Our findings demonstrate that diffusion models are able to generate high-quality samples indicating their potential for various future applications in high-energy nuclear and particle physics.
We foresee various applications of generative models for collider events.For example, generative models are closely related to the development of parton showers and Monte Carlo event generators.While the perturbative part is increasingly well understood from first principles in QCD [76][77][78][79], other components of Monte Carlo event generators can be simulated with generative models, see for examples the approach developed in Ref. [67].Nuclear physics applications include the modification of the shower due to hot or cold nuclear matter.Moreover, anomaly detection techniques based on generative models have been developed for searches of physics beyond the Standard Model.The identification of anomalous signals requires an accurate modeling of the background distribution.Generative modeling also finds applications in hadron structure studies, which are a prominent subject of research within the Jefferson Lab 12 GeV program and the future EIC.The increasing sophistication required to extract parton-level information including transverse momentum distributions (TMDs) and generalized parton distributions (GPDs), necessitates the analysis of multi-dimensional phase space distributions from semiinclusive and exclusive observables.In this context, generative modeling can be used as a generator of partonic structures for QCD global analysis, a phase space generator for particle reactions, or emulators for detector simulation [28,29].It is also opening new avenues to integrate theory and experiment within a unified eventlevel analysis.
The remainder of this paper is organized as follows.In section II, we provide a review of diffusion models.In section III, we discuss the generation of electron-proton scattering events with Pythia8 [50], which is subsequently used as the training data, as well as the data representation.In section IV, we provide details of our implementation and the training procedure.In section V, we present numerical results comparing the diffusion model to Pythia8 using various metrics relevant to simulating collider events.Lastly, we conclude and present an outlook in section VI.

II. DIFFUSION MODELS
Diffusion models are a class of generative machine learning models that can learn the underlying distribution of a given data set.The training procedure of diffusion models consists of two components -a noising, and a denoising process, see Fig. 2. Starting with pixelated images of the training data set, noise is incrementally added to the image until it is ultimately transformed into pure noise.Subsequently, the inverse denoising process can be learned by a suitably chosen machine learning architecture.Due to the stepwise nature of the diffusion process, the results of the entire chain can be included in the loss function allowing for a scalable training process.After the training procedure is finished, we can generate new samples of the target data set by passing noise through the trained machine learning architecture.
Before starting the diffusion process, the pixelated images are treated as data vectors, which we label as x 0 .The corresponding probability distribution of the data is x 0 ∼ q 0 (x 0 ).Analogously, we denote the data vector at time step t of the diffusion process as x t , with t ∈ [0, T ], and the corresponding probability distribution is x t ∼ q t (x t ).The stepwise forward diffusion or noising FIG. 2. Sequence of images illustrating the noising/denoising process of a diffusion model trained on Pythia8 simulations of electron-proton scattering events.Pixels colored in black are empty, and the three RGB color channels correspond to charged pions π + , electrons e − , and kaons K + , respectively.
process is given by q(x 1 , . . ., x T |x 0 ) = T t=1 q(x t |x t−1 ) . ( The probability distribution for a given timestep of the noising process x t−1 → x t , is given by Here N is a multi-variate Gaussian distribution with a diagonal covariance matrix.The values of β t are chosen according to a predefined variance schedule {β t ∈ (0, 1)} T t=1 .We are thus adding a certain amount of Gaussian noise at each time step leading to a sequence of increasingly noisy samples x 0 , . . ., x T , where the variance schedule and the time steps are chosen such that x T is eventually an isotropic Gaussian distribution q(x T ) = N (x T ; 0, I).Note that Eqs.(1) and (2) describe a Markovian process since the probability distribution at time step t only depends on the current sample at time t − 1.
Next, we consider the reverse diffusion or denoising process.We need to train a suitable machine-learning model to approximate the probability distribution of the inverse process q(x t−1 |x t ).The diffusion process is stochastic, which does not allow for the use of backpropagation techniques to obtain the gradient.Instead, the reparametrization trick is used to make the problem tractable and learn the parameters of a Gaussian distribution for which backpropagation can be used [51].The denoising process x t → x t−1 proceeds again in T time steps where the following Gaussians are sampled from The mean µ θ and covariance Σ θ are learned by the model, where θ denotes the trainable parameters.Typically a Ushaped convolutional neural network (U-Net) is used as a model to learn the mean and variance at each time step t [80].
The model parameters are obtained by minimizing a loss function during the training procedure.Different options have been explored in the literature.We start by considering the Variational Lower Bound (VLB), which can be written as follows Here L t−1 is used for all terms in Eq. ( 4) except for t = 0, T .Except for L 0 , closed-form expressions can be found for all KL divergences since each term involves two Gaussian distributions.In Ref. [56], it was found empirically that a simplified objective function can improve the sample quality.Instead of using a neural network to predict µ θ and Σ θ , the network is used to predict x 0 and the noise ϵ θ at each time step, which can be related to µ θ , see Ref. [56] for more details.The simplified mean squared error objective function, is given by where ϵ represents the noise of the forward diffusion process.Since this simplified objective function is only sensitive to µ θ but not Σ θ , a hybrid loss function was introduced in Ref. [58] L hybrid = L simple + λL VLB .
Here, λ is a hyperparameter that determines the relative importance of the two objective functions.Typically, λ is chosen to be relatively small such that µ θ is primarily determined by L simple and Σ θ is related to L VLB .In section IV, we will provide more details of the setup used in this work.
After the training procedure, we can obtain new samples q(x 0 ) from the target distribution by sampling x T ∼ N (0, I) and running the reverse process of the diffusion model.The Markovian noising and denoising processes described in Eqs. ( 2), (3) are analogous to the diffusion process in non-equilibrium thermodynamics and in the continuous-time limit, a stochastic differential equation is obtained [81].See also Ref. [82] for a more detailed introduction to diffusion models.

III. TRAINING DATA SET AND DATA REPRESENTATION
We generate the training data set, by simulating neutral-current electron-proton scattering events with Pythia8 [50] using √ s = 105 GeV as a representative center-of-mass (CM) energy for the future EIC [1].Since the photoproduction region (low photon virtuality Q 2 ), and deep inelastic scattering (DIS) region (high Q 2 ) are sensitive to different physics, we choose to impose a lower cut of Q 2 > 25 GeV 2 to exclude the photoproduction events.Fig. 1 shows a 2D histogram of the momentum distribution of particles in electron-proton scattering events in the laboratory frame.We highlight in particular the kinematic region of the scattered electron.Note that the electron pseudorapidity does not extend to very low values due to the Q 2 cut.In addition, we indicate the recoiling hadronic system or jet, which is produced in the opposite azimuthal direction as the scattered electron.For each generated particle i in the event, we record the transverse momentum p T i relative to the beam axis, the pseudorapidity η i = − ln tan θ/2 with the polar angle θ i with respect to the direction of colliding electron, the azimuthal angle ϕ i and the particle identification (PID i ).We impose a cut on the pseudorapidity of |η i | < 10, which captures the vast majority of particles produced by Pythia8, see Fig. 1.We do not apply a lower cut on the transverse momentum p T i of the particles.Instead of directly incorporating the transverse momentum of each particle as a feature of the training data set, we choose to work with a rescaled variable.A natural choice for the rescaled momentum variable is Here y i is the rapidity and M 2 T i = p 2 T i + m 2 i is the transverse mass, where m i is the hadron mass.This variable is of great interest for simulating full collider events since it satisfies the following event-wide momentum sum rule where we sum over all particles in a given event.This provides an important global constraint for simulating full collider events.However, in practice, this requires having access to all particle species and fully hermetic detectors.
In general, the rapidity coverage of detectors is limited and, in this work, we also limit the diffusion model setup to simulating only three particle species.As a result, 0.0001 0.01 1 100 p T (GeV) Eq. ( 11) is not exactly satisfied.Instead, Eq. ( 11) provides an upper bound on the sum over all zi values in each event.In the limit of massless particles, zi reduces to where η is the pseudorapidity, as introduced above.This variable is frequently used in the perturbative QCD literature, see for example Refs.[79,83].Therefore, we choose z i as the rescaled momentum variable for this work.Eq. ( 11) also provides an upper bound for the sum over all z i in each event.In summary, we record the variables (z i , η i , ϕ i , PID i ) for each particle in the event.
When training the diffusion model, as described below, we limit ourselves to only three "color" channels for which we choose: pions π + , kaons K + , and electrons e − from the full Pythia8 generated event.In particular, the (leading) scattered electron in the event plays an important role in electron-proton scattering and we will study its kinematic distributions in detail below.Since we are primarily interested in a proof-of-concept study, we limit ourselves to modeling only a few representative particle species.In particular, the leading electron kinematics differ significantly from the rest of the hadronic system.
A convenient way to digitize collider events is to represent them as images, which is well suited for machine learning applications.We partition the cylindrical detector around the scattering vertex into a grid of uniformly sized rectangular pixels located at regular intervals in both rapidity and azimuth.We choose the pixel intensity to be the rescaled particle momentum z i , and the rapidity η i and azimuthal angle ϕ i index the location of each pixel on the cylinder.The particle type (PID i ) is stored as indices of the different image color channels similar to RGB color channels.Whenever multiple particles of the same type are in the same pixel, their z i values are added.
In an actual experiment, the natural choice for the pixel intensity is p T i since z i is a derived quantity given in terms of the measured value of p T i and η i .Therefore, combining multiple particles in a given pixel will occur at the level of p T i instead of z i .However, as we will discuss in section V below, the reconstruction of physical observables in inclusive DIS scattering experiments is strongly affected by distortions induced by the η i pixelation.In part, this is due to the large pseudorapidity interval chosen for this work.In principle, this can be mitigated by increasing the number of pixels for the rapidity.Due to limited computing resources, we were not able to increase the number of pixels further in this work.In future work, this can be addressed by increasing the number of pixels or by changing the data representation.Here, we opt for using the z i as the pixel intensities, which reduces pixelation effects.
We quantify the discretization effect by considering as an example the inclusive momentum distribution of electrons e − in Pythia8-generated electron-proton scattering events, which is shown in Fig. 3.We observe a largep T peak due to the leading scattered electron and a continuous spectrum at intermediate to small-p T values due to electrons generated during the shower.We compare the actual distribution with its discretized counterparts using three different choices for the number of pixels for the image: 16 × 16, 32 × 32, and 64 × 64.We observe that the actual distribution is increasingly well reproduced as the number of pixels is increased.Throughout this work, we use 64 × 64 pixels as our default choice.With larger computing resources this can be increased until the experimental resolution of the detector is reached.
As mentioned in the introduction, different than typical tasks in computer vision, images of collider events are very sparse, especially at the relatively low energies of the CEBAF experiment at JLab and the future EIC.For a CM energy of √ s = 105 GeV, we find that the average level of sparsity or the percentage of empty pixels is 99.95±0.02%(including all particle species) for 64×64 images, which can be challenging for generative models.We address this problem by choosing a suitable data representation as discussed in the following.Besides the sparsity of the data, a significant challenge is the steeply falling distributions of particles.The inclusive momentum z-distributions peak close to the endpoints z → 0 for π + , K + since soft hadrons have a large production cross section in QCD.Instead, for electrons e − the distribution peaks in the region z → 1, see Fig. 3.The large-z peak of the electron/positron momentum distribution is a unique feature of electron-proton scattering events.Instead, in proton-proton collisions, all distributions peak at small-z values.This feature appears due to the scattered leading electron, which plays a unique role in electron-proton collisions since it is used to determine the virtuality of the exchanged photon Q 2 and Bjorken x.Therefore, the accurate modeling of its kinematics plays a critical role.In order to take into account the logarithmic behavior of the data near both endpoints z → 0, 1 and to improve the training procedure, we rescale z as follows Here E(z) is an exponential function, L(z) is a logarithmic function, and S(z) is a sigmoid.The three functions are defined as where we have introduced additional parameters that will be discussed in the following.First, we require that the parameters are chosen such that Eq. ( 13) is a bijective function allowing us to eventually recover to the original momentum distribution.Second, we choose the parameters such that the rescaling in Eq. ( 13) matches the peak structures of the z distributions near both endpoints.Note that we apply the same z rescaling to all three channels.We choose the rescaling to be linear in the intermediate z region while near the upper (lower) endpoint, the function approximates an exponential (logarithmic) function.This can be achieved by choosing two values z 1 < z 2 ∈]0, 1[ with z 2 being the value where the exponential E(z) smoothly becomes a linear function, i.e.
Similarly, we require the logarithmic function L(z) to become linear at z 1 .These conditions along with the need to construct a bijective function fixes or constrains several of the parameters in Eq. ( 13).We then choose the sigmoid S(z) to smoothly interpolate between the exponential and logarithmic functions.The remaining parameters are chosen such that the rescaled z distribution is sufficiently smooth for the training of the diffusion model.We note that without the preprocessing of the training data described, the results of the diffusion model can be off by several orders of magnitude near the kinematic endpoints.
The diffusion model takes as input values of the momentum fraction in the range [−1, 1] (floating point numbers).Before pixelation and the rescaling in Eq. ( 13), the range of the particle momentum fractions is in the range of z ∼ [10 −6 , 1].We choose a suitable range of values z ′ ∈ [z ′ min , z ′ max ] with −1 < z ′ min < z ′ max < 1 to which we map the original z values.This includes the rescaling in Eq. ( 13) as well as an additional linear transformation to match the targeted range.In practice, we find that z ′ min = −0.76 and z ′ max = 0.86 work well for the purposes of this work.These values are chosen to allow for an upper and lower gap from the endpoints at −1, 1.The lower gap allows us to train a diffusion model that can generate empty pixels.This is achieved by mapping empty pixel values in the training data set, i.e. initially at z = 0, to the lower end of the allowed interval z ′ = −1.Since any finite z value is mapped to z ′ > z ′ min there is a sufficiently large gap to the z ′ values associated with empty pixels.When generating new images by passing Gaussian noise through the denoising process as described above, the diffusion model does not need to generate pixels with exactly z = 0 but, instead, it is sufficient to generate a narrow peak around z ′ ∼ −1.We can then apply a lower cut at z ′ min and consider pixels with smaller z ′ values as empty.This allows us to generate sparse images of collider events.The upper gap associated with z ′ max is introduced to avoid distortions of the distribution generated by the diffusion model near the upper endpoint.Any values produced by the diffusion model that are outside of the z ′ ∈ [−1, 1] range are clipped, which would lead to artifacts near the endpoint without including the upper gap.Note that we do not enforce a hard upper cutoff at z ′ max .This requires the model to learn momentum conservation, see Eq. ( 11) above, which we quantify numerically below.
We note that one can likely choose E(z) = 0 and/or S(z) = 0 for simulating proton-proton or heavy-ion collisions since in this case all particle spectra peak in the small-z region.We leave the exploration of this for future work.In addition, we note that an alternative approach to simulating sparse collider data is the use of point clouds as employed in Refs.[62][63][64][65][66] instead of the image-based data representation that we use here.

IV. IMPLEMENTATION AND TRAINING
In this section, we are going to present more details about the implementation of the diffusion model and the training procedure.As a starting point for our work, we use the implementation of the diffusion model presented in Ref. [58].In the following, we discuss aspects of the training data, the evaluation of the loss function and the parametrization of the inverse diffusion process.
• The size of the training data set is 10 6 images of DIS events as described in section III above.We choose a batch size of 8.
• For the noising process q(x t |x t−1 ), we use the cosine variance schedule introduced in Ref. [58] with 500 diffusion steps.We find that the number of diffusion steps is sufficient for the 64 × 64 images used in this work.The cosine variance schedule adds noise relatively slowly and is well-suited for the relatively low-resolution images considered here.For the denoising process p θ (x t−1 |x t ), we use a diagonal covariance matrix Σ θ = σ 2 t I, where σ 2 t are timedependent trainable parameters.
• We use the hybrid loss function L hybrid given in Eq. ( 9) with λ = 0.001 following Ref.[58].Since the gradient of L VLB in Eq. ( 4) can be very noisy, we use importance sampling instead of uniform sampling of the this part of the objective function as proposed in Ref. [58].We also explored the use of the simplified loss function proposed in Ref. [56] and λ = 0, which corresponds to the L VLB , which generally underperformed compared to the hybrid loss function for the purposes of this work.
• To parametrize the denoising process p θ (x t−1 , x t ), we use a 3-layer U-Net [80] with circular or periodic convolutions.Note that this is only relevant for the azimuthal coordinate ϕ.We choose a kernel size of 3 with stride 1 and padding 1. Multi-head attention layers [84] and down/up sampling blocks are included to obtain a U-shaped neural network.It takes O(hours) to generate 10 6 samples using a single NVIDIA A100 GPU.Possible speedups can be achieved using for example the methods developed in Refs.[71,73,85].
• We use the AdamW optimizer [86] with a learning rate of 10 −4 .The logarithm of the loss is shown in Fig. 4 as a function of the number of samples that the diffusion model is trained on.We observe a steep decrease of the loss at the beginning.Even though the curve flattens out later during the training process, we still observe a significant improvement of the sample quality.Despite the importance sampling mentioned above, the loss turns out to be relatively noisy.

V. NUMERICAL RESULTS AND BENCHMARKS
In this section, we will assess the quality of the trained diffusion model in simulating fully exclusive events (three particle species) in electron-proton collisions.We stress that our present work is limited in exploring the full extent of uncertainty quantification stemming from model We show the diffusion model results (red) including statistical uncertainties and compare them to Pythia8 with (black) and without (purple) pixelation effects.Under each panel, we include cross sections ratios between the diffusion model and Pythia8 with pixelation.
uncertainties, limited training, and other factors.The results presented here should be viewed as an exploratory study, and we will focus on describing the qualitative overall agreement between the reconstructed synthetic phase space distributions and the training data.Dedicated studies of aleatoric and epistemic uncertainties are beyond the scope of our current work and will be addressed in the future.We start with the inclusive momentum, rapidity, and azimuthal angle distributions for electrons e − , pions π + , kaons K + .The comparison between the diffusion model results and Pythia8 with and without pixelation is shown in Fig. 5.The systematic effects due to the pixela-tion are relatively small for the one-dimensional projections of the phase space.However, as we will find later on, they can become significant for other observables considered in this work.Analogous to Fig. 3, the electron z-distribution peaks at z → 1, whereas the pion and kaon distributions peak at small-z values, see the first column of Fig. 5.We find that our mapping in Eq. ( 13) allows us to train the diffusion model such that the z-spectra match relatively well.This is particularly noteworthy given the fact that the distributions of the three particle species differ substantially.However, achieving an agreement within 1σ of the estimated statistical uncertainties was not attainable.The model appears to have difficul- ties in reproducing the distributions near the edges of the phase while performing better in regions where the distributions are relatively flat.That being said, without the use of the mapping in Eq. ( 13), we find that it is impossible to train the diffusion model to agree with the z spectra, with deviations as large as several orders of magnitude in several regions of phase space.We conclude that a suitable mapping of the z-values is essential for training the diffusion model for electron-proton collisions.The rapidity distribution for electrons shows qualitatively different features compared to pions and kaons, see the middle column of Fig. 5.The difference is again due to the unique role of the leading electron in electronproton scattering events.The steep drop of the rapidity distributions for electrons near η ∼ −3 is due to the imposed cut on the photon virtuality Q 2 , see also Fig. 1.Our model achieves a reasonable description of all three  11), (12).
rapidity distributions with discrepancies that again tend to grow near the edges of phase space.We note that if an additional transverse momentum cut for the electron is included, the rapidity distributions for pions and electrons would be more similar.Lastly, the right column of Fig. 5 shows the azimuthal distributions where the events have been rotated such that ϕ = 0 corresponds to the direction opposite to the leading electron, which itself is not included in these histograms.Overall, we observe that the three bell-shaped curves of the azimuthal angular correlations are approximately reproduced by the diffusion model.We observe small differences in the kaon distributions in the tails of the distribution.This is likely due to the relatively low multiplicity of kaons.Although we did not achieve full agreement within the statistical uncertainties, being able to approximately reproduce the distributions with the help of the remapping of the zvalues encourages further explorations in this direction.Extending this work to include the estimation of model uncertainties is necessary to assess whether the observed disagreement stems from biases due to finite statistics or from a potential lack of model expressivity in our implementation.
Next, we consider particle multiplicity distributions ⟨N i ⟩.The comparison between Pythia8 with and without pixelation and the diffusion model results are shown in Fig. 6.The electron multiplicity peaks at ⟨N e ⟩ = 1, which corresponds to the scattered leading electron, and falls off steeply as the multiplicity increases.The case of ⟨N e ⟩ = 0 corresponds to an unphysical distortion induced by the pixelation algorithm since we are only considering neutral current DIS Pythia samples.The pion multiplicity distribution peaks at intermediate values ⟨N π + ⟩ ∼ 4, and exhibits a long tail extending up to ∼ 20 pions per event.The kaon multiplicity distribution also declines rapidly toward larger values, with many events having no kaons.
In contrast to the situation encountered for the momentum distributions, the multiplicity distributions are rather well reproduced and within 1σ of the statistical uncertainties across most of the available phase space for the three particles.Only in the electron case, we find that the diffusion model generates non-vanishing ⟨N e ⟩ = 5 and it systematically deviates in the unphysical case of ⟨N e ⟩ = 0.However, these events are very rare, occurring at a rate of 10 −4 relative to the peak value.Next, we consider the sum over the rescaled momentum fractions, see Eq. (11) above.This provides an important test of the global characteristics of the events generated by the diffusion model.The diffusion model result compared to Pythia8 is shown in Fig. 7.The distribution peaks near i z i ∼ 1 and falls off steeply toward the endpoints, where the upper limit results from momentum conservation, see Eq. ( 11) above.It is worth noting that while we simulate pions π + and kaons K + , which correspond to some of the most frequently produced particles in the events, the distributions here are significantly shifted to the left compared to the upper boundary due to the omission of other particle species (e.g.π − , π 0 ) in our current implementation.Our diffusion model is able to approximately reproduce this distribution.While the peak around Σ ∼ 1 is well reproduced, the deviations grow toward the edges of phase space.We find that the model does not generate unphysical events with Σ > 2. We conclude that the model is able to approximately learn momentum conservation, i.e. without it being imposed directly as an additional constraint.
Next, we consider a set of multi-hadron correlations, which serve as an important benchmark for evaluating the ability of the diffusion model to capture correlated features beyond the one-dimensional projections shown earlier.The two-dimensional histograms of the momentum fractions of leading versus subleading pions (upper row) and kaons (lower row) are shown in Fig. 8.The diffusion model clearly learns the correlations between the leading and subleading particles that vary within two orders of magnitude in phase space regions with large statistics and loses its ability to reproduce the correlations toward the edges of phase space.As mentioned before, quantifying the model uncertainty is required to better assess the degree of agreement.However, we conclude that our results in Fig. 8 encourage future explorations in this direction.
We are now going to evaluate the diffusion model's performance in describing inclusive DIS reactions characterized by the virtuality of the exchanged photon Q 2 = −q 2 = −(l − l ′ ) 2 , and the Bjorken scaling variable, defined as x = Q 2 /2p • q ≤ 1.Here l, l ′ denote the incoming and outgoing four-momenta of the scattered electron, respectively, and p is the momentum of the incoming proton.Assessing the diffusion model's ability to reproduce the distribution of these variables is a stringent test that allows us to gauge its ability to capture the correlations between the outgoing electron phase space and the initial state momenta.In Fig. 9 we display the reconstructed 2D density in the x, Q 2 space.Overall, we find good agreement between the diffusion model results and Pythia8 with pixelation effects with some noticeable differences only in the x → 1 region.The presence of stripes in the 2D density plots can be attributed to pixelation effects, which can be mitigated by increasing the number of pixels used to represent the events, analogous to the momentum distributions shown in Fig. 3.We defer improvements to future work due to limitations of computational resources.Nevertheless, within the described limitations, we conclude that the diffusion model is able to approximately reproduce the correlations of the DIS phase space.
As discussed above, we use the variable z as the pixel intensity rather than the transverse momentum p T .Here, we discuss why the pixelation of p T induces larger systematic uncertainties compared to z when reconstructing the DIS kinematic variables x and Q 2 .The photon virtuality can be expressed as Q 2 = 2l 0 l ′ T e −y l ′ , where l 0 is the incident lepton energy, and l ′ T and y l ′ are the outgoing lepton transverse momentum and rapidity in the lab frame, respectively.Focusing on the uncertainty induced by the rapidity pixelation, we find that δQ 2 ∼ Q 2 δy l ′ for fixed values of l ′ T .This implies that uncertainties on Q 2 from the rapidity pixelation are amplified by a factor of Q 2 , which is typically required to be large for phenomenological applications.Instead, when using the variable z i , we find the that the photon virtuality is given by Q 2 = l 0 √ sz(e −y l ′ / cosh y l ′ ), where √ s is the CM energy.In this case, the uncertainty induced by the rapidity pixelation for fixed values of z is δQ 2 = Q 2 (e y l ′ / cosh y l ′ )δy l ′ , i.e. there is an additional factor K = (e y l ′ / cosh y l ′ ) relative to the p T pixelation case.In the CM frame of the electron-proton reaction, with the incoming beam of electrons moving along the z-axis, the rapidity of the outgoing electrons is mostly negative, and typical values for K are around 0.01, which significantly suppresses systematic errors.
One of the salient features of the DIS process is the presence of scaling in the variable Q 2 which is interpreted as the evidence of point-like constituents inside the proton.The so-called DIS neutral current (NC) reduced cross sections is defined as Here we defined Y ± = 1 ± (1 − y) 2 , with the inelasticity y = Q 2 /(s − M 2 )/x.M is the proton mass, and α corresponds to the electromagnetic fine-structure constant.
The structure functions F 2,L,3 are independent of Q 2 up to logarithmic corrections that can be predicted within perturbative QCD, provided that Q 2 is sufficiently large relative to any other hadronic scale.In this regime the F 2 structure function is the dominant contribution to the reduced cross section, hence it is approximately invariant under changes in Q 2 .In Fig. 10 we compare the reconstructed cross section σ ep→e ′ +X red.NC from diffusion model generated events and the Pythia8 samples.The reconstructued reduced cross sections from Pythia8 shows the expected scaling behaviour.In contrast after pixelation, distortions on the scaling behavior are induced mostly in the large-x valence region.We stress that this can be mitigated by enlarging the number of pixels.Taking into account these systematic scaling violations induced by the pixelation, the reconstructed reduced cross sections from the diffusion model are qualitatively in agreement with the pixelized version of Pythia8, albeit the diffusion model exhibits deviations at high values of x compared to Pythia8.These deviations may be associated with the epistemic uncertainties of the diffusion model.We would like to highlight that achieving a faithful representation of DIS events serves as the starting point for studying cross sections such as semi-inclusive DIS that are differential in up to approximately 10 variables, which will play a critical role at the future EIC.We note that further improvements of our results may be achieved by using diffusion models based on point clouds [64] and adapting methods developed in Ref. [87] in the context of emulating hard-scattering events.

VI. CONCLUSIONS
In this work, we presented simulations of electronproton scattering events using diffusion models.Our results are relevant for simulations at CEBAF, the future Electron-Ion Collider and LHeC/FCC-eh.The diffusion model is based on a noising schedule that sequentially turns the images from the training data set into Gaussian noise.The stochastic reverse process is learned by a U-Net architecture based on convolutional layers with small filters.We trained the diffusion model on Pythia8 simulations of electron-proton scattering events at EIC energies and observed that it can generate high-quality sparse samples of collider events.We achieved good agreement near the kinematic endpoints by rescaling the particle momenta with a mixed exponential-logarithmic function, which accounts for the unique role that the scattered electron plays in electron-proton collisions.We employed an image-based representation of the training data and explored associated pixelation effects.As a first step, we limited ourselves in this work to electrons, pions, and kaons, which are represented by different "color" channels.Overall, we found good agreement for various observables and their correlations as well as event-wide constraints such as momentum conservation.We foresee various applications of our work in the context of generative modeling for collider physics including eventlevel analysis of hadron structure, data storage, studies of hadronization, exclusive processes like Deeply Virtual Compton Scattering (DVCS), and searches of new physics, which will be addressed in future work.

FIG. 3 .
FIG.3.Momentum distribution of electrons e − in electronproton scattering events for three different levels of pixelation compared to the original Pythia8 result.

FIG. 4 .
FIG. 4. The log loss of the diffusion model training procedure as a function of the number of samples that the model is trained on.

2 FIG. 5 .
FIG.5.Differential cross sections in arbitrary units (A.U) of the rescaled momentum variable z, pseudorapidity η, and azimuthal angle ϕ (left to right).The results are shown separately for electrons e − , charged pions π + , and kaons K + (top to bottom).We show the diffusion model results (red) including statistical uncertainties and compare them to Pythia8 with (black) and without (purple) pixelation effects.Under each panel, we include cross sections ratios between the diffusion model and Pythia8 with pixelation.

1 FIG. 9 .
FIG. 9. Correlations of the DIS variables Bjorken x and the photon virtuality Q 2 comparing Pythia8 events (left) to results from the diffusion model (middle) and ratio to Pythia8 (right).

8 Q 2 ∈Q 2 ∈ [ 86 FIG. 10 .
FIG. 10.Photon virtuality Q 2 scaling of the DIS reduced cross sections as a function of Bjorken x.The three panels show the Pythia8 results without and with pixelation and the diffusion model results (from left to right).