Towards a Deep Learning Model for Hadronization

Hadronization is a complex quantum process whereby quarks and gluons become hadrons. The widely-used models of hadronization in event generators are based on physically-inspired phenomenological models with many free parameters. We propose an alternative approach whereby neural networks are used instead. Deep generative models are highly flexible, differentiable, and compatible with Graphical Processing Unit (GPUs). We make the first step towards a data-driven machine learning-based hadronization model by replacing a compont of the hadronization model within the Herwig event generator (cluster model) with a Generative Adversarial Network (GAN). We show that a GAN is capable of reproducing the kinematic properties of cluster decays. Furthermore, we integrate this model into Herwig to generate entire events that can be compared with the output of the public Herwig simulator as well as with $e^+e^-$ data.


Introduction
Simulations are essential tools for nearly all aspects of data analysis at particle colliders (see e.g., Ref. [1]). These simulations are rooted in particle and nuclear physics and must model a large range in energy scales. At the smallest distance scales, various forms of perturbation theory offer accurate, first-principles descriptions of hard-scatter particle reactions and collinear parton shower radiation. The conversion from quarks and gluons to hadrons is performed using hadronization models. Such approaches are physically inspired but are ultimately phenomenological models with many parameters that must be fit to data. There are currently two main hadronization models, each inspired by a different description of strong dynamics in the low-energy region. The linear confining potential motivated the string model [2,3] implemented in Pythia [4,5] and preconfinement [6,7] inspired the cluster model [8] in Herwig [9][10][11][12] and Sherpa [13,14]. In both models, there is an intermediate object between quarks/gluons and hadrons. This intermediate object (string or cluster) takes as input the kinematic and flavor information from quarks and gluons and then has an approximately universal fragmentation into different hadron species that carry some fraction of the object's momentum.
While existing hadronization models have been used successfully in a large number of phenomenological and experimental studies at the Large Hadron Collider and beyond, there is also significant room for innovation. Existing models are not flexible enough to describe all of the properties of hadronization (see e.g. Ref. [15]). Even so, these models still have a large number of parameters that need to be fit to data, which are adjusted ('tuned') using semi-automated programs like Professor [16]. Existing tuning methods are not able to process high-dimensional observables or simultaneously tune many parameters because they rely on relatively simple surrogate models to approximate the dependence of the data on the model. A number of recently proposed automated tuning approaches employ sophisticated surrogate models [17][18][19], but they all still require approximating complex relationships in high dimensions and therefore often are limited to relatively lowdimensional parameter spaces.
On the path towards a fully flexible, data-optimized, machine learning-based hadronization model, we demonstrate the first step by training a GAN to mimic a component of the cluster hadronization implementation in Herwig. In particular, we replace part of the cluster decayer inside Herwig with a GAN using the Open Neural Network Exchange (ONNX) [86] interface to call the neural network inside the C++ code. This GAN-based cluster decayer, HADML, is trained on Herwig. Future work will add additional complexity (cluster to cluster decays, color reconnection of clusters [87][88][89], etc.) and will ultimately lead to a model that can be trained (tuned) on data. This ultimate model will benefit from new, highdimensional future measurements [90] that will provide the necessary constraining power for the flexible neural network approaches. This paper is organized as follows. Section 2 briefly introduces details of the Herwig Monte Carlo event generator and how we interface a GAN in the hadronization stage. Then, Sec. 3 presents the first numerical results with the HADML hadronization model. The paper ends with conclusions and outlook in Sec. 4.

Dataset
The training data was created using the hadronization cluster model [8] . The cluster model is based on t'Hooft's planar diagram theory [91]: the dominant color structure of Quantum Chromodynamics (QCD) diagrams in the perturbation expansion in 1/N c can be represented in a planar form using color lines, which is commonly known as the limit N c → ∞. The resulting color topology in Monte Carlo events with partons in the finalstate color features open color lines after the parton showers. Following a non-perturbative isotropic decay of any left gluons in the parton jets to quark-antiquark pairs, the event finally consists of color-connected partons in color triplet or anti-triplet states. These parton pairs form color-singlet clusters. This is so-called color preconfinement [6]: the tendency of the partons generated in the parton shower to be arranged in color singlet clusters (pre-hadrons) with limited extension in both coordinate and momentum space. The principle of color preconfinement states that the mass distribution of these clusters is independent of the hard-scattering process and its center-of-mass energy. The cluster mass spectrum is not only universal but also peaked at low masses; therefore, most of the clusters decay into two hadrons and some just into one hadron. However, there is a small fraction of clusters that are too heavy for this to be a reasonable approach. Therefore, these heavy clusters are first split into lighter clusters before they decay. Such decays of massive clusters are beyond the scope of this publication, and we will consider it in future work. Since the kinematics of a cluster decaying into a single hadron is trivial, our training data set only includes cases of decay into two hadrons. To further simplify the training data, we consider only decays into pairs of π 0 . Each decay in our data set was described with the following information: the four-momentum of the cluster, the four-momenta of the two hadrons together with their flavor (encoded as a Particle Data Group (PDG) [92] code), and the Pert flag. Pert = 1 means that hadrons that contain a parton produced in the perturbative stage of the event remember the direction of the parton in the rest frame of the cluster. To create the training data, we used e + e − collisions at √ s = 91.2 GeV generated by Herwig version 7.2.1. The only modification to the default generator settings was the change that the hadrons produced from cluster decays were on the mass shell * .

GAN Model and Training
We trained a conditional GAN to simulate the cluster decays. In a GAN, there is a Generator neural network (Generator for short) and a Discriminant neural network (Discriminator for short). Inputs to the Generator are the cluster's four vectors (E, p x , p y , p z ), and N features sampled from a Gaussian distribution. The N numbers are called noise. N is a hyperparameter and set to be 10. Outputs of the Generator are the polar angle, φ, and azimuthal angle, θ, of the leading hadron's momentum in the spherical coordinate system in the cluster frame, in which the two hadrons are created back-to-back. With the two angular variables, θ and φ, and the cluster's four vector, we reconstruct the four vectors of the two outgoing hadrons as a postprocessing step. Inputs to the Discriminator are just the two angular variables coming from either the Generator, labeled as background, or those from the Herwig, labeled as signal. The output of the Discriminator is a score that is higher for events from the Herwig and lower for events from the Generator. The Discriminator is trained to separate signal from background. However, the Generator is trained to yield signal-like Discriminator score.
The GAN is based on multilayer perceptrons (MLPs). Both the Generator and the Discriminator are composed of a two-layer perceptron. Each perceptron consists of a sequence of Keras [93] modules: a fully connected (dense) network of a hidden size of 256, a batch normalization layer, and a LeakyReLU activation function [94]. These parameters were not extensively optimized.
To help train a GAN, we preprocessed the training data. The incoming cluster's four vector is scaled so that their values are between -1 and 1; so are the two angular variables (φ and θ). In this way, all inputs and outputs are within the same scale. Finally, we use the tanh activation function as the last layer of the Generator. The Discriminator and the Generator are trained separately and alternately by two independent Adam optimizers [95], both with a learning rate of 10 −4 , for about 1000 epochs. Best Wasserstein Dis Figure 1. Generator loss and discriminator loss and progressive best Wasserstein distance as a function of the training epochs for training a GAN with events where two partons are with Pert = 0. Both Generator and Discriminator loss are the binary-crossentropy loss, and the Discriminator loss is divided by two for visualization purposes. The progressive Wasserstein distance is gauged by the right side of the y axis. Figure 1 shows the evolution of the Discriminator loss, which is divided by two for visualization purposes, the Generator loss, and the progressive best total Wasserstein distances † [96,97] for training a GAN with events where two partons are with Pert = 0. The total Wasserstein distance summing over the distances of all variables, is calculated after training for one epoch and only the smallest value is plotted. At the beginning of the training (epoch < 70), even though the Generator loss is going up, we see a rapid drop in the Wasserstein distance until the Generator loss is beyond 0.8. For more than 100 epochs, the Discriminator keeps outperforming the Generator as seen by the increasing Generator † This is a common metric in machine learning that quantifies the minimal 'work' required to transform one density into another, where work, in this case, is defined as the integral of the density multiplied by the distance moved. loss and the decreasing Discriminator loss. This situation is changed around epoch 200 and finally, the two networks reach an equilibrium around epoch 250. Beyond epoch 600, we only see about 0.002 improvements in the Wasserstein distance. The best model for events with partons of Pert = 0, is found at the epoch 849 with a total Wasserstein distance of 0.0228. A similar analysis was performed when training events with at least one parton with Pert = 1.

Integration into Herwig
Each part of Herwig is implemented as a C++ class that contains the implementation of the Herwig physics models, inheriting from an abstract base class in ThePEG [98]. The ClusterHandronizationHandler is the class that controls the cluster hadronization model. Our ultimate goal will be to replace the entire ClusterHandronizationHandler with its ML counterpart. However, since in these studies, we concentrate on the decay of clusters into two hadrons, it was sufficient to modify ClusterDecayer -a helper class of the ClusterHan-dronizationHandler that controls this process. The generative model trained in Python using TensorFlow is converted into the ONNX format [86] and integrated into the Herwig chain using the C++ API of ONNX Runtime [99]. The advent of the ONNX format makes it possible to train a model in one software and hardware environment and then apply it in a completely different environment. ONNX Runtime is well suited for running fast neural network inference as part of a large C++ workflow, and by using it, we avoid having to integrate and maintain TensorFlow [100] within the Herwig framework.
All preprocessing and postprocessing steps performed for training are repeated within Herwig for inference. The entire simulation chain, including the GAN, is then run in Herwig in order to produce the final comparisons and results.

Results
Section 3.1 provides low-level results of individual cluster decays while Sec. 3.2 includes full event simulations and comparisons to e + e − data.

Low-level Validation
Since the training data contained only clusters produced in e + e − collisions at √ s = 91.2 GeV that decayed into π 0 pairs, we begin by comparing the π 0 kinematic variables generated by HADML and Herwig precisely in such decays. The data generated by Herwig, with which we compared the results of HADML in this section, were not used for training. In Fig. 2 we show the distribution of the pseudorapidity (left panels) and transverse momentum distribution (right panels) of π 0 from the decays of the Pert = 0 (upper panels) and Pert=1 (lower panels) clusters. As expected, we see that the transverse momentum spectra of pions coming from clusters containing "perturbative" quarks (Pert=1) are harder compared to those containing only non-perturbative partons (Pert=0). However, the most important observation from Fig. 2 is that Herwig 7 + HADML (labeled on figures as H7+HADML) matches the pseudorapidity of the pions generated by Herwig 7 with the cluster model (labeled as H7 on figures). Transverse momentum spectra that extend over Pseudorapidity distribution of π ± and π 0 multiplicity, Pert=1 1/σ π dσ/dp T Figure 2. Pseudorapidity (left panels) and transverse momentum (right panels) distribution of π 0 from decays of Pert=0 (upper panels) and Pert=1 (lower panels) clusters produced in e + e − collisions at √ s = 91.2 GeV.
several orders of magnitude are also well approximated by H7+HADML. Taking a closer look at these distributions, we see minor differences for low transverse momenta in the case of clusters that have a memory of perturbative quarks (bottom-left panel in Fig. 2). Such small differences are, of course, acceptable, especially since the information about the four-momentum of partons that make up the clusters were not used for training. Taking this additional information into account in the training process will likely eliminate these minor differences. However, this is beyond the scope of this publication, and we will leave this problem for future work. It is crucial that the hadronization model is universal, i.e., that it works independently of the hard process or collision energy. As we described in the Sec. 2.1 the cluster model has this property. To test whether HADML also is universal, we decided to repeat the comparison made at the beginning of this section, but this time generating events with collision energies twice as high as those used in the training data. In Fig. 3 we show π 0 kinematic variables generated by H7+HADML and Herwig 7 in e + e − collisions at √ s = 192 GeV. We can see that all distributions are described very similarly by both models, which reassured us that the HADML model is also universal.
The last thing we need to check before using HADML to simulate the decay of all clusters into hadron pairs in Herwig is whether the model is able to describe the kinematics of other hadrons than π 0 . In Fig. 4 we present the pseudorapidity (left panels) and trans- 1/σ π dσ/dp T Figure 3. Pseudorapidity (left panels) and transverse momentum (right panels) distribution of π 0 from decays of Pert=0 (upper panels) and Pert=1 (lower panels) clusters produced in e + e − collisions at √ s = 192 GeV.
verse momentum (right panels) distribution of π ± and π 0 (first row), kaons (second row) and lambdas (third row). We see that the distributions differ for the various hadrons, but they are all described almost identically by both models. This encouraged us to perform a comparison with experimental data in which the kinematics of all hadrons ‡ in Herwig are generated by HADML model.

Full-event Validation
In this section, we generate full events using HADML integrated into Herwig and compare the results also to data from LEP § . In particular, we consider an analysis from DELPHI with data collected at √ s = 91.2 GeV [101] using RIVET ¶ [102]. These events correspond to hadronic Z boson decays with a number of event shape and identified hadron spectra. These data have been used for hadronization parameter tuning [101,103]. Figure 5 shows histograms of various event shapes. Thrust [104,105] is the quintessential e + e − event shape: ‡ Except for a small number of hadrons that come from the decay of a cluster into a single hadron for which the kinematics is trivial. § Note that the data are for illustration only -given that the GAN is trained on Herwig, we cannot expect it to outperform Herwig. Tuning to data is a longer-term goal of this research (see Sec. 4). ¶ https://rivet.hepforge.org/analyses/DELPHI_1996_S3430090. 1/σ Λ dσ/dp T Figure 4. Pseudorapidity (left panels) and transverse momentum (right panels) distribution of π ± and π 0 (first row), Kaons (second row) and Lambdas (third row).
where the sum runs over all final state particle three momenta. The direction n that maximizes the argument of Eq. 3.1 is called the Thrust axis. Thrust major is defined similarly to Eq. 3.1 but with n replaced with vectors transverse to the Thrust axis and Thrust minor is the same, but with an optimization only over directions perpendicular to both the Thurst and Thurst major axes. The Sphericity is computed from the eigenvalues of the quadratic momentum tensor where α, β are the spatial momentum indices, and the sum runs over the same particles as in Eq. 3.1. Sphericity is defined as 3 2 (λ 2 + λ 3 ) for eigenvalues λ i of the 3 × 3 matrix defined in Eq. 3.2 and λ 3 ≤ λ 2 ≤ λ 1 . Hadronization shifts event shapes (see e.g., Ref. [106]) and so these observables are sensitive to hadronization modeling. Figure 5 shows that HADML agrees with Herwig within 10% across most of the spectra, which itself agrees with data at a similar level. Individual particle spectra are shown in Fig. 6 for the transverse momenta along the Thurst major and minor directions. The level of agreement is similar to the event shapes where there is sufficient statistical power.   Figure 6. Normalized, differential cross-sections of particle transverse momenta along the Thrust major (left) and Thurst minor (right) axes for Herwig, Herwig with our HADML, and for data from DELPHI at LEP. Error bars on the predictions represent statistical uncertainties.

Summary and Outlook
In this paper, we have established a first step on the path towards a neural network-based hadronization model. The cluster hadronization model from Herwig has been emulated with a Generative Adversarial Network. This model is designed to reproduce the twobody decay of clusters into pions. The GAN is integrated into the full Herwig program by using all other hadronization components from the Herwig default model. The kinematic properties of other hadrons are emulated using the pion model and conservation of energy. We have shown that the HADML is able to reproduce Herwig's light cluster decays and when integrated with the full Herwig simulation, is able to reproduce results from e + e − data as well.
The ultimate goal of this research direction is to train the ML model directly on data to improve upon the existing hadronization models. A number of technical and methodological steps are required to achieve this vision. First, the deep generative model needs to be extended to directly accommodate multiple hadron species and to model the relative probabilities of the various final states. In this work, we have modeled different hadron species using conservation of energy, but this means that the fragmentation is assumed universal. Architectural modifications could allow for perturbations on universality. Hyperparameter optimization, including the investigation of alternative generative models, is an important component of future work. Once the deep generative model has the capacity to reproduce all of the physics of the Herwig cluster model, methodological innovation is required to explore how to tune the model to data. Traditionally, e + e − data are used for tuning. Optimization with a large set of one-dimensional, binned measurements will need to be explored. A non-trivial aspect of this optimization is that while the hadronization model would be differentiable, the parton shower input would not be. Building in a model of uncertainty would also be a central aspect of model tuning. It may also be possible to tune with unbinned, and higher-dimensional results from ep and pp data [90,[107][108][109][110][111].
While we have focused on hadronization in the context of collider physics, the ideas and concepts described in this paper have broader implications. First of all, hadronization is used across high energy particle and nuclear physics (see e.g., Ref. [112]) and perturbations on the collider model may be required to accurately describe other systems. Second, there are other physical systems where first-principles input is combined with phenomenological models. For example, a complete description of observational cosmology requires an N -body simulation of the dark matter to be combined with a description of visible matter around dark matter halos (see e.g., Ref. [113][114][115][116][117]). While different applications call for domain-specific adaptations, some components and core methodology is common. Further development in this research area will enable important advances in simulation to improve inference in high energy physics and beyond.
Note added: As this manuscript was being finalized, we became aware of the recent work in Ref. [118], which has a similar goal. That study uses a different Monte Carlo program (Pythia instead of Herwig) and uses a different generative model (Variational Autoencoder instead of a GAN). Reference [118] also focuses on the pion-only case.