LHC hadronic jet generation using convolutional variational autoencoders with normalizing flows

Breno Orzari; Nadezda Chernyavskaya; Raphael Cobe; Javier Duarte; Jefferson Fialho; Dimitrios Gunopulos; Raghav Kansal; Maurizio Pierini; Thiago Tomei; Mary Touranakou

doi:10.1088/2632-2153/ad04ea

1. Introduction

In the CERN Large Hadron Collider (LHC) [1, 2], approximately one billion proton–proton (pp) collisions occur every second. Those collisions generate outgoing particles that are measured by dedicated, custom-made particle detectors. The main data stream collected by such a detector is close to 1.4 kHz. After the upcoming LHC upgrade to the high-luminosity LHC (HL-LHC) [3] this number will increase by a factor of 3.5–5. In high energy physics (HEP), an important part of the data analysis is the simulation of pp collisions that are used to test analysis techniques, estimate the particle detectors' efficiencies and resolutions, and make comparisons with collected data. Usually, the number of simulated events is an order of magnitude higher than the number of recorded events in real data, such that the statistical uncertainty of the simulated events does not dominate the overall uncertainty. For simulating collision data, Monte Carlo (MC) event generators are used. However, the increased granularity of the detectors and the higher complexity of the collision events after the HL-LHC upgrade pose significant challenges that cannot be solved by increasing the computing resources alone [4].

One of the most common final-state objects in high-energy pp collisions is an energetic cluster of hadrons, known as a hadronic jet [5]. Jets can be described by the features of their constituent particles, mainly their four-momentum. The particle-flow [6] algorithm is used to combine information obtained by several sub-detectors to precisely measure each particle's properties. Jet clustering algorithms [7] can be used to cluster the particle constituents into a jet and obtain its relevant properties for physics analysis such as its invariant mass, transverse momentum ( ${p_\mathrm{T}}$ ), energy, azimuthal angle φ, and pseudorapidity η ⁵ .

Being able to accelerate the process of jet simulation in a particle detector would be of great advantage to mitigate the needed overhead in computing resources. One of the avenues that shows promising results is the use of generative machine learning (ML) algorithms to perform the jet simulation. The goal of this program is to maintain the accuracy of MC methods while making the process orders of magnitude faster. Several approaches using neural networks (NNs) have been studied and applied for this purpose, using generative adversarial networks (GANs) [8, 9], normalising flows [10], and score-based diffusion models [11, 12]. Another promising technique is the use of variational autoencoders (VAEs) [13], which have yet to demonstrate high fidelity results.

In this work, a new method for the generation of hadronic jets from noise is presented. It is based on a VAE coupled with a normalizing flow (NF) and trained in a two-step process. This technique opens up a path to explore and further our expertise in generating point-cloud-like-objects in general, and in this paper we focus on a possible application in HEP. The paper is organized as follows. Section 2 contains a literature overview of ML applied to HEP, mainly focusing on generative methods for jets simulation. In section 3 the dataset and the model are described. Section 4 details the custom loss function and the evaluation metrics; results are presented and discussed in section 5. Finally in section 6, we summarize the work and give an outlook for using the new method.

2. Related work

ML techniques have been applied to hadronic jets generation following several distinct approaches. Some of the first attempts [14, 15] were performed using convolutional neural networks (CNNs) because of their impressive results in computer vision and the fact that an image-like representation of a jet arises naturally from the particle detectors' data. For a similar reason, the use of CNNs for the simulation of calorimeter showers [16–18] were explored. Other jet representations such as vectors of energy deposits in calorimeters or high-level jet characteristics have also been studied [19, 20].

Recently, particle-based jet datasets have also been extensively used by the ML community in the context of jets generation [8, 9, 13, 21]. They are composed of the jets constituent particles' properties and give a detailed description of particles distribution inside the jet, which is known as the jet substructure [22, 23]. GANs based on graph networks [8, 9] have provided inspiring results when aiming to generate particle-based jets, retaining great similarity between generated jets and MC simulated jets while improving the speed of simulation by 4 to 5 orders of magnitude depending on the number of particles inside the generated jet. Some VAEs based on convolutional networks have also been implemented using the same dataset for jets fast simulation [21] and for the jets generation as well [13]. However, for the latter, there is great room for improvement when comparing the VAE generated jets with MC simulated jets.

The premise of this work is that the VAE by itself is not enough to capture the full characteristics of a hadronic jet, mainly due to the a priori choice of the latent space distribution. On the other hand, a combined VAE+NF approach, which has already been applied for image generation [24], time series data prediction [25] and calorimeter shower simulation [26, 27], among others, has not yet been applied in the context of hadronic jets generation. In this work, this combined approach is used to improve the generative capacity of the VAE.

3. Dataset and model

We use the gluon HLS4ML LHC Jet dataset [28, 29] that is composed of ∼177 k gluon jets, containing the jet constituent particles' three-momenta $({p_{x}}, {p_{y}}, {p_{z}})$ . These were obtained from simulated pp collisions with a center of mass energy of $\sqrt{s} = 13{\,\text{TeV}}$ that produced jets with ${p_\mathrm{T}} \approx 1{\,\text{TeV}}$ . A simplified particle detector parameterization, similar to the apparatuses deployed by the large LHC experiments, is used to emulate the detector response. The hadronic jets are reconstructed using the anti- ${k_\mathrm{T}}$ algorithm [30, 31] with a distance parameter R = 0.8. The 177 k samples are divided into ∼70% for training and ∼15% each for validation and testing. Even though there is no intrinsic ordering to the particles inside a jet, the particles are ordered in a list by decreasing ${p_\mathrm{T}}$ and only the first 30 particles are used as input to the ML algorithm. If a jet contains less then 30 particles, the rest of the list was filled with zeros (zero-padding). A feature-wise standardization was performed to bound the values of the three-momentum to be in the range [0.0, 1.0] as input to the network.

3.1. Convolutional VAE

To accommodate this representation of the input dataset, the chosen model is a VAE [32–34] composed of convolution layers (ConVAE). The VAE is composed of three components: (i) an encoder, which learns to compress the representation of the input data into a lower dimensional representation, (ii) a latent space, which stores this lower dimensional representation into values that closely follow a probability distribution defined by the user, and (iii) a decoder, which decodes the latent space values to the output. The probability distribution of the latent space values has to be differentiable, easy to implement, and easy to sample from. The most common choice is the standard Gaussian ${\mathcal{N}(0,1)}$ . The goal of the VAE is to produce output data whose distribution is as similar as possible to that of the input data. In addition, it is desirable to be able to generate new data samples resembling the training data by sampling from ${\mathcal{N}(0,1)}$ and feeding those values into the decoder.

The network was implemented in the PyTorch library [35]. In this work, the encoder and decoder structures are mirrored, and only the encoder will be described. It is composed of a given number of consecutive two-dimensional convolution layers where the last convolution layer is flattened and introduced into two linear layers, where the latter is input into the latent space. An activation function is applied after each layer, except for the last linear layer of the encoder. The input and output layers of the network have a fixed size of 3×30. The activation function in between each layer, number of convolutional layers in the encoder and decoder, kernel size, latent dimension size, number of filters, and number of nodes in the linear layers between encoder (decoder) and the latent space are hyperparameters to be optimized. The activation function of the last deconvolution layer of the decoder was chosen to be the sigmoid [36] function, to bound the output values in between 0.0 and 1.0, given the feature-wise standardization.

In the encoder architecture, the number of input nodes of the first linear layer is fixed by the size of the last convolutional layer, while the number of output nodes of the second linear layer is fixed as two times the number of dimensions of the latent space to provide the means and standard deviations of the latent distributions⁶ . In the decoder architecture, the number of input nodes in the first linear layer is fixed by the latent space size, and the number of output nodes in the second linear layer is fixed by the size of the first deconvolution layer. The stride and padding of the convolution layers were kept as 1 and 0, respectively. The first convolutional layer has 1 input filter and $N_\mathrm{filters}$ output filters, and the rest of the convolutional layers have an input number of filters equal to 2 $^{\ell-2} \times N_\mathrm{filters}$ and an output number of filters equal to 2 $^{\ell-1} N_\mathrm{filters}$ , where $\ell$ is the layer number. Using as an example an encoder of a ConVAE composed of three convolutional and two linear layers, table 1 summarizes the sequential layers of the network.

Table 1. Sequential layers of a representation of the encoder of the ConVAE model. The sizes of the Conv2D and Linear layers are given as [input channels, output channels, kernel size[0], kernel size[1]] and [input nodes, output nodes], respectively. The decoder architecture is an exact mirror of the encoder up to the last linear layer, using 2D deconvolution layers. An activation function is applied after all layers, except for the last linear layer of the encoder. The activation function, number of filters, kernel size, linear size and latent dimension size were obtained via hyperparameter optimization.

Layer type	Size	$\ell$
Input list	[1, 3, 30]	—
Conv2D	[1, $N_\mathrm{filters}$ , 3, kernel size]	1
Conv2D	[2 $^{\ell-2} \times N_\mathrm{filters}$ , 2 $^{\ell-1} \times N_\mathrm{filters}$ , 1, kernel size]	2
Conv2D	[2 $^{\ell-2} \times N_\mathrm{filters}$ , 2 $^{\ell-1} \times N_\mathrm{filters}$ , 1, kernel size]	3
Linear	[2 $^{\ell-2} \times N_\mathrm{filters}$ × (30 - (kernel size - 1) × ( $\ell$ - 1)), linear size]	4
Linear	[linear size, 2 × latent dimension size]	—
Mean & Variance	latent dimension size	—
Latent vector	latent dimension size	—

At training time, the batch size was kept fixed at 100 samples and the Adam optimizer [37] was used with a learning rate (LR) that was set by the hyperparameter optimization.

3.2. Normalizing flows

When the ConVAE is optimized for reconstruction, i.e. the importance of the reconstruction loss term (detailed in section 4) is small, a looser constraint is applied to ensure the latent space values follow a given probability distribution. In this case, the probability distribution of the latent space values, $p(\mathbf{z})$ , favors optimizing the reconstruction rather than strictly matching an imposed prior. These reconstruction-imposed latent space distributions are generally unknown and one cannot easily sample from them.

A NF [38, 39] is a transformation of a simple probability distribution (e.g. ${\mathcal{N}(0,1)}$ ) into a more complex distribution (and vice versa) through a sequence of invertible and differentiable mappings usually learned by training a NN. Therefore, we can combine the ConVAE approach with a NF to find the transformation from the more complex latent space distribution to a simpler one, such as $p(\mathbf{z}) \overset{f}\longrightarrow u(\mathbf{x})$ where $u(\mathbf{x}) = {\mathcal{N}(0,1)}$ . Since these transformations are invertible by construction, it is possible to start from values that follow the simple distribution, pass it through the inverse of the transformation and obtain values that follow the complex and unknown distribution as $u(\mathbf{x}) \overset{{f^{-1}}}\longrightarrow p(\mathbf{z})$ . The network is trained by maximizing the right-hand side of equation (1).

$\begin{align} \ln\left(p\left(\mathbf{z}\right)\right) = \ln\left(u\left(f\left(\mathbf{z}\right)\right)\right) + \ln \left| \text{det} \left( \frac{\partial f\left(\mathbf{z}\right)}{\partial \mathbf{z}} \right) \right|, \end{align} \tag{ 1 }$

where $\partial f(\mathbf{z})/\partial \mathbf{z}$ is the Jacobian of the transformation.

The NF network we use is based on the RealNVP network⁷ [40, 41] that makes use of simple analytic expressions for each intermediate transformation

$\begin{align} y_{1:d} & = x_{1:d}, \end{align} \tag{ 2 }$

$\begin{align} y_{d+1:D} & = x_{d+1:D} \cdot \text{exp}\left(s\left(x_{1:d}\right)\right) + t\left(x_{1:d}\right), \end{align} \tag{ 3 }$

using the coupling layers [42], where only a subset of the input data undergoes the transformations while the rest remains unchanged, and the subsets are permuted every epoch to ensure that all input data goes through the flows. The parameters s and t are obtained as the outputs of two multilayer perceptrons (MLPs), which can be as complex as needed.

The NF network was implemented using the PyTorch library. The Adam optimizer was chosen to update the network parameters by minimizing the negative of the mean of the expression in equation (1) as the loss function. The LR, number of flows and the number of layers and nodes of the two MLPs that determine s and t were hyperparameters to be optimized. The ReLU [43] activation function was used for each layer of the MLPs, and for s the hyperbolic tangent was used after its last layer to constrain its values to be in the range $(-1.0, 1.0)$ . The number of input and output nodes of the s and t networks were set to match the number of latent dimensions of the ConVAE.

4. ConVAE loss function and evaluation metric

To define a customized loss function for the ConVAE model, we adapt the standard VAE loss [32], which is given by the expression:

$\begin{align} L\left(\theta,\phi\right) = - \mathbb{E}_{z \sim q_{\phi}\left(\mathbf{z}|\mathbf{x}\right)} \left[ \log \ p_{\theta}\left(\mathbf{x}|\mathbf{z}\right) \right] + {D_\mathrm{KL}}\left(q_{\phi}\left(\mathbf{z}|\mathbf{x}\right)\parallel p\left(\mathbf{z}\right)\right), \end{align} \tag{ 4 }$

where x is the vector of data values, z is the vector of latent space values, $q_{\phi}(\mathbf{z}|\mathbf{x})$ the variational posterior (i.e. the encoder), $p_\theta(\mathbf{x}|\mathbf{z})$ the conditional likelihood (i.e. the decoder), and $p(\mathbf{z})$ the prior distribution. The first term on the right-hand side can be seen as the reconstruction loss $L_{\mathrm{rec}}$ , measuring the distance between the output produced by the network and the input data. The second term ${D_\mathrm{KL}}$ is the Kullback–Leibler divergence [44] that measures the difference between the probability distributions of the encoder output values and the latent space values. The distribution of the latent space values is defined a priori, and the distribution of the encoder output values is one that is easy to work with, but still complex enough to describe its output, where the usual choice is a multivariate Gaussian.

The minimization of this loss function ensures that the output will be as close as possible to the input, while the values of the latent space closely follow the chosen distribution. In addition, one can add a hyperparameter β [45] to control the importance of each term in the loss. The loss function can be written as:

$\begin{align} L_\mathrm{VAE} = \left(1-\beta\right)L_\mathrm{rec} + \beta {D_\mathrm{KL}}. \end{align} \tag{ 5 }$

The reconstruction loss term was customized to the task of particle generation. The loss function that compares output particles with input dataset particles was chosen to be the Chamfer distance [46], also referred to as nearest neighbor distance (NND). This loss function is preferred over the commonly-used mean-squared error (MSE) function because it is permutation invariant, which is desirable since particles in a jet have no intrinsic ordering. The NND loss is expressed as:

$\begin{align} L^\mathrm{NND} = {\sum}_{k} \left[ {\sum}_{i \in \mathcal{J}_{k}} \min_{j \in \hat{\mathcal{J}}_{k}} D\left(\vec{p}_i, \vec{\hat{p}}_j\right) + {\sum}_{j \in \hat{\mathcal{J}}_{k}} \min_{i \in \mathcal{J}_{k}} D\left(\vec{p}_i, \vec{\hat{p}}_j\right) \right], \end{align} \tag{ 6 }$

where i and j are indices of the particles in the input and output samples, respectively, k is the index of a given jet, $\mathcal{J}_{k}$ represents the k-th jet in the dataset, hat distinguishes between input (without) and output (with) objects, and $D(\vec{p}_i, \vec{\hat{p}}_j)$ is the Euclidean distance between input and output particles, treating each as a vector in $({p_{x}}, {p_{y}}, {p_{z}})$ space. The first term finds the closest output particle to a given input particle, the second term finds the closest input particle to a given output particle, and their distances are summed.

The loss $L^\mathrm{NND}$ alone was not enough to provide good quality of the generated jets. Therefore, physics-inspired constraints were included in the form of distances between the input and output jet properties. The best results were achieved using a combination of the jet mass and ${p_\mathrm{T}}$ terms using MSE as a distance function. The additional loss term was added as follows

$\begin{align} L^\mathrm{J} = {\sum}_{k} \left[ \gamma_{{p_\mathrm{T}}}\mathrm{MSE}\left(p_{ \mathrm{T}_{k}}, \hat{p}_{\mathrm{T}_{k}}\right) + \gamma_{m}\mathrm{MSE}\left(m_{k}, \hat{m}_{k}\right) \right]. \end{align} \tag{ 7 }$

where the parameters $\gamma_{{p_\mathrm{T}}}$ and γ_m were used to weight each contribution. Finally, the combined reconstruction loss is given by

$\begin{align} L_\mathrm{rec} = \mathbb{E}_{\hat{\mathcal{J}} \sim r_{\phi,\theta}\left(\hat{\mathcal{J}}|\mathcal{J}\right)}\left[\alpha L^\mathrm{NND} + \gamma L^\mathrm{J}\right], \end{align} \tag{ 8 }$

in which α and γ were used to weight the importance of each contribution to the loss function and $r_{\phi,\theta}(\hat{\mathcal{J}}|\mathcal{J})$ represents the full VAE probability distribution from input to output. The full set of loss parameters $(\alpha, \beta, \gamma, \gamma_{{p_\mathrm{T}}}, \gamma_m)$ was optimized via hyperparameter optimization.

The evaluation metric chosen to quantitatively measure the generation capabilities of the distinct models was the earth mover's distance (EMD) [47], or 1-Wasserstein distance, calculated for each histogram of the input and output jets mass, energy, ${p_\mathrm{T}}$ , η, and φ, and summed to get the EMD ${_{\small\textrm{sum}}}$ . The choice of the variables was based on relevant quantities for jet-based searches in LHC experiments [48, 49]. This metric was used as the hyperparameter optimization objective, the method to choose the best model, and to compare different approaches.

The hyperparameter optimization was performed using the Optuna package [50], which allows for fast convergence thanks to aggressive pruning based on intermediate results and easy parallelization. Each ConVAE was set to go through 300 trials of the optimization, where each trial was executed for 300 epochs of the training. The procedure was carried out in two steps. In the first, only the loss parameters and the optimizer LR were optimized while the network parameters were fixed at some reasonable choices established after manual tests. In the second, all the network parameters were optimized given the best set of loss parameters. This was done to ensure convergence of the optimization procedure in a feasible amount of time given the amount of resources available. During the optimization, the EMD ${_{\small\textrm{sum}}}$ was minimized using the value evaluated on the validation dataset after every 5 epochs.

To combine the ConVAE with the NF network (ConVAE+NF), a two-step training was performed. First, the hyperparameters of the ConVAE were optimized and the resulting best model was trained for 1500 epochs. The model with the smallest EMD ${_{\small\textrm{sum}}}$ when comparing the input with reconstructed jets in the validation dataset, calculated every 5 epochs, was saved. The latent space values of the saved network when receiving as input the training and validation jet datasets were extracted to generate new training and validation datasets for the NF. A diagram containing the full training procedure is displayed in figure 1.

**Figure 1.** Illustration of ConVAE+NF network training scheme.
Download figure:
Standard image High-resolution image

The metric used to evaluate the combination of both networks as the hyperparameter optimization objective was EMD ${_{\small\textrm{sum}}}$ . After every four epochs, $N_\mathrm{latent}$ values were sampled from ${\mathcal{N}(0,1)}$ and input into the inverse of the NF network, whose output was given as input to the decoder of the ConVAE tuned for reconstruction, and its output was compared with jets from the validation dataset. The optimization was run for 300 trials with 20 epochs of NF training per trial. The best set of hyperparameters was used to train the final NF network for 100 epochs, and the model that exhibited the smallest metric value, checked after every epoch, was saved.

5. Results and discussion

5.1. Hyperparameter optimization

As described above, the hyperparameter optimization was performed using the Optuna framework and executed in steps to speed up the convergence of the method given the limited computational resources. The ranges considered for each hyperparameter during the optimization for both types of NN are shown below. In the ConVAE case, the set of parameters to be optimized was: the LR of the optimizer (Adam), the loss function parameters α, β, γ, $\gamma_{{p_\mathrm{T}}}$ , and γ_m, and the architecture parameters as the number of latent dimensions, number of filters, kernel size related to the number of particles to convolve, number of nodes in linear layers, number of convolution layers, and activation function. For the 2D kernel size $(k_1, k_2)$ , only k₂, the number of particles to convolve, was varied, while k₁, the number of particle features to convolve, was set to 3 for the first (last) convolution (deconvolution) layer of the encoder (decoder) and 1 otherwise.

The ranges of the parameters $\gamma_{{p_\mathrm{T}}}$ and γ_m were determined by the dataset because jet ${p_\mathrm{T}}$ and mass have different scales, which imply different magnitudes for those loss components. The range for $\gamma_{{p_\mathrm{T}}}$ was set from 20% smaller to 20% larger than unity, while the range for γ_m was set from 20% smaller to 20% larger than the ratio of the mean of the jet ${p_\mathrm{T}}$ and mass. For the loss function hyperparameter optimization, the network architecture hyperparameters were fixed at the reasonable choices displayed in table 2. For both ConVAE networks (ConVAE and ConVAE+NF approaches) almost all the ranges of the hyperparameters for the optimization were the same, and are shown in table 3. The only exception was for the loss hyperparameter β that, due to the difference in the magnitudes of $L_{\mathrm{rec}}$ and ${D_\mathrm{KL}}$ , needed to be much closer to 1.0 for the ConVAE tuned for generation, where the optimization range was [0.9, 1.0] in log scale, and much closer to 0.0 for the ConVAE tuned for reconstruction, where the optimization range was [0.0, 0.1] in log scale.

Table 2. Set of common architecture hyperparameters for ConVAE loss parameters optimization.

$N_\mathrm{latent}$	Filters	Kernel	$N_\mathrm{linear}$	Layers	Act. Func.
30	50	5	1500	3	ReLU

Table 3. Ranges of common architecture hyperparameters for ConVAE loss parameters optimization. The choice of activation function was also optimized, chosen from one of ReLU, GeLU, LeakyReLU, SELU or ELU.

LR	α	γ	$\gamma_{{p_\mathrm{T}}}$	γ_m
10⁻⁵–10⁻¹	0.1–1.0	0.1–1.0	0.8–1.2	9.986–14.98
$N_\mathrm{latent}$	Filters	Kernel	$N_\mathrm{linear}$	Layers
10–300	5-100	1–8	100–3000	1–4

Since the number of hyperparameters of the RealNVP network is much smaller than for the ConVAE, its hyperparameter optimization was performed only in one step. The set of hyperparameters that were optimized was: the LR of the optimizer (Adam); and the network architecture hyperparameters as the number of flows, number of linear layers for the s and t MLPs and the number of nodes in each layer of those MLPs. For both the RealNVP network used together with the ConVAE and the one used by itself, the ranges of the hyperparameter optimization are shown in table 4.

Table 4. Ranges of architecture hyperparameters for the RealNVP optimization. The number of nodes in each linear layer ( $N_\mathrm{linear}$ ) is optimized separately for each layer.

LR	$N_{\mathrm{flows}}$	Layers	$N_\mathrm{linear}$
10⁻⁵–10⁻¹	5–100	1–4	50–400

5.2. Experiments

We experiment with three approaches: a baseline ConVAE-only model, the ConVAE+NF model, and another baseline NF-only model. The final set of optimized hyperparameters for the former is displayed in table 5, which result in an EMD ${_{\small\textrm{sum}}}$ value of 0.0033. Figure 2 shows the comparisons of the distributions of input and output jet mass, ${p_\mathrm{T}}$ , energy, η and φ after hyperparameter optimization. Qualitatively, the distributions of jet energy, η, and φ are very similar, with differences only at the extremes of the histograms, while for ${p_\mathrm{T}}$ there is a slight difference at the peak of the distribution and small discrepancies at both tails. The greatest discrepancies are in the distributions of jet mass, where they are only similar in the approximate range $80{\,\text{GeV}} \lt m_\mathrm{jet} \lt 175{\,\text{GeV}}$ .

**Figure 2.** Comparison of input jets variable distributions (yellow) with randomly generated jets from the ConVAE model (green). From left to right, top to bottom: mass, energy, ${p_\mathrm{T}}$ , η, and φ distributions are displayed. In the subpanels, the ratio ${{\mathrm{generated}}}/{{\mathrm{input}}}$ is shown.
Download figure:
Standard image High-resolution image

**Figure 2.** Comparison of input jets variable distributions (yellow) with randomly generated jets from the ConVAE model (green). From left to right, top to bottom: mass, energy, ${p_\mathrm{T}}$ , η, and φ distributions are displayed. In the subpanels, the ratio ${{\mathrm{generated}}}/{{\mathrm{input}}}$ is shown.
Download figure:
Standard image High-resolution image

Table 5. Set of optimal hyperparameters for the ConVAE tuned for generation.

LR	α	β	γ	$\gamma_{{p_\mathrm{T}}}$	γ_m	$N_\mathrm{latent}$	Filters	Kernel	$N_\mathrm{linear}$	Layers	Act. Func.
6.1 $\times 10^{-4}$	0.449	0.998	0.118	0.817	11.7	190	75	3	1100	4	ReLU

The complete sets of optimized hyperparameters for the ConVAE+NF model are displayed in tables 6 and 7, which result in an EMD ${_{\small\textrm{sum}}}$ of 0.0026. The histograms of the jet variables are displayed in figure 3.

**Figure 3.** Comparison of input jets variable distributions (red) with randomly generated jets from the ConVAE+NF model (blue). From left to right, top to bottom: mass, energy, ${p_\mathrm{T}}$ , η, and φ distributions are displayed. In the subpanels, the ratio ${{\mathrm{generated}}}/{{\mathrm{input}}}$ is shown.
Download figure:
Standard image High-resolution image

Table 6. Set of optimal hyperparameters for ConVAE tuned for reconstruction.

LR	α	β	γ	$\gamma_{{p_\mathrm{T}}}$	γ_m	$N_\mathrm{latent}$	Filters	Kernel	$N_\mathrm{linear}$	Layers	Act. Func.
$5.3\times 10^{-4}$	0.425	0.046	0.121	1.15	11.8	180	50	2	3000	4	ReLU

Table 7. Set of optimal hyperparameters for the NF to be used on the ConVAE+NF approach.

LR	$N_{\mathrm{flows}}$	Layers	$N_\mathrm{linear}$
$2.9\times 10^{-4}$	55	4	[400, 350, 400, 400]

Based on the EMD ${_{\small\textrm{sum}}}$ , there is a large improvement in the generative capabilities of the ConVAE+NF model compared to the ConVAE model. In particular, the agreement in the jet mass distribution improves dramatically, with disagreement only for masses smaller than 25 GeV. However, there remain notable deviations from unity in the η, φ, and ${p_\mathrm{T}}$ histograms ratio between randomly generated and input jets.

For comparison, a RealNVP network was also optimized, in the same way as the NF network for the ConVAE+NF approach, and trained to receive as input a flattened jet⁸ , transform each feature distribution into an uniform normal, and invert the transformation back to flattened jets. Table 8 contains the set of best parameters after hyperparameter optimization for this network. The value of EMD ${_{\small\textrm{sum}}}$ for this case was 0.0049 and the histograms of jet variables are shown in figure 4. The agreement between the distributions and the EMD ${_{\small\textrm{sum}}}$ are significantly worse than ConVAE+NF, demonstrating the advantage in combining the inference capability of the ConVAE, with the invertibility of the NF network.

**Figure 4.** Comparison of input jets variable distributions (grey) with randomly generated jets from the NF only model (purple). From left to right, top to bottom: mass, energy, ${p_\mathrm{T}}$ , η, and φ distributions are displayed. In the subpanels, the ratio ${{\mathrm{generated}}}/{{\mathrm{input}}}$ is shown.
Download figure:
Standard image High-resolution image

Table 8. Set of optimal hyperparameters for the NF.

LR	$N_{\mathrm{flows}}$	Layers	$N_\mathrm{linear}$
3.4 $\times 10^{-4}$	65	4	[150, 200, 350, 150]

Following the work in [8, 51], another set of metrics was calculated and compared to the results obtained using the state-of-the-art message-passing generative adversarial network (MPGAN) on this dataset. The metrics $W_{1}^{M}$ , $W_{1}^{P}$ and $W_{1}^\mathrm{EFP}$ correspond to the average 1-Wasserstein distance between input and generated jets mass, constituents $\eta^{\mathrm{rel}}$ , $\phi^{\mathrm{rel}}$ and ${p_\mathrm{T}}^{\mathrm{rel}}$ , and five energy-flow polynomials (EFPs) [52], which are a set of variables that capture n-point particle correlations and jet substructure. The computation of the EFPs was performed using the JetNet package [53], where the EFPs correspond to the default choice defined in the package. The Fréchet ParticleNet Distance (FPND), which is the ParticleNet [54]-adaptation of the Fréchet inception distance [55] that is standard in computer vision; and coverage (COV) and minimum matching distance (MMD) which measure the ratio of samples in X (e.g. the generated jets) that have a match in Y (e.g. the input jets) and the average distance between matched samples, respectively, are also measured. Table 9 shows that among the models implemented in this work, the ConVAE+NF approach is the best in almost every metric. However, MPGAN outperforms the VAE-based approaches in every single metric. The advantage of the ConVAE+NF approach is that it is approximately two times faster: the average time to generate a single jet is $18.30 \pm 0.04$ µs for ConVAE+NF versus 35.7 µs for MPGAN measured using an NVIDIA Tesla T4 GPU. Nevertheless, when compared with the EPiC-GAN model, which also achieves strong physics performance and can generate a single jet in 2 µs, the ConVAE+NF model still lacks speed. It is still interesting to continue development of the latter; since the techniques applied in both models are completely distinct, one could expect that their strengths and weaknesses are complementary. Additionally, the availability of multiple tools for the same objective is usually expected by the HEP community.

Table 9. Set of metrics proposed in [8, 51] to compare the performance of distinct ConVAE and NF approaches. In bold are the best values of the metrics when comparing the methods implemented in this work. The last rows display the performance of the MPGAN and EPiC-GAN for the same gluon jet dataset.

Model	$W_{1}^{M}$ ( $\times10^{-3}$ )	$W_{1}^{P}$ ( $\times10^{-3}$ )	$W_{1}^\mathrm{EFP}$ ( $\times10^{-5}$ )	FPND	COV $\uparrow$	MMD
ConVAE	9.1 ± 0.6	8.5 ± 0.8	576 ± 1035	43.6	0.32	0.036
ConVAE+NF	4.5 ± 0.5	5.3 ± 0.4	197 ± 247	34.3	0.38	0.034
NF	12.7 ± 0.7	8.5 ± 0.8	8.6k ± 13.2k	93.6	0.38	0.033
MPGAN	0.7 ± 0.2	0.9 ± 0.3	0.7 ± 0.2	0.12	0.56	0.037
EPiC-GAN	0.3 ± 0.1	1.6 ± 0.2	0.4 ± 0.2	1.01	—	—

From the comparisons of the low- and high-level jet feature distributions, we observe that the mixed ConVAE+NF approach exhibits an improvement over the ConVAE alone. We attribute this to two reasons: (1) the standard Gaussian distribution, used by the ConVAE alone but altered in the ConVAE+NF approach, is not the optimal probability distribution for the elements in the latent space dimension, and (2) the decompression of those values could be improved by using the same parameters of the ConVAE tuned for reconstruction. Still, there are discrepancies between the ConVAE+NF generated jets and input jets, mainly related to the substructure variables, which are not present for the MPGAN approach. Other than the structural distinction between GAN and VAE, one of the core differences between the two approaches is that the architecture for the MPGAN is a GNN. The representation of jet constituents as an unordered set of particles is more natural, intrinsically preserves permutation invariance, and most likely contributes to the differences between the ConVAE and MPGAN results.

6. Summary and outlook

We presented a novel technique in the context of hadronic jets generation in simulated high-energy pp collisions, based on a ML approach that combines a ConVAE with NFs. This method provides an opportunity to explore and enhance our capabilities for creating point-cloud objects in a broader context, with a particular emphasis on its potential application within the field of high energy physics. From the values of EMD ${_{\small\textrm{sum}}}$ obtained, there was a clear improvement with respect to the work in [13], not only by the application of hyperparameter optimization to the ConVAE, but also with the usage of the ConVAE+NF technique. The generation of gluon jets was performed using the ConVAE+NF model and compared with baseline ConVAE and NF models alone, as well as the state-of-the-art MPGAN. The ConVAE+NF approach is superior to the first two, but needs further improvements to be comparable to the latter.

There is, however, an improvement in inference time compared to MPGAN as the ConVAE+NF can generate a jet twice as fast, but it still lacks speed to be comparable with EPiC-GAN. Even in the scenario where the ConVAE+NF is not the fastest network, it still offer advantages in physics applications where large amounts of simulated samples are needed but perfect accuracy is not a strict requirement, since the techniques used in each model are completely distinct and it can be expected their weak and strong characteristics to be complementary, specially in the HEP context where more than one method for the generation of new data is wanted. It is worth noting that the MPGAN architecture is based on a graph NN, which has a much more natural representation for a particle-based jet dataset. In future work, a graph-based VAE+NF can be implemented, trained, and tested to check for improvements in the jet generation capabilities and timing.

Acknowledgments

B O, J F, R C and T T are supported by Grant 2018/25225-9, São Paulo Research Foundation (FAPESP). B O was also partially supported by Grants 2019/16401-0 and 2020/06600-3, São Paulo Research Foundation (FAPESP). R C and T T were also partially supported by Grant 2022/02950-5, São Paulo Research Foundation (FAPESP). R K was partially supported by an IRIS-HEP fellowship through the U.S. National Science Foundation (NSF) under Cooperative Agreement OAC-1836650. R K was additionally supported by the LHC Physics Center at Fermi National Accelerator Laboratory, managed and operated by Fermi Research Alliance, LLC under Contract No. DE-AC02-07CH11359 with the U.S. Department of Energy (DOE). J D is supported by the DOE, Office of Science, Office of High Energy Physics Early Career Research program under Award No. DE-SC0021187, the DOE, Office of Advanced Scientific Computing Research under Award No. DE-SC0021396 (FAIR4HEP), and the NSF HDR Institute for Accelerating AI Algorithms for Data Driven Discovery (A3D3) under Cooperative Agreement OAC-2117997. M P and M T were supported by the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (Grant Agreement No. 772369). D G is partially supported by the EU ICT-48 2020 Project TAILOR (No. 952215).

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: 10.5281/zenodo.3601436.

LHC hadronic jet generation using convolutional variational autoencoders with normalizing flows

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Related work

3. Dataset and model

3.1. Convolutional VAE

3.2. Normalizing flows

4. ConVAE loss function and evaluation metric

5. Results and discussion

5.1. Hyperparameter optimization

5.2. Experiments

6. Summary and outlook

Acknowledgments

Data availability statement

Footnotes

LHC hadronic jet generation using convolutional variational autoencoders with normalizing flows

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Related work

3. Dataset and model

3.1. Convolutional VAE

3.2. Normalizing flows

4. ConVAE loss function and evaluation metric

5. Results and discussion

5.1. Hyperparameter optimization

5.2. Experiments

6. Summary and outlook

Acknowledgments

Data availability statement

Footnotes