1 Introduction

A cornerstone of modern high-energy collider physics is first-principles simulations that encode the underlying physical laws. These simulations are crucial for performing and interpreting analyses, linking theoretical predictions and experimental measurements. However, the most precise simulations are computationally demanding and may become prohibitively expensive for data analysis in the era of the High-Luminosity Large Hadron Collider (HL-LHC) [1, 2]. As a result, various fast simulation approaches have been developed to mimic precise simulations with only a fraction of the computational overhead.

In the language of machine learning (ML), fast simulations are called surrogate models. While most fast simulations use some flavor of machine learning, a growing number of proposed surrogate models are based on deep learning. Deep generative models include Generative Adversarial Networks (GAN) [3, 4], Variational Autoencoders [5, 6], Normalizing Flows [7, 8], and Diffusion Models [9,10,11] which often make use of score matching [12,13,14].

In collider physics, deep learning-based surrogate models have been proposed for all steps in the simulation chain, including phase-space sampling, amplitude evaluation, hard-scatter processes, minimum bias generation, parton shower Monte Carlo, and detector effects (see e.g. Refs. [15, 16] for recent reviews). In this paper, we focus on the emulation of hard-scatter event generation [17,18,19,20,21], although the approaches are more general.

Simulations can be viewed as deterministic functions from a set of fixed and random numbers into structured data. The fixed numbers represent physics parameters like masses and couplings, while the random numbers give rise to outgoing particle properties sampled from various phase space distributions. For deep generative models, the random numbers are typically sampled from simple probability densities like the Gaussian or uniform probability distributions. The random numbers going into a simulation constitute the latent space of the generator. A key challenge with deep generative models is achieving precision. One way to improve the precision is to modify the map from random numbers to structure. A complementary approach is modifying the probability density of the generated data space or the latent space. This approach is what we investigate in this paper.

For a variety of applications, normalizing flows have shown impressive precision for HEP applications [21,22,23,24,25,26,27].However, these models suffer from topological obstructions due to their bijective nature [28,29,30] which for instance becomes relevant for multi/resonant processes and at sharp phase-space boundaries that are often introduced by kinematic cuts. Topological obstructions are not just affecting normalizing flows but can also limit the performance of autoencoders [31]. If the phase-space structure is known, the problematic phase-space regions can be either smoothed by a local transformation [21] or the flow can be directly defined on the correct manifold [32]. However, we generally do not know where these problematic regions lie in the highly complex and multi-dimensional phase space. Moreover, even if we identify and fix problematic patterns in a lower-dimensional subspace, we are not guaranteed to cure all higher-dimensional topological structures. Therefore, we study two complementary strategies to improve the precision of generative networks affected by topological obstructions and other effects. These strategies are both generalizable and computationally inexpensive.

As a first method, we use the Laser protocol [30] which relies on the refinement of the latent space of the generative model. The Laser protocol is an extension of the DctrGAN [33] method that is not specific to GANs and circumvents the limited statistical power of weighted events by propagating the weights into the latent space. While this idea has also been proposed solely for GANs in Ref. [34], the Laser approach works for any generative model and further extends Ref. [34] by allowing more sophisticated sampling techniques.

In the second method, we employ augmented normalizing flows [28, 35,36,37]. In contrast to standard normalizing flows, the feature space is augmented with additional variables such that the latent space dimension can be chosen to have higher dimensions than the original feature space dimension. As a result, possibly disconnected patches in the feature space, which are difficult to describe by deep generative models, can be connected by augmenting additional dimensions and thus simplify the mapping the network has to learn. In order to obtain the original feature space, we use a lower/dimensional slice of the generated feature space.

The Laser protocol and the augmented flow show excellent results when applied to simple toy examples. In this paper, we apply both methods to a complicated high-energy physics example for the first time and show how they alleviate current limitations in precision while being conservative regarding network sizes and the number of parameters. Note that augmentation only describes one possible incarnation of a surjective transformation, of which some have already applied on HEP data successfully in Ref. [26]. In our studies, we consider W-boson production with associated jets [38,39,40,41,42,43,44,45] at the LHC, i.e. 

(1.1)

as illustrated in Fig. 1. Owing to the high multiplicity of the multi-jet process and phase space restrictions (e.g. jet \(p_{\textrm{T}}\) thresholds), the associated phase space is large and topologically complex, with many non-trivial features that need to be correctly addressed by the generative models.

Fig. 1
figure 1

Examples of LO Feynman diagrams contributing to the process

This paper is organized as follows. In Sect. 2, we introduce and compare both methods and describe their implementation. We define and list different data representations in Sect. 3 and introduce the Mahambo algorithm. As a first application, we show simple 2/dimensional toy examples to illustrate our approaches in Sect. 4. Afterwards, we employ both methods on a fully comprehensive LHC example and investigate the impact of various data representations on the models’ performance. Finally, we conclude in Sect. 5.

2 Methods

The baseline method for both of our models is a normalizing flows [7, 8]. Normalizing flows rely on the change of variables formula

$$\begin{aligned} \log p_{g({\mathcal {Z}})}(x)=\log p_{\mathcal {Z}}(g^{-1}(x)) + \log \left| \frac{\partial g^{-1}(x)}{\partial x}\right| , \end{aligned}$$
(2.1)

which explicitly encodes an estimate of the probability density \(p_{g({\mathcal {Z}})}\approx p_{{\mathcal {X}}}\) by introducing a bijective mapping g (parameterized as a neural network) from a latent space \({\mathcal {Z}}=\{z\in {\mathbb {R}}^N\vert z\sim p_{\mathcal {Z}}\}\) to a feature space with data described by \({{\mathcal {X}}}=\{x\in {\mathbb {R}}^N\vert x\sim p_{{\mathcal {X}}}\}\). In contrast, for VAEs and GANs, the mapping g does not need to be bijective. However, VAEs only yield a lower bound of the density (evidence lower bound, or ELBO) and for GANs the posterior is only defined implicitly.

We study two methods that aim to improve the precision and expressiveness of the generative model g by either modifying the latent space prior \(p_{\mathcal {Z}}\) or by post-processing the data space, either with importance weights or by augmenting the feature space \({{\mathcal {X}}}\) with an additional variable \(r\sim q(r\vert x)\). These methods are illustrated in Fig. 2 and are described in more detail in the following.

Fig. 2
figure 2

A schematic diagram illustrating the augmented flow (left) and the Laser protocol (right)

2.1 Latent space refinement

As the first method, we use the Laser protocol [30], illustrated on the right of Fig. 2. The Laser method acts as a post-hoc procedure modifying the latent space prior \(p_{\mathcal {Z}}\) to improve the generated samples x while keeping the generative model g fixed. For this method, the baseline generator g(z) can be any generative model or physics simulator (as long as we have access to the input random numbers) and does not have to be bijective in general. Throughout this paper, we choose g as a normalizing flow for simplicity and to make this method better comparable with the second method.

As a first step in the Laser protocol, we train a classifier f to distinguish samples x generated by g and samples drawn from the truth data probability distribution, \(p_{{{\mathcal {X}}}}\). If the generative model g is part of a GAN, it is possible to use the discriminator d as the classifier [34], but it might be more appropriate to train a new unbiased network from scratch [33]. If this classifier is trained with the binary cross entropy (BCE) and converges to its optimum, it obeys the well-known (see e.g. Refs. [46, 47]) property

$$\begin{aligned} f(x)\approx \frac{p_{{{\mathcal {X}}}}}{p_{g({\mathcal {Z}})}(x)+p_{{{\mathcal {X}}}}}. \end{aligned}$$
(2.2)

This can be used to assign a weight

$$\begin{aligned} w(x)=\frac{f(x)}{1-f(x)}\approx \frac{p_{{{\mathcal {X}}}}}{p_{g({\mathcal {Z}})}(x)}, \end{aligned}$$
(2.3)

to each generated sample. This idea has been widely studied in HEP for simulation reweighting and inference (see e.g. Refs. [48,49,50,51,52,53,54,55,56,57,58,59,60]) and is the core of the Dctr protocol [57] that has also been proposed for simulation surrogate refinement [33, 61] so we will refer to events weighted in this way as Dctr.

If weighted events are acceptable, then one could stop here and use the weights in the data space (as was proposed in Ref. [33]). For Laser, we next backpropagate this weight to the corresponding latent space point \(g(z_i)=x_i\).

If g is not bijective, then the weights pulled back from the data space to the latent space can be made a proper function of the phase space via Step 2 of the OmniFold algorithm [30, 58]. Finally, we sample from the new weighted latent space \((z,\bar{w}(z))\). There are three possible ways proposed by the Laser protocol:

  • Rejection sampling: The weights w(z) can be used to perform rejection sampling on the latent space. In general, however, the weights are broadly distributed which makes this method inefficient and least favored.

  • Markov chain Monte Carlo: The weights \(\bar{w}(z)\) induce a modified probability distribution

    $$\begin{aligned} q(z,\bar{w})= p_{\mathcal {Z}}(z)\,\bar{w}(z)/Z_0, \end{aligned}$$
    (2.4)

    with some normalization constant \(Z_0\). If both the probability q and its derivative \(\partial q/\partial z\) are tractable, we can employ a Hamiltonian Monte Carlo (HMC) [62, 63] algorithm for sampling. Differentiability is natural when w is a neural network.

  • Refiner network \(\Phi \): Finally, if the derivative \(\partial q/\partial z\) is not tractable the weighted latent space (zw(z)) can instead be converted into a unweighted latent space. For this, another generative model \(\Phi \) is trained on the weighted latent space [22, 64] while generating unweighted samples following the same probability distribution \(p_\Phi (z)\approx q(z,\bar{w})\). The output of the refinement network \(\Phi \) can then be used as an input to the generative model g.

In contrast to the MCMC algorithm, the refiner network \(\Phi \) requires the training of an additional generative model which might be affected by the same topological obstructions or even evokes new type of problems. Therefore, when applying the Laser protocol in our experiments, we always use a HMC algorithm to sample from the refined latent space when possible.

2.2 Augmented normalizing flows

Augmented normalizing flows [28, 35,36,37] are bridging the gap between VAEs and NFs. While they allow for (partially) bijective mappings they also tackle the bottleneck problem [36] by increasing the dimensionality of the original data. As it has been pointed out in great detail in Ref. [37], augmented flows only represent a particular example of surjective normalizing flows. In general, there are multiple and potentially interesting surjective transformations that might enhance the overall performance of the network [26]. In our studies, we restrict ourselves to only investigate the effect of an augmentation transformation. As all necessary formulae for surjective transformations have been derived and documented exhaustively in the literature, we only explain briefly how the augmentation transformation is defined and refer for details to Ref. [37].

Augmentation (or slicing in the inverse direction) starts by augmenting the data x with an additional \(D_k\) dimensional random variable \(z_2\in {\mathbb {R}}^{D_k}\) which is drawn from a distribution \(q(z_2)\). While not necessary in our application, the more general case allows for \(z_2\) to be conditional on the data x. In all our examples, \(q(z_2)\) is simply given by a normal distribution, i.e. \(q(z_2)={\mathcal {N}}(0,1)\). This means that the forward and the inverse passes are given by

(2.5)

As this mapping is surjective in the inverse direction but stochastic in forward direction, the full likelihood of this transformation is not tractable. Consequently, the augmented flow is optimized using only the tractable likelihood contribution which is the ELBO, and connects surjective flows with VAEs.

We note that a similar idea was also introduced in Ref. [65] where a noise-extended INN was used to match the dimensions of parton-level and detector-level events to better describe the stochastic nature of detector effects.

3 Data representation

While it is well known that proper data preprocessing can increase the accuracy and reliability of a neural network, it is often difficult to find the best data representation for any particular dataset.

In order to better understand which data representations are well suited in the context of event generation, we investigate several different proposals from the literature and develop our own parametrizations, which are built to enhance the sensitivity of the neural network dependent on its specific task. In detail, we consider parton-level events from a hadronic \(2\rightarrow n\) scattering process with an intrinsic dimensionality of \(d=3n-2\). We implemented the following preprocessings:

  1. 1.

    4-Mom: We parametrize each final state particle in terms of its four momentum and represent the entire event by 4n features

    $$\begin{aligned} x = \{p_1, p_2,\ldots , p_n\} \end{aligned}$$
    (3.1)

    where \(n_f\) denotes the number of final-state particles, and standardize each feature according to

    $$\begin{aligned} {\tilde{x}}_i = \frac{x_i - \mu _i}{\sigma _i}. \end{aligned}$$
    (3.2)
  2. 2.

    MinRep: We start by considering the 3-momenta of all final state particles, neglecting the energy component due to on-shell conditions. Further, in order to obey momentum conservation in the transverse plane, we reject the \(p_x\) and \(p_y\) components of the last particle and treat them as dependent quantities. This reduces the dataset to a \(3n-2\) dimensional data representation. Again, we normalize all features using the standardization in Eq. (3.2).

  3. 3.

    Precisesiast : In this preprocessing, we follow the preprocessing suggested in Ref. [21] and thus denote it as precision enthusiast (Precisesiast) parameterization. For parton level events, this preprocessing starts by representing each finals state in terms of

    $$\begin{aligned} \{p_\text {T},\eta ,\phi \}, \end{aligned}$$
    (3.3)

    followed by the additional transformations

    (3.4)
    (3.5)
    (3.6)

    to Gaussianize all features. This leads to 3n/dimensional representation.

  4. 4.

    Mahambo : In this preprocessing, we map the phase space onto a unit-hypercube with \(3n-2\) dimensions. Hence, we reduce the representation to the relevant degrees of freedom similar to the MinRep approach. While the MinRep parametrization is dependent on the ordering of the momenta and hence potentially prone to instabilities, the Mahambo preprocessing relies on the Rambo [66, 67] algorithm which populates the phase space uniformly (at least for massless final states). The details about the Mahambo algorithm are given in more detail in Sect. 3.1.

  5. 5.

    Elfs : This starts from a Precisesiast inspired representation

    $$\begin{aligned} \{\log p_\text {T}, {\widetilde{\phi }}, \eta \} \end{aligned}$$
    (3.7)

    and additionally augments with the features

    $$\begin{aligned} \{{\text {atanh}}(2\Delta \phi _{ij}/\pi -1), \log (\Delta \eta _{ij}), \log (\Delta R_{ij})\} \end{aligned}$$
    (3.8)

    for all possible jets pairs ij, where we define

    $$\begin{aligned} \Delta \eta _{ij}&= \vert \eta _i - \eta _j \vert \nonumber \\ \Delta \phi _{ij}&= {\left\{ \begin{array}{ll} \vert \phi _i - \phi _j \vert , &{} \text {if}\ \vert \phi _i - \phi _j \vert < \pi \\ 2\pi - \vert \phi _i - \phi _j \vert , &{} \text {otherwise} \end{array}\right. }\nonumber \\ \Delta R_{ij}&= \sqrt{(\Delta \phi _{ij})^2 + (\Delta \eta _{ij})^2}. \end{aligned}$$
    (3.9)

    This yields a \(3n+k\) dimensional representation, where \(k=\{3,9\}\) for 2- or 3-jet events considered in our studies, respectively. Thus, this is denoted as enlarged feature space (Elfs) representation.

The list of possible parametrizations above is not exhaustive, but they cover common representations and thus we expect them to be representative.

3.1 The Mahambo algorithm

In order to reduce the dataset to the relevant degrees of freedom while avoiding the limitations of the MinRep parametrization, which is sensitive to the order of the particle momenta, we employ a modified version of the Rambo [66, 67] algorithm which allows for an invertible mapping between momenta p and random numbers r living on the unit-hypercube \(U=[0,1]^d\). While the original Rambo [66] algorithm is not invertible as it requires 4n random numbers to parametrize n particle momenta, the RamboOnDiet [67] algorithm solves this problem and only requires \(3n-4\) random numbers. By construction, the RamboOnDiet algorithm is defined for a fixed partonic center-of-mass (COM) energy and provides momenta within this frame.

However, as we are considering scattering processes at hadron colliders the corresponding events only obey momentum conservation in the transverse plane and are only observed in the lab frame. Consequently, we need two additional random numbers \(r_{3n-3}\), \(r_{3n-2}\) that parametrize the momentum fractions \(x_1\) and \(x_2\) of the scattering partons, which are relevant for the PDFs and connect the lab frame with the partonic COM frame via a boost along the z-axis. This boost can be parametrized in terms of a rapidity parameter

$$\begin{aligned} \zeta = \frac{1}{2}\log \frac{x_1}{x_2}={\text {atanh}}\frac{q^3}{q^0},\quad \text {with}\ q=\sum _i p_i. \end{aligned}$$
(3.10)

In particular, we choose the following parametrization

$$\begin{aligned} \tau = \tau _{\text {min}}^{r_{3n-3}},\quad x_1=\tau ^{r_{3n-2}},\quad x_2=\tau ^{1-r_{3n-2}}, \end{aligned}$$
(3.11)

with \(\tau _{\text {min}}\) reflecting a possible cut on the squared partonic center-of-mass energy \({\hat{s}}>{\hat{s}}_{\text {min}}=\tau _{\text {min}}s\). We denote the modified algorithm as Mahambo.

The derivation and proofs for the unmodified parts adopted from RamboOnDiet are given in Ref. [67].Footnote 1

Algorithm 1
figure a

The Mahambo algorithm.

The Mahambo procedure to generate a single phase-space x point starting from random numbers r passes the following steps:

  1. 1.

    Sample the partonic momentum fractions \(x_1\) and \(x_2\) according to Eq. (3.11).

  2. 2.

    Define the partonic COM energy \({\hat{s}}=\sqrt{x_1 x_2 s}\) and define the Lorentz boost parameter \(\zeta \) according to Eq. (3.10).

  3. 3.

    Pass \({\hat{s}}\) to the RamboOnDiet algorithm to generate n massless 4-momenta in the partonic COM frame.

  4. 4.

    Possibly reshuffle these four/momenta following the procedure in Ref. [66] to obtain massive final states.

  5. 5.

    Boost all momenta into the lab frame with \(\Lambda _z(\zeta )\).

The full details of the mapping between \(3n-2\) random numbers \(r_i\) and n 4-momenta is given in Algorithm 1, and its inverse in Algorithm 2.

Algorithm 2
figure b

The inverse Mahambo algorithm.

To solve the equations in line 5 and 15 of Algorithm 1 and line 4 of Algorithm 2 we use Brent’s method [68]. The function \(\rho (M_{i-1},M_i,m_{i-1})\) in line 9 corresponds to the two-body decay factor defined as

$$\begin{aligned} \rho (a,b,c) =\frac{\sqrt{(a^2-(b+c)^2)(a^2-(b-c)^2)}}{8a^2}, \end{aligned}$$
(3.12)

which simplifies in the massless \((m_i=0)\) case to

$$\begin{aligned} \rho (M_{i-1},M_i,0)=\dfrac{M_{i-1}^2-M_{i}^2}{8M_{i-1}^2}. \end{aligned}$$
(3.13)

4 Experiments

We demonstrate the performance of Laser and augmented flows in two numerical experiments. We start with two illustrative toy examples in Sect. 4.1, before turning to a particle physics problem in Sect. 4.2. For the more complicated examples, we further perform a dedicated analyses of choosing an optimal parametrization for both the generative model as well as the classifier needed for the Dctr or Laser approach.

In order to have full control of all parameters and inputs we implemented our own version of Hamiltonian Monte Carlo in PyTorch, using its automatic differentiation module to compute the necessary gradients to run the algorithm. For all examples, we initialize a total of 100 Markov Chains running in parallel to reduce computational overhead and solve the equation of motions numerically using the leapfrog algorithm [63] with a step size of \(\epsilon = 0.01\) and 30 leapfrog steps. To avoid artifacts originating from the random initialization of the chains, we start with a burn-in phase and discard the first 5000 points from each chain.

4.1 Toy examples

We consider two different toy examples commonly used in the literature: a probability density defined by thin concentric circles and another density described by circular array of two-dimensional Gaussian random variables. These are illustrated pictorially in the leftmost column of Fig. 3.

4.1.1 Network architecture

We use a normalizing flow with affine coupling blocks as the baseline model in all toy examples. Our implementation relies partially on SurVAE [37]. The NF consists of 8 coupling blocks, each consisting of a fully connected multi-layer perceptron (MLP) with two hidden layers, 32 units per layer, and ReLU activation functions. The model is trained for 100 epochs by minimizing the negative log-likelihood with an additional weight decay of \(10^{-5}\). The optimization is performed using Adam [69] with default \(\beta \) parameters and \(\alpha _0 = 10^{-3}\). We employ an exponential learning rate schedule, i.e. \(\alpha _n = \alpha _0 \gamma ^{n},\) with \(\gamma = 0.995\) which decays the learning rate after each epoch.

In the scenario using the augmented flow, we extend the dimensionality from \(2\rightarrow 4\) by augmenting an additional 2-dimensional random number \(r\sim {\mathcal {N}}(0,1)\) to the feature vector x.

For the classifier, we use a simple feed-forward neural network which consists of 8 layers with 80 units each and leaky ReLU activation [70] in the hidden layers. We note that, none of these hyperparameters have been extensively optimized in our studies.

Fig. 3
figure 3

Comparison of the baseline flow model, the LaSeR refined output and the augmented flow for the Circle (top) and Gaussians (bottom) toy examples

4.1.2 Probability distance measures

In the 2-dimensional toy examples, we can quantify the performance gain of our Laser-flow and the augmented flow compared to our baseline flow model. For this, we implemented different f-divergences [71]. All f-divergences used in this paper are shown in Table 1.

Table 1 A comparison of different f-divergences and the earth mover distance (EMD)

Except for the total variation (V), the squared Hellinger distance (H\(^2\)) and the Jensen–Shannon divergence (JS) are not proper metrics as they do not obey the triangle-inequality. In contrast to the Kullback–Leibler (KL) divergence, which is also commonly used, we only considered f-divergences which are symmetric in p and q.

Table 2 Earth mover distance (EMD), total variation (V), squared Hellinger distance (H\(^2\)) and Jensen–Shannon divergence (JS) on test data for the baseline model and the proposed methods for various two-dimensional examples. The best results are written in bold face. The errors on the scores are indicated in parenthesis

Additionally, we implemented the earth mover’s distance (EMD) [72] between two probability distributions p(x) and q(x). The EMD is directly connected to an optimal transport problem. If the two distributions are represented as a set of tuples \(P=\{(x_i,p(x_i)\}_{i=1}^N\) and \(Q=\{(y_j,q(y_j)\}_{j=1}^{M}\), the EMD is the minimum cost required to transform P into Q. This transformation is parameterized by discrete probability flows \(f_{ij}\) from \(x_i\in P\) to \(y_j\in Q\) and reads

$$\begin{aligned} {\textrm{EMD}}(p,q)&=\min _{\{f_{ij}>0\}}\sum _{i=1}^N \sum _{i=1}^{M}\,f_{ij}\,||x_i - y_j||_2,\nonumber \\ \sum _{j=1}^{M}\,f_{ij}&\le p(x_i),\quad \sum _{i=1}^{N}\,f_{ij}\le q(y_j),\nonumber \\ \sum _{i=1}^N \sum _{i=1}^{M}\,f_{ij}&=\min \!\left( \sum _{i=1}^{N}p(x_i), \sum _{j=1}^{M}q(y_j)\right) , \end{aligned}$$
(4.1)

where \(\vert \vert \cdot \vert \vert _2\) is the Euclidean norm and the optimal flow \(f_{ij}\) is found by solving the optimization problem. For an efficient numerical implementation we use the POT library [73]. An advantage of the EMD over the f-divergences is that it also takes the metric space of the points \(x_i\) and \(y_i\) into account.

All f-divergences introduced have been evaluated in a histogram-based way using 1M data points for all models and examples. By comparing the same metrics for two equally large, but independent samples drawn from the truth distribution, we can estimate the uncertainty of the quantities. The extracted error estimates are indicated in parenthesis in Table 2.

We note that we used a total of \(10^4=100\times 100\) bins to provide a detailed test of the phase space for evaluating all metrics while still providing stable results for the given statistics. We did not optimize the number of bins, but found that the qualitative conclusions are unaffected by the details of this choice.

4.1.3 Results

In the first toy example, illustrated in the top panel of Fig. 3, we considered multiple concentric circles representing a multimodal distribution that is topologically different than the simple latent space given by a Gaussian random variable. Consequently, the baseline flow fails to properly learn the truth density and generates fake samples which result in a blurred distribution that is not able to properly separate the modes. In contrast, the Laser improved generator as well as the AugFlow show highly improved distributions which is both easily detectable by eye as well as quantified by the lower divergences given in Table 2. Overall, when looking at the given scores in Table 2, the performance of the Laser improved model is superior over both the baseline model as well as the AugFlow, except for the EMD metric where the AugFlow yields the best score.

Similarly to the first example, the baseline model is again not expressive enough to adequately describe the second toy example, illustrated in the lower panel of Fig. 3. Both Laser and AugFlow are nearly indistinguishable from the truth by eye, which is quantified by even lower scores numerically than for the circle example in Table 2. In this case, the Laser improved flow gives the best score for all evaluated metrics and outperforms the AugFlow. This indicates that the Laser protocol seems to typically perform better than the augmentation of additional dimensions, an observation we will make again in the higher dimensional physics examples in Sect. 4.2.

Finally, we want to remark that simply ramping up the number of parameters in the baseline model would also improve its performance in these simple low/dimensional toy examples. However, this parameter growth scales poorly for more complicated and higher/dimensional distributions, leading to increasingly minor improvements. Further, as Ref. [30] noted, making a bijective generative model larger will not fundamentally overcome the topology problem.

4.2 Physics example

As a representative HEP example, we consider the

(4.2)

scattering process. For all simulations in this paper, we consider a centre-of-mass energy of \(\sqrt{s}= 14\, {\textrm{TeV}}\). All events used as training and test data are generated at parton level using MadGraph5_aMC@NLO [74] (v3.4.1). We use the NNPDF 3.0 NLO PDF set [75] with a strong coupling constant \(\alpha _s(M_{\textrm{Z}}) = 0.119\) in the four flavor scheme provided by LHAPDF6 [76]. In order to stay in the perturbative regime of QCD we apply a cut on the transverse momenta of the jets as

$$\begin{aligned} p_{{\textrm{T, j}}}>20\,{\textrm{GeV}}. \end{aligned}$$
(4.3)

Additionally, we require the jets to fulfill

$$\begin{aligned} \vert \eta _{\textrm{j}}\vert <6,\quad \Delta R_{{\textrm{jj}}}>0.4. \end{aligned}$$
(4.4)

We have 1M data points for each jet multiplicity and train on 800k events each. The network is trained on each multiplicity separately. In general, it is also possible to set up an architecture which can be trained on different multiplicities simultaneously [21, 26].

4.2.1 Network architecture

For the physics case considered in this work, the affine coupling blocks are replaced with rational quadratic spline (RQS) blocks [77] with 10 spline bins unless mentioned otherwise. In order to cope with the higher dimensions we now use 14 coupling blocks, where each block has 2 hidden layers, 80 units per layer and ReLU activation functions. We now employ a one-cycle training schedule with a starting and maximum learning rate of \(\alpha _0 = 5\cdot 10^{-4}\) and \(\alpha _{\text {max}} = 3\cdot 10^{-3}\), respectively. The networks are trained for 50 epochs. In contrast to the toy examples, the classifier now consists of 10 layersFootnote 2 with 128 units each and all other parameters unchanged.

Fig. 4
figure 4

Transverse momentum distributions for the W-boson and the two jets (top), and high-level angular distributions (bottom) for the process. We show the baseline model as well as AugFlow and a Laser improved version. The baseline model uses the 4-vector parametrization introduced in Eq. (3.1)

A possible bottleneck, when running the Laser protocol in terms of a HMC on the latent space, is the necessary fine tuning of the step size \(\epsilon \). Usually, the average acceptance rate of a proposed phase-space point is a good indication about the quality of the chosen hyperparameters. In an optimal scenario, the acceptance rate of an HMC should be around \(\sim 0.65\) [63]. Consequently, during the initial burn-in phase, we adapt the step-size to achieve the optimal acceptance rate of 0.65 on average. We found, that this additional adaptive phase was only important for the physics examples considered.

Further, for the augmented flow, we have a freedom in choosing the dimensionality of the augmented random number \(r\in {\mathbb {R}}^{D_r}\). We find that for both physics examples considered we obtain the best performance when choosing \(D_r=d\), where d is the dimension of the used data representation. We note, however, that we did not do any extensive tuning of this parameter.

4.2.2 Two-jet events

Before diving into the results of the various tested methods, we want to emphasize that the structure and storyline of what is coming next are pedagogical. We will start by using a very naive model to highlight and explain the reason for its limitations. Only then will we move forward by alleviating most of these issues and finish this section with a recommended workflow for multi-dimensional event generation.

Naive baseline model In our first study, shown in Fig. 4, we tested a simple affine flow as baseline model which has been trained on events given in the naive 4-Mom parametrization. This parametrization represents the most simple and straightforward parametrization possible and consequently serves as a benchmark point for all further improvements. Thus, we also kept the affine blocks in this example to keep the architecture as close to the ones used in the toy examples as possible. Within this parametrization, we train our classifier model which subsequently yields the necessary weights to perform the Laser protocol. As we can see in Fig. 4, both LaserFootnote 3 and the AugFlow improve the precision of the baseline network. However, the improvement obtained by the AugFlow is minimal compared to the large gain in precision by using the Laser protocol. In particular, we find that the more expressive the baseline model is, the less effective the AugFlow performance is in these high-dimensional examples. Consequently, we omit showing further comparisons of the AugFlow with the baseline model for all other tests performed and focus on techniques that are more relevant for reaching a higher precision.

Different parametrizations in the generator Next, we study the effect of using different data preprocessings on the quality of the generated events. In particular, we compare the parametrizations 4-Mom, MinRep, Precisesiast and Mahambo which have been introduced in Sect. 3. In order to be more compatible with state-of-the-art generators, we now replace the affine blocks with RQS blocks and keep them throughout the rest of our analyses. In the various distributions shown in Fig. 5, we see that the previously tested 4-Mom parametrization is strictly the worst, likely because it has the highest dimensionality. This leaves the network with the task of learning the likelihood of a lower-dimensional manifold within this higher/dimensional representation, making it harder to match the redundant features. Consequently, the naive MinRep parametrization, which merely omits all redundant features in the data representation, outperforms the 4-Mom preprocessing and achieves higher precision in all observables.

The results become even better when picking either the Precisesiast or Mahambo preprocessing, as can be seen in Fig. 5. At first glance, this seems to somewhat conflict with our conjecture about dimensionality since the Precisesiast parametrization only reduces the data representation to 3n dimensions and is thus not minimal. However, this parametrization is deliberately designed to perform well at the otherwise hard-to-learn phase-space boundaries introduced by the cuts. As a result, this parametrization yields the best precision in all transverse-momentum observables. In contrast, the Precisesiast parametrization fails to correctly describe the distribution. This is in line with the observations made in Ref. [21], where the authors introduced a ‘magic’ transformation to smooth out the hard phase-space cut. In comparison, within the Mahambo preprocessing, the flow renders this phase-space boundary automatically and is generally more precise in all connected angular observables such as and , shown in the lower panel of Fig. 5. While there is indeed still room for some improvement even for the flow using the Precisesiast and Mahambo representation, we postpone the combination of better preprocessing with refinement (via the Laser protocol) to the process, as the impact will be more important.

Fig. 5
figure 5

Transverse momentum distributions (top) for the final state particles and high-level angular distributions (bottom) for the process. We show the results for different parametrizations used for the generation

4.2.3 Three-jet events

The three-jet case is much more complex than the two-jet case because the dimensionality of the data manifold is relatively smaller than the data space dimensionality.

However, as the relevant phase space is bigger than in the two-jet case, this allows us to study the scaling behavior of our surrogate model for increasing multiplicities and hence phase-space dimensions. This study is relevant as the expected precision of the surrogate model, which is trained on data following a given underlying distribution, solely scales with the dimension and is not affected by the intrinsic precision of the training data. In other words, given a fixed phase-space dimensionality, the surrogate model’s performance will not degrade by replacing the LO training data with more accurate NLO or even NNLO simulation data. Hence, we only test our methods on easy-to-generate LO data for simplicity and without loss of generality. Finally, we expect the effect effective gain of the surrogate model to be more significant for more complicated processes without losing precision. This has been shown in Ref. [61] and can be explained as the surrogate model is generally much faster to evaluate than event weights derived from first principles. This speed advantage is becoming even bigger for NLO and NNLO scenarios where the evaluation time of the matrix elements becomes the limiting aspect.

Different parametrizations in the generator reloaded First, we study the impact of preprocessing in Fig. 6 (the three-jet analog of Fig. 5). We omit the 4-Mom parameterization and observe that the MinRep parameterization is already within 10% across the phase space and often better than or at least as good as Precisesiast and Mahambo. By design, Precisesiast is excellent for the jet transverse momenta but struggles with some of the \(\Delta R\) observables. Mahambo is the worse for the second jet transverse momentum but the best at the \(\Delta R\) and \(\Delta \eta \) between the subleading two jets. Elsewhere, MinRep and Mahambo are comparable.

Fig. 6
figure 6

Transverse momentum distributions for the W-boson and the two leading jets (top) and high-level angular distributions (bottom) for the process. We show the results for different parametrizations used for the generation

Classifier improved event generation The rest of the figures show how reweighting can improve the precision in both the data space and from the latent space.

Initially, the classifier was fed directly by the events generated in the Mahambo preprocessing, meaning the discriminator acts on a 10/dimensional hypercube. However, as shown in Fig. 7 by the red curve, the classifier is not learning any sensible weights and sharply peaks at \(w(x)\approx 1\). This means the classifier cannot distinguish between the generated data and the truth using the Mahambo parametrization even though we tested a rather large network size to ensure that too few network parameters are not limiting the performance. Therefore, to better use the classifier, we change the data parametrization before feeding into the network to enhance the classification capabilities as much as possible. In particular, if the classifier inputs directly contain critical features such as , \(\Delta R\), \(\Delta \eta \) and \(\Delta \phi \) the obtained weights provide much more information, as can be seen by the weight distributions for the Precisesiast and Elfs parametrization shown in Fig. 7.

First, Fig. 8 demonstrates the reweighting in the data space when generating events in the Mahambo parametrization and feeding the data into the classifier in the Elfs parametrization. In this case, the weights defined by the classifier are no longer proper functions of the latent space, which naively does not allow to employ the Laser protocol. Since the data space and latent space are not having the same dimensionalities, we have to use the OmniFold protocol to pull back the weights to the latent space. However, it is as tricky for OmniFold to pull back the weights onto the latent space as obtaining useful classifier weights when training the classifier in the Mahambo representation. Apparently, it is difficult to detect the non-trivial deformations to the Gaussian latent space required to alleviate the small deficiencies in the Mahambo representation.

Instead, we can use these weights to reweight the data space directly, which is the bases of the Dctr protocol. As was observed in many other studies that employ classifier/based reweighting, including for ML/based simulation surrogate refinement [33, 61], we find that the weighted data/space is nearly indistinguishable from the truth and can capture sharp and subtle features.

Fig. 7
figure 7

Learned classifier weight w(x) for different parametrizations applied to events generated in the Mahambo parametrization

However, the weighted samples in the data space have less statistical power due to a variance in the weights. Figure 9 explores how we can pull back the data-space weights to the latent space to produce unweighted examples in the data space. We start with Elfs instead of Mahambo as we found that pulling back Mahambo weights was ineffective as it requires OmniFold. Instead, if we use Elfs in both the generator and discriminator, the obtained weights are already proper weights of the latent space and allow using the Laser protocol. Without any refinement, Elfs cannot capture many of the key features of the phase space. However, Dctr works well, and the improvements directly translate almost verbatim when pulled back to the latent space and integrated into Laser. This includes all sharp and subtle features, especially in the variables that describe angles between jets. Even though Elfs starts further from the truth, the refinement makes it closer than Mahambo.

5 Conclusions and outlook

Neural network surrogate models are entering an era of precision, where a qualitative description is replaced with quantitative agreement across the full phase space. Using W+jets matrix element simulations as an example, we explore how modifying the data before, during, and after generation can enhance the precision.

Unsurprisingly, preprocessing is found to have a significant impact on downstream precision. Therefore, we proposed Mahambo, a variation of RamboOnDiet, to automatically reduce the data to its minimal/dimensional representation without needing process-specific interventions. This approach matched or outperformed other parameterizations across most of the phase space.

In agreement with previous studies, we find that classifier-based reweighting is able to precisely match the truth since the support of the probability density for our surrogate models is a superset of the truth.

We also study two latent space refinement methods to avoid weighted samples. The best approach for the example was Laser, where the data-space weights are used in the latent space to produce unweighted examples.

Fig. 8
figure 8

Transverse momentum distributions for the W-boson and the two leading jets (top) and high-level angular distributions (bottom) for the process. We show the results for pure Mahambo parametrization and including Dctr reweighting

Fig. 9
figure 9

Transverse momentum distributions for the W-boson and the two leading jets (top) and high-level angular distributions (bottom) for the process. We show the results for pure Elfs parametrization and including Dctr and Laser refinement

While our primary focus has been on matrix element generation, the presented ML approaches have the potential for broader applications in the study of surrogate models. Furthermore, these techniques can be adapted for physics-based simulators, particularly in cases where refining the latent space allows for accessing the random numbers within the program. Finally, improving surrogate models to align accurately with physics-based simulations or observational data holds immense potential for enhancing various downstream inference tasks. We eagerly anticipate further exploration of these connections in future endeavors.