Learning to discover: expressive Gaussian mixture models for multi-dimensional simulation and parameter inference in the physical sciences

We show that density models describing multiple observables with (i) hard boundaries and (ii) dependence on external parameters may be created using an auto-regressive Gaussian mixture model. The model is designed to capture how observable spectra are deformed by hypothesis variations, and is made more expressive by projecting data onto a configurable latent space. It may be used as a statistical model for scientific discovery in interpreting experimental observations, for example when constraining the parameters of a physical model or tuning simulation parameters according to calibration data. The model may also be sampled for use within a Monte Carlo simulation chain, or used to estimate likelihood ratios for event classification. The method is demonstrated on simulated high-energy particle physics data considering the anomalous electroweak production of a $Z$ boson in association with a dijet system at the Large Hadron Collider, and the accuracy of inference is tested using a realistic toy example. The developed methods are domain agnostic; they may be used within any field to perform simulation or inference where a dataset consisting of many real-valued observables has conditional dependence on external parameters.


Introduction
In the physical sciences, we often use statistical methods to make quantifiable statements about how compatible experimental observations are with different hypotheses about nature. These frameworks, typically frequentist or Bayesian, usually require us to model the expected probability density function (PDF) for all possible observations, conditioned on the hypotheses of interest. Finding such a parameterization for the PDF can be very challenging when data are multi-dimensional.
Within experimental particle physics, often the problem is simplified by observing only one or two dimensions of the data at a time following some initial data selections. For these low-dimensional measurements, we are able to approximate the PDF either parametrically or using histograms, allowing for statistical interpretation of the data. To ensure these simplified measurements contain maximum sensitivity to the processes of interest, hereafter referred to as the 'signal' in contrast with the 'background' of all other processes contained in the dataset, we only select data in regions of phase space for which the frequency of signal is high relative to the background. We note several disadvantages of this approach: (a) By analyzing data only in select regions of phase space, we lose any potentially useful information contained within other regions. (b) Different hypotheses may predict different distributions of the data in the high-dimensional space. However, we lose this information when collapsing data into one or two dimensions. (c) When analyzing histograms, the binning of data discards finely-grained information about the shape of the distribution.
(d) The experimentalist must manually design the selection criteria, observables and binning, making it difficult to ensure that an analysis provides fully optimized sensitivity to all accessible regions of the theory parameter space.
If the expected signal and background PDFs can be modeled parametrically in a space spanning all data dimensions, the PDF ratio contains the expected signal-to-background-ratio at every point in phase space. This means that we do not require restrictive data selections to optimize statistical sensitivity to the signal component. We also do not require binning. The information described above is therefore retained and may be used to provide greater exclusion and discovery potential for all possible new physics models 1 .
It has recently been demonstrated that machine-learned density models may be constructed which describe PDFs (or PDF ratios) in a high-dimensional observable space [1][2][3][4][5][6][7][8][9]. Provided that model bias can be mitigated and systematic uncertainties properly described, these can be used to perform parameter inference or construct likelihood ratios for event classification 2 .
Many PDF models may also be sampled from, which is not the case when exclusively modeling the PDF ratio. This has several benefits: (a) We can verify that the distribution obtained by sampling the model is well-behaved when compared with the training data. Such cross checks are desirable in the physical sciences, where rigorous data interpretation is emphasized. (b) It may be used to generate new datasets at arbitrary points in parameter space, which the model accomplishes by interpolating between the external parameter values at which training data were provided. (c) We can numerically estimate the expected distribution of a test-statistic under different parameter hypotheses, instead of assuming an asymptotic form. This aids in the estimation of rigorous frequentist confidence limits. (d) Once trained, sampling from the density model may be more computationally efficient than running the full simulation package used to generate training data. In this context, density models provide a compelling alternative to other stochastic generative models such as generative adversarial networks [13] and variational auto-encoders [14,15] for performing steps in a simulation chain [16][17][18][19].
In this work, we will show that density models describing multiple observables with (1) a complex multi-dimensional distribution, (2) hard boundaries and (3) dependence on external parameters may be created using an auto-regressive Gaussian mixture model (GMM) [5,6,20,21] 3 . The model is made more expressive by projecting data onto a configurable latent space. The method is designed to capture how observable spectra are continuously deformed as the external parameters are varied, behaviour which is common in the physical sciences. We hope that this work will provide users with a simple but expressive way to model such datasets in their own domains.
To study the performance of our method on a high-dimensional dataset of physically realistic observables, we use simulations of particle physics data sensitive to anomalies in the electroweak production of a Z boson in association with a dijet system. We demonstrate the degree to which our trained density models can describe this data, capturing how it is deformed as two physical parameters are varied. We then use a toy example, in which we can access the ground-truth PDF, to demonstrate that accurate parameter estimates and exclusion limits may be obtained using our method. This is not possible using the physical example because we do not have access to the ground-truth PDF with which to compare. This paper is structured as follows. In section 2 we describe the generation of training data used throughout the paper, and explain the physical basis behind it. In section 3 we describe how data are transformed onto the latent space and how the density model is built. We then discuss several features of the model. In section 4 we construct a 12-dimensional model to study the ability to describe a highly multi-dimensional dataset. In section 5 we construct a 4-dimensional model with dependence on two external parameters to study the ability to learn the parameter dependence. In section 6 we study the accuracy of inference using our toy example. In section 7 we conclude. 1 Here we consider only the optimization of statistical sensitivity and assume that the PDFs can be modeled with sufficient accuracy and well-described systematic uncertainties. This may be challenging in a real-world analysis which includes data-driven constraints and regions with large systematic effects. 2 See e.g. [10][11][12] for alternative approaches for enhancing sensitivity to new physics models using machine-learned classifiers and anomaly detection. 3 Whilst we were unable to find examples which combine all these properties, Bishop [20] and Variani et al [21] provide examples of GMMs for density estimation parameterized using neural networks and Papamakarios et al [5] and Uria et al [6] of auto-regressive density estimation used to model multi-dimensional data.
Whilst these experiments demonstrate that the method is performant on datasets of realistic observables within the domain of high-energy physics, we emphasize that it may be used to model any dataset of continuous observables for which a high-dimensional PDF is deformed by parameter variations, regardless of scientific domain, provided that appropriate training data may be provided.

Experimental setup
To test our method in a real-world environment, we consider the electroweak production of a Z boson in association with a dijet system occurring in high-energy proton-proton collisions at the Large Hadron Collider. This process is labeled EW Zjj in the remainder of this text. It is often referred to as the Vector Boson Fusion production of a Z boson. We choose to model the EW Zjj process for several reasons. Firstly, it provides a number of physically interesting observables which are correlated, challenging our method to capture a feature-rich high-dimensional distribution. Secondly, there exist new physics models which are expected to continuously deform this distribution in distinct ways as different parameters-of-interest are varied. Finally, it is a process of interest for current and future LHC experiments. Nonetheless, we emphasize that the EW Zjj process is intended to be a representative example using which we test the ability of our method to overcome general modeling challenges, and we hope that the method may be used to model smoothly-varying parameter-dependent high-dimensional datasets in any domain.
Each 'event' is the observation of many particles created by a single proton-proton collision. High-energy physics datasets typically consists of O (100 − 100 M) events, depending on the pre-selection criteria applied. By identifying the particles produced, and measuring their kinematic properties and other high-level 'observables' , we study the processes which contributed to their production.
The EW Zjj process is characterized by a final state of two jets of hadrons along with two oppositely charged electrons or muons which are produced by a Z-boson decay. Since the EW Zjj process is defined by a t-channel exchange of a colour-neutral weak boson between the two incoming partons, these jets are typically separated by a wider rapidity than in the dominant background process which contains a t-channel exchange of a gluon. As a result, experimental analyses often select events with a large dijet rapidity separation (or large invariant mass) to enhance the proportion of signal within their sample. We may measure the event rate as a function of many observables. We expect that the presence of certain new particles/forces will induce distortions in the shape or magnitude of these spectra relative to the precise predictions of the Standard Model of Particle Physics (SM). These measurements enable a rich discovery potential for new natural phenomena and the derivation of constraints on the theoretical models describing them.
The binned one-dimensional kinematic spectra of particles produced via EW Zjj in high-energy proton-proton collisions were recently measured [22,23] by the ATLAS experiment [24]. Exclusion limits were derived for several parameters of the SM effective field theory (SMEFT) in the Warsaw basis [25], which characterize the presence of any novel physics phenomena in such interactions. In this work, we consider how EW Zjj events are affected by variations of the SMEFT parameters c HWB andc W . These parameters extend the SM Lagrangian L SM by the addition of two non-renormalizable terms with mass dimension six. These additional terms modify how electroweak bosons interact with one another, impacting the rate and expected kinematic distribution of EW Zjj events. These modifications reflect the indirect effects of new physics interactions above some energy scale Λ which is not directly probed by the experiment. We will assume Λ = 1 TeV throughout, noting that other choices simply correspond to a re-scaling of c HWB andc W within this parameterization. The effective Lagrangian is [25][26][27]: where H is the Higgs doublet, τ are the Pauli matrices, W µν and B µν are the electroweak field strength tensors, ε are anti-symmetric tensors with ϵ 012 = ϵ 0123 = 1,W µν = 1 2 ϵ µν ρσ W ρσ and we neglect Hermitian conjugates. In this work, we use simulated events to construct high-dimensional statistical models which describe many of the kinematic observables considered in the ATLAS analysis. Of the six parameters constrained within the ATLAS analysis, we choose to study c HWB andc W because they are shown to vary the expected PDF in distinctly different ways. Simultaneously modeling both parameters therefore provides a more ambitious test for the efficacy of our methods.
Ground truth events are generated using the Madgraph5 (MG5) [28] program with perturbative calculations at leading order in the strong coupling constant. This models the primary high-energy interaction of interest, simulating the resultant array of particles and their properties. Subsequent hadronization of these particles and modeling of the underlying event [29,30] are simulated using Pythia8 [31,32]. Definition and selection of stable and detectable particles produced in the collision is performed using Rivet [33]. Neural networks are implemented using TensorFlow v2.4.3 interfaced with Keras v2.4.0 [34,35]. SMEFT interactions are implemented in MG5 using the SMEFTSim [36] package. 1M datapoints are generated at the Standard Model value of (c HWB ,c W ) = (0, 0). 400k datapoints are generated in increments of 0.1 on the intervalc W ∈ [−0.4, 0.4] with c HWB = 0, excluding the SM configuration. 200k datapoints are generated in a 2D grid with increments of 0.2 on the intervalc W ∈ [−0.4, 0.4] and increments of 2 on the interval c HWB ∈ [−4, 4], excluding pairs with c HWB = 0.
All objects are defined at particle level, i.e. after parton showering and hadronization (as they would appear in a particle detector). Testing our method on such a dataset demonstrates that it fulfills the key objective of this work: to effectively model a high-dimensional PDF of physically realistic observables with external parameter dependence. Since the method is not restricted to any particular experiment or domain, we do not simulate the effects of detector efficiency and resolution when generating our training data. However, we note that end-users who wish to perform (for example) parameter estimation using detector-level experimental data can accomplish this by simulating the impact of their detector when generating their own training data. We expect this to smear the PDF, but not impact the key modeling challenges identified above. We emphasize that there are no practical barriers preventing the modeling of detector-level datasets for use within a given experimental context.

EW Zjj event selection and observable definitions
Selection requirements and observables of interest are chosen based on the recent ATLAS measurement [22], and the ATLAS co-ordinate system [24] is used throughout with all observables defined in the laboratory reference frame.
All final state objects are required to satisfy a pseudorapidity of |η| ⩽ 5. Electrons and muons are 'dressed' [37] with photons within a cone of ∆R ⩽ 0.1. Electrons are required to satisfy p T ⩾ 25 GeV and have |η| < 2.47 excluding 1.37 < |η| < 1.52 where p T is the momentum component transverse to the beamline. Muons are required to satisfy p T ⩾ 25 GeV and |η| < 2.4. Jets arise from collimated streams of stable particles and are clustered [38] from all final state particles excluding muons and neutrinos using the anti-k T algorithm [39] within a cone of ∆R ⩽ 0.4. Reconstructed jets are required to satisfy p T ⩾ 30 GeV and have a rapidity of |y| < 4.4. Jets are rejected if they fall within ∆R ⩽ 0.2 of a selected electron, to reflect the limitations of a real detector in accurately distinguishing jets and electrons produced at small angular separations.
Events are required to have at least two selected electrons or muons, where the two leptons with the highest p T are used to define the dilepton system and are required to have opposite charge. Events are also required to contain two selected jets, and the two jets with the highest p T are used to define the dijet system. The following observables are calculated from the selected objects: • m ll , p ll T and |y ll | are respectively the mass, transverse momentum and absolute rapidity of the dilepton system. • m jj , p jj T and |y jj | are respectively the mass, transverse momentum and absolute rapidity of the dijet system. • p j1 T and p j2 T are the transverse momenta of the highest and second-highest p T jets.
• ∆ϕ (j, j) is the angular spread of the dijet system in a plane transverse to the beamline, measured clockwise with respect to the highest rapidity jet and defined on a domain of [−π, π]. • |∆y (j, j) | is the absolute rapidity spread of the dijet system. • N jet is the number of selected jets, and N gapjet is the number of selected jets which have a rapidity in the interval bounded by the rapidities of the two highest p T jets. Table 1 shows the intervals over which these observables are defined. Events are rejected if any observable falls outside of its interval. The total selection efficiency is estimated to be 64% using the events simulated under the SM hypothesis.

Method overview
Consider that we measure datapoints x ∈ X on an n-dimensional observable space X ≡ R n . The PDF is p(x|θ), where θ ∈ Θ represents the set of parameters of interest and nuisance parameters. This conditional dependence allows us to constrain a set of possible physical models according to their consistency with experimental observations.

Gaussian mixture models
We can model a conditional one-dimensional density p(x|θ) by simulating data for a variety of θ and fitting this with a conditional GMM. This parameterizes the density as a linear sum of Gaussian distributions according to: where N G labels the number of Gaussian modes; N is a Gaussian PDF; f ϕ,g , µ ϕ,g and σ ϕ,g are respectively the amplitude, mean and width of the g th Gaussian subject to NG g=1 f ϕ,g = 1 and f ϕ,g ⩾ 0 ∀ g; ϕ label the parameters of a neural network used to capture the functional forms of f ϕ,g , µ ϕ,g and σ ϕ,g (see e.g. [20,21]).
We use mixture models in this work because they allow us to model arbitrarily complex positive-definite distributions which can be analytically normalized to unity and easily sampled from. This is achieved by writing the density as the linear sum of simple parametric probability distributions. They are often used to model multi-modal data [40], and are well-suited for our probability spectra which we can imagine as being composed from a series of overlapping local probability masses. Each local mass may be modeled as having a different dependence on the external parameters θ, allowing us to express how every region of the spectrum is deformed when θ is varied. In this work we use Gaussian distributions to model each local mass of density. This is because they are simple distributions (each defined by only two parameters) which are peaked in the center and smoothly vary to 0 without excessively sharp or sparse tails, ensuring continuity in the model and retaining the local nature of the probability mass. They are also easily normalized and sampled from.
However, there are several ways in which the shape of p(x|θ) may not be well-suited to a GMM: (a) GMMs naturally model a smooth turn-off at the boundaries of a distribution, whereas the data distribution may have hard boundaries due to strict physical constraints or event pre-selection. (b) The structural features of the PDF, and any deformations induced by variations of θ, must be smooth and wide enough to be modulated by the Gaussian modes. (c) In order to deform the PDF downwards, the model must contain a Gaussian mode with finite amplitude local to the deformation, the amplitude of which can be modulated downwards without impacting the rest of the distribution 4 .
Points (b) and (c) mean that a GMM which is dominated by few wide Gaussian modes will have limited ability to describe local deformations of the PDF as θ is varied. Instead, we wish to have a distribution which is described by a spectrum of many narrow overlapping Gaussian modes and which contains no deformations narrower than the Gaussians themselves. We will now show that these conditions may be achieved by transforming the data and using a suitable network architecture to model f ϕ,g , µ ϕ,g and σ ϕ,g . We find that this method resolves the failure conditions listed above in the experiments presented.

Modeling a single observable
Datapoints are projected by a function h : x → u ∈ U onto a latent space U ≡ R n . The properties of the projection may be tuned to optimize the performance of a GMM describing the density p ϕ (u|θ). We will now explore this idea using our EW Zjj example.
Consider the case where x = ∆ϕ (j, j) is the only observable. Figure 1 (left) shows the probability density p (x) for the SM case of c HWB =c W = 0. This plot is obtained by histogramming the datapoints simulated using MG5. We note that this distribution has hard physical boundaries at [−π, π] which a GMM would be unable to model. Figure 1 (right) shows the probability density of the same datapoints after projecting x onto the latent space. This distribution is designed to be well described by a series of overlapping narrow Gaussian modes. We will now describe how this projection function h (x) was derived, then train a GMM to model this spectrum for a variety ofc W .
To derive h (x), we first construct a response curve Q x (x) between the physical boundaries of x. This is written as: where D x (x) is the cumulative distribution function of the data simulated at the SM and L x (x) is a linear function. The hyperparameter f is tuned to ensure that wide regions in X are not collapsed onto narrow regions in U, whilst also providing a smooth turn-off at the boundaries of the distribution. This function is shown as the solid black line in figure 2 (left). We then construct a response curve Q u (u) over the latent space, shown as the solid blue line in figure 2 (middle), defined as the cumulative distribution function of a target functionq u (u) given by: This function, shown in figure 2 (right) using values of (α, β, γ) = (4, 3, 1), is heuristically designed to be flat in the centre and smooth at the edges. This encourages the optimal GMM description to contain many narrow overlapping Gaussian modes. We note that it may seem natural to choose a Gaussian distribution for q u (u) (see e.g. [9]), however this will often result in a GMM which is dominated by a single wide Gaussian mode, violating our target behaviour. The mapping function between X and U is defined as , and its derivation is shown visually as the green dotted line connecting the points x * and u * in figure 2 (left and middle).
We compute Q u (u) as a piecewise-linear function over the interval u ∈ [−5, 5]. Whilst the domain of u could be extended arbitrarily far so that all sampled points u * ∈ U are mapped onto the physically allowed domain of X, we found that limiting the domain improved numerical stability in our experiments by avoiding dilute tails in the latent distribution.
We now apply the projection function h to all our datasets with nonzero values ofc W 5 . It is crucial that h are derived using data at a single point in parameter space (herec W = 0) and applied to the data at all values ofc W . Asc W is varied, the probability density p (u|c W ) is deformed. This is modeled as p ϕ (u|c W ) where the neural network parameters ϕ are trained using maximum likelihood estimation evaluated over the simulated training data for allc W , i.e.: where w label Monte Carlo event weights, used to account for how integration of probabilities is handled within a particular simulation package [29,30], if applicable. We train a GMM with N G = 30 individual modes to describe the probability density. Figure 3 (top row) compares the training data and post-fit model p ϕ (u|c W ) at values ofc W = {−0.4, 0, 0.4}. Thin coloured lines show the decomposition into individual Gaussian modes. Asc W is varied, we see that deformations in the spectrum are captured by modulating the amplitudes, positions and widths of the narrow Gaussian modes. Figure 3 (middle row) shows the ratio between the training data and the model PDF, offset to 0 so we study the residual difference between the two. This demonstrates that systematic mis-modelling is below 5% except in the sparsely populated tails of the distribution for all three values ofc W . The dark shaded band around the data shows the Poisson estimate of the statistical uncertainty. The thickness of this band is comparable with the residual difference between the data and the model, suggesting that this residual is mostly dominated by random fluctuations in the data. Figure 3 (bottom row) shows the ratio between p ϕ (u|c W ) and p ϕ (u|0), the model PDF evaluated at c W = 0, once again offset to 0 so we study the residual difference between the two. This quantifies how the shape of the distribution is deformed when translating acrossc W . Training data are also shown, demonstrating that the model has captured how the spectrum is deformed asc W is varied.

Extending to multiple observables
When modeling d observables on the latent space, we write an auto-regressive probability density: where i label observables and u <i is the list of all prior latent observables. The conditional probability density for each u i is modeled using a GMM parameterized by a neural network with parameters ϕ according to: where f ϕ,g,i , µ ϕ,g,i and σ ϕ,g,i are respectively the amplitude, mean and width of the g th Gaussian for observable index i. By including u <i as input to the network, it now captures the dependence on both external parameters and preceding observables. This means that high-dimensional observable correlations may be described by the model.  Figure 4 shows a schematic diagram of the neural network architecture used to model the GMM for latent observable u i ∈ u min i , u max i . Fully connected layers at depth l are shown in grey and labelled Dense, with a number of neurons equal to N l as specified and an activation function shown in parentheses. These are either linear, equivalent to applying no activation function, or LeakyReLU [41] with a negative gradient of 0.2 defined for input x according to:

Neural network architecture
Inputs θ and u <i of lengths N θ and N u respectively are compressed onto the interval [−2, 2] and fed into initial layers of size N 1 and N 2 . The configurable constants {A 1 , A 2 , B 1 , B 2 } determine the width of these layers. The outputs are concatenated and fed into a sequence of C layers of width N 1 + N 2 . The constant C determines the ultimate depth of the network. The outputs are then fed into three separate channels, which will separately assign the Gaussian amplitudes ⃗ f i , means ⃗ µ i and widths ⃗ σ i . In each channel, activations x pass through two further dense layers of size D · N G and N G , creating three vectors of length The configurable constant f σ therefore determines how many standard deviations of overlap exist between the initial Gaussian modes. Finally, a constant of ϵ = 10 −4 is added to prevent the evaluation of Gaussian modes with zero width. We note that these transformations impact the gradients of the loss function with respect to the three different channels, leading to different learning rates for the amplitudes, means and widths respectively. This likely impacts the post-fit model, and future optimization may be achieved by controlling the balance of these gradients to preferentially enhance model updates in one channel.
The resulting network contains optimization is performed using the Adam [42] algorithm with a learning rate of λ lr . An adaptive learning rate is used, such that λ lr is multiplied by a factor of λ update factor lr < 1 if the training loss does not improve for λ patience lr epochs. This mitigates underfitting when the initial λ lr is large. Network biases are initialized to zero and weights are drawn randomly from a uniform distribution over the interval ±10/ (3 √ N in ) where N in is the number of input neurons. This mitigates vanishing/exploding activations and gradients in the initial state.

Impact of transforming the likelihood
The function h performs a monotonic one-dimensional change of variables between x and u. The probability density p u (u) over the latent space may therefore be transformed into a probability density over the original data space p x (x) according to: where h (x) is evaluated using a piecewise linear function calculated from the training data, and so dh(x) dx is a step function over x. Whilst it leads to a tractable density over x, equation (10) contains no dependence on θ. This means that statistical inference is equivalent when performed on U and X. Applying such a transformation is therefore not necessary, and we will always perform inference using observations in the latent representation unless stated otherwise.
We also note that the transformation h (x) must preserve the total probability contained within a span, i.e.ˆx and so we can integrate the probability contained within [x 1 , x 2 ] simply by transforming x 1 and x 2 and performing the integration over the latent space. However, this integration may only be performed analytically when data are one-dimensional. We do not perform a rotation when transforming between x and u. This secures three desirable features: it ensures a diagonal Jacobian matrix, it retains an easily understood relationship between each component of x and u, and it mitigates potential concerns about loss of generalization [43].

Complexity of likelihood evaluation
Consider that we wish to model d observables, using d neural networks each containing L hidden layers and W neurons per layer. Assuming that d W and N G LW, the calculation of p (u|θ) has a complexity of O dLW 2 . However, each of the d conditional probability densities may be computed in parallel, resulting in O LW 2 complexity. This may be further accelerated up to a limit of O (L) by using a GPU for efficient matrix multiplication. Since u <i are used as input to the networks for all i > 0, network outputs must be computed separately for every datapoint except in the case of the first observable u 0 , for which a single pass through the network can be used to provide the Gaussian parameters needed to evaluate every datapoint.

Complexity of generative sampling
We have noted that the density model may be sampled, allowing it to be used as a generative model for event simulation. We achieve this by randomly drawing u * 0 ∼ p ϕ,0 (u 0 |θ), u * 1 ∼ p ϕ,1 (u 1 |u * 0 , θ) and so on until a datapoint u * in d dimensions is constructed. This may be transformed back onto data space using Since this process is sequential in the latent observables, they may not be simulated in parallel. As with likelihood evaluation, the complexity of sampling is O dLW 2 . This may be accelerated up to a limit of O (dL) using a GPU. Since p ϕ,0 (u 0 |θ) contains no dependence on other observables, many u * 0 may be sampled using a single evaluation of the network. However, sampling u * i for i > 0 requires the network to be evaluated for every datapoint.

Modelling of systematic uncertainties
In this work, we focus on the expressive power of the model and do not consider the impact of systematic uncertainties. However, it is crucial that such uncertainties are accounted for when performing a statistical interpretation on a measured dataset. Here we briefly discuss how this may be done, whilst noting the limitations. We note that cross-section uncertainties may be trivially accounted for, since they do not impact the distribution of events throughout phase space.
We may separate modelling uncertainties into three categories. The first category are uncertainties associated with the simulation of training data which are parameterizable in terms of a nuisance parameter θ NP . These may be accounted for either by including θ NP within the vector θ input to the network, or by training a separate model r (u, θ NP ) = p (u|θ NP ) /p u|θ ref NP for some fixed reference θ ref NP and writing: The second category are non-parameterizable uncertainties associated with the simulation of training data. In high energy physics, these may account for poorly understood differences between the simulated data and control measurements. In a binned one-dimensional analysis, they may be mitigated by performing auxiliary observations which are uncorrelated with the observable being modelled and 'transferring' the data-driven constraint on a bin-by-bin basis. Residual uncertainties may then be parameterized according to systematic variations of this transfer procedure. It is challenging to extend such techniques to our model because we must cover possible mismodelling of the high-dimensional observable correlations.
The third category are uncertainties associated with the density model. These biases are caused by the inductive bias of the model as well as under-or over-fitting. Over-fitting may be mitigated using techniques such as regularization, dropout and early stopping, and by limiting model complexity. Under-fitting may be studied by sampling the density model for all simulated θ and showing that the marginal projections are compatible with the simulated data. Quantifying and parameterizing the remaining mismodelling is once again challenging, and we leave this for future work.
We consider overcoming these challenges to be one of the main hurdles facing the use of high-dimensional density models in high energy physics.

Model optimization
A strength of the proposed method is that there are many ways in which modelling may be improved if under-fitting is observed. These strategies include: (a) Increase the model capacity by using more complicated networks or larger N G . (b) Tune the parameters s f , s µ and s σ , which modulate the size of the initial state perturbations of the Gaussian amplitudes, positions and widths as described in figure 4, to balance the stability of the initial model with the size of perturbations which provide gradients for the learning process. (c) Tune f σ to configure the initial width of the Gaussian modes. Whilst narrow modes tend to describe local features of the data, fulfilling the objectives of our model design, training data do not provide significant learning potential for Gaussian modes several standard deviations away. We find that successful training occurs when the value of f σ balances these effects. (d) Tune the hyperparameter f or the functional form ofq u to create a latent distribution which is well described by a mixture of narrow Gaussians. (e) Alter the ordering of the observables, since p (B|A) may be more easily described than p (A|B) for two latent observables A and B. (f) Alter the training procedure to improve convergence towards likelihood maxima. (g) Rotate observables onto the eigenvectors of their covariance, reducing strong correlations in the data.
These opportunities for tuning improve the chance of finding a model which captures the salient features of the dataset provided.

EW Zjj with 12 observables and no external parameter dependence
In this section we create a density model to describe 12 observables with no external parameter dependence. This demonstrates that the method can learn a joint probability density over a high-dimensional dataset of physically realistic observables. Table 2 shows the observable ordering as well as the f -values used to configure the projection onto the latent space.
We include the two discrete observables N gapjet and N jet in the model. This demonstrates that there are no barriers to modelling continuous and discrete observables at the same time. A discrete observable taking integer values on the inclusive interval [u min i , u max i ] is modelled using a neural network which outputs a categorical probability distribution of length N p = 1 + u max i − u min i . Inputs θ and u <i are projected onto the interval [−2, 2] and passed through dense layers of size N 1 and N 2 respectively. These are followed by two fully connected layers of size 300 and 200, and an output layer of size N p . All intermediate layers use a LeakyReLU activation function with a negative gradient of 0.2. The output layer uses a SoftMax activation function to ensure that outputs represent a normalized multinomial probability distribution. The network is trained using a cross entropy loss function and the same training scheme as used to model continuous observables. Table 3 shows the constants used to configure the remaining neural networks and their training. The networks contain between 27k and 304k trainable parameters. This reflects a degree of over-parameterization of the model, since the number of parameters is the same order of magnitude as the number of training samples. We note that any resultant over-training is mitigated by the use of a GMM which naturally smooths each conditional PDF in the auto-regressive chain. Each network is initially trained for up to 400 epochs,  stopping early if the loss function does not improve over a period of 12 epochs. We observe that O (10 −4 ) relative updates to the log-likelihood are important, since they may lead to %-level improvements in the description of the tails. Training should therefore not be halted until a true plateau in the loss function is obtained. The model is trained using the 640k selected MG5 events generated assuming the SM hypothesis. To evaluate its performance, we randomly sample 4M datapoints from the model and compare the 1D and 2D marginal distributions with those of the training data. This large number is chosen to reduce fluctuations due to sampling variance. Figure 5 presents the 1D marginal distributions. For each observable, an upper panel presents the absolute spectrum in units normalized such that the highest bin takes a value of 1, and a lower panel shows a ratio taken with respect to the MG5 events. MG5 events are shown in red and compared with events sampled from the density model, shown in black. Shaded areas present Poisson estimates of the statistical uncertainty arising from finite sample size. We observe that all spectra are well described within a systematic precision of ±5%, with many spectra achieving precision similar to the statistical variance of the training data. We note that fewer bins than the expected O (32%) lie outside of the uncertainty bands, indicating that the model may be over-trained. Since this work is intended as a proof-of-principle for the method, we make no further attempt to mitigate over-training, whilst noting that this will be important for future applications. Figure 6 presents the 2D marginal distributions for all pairs of observables as measured using the MG5 events. This demonstrates that complex correlations exist between all observables. Figure 7 presents the 2D marginal distributions using the samples from the density model. Comparing figures 6 and 7 shows that the model has captured the high-dimensional correlations between all pairs of observables. Bins are coloured white if no entries exist, and black if a small number of entries are observed. We note that several fully-white regions of figure 6 are black in figure 7, suggesting that the density model may predict a small non-zero probability in regions of phase space which are unpopulated when simulating from-first-principles, as is the case with MG5.
If the modelled density in such regions is sufficiently small, we expect that this artifact should have minimal impact on inference tasks. This is because any overflow of density into physically-disallowed regions of phase space will mainly cause a small under-estimate of the normalization in physically-allowed regions, where all observed events must necessarily exist. Furthermore, this normalization shift may cancel when considering likelihood ratios. A greater problem may occur when using the density model for event sampling, since events may be generated in the physically-disallowed regions. Whilst not solving this problem at this time, we foresee potential for mitigation using two methods: T is required to satisfy p j1 ′ T ⩾ 0, preventing such unphysical behaviour. A drawback is that we cannot enforce the original boundary limits of p j1 T , because these must now be defined relative to the value of p j2 T . Furthermore, most physical boundary conditions may not be easily enforced by such a transformation, either because they are too complicated or because the user is not aware of them. (b) In high energy physics, one can model the components of object four-vectors and reconstruct observables accordingly. This naturally imposes many physical constraints, although not all, and once again we cannot enforce simple boundary conditions for high-level observables.
With these caveats, figures 6 and 7 demonstrate that the 2D projections of events sampled using density model are qualitatively very similar to the ground truth events throughout most of the space. The comparison is quantified in figure 8. This shows the pull on the ratio of these histograms, defined as: where p model and p MG5 are the densities estimated using events sampled from the density model and MG5 respectively, and ∆ p model pMG5 represents the estimated statistical uncertainty on the ratio between them. This dominated by the estimated statistical uncertainty on p MG5 . The pull can be interpreted as 'the number of standard deviations by which the ratio differs from unity.' It therefore shows the sign and statistical significance of the difference between the two distributions. If the density model represents an unbiased fit to the MG5 events then we expect O (68%) of bins to fall inside the interval [−1, +1]. Due to random fluctuations in the event sampling, we still expect O (32%) to fall outside of this interval by chance, even when the density model is equal to the ground truth. Extending this idea, we expect O (5%) of bins to fall outside the interval [−2, +2] and O (0.3%) outside [−3, +3] due to random fluctuations. If the density model is over-trained, we expect to observe an excess of bins with small pulls. Where mis-modeling occurs, we expect to observe a systematic trend of large pulls.
This allows us to study the agreement between the density model and training events in the following way. All bins with pulls less than 1 in magnitude are shown in green in figure 8. These are bins where the agreement between the density model and training data is better than the estimated statistical uncertainty. Bins with pulls above +1 (below −1) are shown in increasingly dark shades of red (blue). Since we expect    that O (32%) of bins will be coloured red or blue, the presence of these bins does not indicate mis-modeling. Instead we search for the following signatures: • Adjacent dark red or blue bins indicate that the difference between the density model and training data is unlikely to occur by chance. These regions are likely mis-modeled. • Multiple adjacent bins which are all coloured red or blue suggest an effect of systematic mis-modeling rather than statistical fluctuation. • More bins shaded in red or blue than expected, indicating that more bins than expected exceed the statistical variance due to sampling, suggesting that mis-modeling is present in some of these regions.
Since most bins in figure 8 are coloured green or light red/blue, we observe that most of the space is well-described within ±2 standard deviations. This indicates that, in general, the model is able to describe the high-dimensional distribution of the data at a level comparable with the statistical precision of the training data.
Some red or blue bands are observed, for example (i) in the steeply falling tail of the ∆y (j, j) distribution when projected along with m jj and |y jj |, and (ii) in the region p j1 T ≈ p j2 T in the projection of the two. This suggests some systematic mis-modeling in these regions, and scope for tuning using the optimization methods suggested in section 3.
White regions indicate that no density is present, whilst black regions indicate that events are present when sampling the density model but not MG5, repeating the observations discussed above. Table 4 summarizes the total frequency with which pulls are observed in the different colour bins.

EW Zjj with 4 observables and 2 external parameters
We now train a model which captures the dependence of EW Zjj data on the external parameters ⃗ c = {c HWB ,c W }. Such a model may be used to perform maximum likelihood estimation or derive exclusion limits on the space of⃗ c based on an observed dataset 6 .
In this case, two SMEFT coefficients would be profiled with all others assumed to be 0. This is consistent with experimental analyses in the Higgs and electroweak sectors, in which only one or two parameters are usually profiled at a time. In general, it is not possible to constrain many more parameters. This is because we must simulate training data at regular intervals in all directions of⃗ c. The number of required simulations therefore grows exponentially with the number of parameters profiled, which quickly becomes computationally intractable 7 .
We note that the external parameters also impact the rate σ fid (⃗ c) at which signal is expected to be produced within the observable phase space. When performing an experiment with a fixed exposure (rather than a fixed number of events), we expect to observe events at a point x in phase space at a rate of: 6 We emphasize that detector effects have not been applied to our training data, but would be for such an analysis. 7 We note that global fits of SMEFT parameters are possible when using binned measurements [27,44,45]. This is because the prediction for a given⃗ c may be decomposed into a parametric relationship between a number of pure-SM, pure-SMEFT and interference terms. In this case, since the number of unique terms rises slower than exponentially with the number of parameters, all parameters which impact EW Zjj events may be profiled together. However, for general new physics models where no such parameterization exists, the number of parameters profiled will be limited by the curse-of-dimensionality. Exploiting this special case is not possible using our method because we cannot express negative event densities which may arise in the interference term.  In this work we consider the modeling of p (x|⃗ c). We note that σ fid (⃗ c) may typically be modelled using a simple feed-forward neural network, allowing the event rate to be used as a discriminating observable if desired.
We also note that we are not modeling any backgrounds to the EW Zjj process. This is because we wish to test our ability to model multi-dimensional data with a non-trivial parameter dependence. This is best achieved by isolating the signal component, since in general background processes will not depend on the same parameters. However, we note that background modeling must be considered when performing parameter inference using detector-level data, and in particular a large irreducible background from non-electroweak Zjj production would exist in a 'real-world' EW Zjj analysis. For such an analysis, a statistical model combining individual components p sig (x|⃗ c) and p bkg (x) with expected cross-sections σ sig (⃗ c) and σ bkg may be constructed as: assuming that interference is either small or absorbed into the background model. For simplicity we select four observables to model, in the sequential order p ll T , p j1 T , m jj and finally ∆ϕ (j, j), excluding the other eight from consideration. All four observables are expected to depend on the external parameters, and we aim to capture this dependence within our model.
The projection onto the latent space is performed using the same f -values as presented in table 2 and used in the previous section. Table 5 presents the constants used to configure the neural networks which contain 18k-85k trainable parameters. Compared with those in table 3, we note that larger values of s f , s µ and s σ are used. This initializes the model such that external parameter variations deform the kinematic spectra, and so impact the log-likelihood, significantly enough that we find an improved parameter dependence to be learned during training. However, we note that large values may excessively enhance fluctuations and lead to an unstable initial state, and the final constants are chosen to balance these effects. The constant f σ is tuned to ensure that the initial Gaussian width is not much larger than the scale of latent space features which are deformed by parameter variations.
Each neural network is trained for up to 200 epochs, stopping early if the log-likelihood does not improve by an amount greater than 10 −10 over a period of 15 epochs. Figure 9 shows the 1D marginal distributions evaluated at the SM hypothesis of⃗ c = (0, 0), obtained by sampling 4 M events from the density model. Figure 10 shows the corresponding pulls on the 2D marginal spectra. Replicating the results of the previous section, these demonstrate that the model describes the 1D distributions to within ±5% at this point in parameter space, and without significant pulls in the 2D projections.
To investigate whether the parameter dependence has been learned, we scan across all hypotheses in the ⃗ c-plane and study the ratio of the 1D marginal distributions when compared with the SM. To reduce sampling variance when studying the density model, we form this ratio using importance sampling. We first sample 100k events from the model assuming the SM hypothesis. We then use the density model to evaluate the probability density of every datapoint under both the SM and⃗ c hypotheses, labelled p SM and p c respectively. The distribution under the⃗ c hypothesis is then obtained by assigning a weight of pc pSM to every datapoint. This approach assumes that the probability distribution under the SM hypothesis fully spans the support of that of the⃗ c hypothesis. The result is that the distributions obtained under the SM and⃗ c hypotheses have strongly correlated statistical fluctuations. These largely cancel when we take the ratio, which can be estimated using fewer samples than if the hypotheses were sampled independently. Figure 11 shows how the p ll T PDF, expressed as a ratio with respect to the SM, varies as a function of the⃗ c hypothesis which is indicated by the green box in every panel. Events generated with MG5 are shown in red, and those sampled from the density model are shown in black. We observe a significant enhancement of the high energy tail whenc W is large in magnitude, approximately independent of its sign. We observe that negative values of c HWB lead to a modest enhancement of the tail, whilst positive values suppress the tail by a comparable factor. The combination of these effects, plus any interference between them, manifests as a non-trivial structure throughout the plane of⃗ c. We observe that the density model has captured this external parameter dependence well, since it is able to describe the deformations with an accuracy significantly better Figure 9. Marginal distributions of events sampled using the density model (black) compared with those generated using MG5 (red) for a value of (cHWB,cW) = (0, 0). Shaded areas show sampling uncertainties. Note that the two spectra are not statistically independent, since the density model was trained using the MG5 events. Figure 10. Pull on the ratio between the 2D marginal distributions comparing events simulated with MG5 (denominator) with those sampled from a GMM trained on a latent space (numerator), both assuming the SM hypothesis of (cHWB,cW) = (0, 0). The model accepts cHWB andcW as input parameters.
than the size of the deformations themselves. The double ratio, quantitatively comparing the two histograms, is presented in figure A1 of appendix A. Figure 12 shows how the p j1 T PDF varies as a function of⃗ c. We observe an enhancement of the high-energy tail whenc W is large in magnitude. We also observe a low-energy enhancement when c HWB is highly negative, resulting in another non-trivial structure as we scan the plane of⃗ c. Once again, we find that the density model has captured this external parameter dependence well. The double ratio, quantitatively comparing the two histograms, is presented in figure A2 of appendix A. Figure 11. Evolution of the p ll T PDF as a function of (cHWB,cW), presented as a ratio with respect to the SM hypothesis. The dependence is well captured by the density model. The double ratio comparing the two histograms is shown in figure A1. T PDF as a function of (cHWB,cW), presented as a ratio with respect to the SM hypothesis. The dependence is well captured by the density model. The double ratio comparing the two histograms is shown in figure A2. Figure 13 shows how the m jj PDF varies as a function of⃗ c. We observe that highly negative values of c HWB lead to significant structure at m jj ∼ 0.15 TeV. As shown in figure 9, this is also where the bulk of the data is expected to be measured. When measuring other observables, experimental analyses typically apply pre-selection criteria requiring m jj to exceed O (1 TeV) in order to preferentially reject non-electroweak processes. By instead modelling an inclusive range of m jj simultaneously with all other observables and performing a high-dimensional unbinned analysis, such a restrictive requirement would not be required, provided that all backgrounds can also be sufficiently well modelled. The double ratio, quantitatively comparing the two histograms, is presented in figure A3 of appendix A. Figure 14 shows how the ∆ϕ (j, j) PDF varies as a function of⃗ c. We observe thatc W modulates the amplitude of an approximately sinusoidal oscillation introduced into the ∆ϕ (j, j) spectrum. We observe that negative values of c HWB modulate an enhancement at ∆ϕ (j, j) ∼ 0, whereas positive values of c W cause a suppression. This observable is therefore sensitive to the sign of both parameters. Once again we note that Figure 13. Evolution of the m jj PDF as a function of (cHWB,cW), presented as a ratio with respect to the SM hypothesis. The dependence is well captured by the density model. The double ratio comparing the two histograms is shown in figure A3. Figure 14. Evolution of the ∆ϕ ( j, j) PDF as a function of (cHWB,cW), presented as a ratio with respect to the SM hypothesis. The dependence is well captured by the density model. The double ratio comparing the two histograms is shown in figure A4. the distribution shows a significantly non-trivial dependence as a function of⃗ c, and that this dependence is captured well by the model. The double ratio, quantitatively comparing the two histograms, is presented in figure A4 of appendix A.

Demonstration of statistical interpretation using a toy model
In the previous two sections we have demonstrated that we can construct density models which replicate the behaviour of simulated training data when sampled. Whilst this implies that good behaviour should also be obtained when performing inference tasks at the trained points in parameter space, this cannot be demonstrated because we are not able to evaluate the ground truth PDF for any given datapoint.
Nonetheless, we consider such a demonstration to be important. This is because the quality of inference is impacted not only by the ability to fit the training data but by (1) the degree of under-or over-training and (2) the way in which the probability distribution is interpolated between training points, hereafter referred to as the inductive bias. Whilst the probability distribution may be learned with arbitrarily high accuracy at the training points, depending on the complexity of the model configuration and number of training samples provided, it is likely that the interpolation between training points will not exactly match the true behaviour, which is unobserved. We aim to show that the approximate behaviour of the model can work sufficiently well for inference tasks, provided that training data are provided at dense enough points in parameter space.
To achieve this, we construct a toy model from which to sample ground truth training data. This is projected onto a latent space and used to train a density model using the method proposed in this paper. The toy contains four observables which vary according to two external parameters. Several pseudo-datasets are sampled from the true model assuming different parameter hypotheses. For each dataset, the density model is used to compute exclusion bounds on the latent space, and the results are compared with ground truth exclusion bounds computed using the true PDF on the data space. The level of agreement is then analyzed. Use of a toy model allows us to compute these ground truth bounds, which are typically intractable for real simulations.
We define a toy model with four observables This observable is sensitive to the sign and amplitude of both external parameters. Observable x 3 follows a smooth-peak distribution with no physical limits, and is correlated with all observables and external parameters.
Data are projected onto the latent space using values of f = 0.5 for all observables. Neural networks are configured using the constants presented in table 6 and contain 18k-85k trainable parameters. Each network is trained on 60% of the available data until the log-likelihood evaluated over the other 40% no longer improves by an amount greater than 10 −6 over a period of 8 consecutive epochs, after which the solution with the least-positive (or most-negative) validation loss is chosen. Training is found to terminate after 33-46 epochs. Figure 15 (bottom) shows the latent space distributions. A third panel compares the 1D marginal distributions obtained from the ground truth data and from drawing 50 k samples from the resulting density model. The level of agreement is found to be comparable with the statistical precision of the data.
We now test the accuracy of inference performed using the density model. We select nine different 'true' hypotheses⃗ c true in a 2D grid with edges at c x ∈ [−0.8, 0, 0.8] and c y ∈ [−0.8, 0, 0.8]. For each value of⃗ c true , a pseudo-dataset with a size of 400 events is created by sampling the true PDF. We assume that the expected number of observed events is identical for every value of⃗ c. Figure 16(a) shows nine panels in which the different⃗ c true hypotheses are presented as black dots. Open circles show the points in parameter space⃗ c trained at which the model was trained, excluding those which lie outside of the axis range.
The true PDF is used to profile the likelihood of the dataset. Using this method we evaluate (1) the true maximum likelihood estimate (MLE) and (2) the frequentist 68% and 95% confidence limits, assuming that the expected distribution of the profile likelihood ratio follows the asymptotic approximation described by Wilks' theorem [46,47]. In figure 16(a), orange crosses present the MLE evaluated using the true PDF, whilst orange contours present the confidence limits. We note that, since the pseudo-datasets are stochastically sampled from the true PDF, we expect each MLE to fluctuate away from⃗ c true as observed. The datasets are then transformed onto the latent space, and the same analysis is performed using the density model to evaluate the likelihood. Blue crosses present the MLE evaluated using the density model, whilst blue contours present the confidence limits. Figure 16(a) demonstrates generally good agreement between the exclusions bounds evaluated using the density model and ground truth PDF, although we observe a mild over-coverage when c x ∼ 0 or c y ∼ 0. We expect that this is because these axes represent turning points in the function p (x|⃗ c), the form of which is only approximated by the inductive bias of the density model. To test this, we train a second model which contains additional training data at c x = ±0.2 and c y = ±0.2. The resulting contours are shown in figure 16(b). We observe that the additional training data have constrained the model at |c x |, |c y | ∼ 0, resulting in an improved agreement with the ground truth. We conclude that the most reliable results will be achieved when the spacing of⃗ c trained points is smaller than the size of the expected exclusion bounds.
In both cases, figure 16 shows that accurate MLEs and exclusion contours have been estimated using density models on the latent space. Reliable results could therefore be obtained in this example without having access to the true PDF.

Conclusion
We present a method for modelling probability distributions over a high-dimensional space of observables with dependence on external parameters, a dataset type which is common within the physical sciences. The method uses a novel transformation of input data and a targeted network architecture to improve the expressive power of GMMs. It is designed to capture smooth deformations of the probability density induced by external parameter variations, and respects strict boundaries on the observables. The model may be used to perform inference on observed data, or sampled to act as a stochastic generator. We demonstrate the power of the method by applying it to two high-energy particle physics datasets: one which contains twelve highly correlated observables, and one which depends on two external parameters. We then use a toy model to demonstrate that fast and accurate inference may be performed from experimental data. We demonstrate that the problem-of-interest may also contain discrete observables, which are modelled with a relatively simple categorical model. Whilst the method enables interpretations to be performed using unbinned multi-dimensional data, it may also be used within the experimental design of binned measurements (which are intended to characterize observed data with minimal physical model assumptions). Such an analysis may proceed as follows. An experimenter may assign benchmark hypotheses to which a planned measurement should have reasonably optimized sensitivity. We expect that a near-optimal classifier 8 for a given parameter hypothesis may be created using the ratio of the PDFs evaluated at the null and alternative hypotheses. By isolating the regions of the high-dimensional space which provide the most discrimination power, they may ensure that these regions are targeted by dedicated bins.
The method presented is not domain-specific, and may be used to model any dataset of continuous observables which follow a smooth PDF, and to subsequently perform statistical inference from experimental data for the purposes of scientific discovery.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://dx.doi.org/10.48420/17136839.