Efficient description of experimental effects in amplitude analyses

Amplitude analysis is a powerful technique to study hadron decays. A significant complication in these analyses is the treatment of instrumental effects, such as background and selection efficiency variations, in the multidimensional kinematic phase space. This paper reviews conventional methods to estimate efficiency and background distributions and outlines the methods of density estimation using Gaussian processes and artificial neural networks. Such techniques see widespread use elsewhere, but have not gained popularity in use for amplitude analyses. Finally, novel applications of these models are proposed, to estimate background density in the signal region from the sidebands in multiple dimensions, and a more general method for model-assisted density estimation using artificial neural networks.


Introduction
Amplitude analysis of hadron decays is a powerful technique employed in many flavour physics studies, such as measurements of CP violation, searches for effects beyond the Standard Model, spectroscopic studies of excited hadrons, and searches for previously unobserved hadronic states.In this kind of analysis, multidimensional kinematic distributions of the decay products of a parent particle are studied to reveal the dynamical structure of the decay amplitude [1].In addition to the decay dynamics, the kinematic distributions are in general affected by non-uniform acceptance, or detection efficiency, and background density, which need to be accounted for in the fit.
In this paper we briefly review the conventional approaches, recall a few already known but rarely used methods, and, finally, propose new techniques to model non-uniform acceptance and background distributions.The proposed techniques not only offer more accurate descriptions of these distributions, but also provide improved avenues to control the systematic uncertainties arising from conventional approaches.Furthermore, the ability to make efficient use of expensive detailed simulation will be of key importance in the future, when data rates are expected to grow faster than the availability of computing resources [2] A simple, yet typical example of an amplitude analysis is the study of the twodimensional distribution of a three-body decay of a (pseudo)scalar meson into three (pseudo)scalar mesons: Dalitz plot analysis [1,3].This is the simplest case where the decay has internal degrees of freedom, yet the amplitude is a function of only two kinematic variables.In more complicated cases, such as decays involving non-spin-zero states or decays with greater than three particles in the final state, one is required to analyse kinematic distributions in more than two dimensions.Here we focus on the simple twodimensional case, however the approaches presented here can be easily generalised to the cases with higher dimensionality.
We deliberately avoid any quantitative direct comparisons of the performance for the illustrated techniques: the optimal technique for each individual analysis depends on many factors, such as the size of the data sample; dimensionality of the kinematic phase space; requirements on statistical and systematic uncertainties; complexity of the amplitude model; signal selection procedure; or structure of the background contributions.As such, it is important to investigate a variety of complementary approaches.
The structure of the paper is as follows: the formalism of multidimensional maximumlikelihood fits is recalled in Section 2 and non-parametric methods to deal with background and acceptance are presented.The samples used to illustrate the background and acceptance parameterisation techniques are described in Section 3. Conventional techniques to parametrise the acceptance distribution are illustrated in Section 4. Further, two rarely used but yet efficient approaches are presented: a technique using Gaussian processes (Section 5) and density estimation with artificial neural networks (Section 6).Finally, two novel approaches are proposed: the technique for inter-or extrapolation of background density from the sidebands using one of the methods above (Section 7), and a model-assisted parameterisation of background or acceptance density using neural networks (Section 8).

Formalism of amplitude analyses
A typical amplitude analysis in flavour physics deals with the distribution of kinematic observables x that characterise the multibody decay.The goal is to determine the unknown parameters Θ A that characterise the amplitude of the decay A(x|Θ A ).Given a set of decay candidates characterised by a vector of observables x i (1 ≤ i ≤ N ) obtained in an experiment, an unbinned maximum-likelihood fit is performed to infer the model parameters Θ A .The negative logarithm of the likelihood, −2 ln L, minimised in the fit, is of the form where N is the size of sample being fit, F (x|Θ A ) is the normalised probability density of the decay that depends on the model A(x|Θ A ), with a normalisation term I, the acceptance (x) and the normalised probability density function for the background events B(x), f sig and f bkg are the relative fractions of signal and background contributions, respectively.Another instrumental effect that needs to be taken into account in the fit, particularly if the amplitude contains narrow resonant states, is the finite resolution of the kinematic observables x.This effect is beyond the scope of this paper and is not considered.The contribution of background events is typically obtained by analysing the distribution of selection variables, m.In the simplest case, m is a single variable that is taken to be the combined invariant mass of the final state particles, which typically peaks at the mass of the parent particle for the signal events, and is distributed more uniformly for the background.However, other parameters of the event can also be included in the background selection.Alternatively, instead of treating the background contribution explicitly as shown in Eq. (2), one can also assign to each candidate i in the data sample x i a weight w i , such that the background contribution is statistically subtracted.This procedure will be discussed in more detail in Section 2.2.
While the amplitude A(x|Θ A ) is driven by the model of the decay dynamics and is the primary object under study, the experimental effects of background and non-uniform acceptance exist as "nuisance" objects that, nevertheless, have to be modelled accurately for correct interpretation of the results.Their description, especially in the case of the multidimensional kinematic phase space of the decay, often presents a major difficulty in an analysis.Below we review several conventional techniques employed in amplitude analyses to deal with effects of background and non-uniform acceptance.

Treatment of non-uniform acceptance
Non-uniform acceptance is usually handled either explicitly, using a parametric or nonparametric model of the decay density as shown in Eq. (2), or in an implicit way, by including its effect in the normalisation term of the likelihood.In the latter case, the scattered data from simulation can be directly used, and no functional representation of the acceptance is required.
To demonstrate the implicit approach, let us consider Eq. ( 1) that is being minimised in the unbinned fit (the background contribution has been omitted for simplicity): Here the normalisation term I is calculated by taking the mean of the values of the density function on a uniformly distributed sample y j (1 ≤ j ≤ M ) in a volume V .The constant terms that do not depend on the parameters of the model Θ A can be omitted, which leads to the following expression: The second term in the above equation can be seen as the sum of |A| 2 calculated over the sample y j distributed uniformly over the decay phase space, where each event j enters with the weight (y j ).The weight of each event can also be interpreted as the probability of that event passing the detector acceptance.Such an interpretation hints at a way to prepare the normalisation sample: one has to generate the decays uniformly in the decay phase space, and then simulate the reconstruction and selection of the events.The retained events will serve as a normalisation sample for the likelihood and no further corrections to the acceptance are required.In a real analysis, however, additional weighting of the normalisation sample may be needed to account for the imperfections in the simulated sample.
With this approach, no explicit parameterisation of the acceptance distribution is needed, and therefore it is often used in amplitude analyses with more degrees of freedom than the two-dimensional Dalitz plot, such as [6] decays where the amplitudes are described in a five-or sixdimensional phase space.This method is, however, statistically sub-optimal, since it does not exploit the fact that the acceptance distribution can be assumed to be at least locally smooth.As a result, these analyses usually require simulation data sizes several times larger than the real data samples.
Explicit modelling of the acceptance function is more typically used in two-dimensional Dalitz-plot analyses.The techniques often used are two-dimensional polynomials with the parameters obtained from fitting a simulated data sample [7], two-dimensional histograms smoothed with cubic splines [8,9], and kernel density estimation [10][11][12].Analyses in more than two dimensions do sometimes also use an explicit acceptance parameterisation, albeit with assumptions on the factorisation of the acceptance in some variables [4] to reduce the dimensionality.

Treatment of background contributions
As in the case of acceptance, backgrounds can also be treated in the amplitude analyses either in an explicit or implicit fashion.The implicit inclusion of the background into the likelihood fit can be performed using the sPlot technique [13,14], where each event is assigned a weight that depends on the value of the selection variables m.These weights are positive in the signal-dominated regions of the selection variables and negative in the background-dominated regions, and as such the contribution of the background events can be statistically subtracted from the likelihood.This procedure does not require explicit parameterisation of the background density, however, it suffers some drawbacks.Firstly, it assumes that the kinematic observables x are uncorrelated with the selection variables m, which as will be demonstrated in Section 7, is an assumption that in general is not well motivated.The presence of correlations will thus introduce bias in the fit results, especially if the background level is large.Secondly, since the procedure does not make any assumptions on the functional form of the background density, it results in larger statistical uncertainties on the results than when a reasonable functional form is assumed.

Illustrative simulated samples
For the purposes of illustration, the D + s → K + π − π + decay is considered in this paper.As this is a three-body decay with scalar particles in both the initial and final states, its dynamics are fully characterised by two kinematic variables.This decay is also convenient as all the final states of the decay are charged tracks, which makes the selection easier to be implemented in a simplified Monte-Carlo simulation framework.There are no identical particles in the final state, avoiding any need to deal with Bose symmetrisation of the kinematic phase space.Finally, being a singly Cabibbo-suppressed decay, it is interesting from a physics point of view, since it can exhibit significant CP violation.
Here we choose to parametrise the phase space in terms of the two square Dalitz-plot observables x = (m , θ ) [15], defined as where m(K + π − ) is the invariant mass of the K + π − combination, m min Kπ = M K + M π and m max Kπ = M B −M π are the minimum and maximum values of m(K + π − ) variations, M K and M π are the masses of K and π mesons, respectively [16], and θ(K + π − ) is the helicity angle of the K + π − combination (the angle between the K + π − resonance and the lone hadron, in the resonance rest frame).While the square Dalitz plot was proposed for analyses of B mesons in order to give more weight to the interference regions between different two-body combinations that predominantly occur in the corners of the conventional Dalitz plot, in our case it is used purely to avoid complications related to the curved boundaries of the conventional Dalitz plot.
Although the three-body Dalitz-plot analysis is the simplest case for the techniques considered here, they scale well with dimensionality of the kinematic phase space, and can be applied for more complicated amplitude analyses.Similarly, the exact definition of the kinematic phase space (conventional or square Dalitz plots, or various representations for four-body kinematics) is not a limitation for the techniques under study.

Efficiency distribution for D
The sample of D + s → K + π − π + decays is simulated using a simplified Monte-Carlo technique, where only kinematic properties of the initial and final state particles are considered.The production of D + s mesons and reconstruction of decay products is inspired by the conditions of LHCb experiment [17], however the numerical values of the parameters used in the simulation are largely arbitrary and differ from those at LHCb.
For the initial D + s mesons, the transverse momentum p T (the component of momentum perpendicular to the ẑ-axis, which in the case of a proton-proton collider corresponds to the direction of the beams) is generated according to an exponential distribution with a mean of 1 GeV. 1 The angle θ T , which is the angle between the direction of D + s momentum and the ẑ-axis, is generated such that the pseudorapidity, η = − ln tan(θ T /2), is distributed uniformly in the range from 2 to 5 (which roughly corresponds to the fiducial volume of LHCb).Assuming that the D + s decay is spherically symmetric, the momenta of the final state products are generated such that they are uniform in the square Dalitzplot variables m and θ .These momenta in the D + s rest frame are then boosted to the laboratory frame.
To simulate the selection of D + s candidates in the experiment, only the candidates that satisfy certain kinematic criteria are retained: the total momentum p and the transverse momentum p T of each track are required to exceed 3.0 and 0.4 GeV, respectively; the p T of at least one of the final state tracks is required to be greater than 1.0 GeV; the p T of the D + s candidate is required to be greater than 2.0 GeV; and the sum of transverse momenta of the three tracks is required to exceed 3.0 GeV.
The square Dalitz-plot distribution for the retained candidates is shown in Fig. 1.Since we are only interested in relative variations of the efficiency, it is normalised such that its average over the full kinematic phase space equals 1.The same applies to other generated and estimated two-dimensional efficiency and background distributions presented in this paper.The plot in Fig. 1 uses a high-statistics sample of 4 × 10 6 events after selection, where a smaller sample of 10 5 selected candidates is used in the examples presented elsewhere in this paper to estimate this distribution.

Combinatorial background density for D
The simulated combinatorial background contribution to the D + s → K + π − π + decays contain purely random combinations of three tracks, as well as the combinations of ρ(770) 0 → π + π − or K * (892) 0 → K + π − with a K + or a π + , respectively.The kaon and pion tracks, as well as the K * 0 and ρ 0 resonances, are generated uniformly in pseudorapidity, η, and with an exponential p T distribution (with a mean p T of 0. and 2.0 GeV for pions, kaons and resonances, respectively).The fractions of ρ 0 and K * 0 in the combinations before applying selection cuts are set to be 20% and 10%, respectively.The invariant masses of the ρ 0 and K * 0 resonances are generated according to the relativistic Breit-Wigner distribution, with masses and widths equal to their world-average values [16], and their decay products are generated assuming that the resonances are unpolarised (i.e. they are isotropic in the resonance rest frame and are uncorrelated with the third track).
In a second stage, the three charged tracks are combined to form the D + s candidate.Only the tracks with p T greater than 0.4 GeV and total momentum greater than 3 GeV are used to make the candidates.Finally, a kinematic fit is performed which adjusts the momenta of the final state tracks in such a way that the invariant mass of the three-body combination is coincident with the world-average D + s mass M D + s = 1.97 GeV [16].The three-dimensional distribution of the invariant mass, m D ≡ m(K + π − π + ), of the three particles before the kinematic fit, and the square Dalitz-plot variables m and θ after the kinematic fit, are used to extract the density of the background events in the signal region.
The distributions of each variable with the definition of signal and sideband regions are shown in Fig. 2. To clearly show the features of the background density, the twodimensional projections in Fig. 2(a,b,c) are obtained with the high-statistics Monte-Carlo sample of 4 × 10 6 events satisfying the selection requirements.The examples of background density estimation in this paper, as well as the one-dimensional projections in Fig. 2(d,e,f) use the smaller sample of 10 5 candidates.The following examples assume that the background density is estimated only using the sideband regions of the m D distribution, defined as 1.77 < m D < M D + s − 0.  clearly different, with the positions of the peaking structures due to K * 0 and ρ 0 resonances shifted in the sidebands with respect to the signal region.In this particular case, it is purely explained by the kinematic fit procedure applied to the background sample.In a real analysis, this can also be caused by dependence of background production properties on the selection variable, m D .

Conventional acceptance parameterisations
In this section, we present the estimation of the acceptance variation over the amplitude fit variables using conventional methods that involve use of Legendre polynomials and cubic splines.These methods use histograms to estimate the local density of events before and after a selection requirements, and then interpolate the efficiency value between the bin centres of the histogram.
Numerous analyses use multidimensional orthogonal Legendre polynomials [7,[18][19][20] to parameterise the variation of the acceptance.In d dimensions these consist of the product of d n d -order polynomials, where n d can be different for each dimension, that describe the form of the acceptance.As such, the number of free parameters in this method are O(n 1 × n 2 × ... × n d ), exhibiting the power-law growth in the number of dimensions.The coefficients of these polynomials are then extracted in a maximumlikelihood fit.As the optimal order of each of these polynomials is a priori unknown, often cross-validation and/or regularisation is used to prevent overfitting.
Another method commonly used is interpolation between the histogram bin centres via cubic splines [8,21,22].Here, the scale of the variation over the phase-space is determined by the initial histogram bin size, and therefore to avoid overfitting the size and location of bins are optimised before the spline function is calculated.Unlike in the case for the Legendre polynomials, no free parameters exist for the spline functions (except for the values at the bin centres).Therefore these are particularly susceptible to over-fitting, as they simply "connect" the points.
In Figure 3, we show the results of these two conventional approaches [23,24].For the Legendre polynomial fit, additive L 1 and L 2 regularisation terms were included into the likelihood to reduce overfitting and improve likelihood fit stability [25]: with and where c ij are the coefficients of the polynomials and λ 1 and λ 2 are regularisation parameters.A cross-validation procedure was performed to determine the optimal degree of the Legendre polynomials (set to be equal in each dimension for simplicity), as well as the magnitude of each of the L 1 and L 2 terms.This resulted in a polynomial degree of n = 8 in each dimension, an L 1 regularisation parameter λ 1 = 0.01, and an L 2 regularisation parameter λ 2 = 0.1.The reduced χ 2 for this fit is calculated using an independent test set, and the number of effective degrees of freedom is calculated using bootstrap resampling, and yields a value of χ 2 /nDoF = 2517/2429 = 1.04.
For the fit with cubic splines, a 10 × 10 binning in (m , θ ) is used, and is chosen ad hoc such that the bin size is similar to the smallest structure size in these variables (∼ 0.1), but not so fine as to exhibit dramatic overfitting if these are compared to the underlying distribution.Here the reduced χ 2 is calculated using the number of bins in the χ 2 test (50 × 50) minus the number of points fitted by the spline (100), and a value of χ 2 /nDoF = 2460/2400 = 1.03 is obtained.

Gaussian processes
A Gaussian process [26] is a statistical model which associates each point in the input space with a normally distributed random variable.The joint distribution of these random variables is also a normal distribution, yielding a closed-form expression for the model at any arbitrary point the space.Fortunately, to avoid having to fit for the parameters of an infinite number of normal distributions, Gaussian processes are completely determined by parameterising the covariance matrix using a covariance function.As such, it is possible to obtain model estimates in a large number of dimensions with relatively few parameters, which can then be used to extrapolate the behaviour of the model with reliable estimates of the uncertainty.Furthermore, these parameters can be robustly extracted directly from the data, which gives Gaussian processes an advantage over other "non-parametric" models, such as kernel density estimates or piece-wise spline interpolation.A pedagogical introduction to Gaussian processes can be found in Ref. [27], and other applications in high-energy physics can be found in Refs.[28,29] Given some input data vector X = (x 1 , . . ., x N ), of length N (where the elements x i can themselves be vectors of arbitrary dimension), the output, y, of a Gaussian process is defined as where N is a multivariate normal distribution with zero mean, and the covariance matrix, Σ(X; Θ), given by Here k(x i , x j ; Θ) is a covariance function, with hyperparameter vector Θ, that is defined between any two input points, x i , x j , that are elements of X.For a new input point x * , the conditional probability of predicting an output that is equal to the true unknown value y * , given the previously observed outputs, Y = (y 0 , . . ., y N ), follows a normal distribution, where Σ * = [k(x * , x 0 ; Θ), k(x * , x 1 ; Θ), . . ., k(x * , x N ; Θ)], and Σ * * = k(x * , x * ; Θ).The best estimate of the true value, y * , is equal to the mean of the above probability distribution, and the uncertainty is the square-root of the variance, The negative logarithm of the likelihood for this construction is given by The vector of hyperparameters Θ of the covariance function can then be inferred by minimising the negative logarithm of the likelihood (Eq.14), or obtained via marginalisation using suitable priors and Markov chain Monte-Carlo.One such covariance function is the Matérn function, where d = ||(x j − x i )/ρ|| is the scaled distance between two points in the input space, Γ is the gamma function, K ν is the modified Bessel function of the second kind, ρ is a vector of length scales, ν is a non-negative parameter, and σ controls the absolute magnitude of the covariance.The Matérn function is defined in terms of the distance between two points, rather than the location of each point, and therefore describes a stationary distribution.For half-integer values of ν, this can be expressed as a product of an exponential function and a polynomial of order p = ν − 1/2.The parameters ν can be thought of as controlling the smoothness of the function, and when ν → ∞, the Matérn function converges to the squared-exponential covariance function Here we use the Matérn function with ν = 5 2 throughout, as this provides a good balance between replicating and smoothing the observed structures, however the parameter values, and the best model choice in general, depends on the data in question.These are ideally selected using cross-validation, or a similar procedure, and can be considered as a source of systematic uncertainty on the final distribution.
In reality, one would also want to describe distributions that are non-stationary (i.e., where the mean in Eq. ( 9) is non-zero).However, due to the linearity of the model, this represents a simple subtraction of the mean function from the observed data, and therefore there is no loss in generality due to this description.Specific assumptions on the mean distribution will be discussed for the applications described in Sections 5.1 and 7.

Acceptance parameterisation
The dataset described in Section 3 parameterises the acceptance in terms of the square Dalitz-plot variables, m and θ .As the Gaussian process does not estimate the density directly (although there are modifications that would permit this [30]), the density in (m , θ )-space is estimated first by a uniformly binned histogram, with 50 bins in each axis.The location of the centres of these bins are then the input points, X, defined above, where x i = [m i , θ i ].As these were generated uniformly in (m , θ )-space, the acceptance probability in each bin is simply the reciprocal of the bin content, which is the output, Y , of the Gaussian process.Therefore for each input x i = [m i , θ i ] there is an associated output y i .As the acceptance is positive definite everywhere, the resulting Gaussian process does not represent a stationary process, and as such, a mean function constant in (m , θ ) is added to the Gaussian process to account for this scaling.
An advantage of the Gaussian process is that it is relatively robust to statistical fluctuations due to low sample sizes, as the uncertainty at each point is estimated directly.As such, the aforementioned binning can be considerably finer than in other methods, provided that the assumption of normally distributed uncertainties holds (processes where the likelihood is replaced with a Poisson distribution can also be used, however this is less computationally tractable than in the Gaussian case, as Poisson distributions are not closed under linear combination).
Here, Gaussian processes are implemented using the GPy package [31] (and the functionality described in this section and Section 7.1 can be obtained with the package in Ref. [32]), and the resulting acceptance model as a function of m and θ can be seen in Figure 4.The parameters of this model are the overall scale of the Matérn function, σ 2 GP , the overall scale of the constant mean function σ 2 mean , the characteristic length scale over which points covary for each dimension, ρ m and ρ θ , and a term describing the additive  Gaussian noise at each point, .This fit was performed using a maximum-likelihood approach, and the resulting parameter values can be found in Table 1.
The χ 2 per number of degrees of freedom of this acceptance model with respect to the data is evaluated using an independent dataset of simulated D + s → K − π + π − decays.Here, the effective number of degrees of freedom is calculated approximately as the number of bins in the χ 2 test (50 × 50), minus the number of model parameters (5), and a value of χ 2 /nDoF = 2471/2495 = 0.99 is obtained.This indicates that the model reproduces the underlying distribution very well, where the smallness of the χ 2 value compared to the values reported in Section 4 is mostly due to the fact that this reproduction can be achieved with a small number of parameters.
As the only parameters that scale with dimensionality are the characteristic length scale of the Matérn function, the increase in the number of parameters is linear in the dimensionality.Furthermore, the time complexity of Gaussian processes is also linear in the dimensionality, making these very efficient in high dimensions compared to other parameterisations.Unfortunately however, the complexity is cubic in the input data size, due to the dependence on a matrix inversion, and therefore these do not scale well with large data sizes.Nevertheless, methods exist to mitigate this, such as the use of binned data in the strategies described here, or by selection of a small number of "pseudo" inputs [33].

Density estimation with neural networks
Multivariate techniques such as artificial neural networks (ANNs) or boosted decision trees provide an alternative approach to parametrise multidimensional probability density from scattered data [34,35].The approach involving ANNs exploits a property of neural networks, where a "feed forward" network (when layers of neurons are arranged in a non-cyclical structure), with smooth activation functions can approximate any continuous function.Here, the parameters of the ANN are treated as free parameters in a maximum-likelihood fit to the unbinned data, performed by treating the negative logarithm of the likelihood as a custom loss function.Since this technique does not require binning the input data, and in general ANNs have successfully shown their ability for multivariate generalisation, we expect that density estimation approach using ANNs can become useful for multidimensional amplitude analyses.This approach is demonstrated below for the parameterisation of the two-dimensional acceptance of the D + s → K + π − π + decay, described in Section 3.
The outputs of the n th neuron of the l th layer in the ANN is given by where w nm,l is the matrix of weights, b n,l is the vector of biases for l th layer, and f (x) is a non-linear activation function.For the estimated density to be smooth, it is convenient to use a smooth differentiable activation function such as a sigmoid function, In this structure, the first layer of neurons is the input layer, and accepts kinematic variables x as inputs, while the output neuron return a single scalar, the density estimate P (x) Density estimation can be performed by treating the weights and biases Θ ≡ {w nm,l , b n,l } of the ANN as free parameters, and minimising the negative logarithm of the likelihood, where x i (i = 1 . . .N ) are data points, and y i (j = 1 . . .M ) is a uniformly distributed sample used for normalisation.The function (19) is used as the loss function to train the ANN given the training sample x i .
As in many applications of machine learning techniques, special care needs to be taken to avoid overfitting, where the model configuration or parameters become too specialised to the training dataset, and therefore fail to generalise properly.In the case of density estimation with ANN, overfitting manifests itself as isolated peaks of the PDF around training data points.Regularisation techniques, where the likelihood is explicitly penalised to promote smoothness or sparsity, are thus essential to control overfitting.It was found that regularisation which penalises large neuron weights (and therefore those that result in large gradients of the density function) works well in the typical cases when density is expected to be smooth.Specifically, an L 2 regularisation term of the form is added to the loss function (19), where λ 2 is the regularisation parameter that ultimately controls the smoothness of the PDF.
Additional constraints can be imposed by the choice of the input variables.For example, in the case of a decay process where angular observables are necessary to describe it, the periodicity of the density as a function of angles can be enforced by using two variables, sin φ and cos φ, as the input variables for the ANN instead of a single angular variable φ.Similarly, if the density is expected to be an even function of a kinematic observable (e.g.cosine of a helicity angle), the square of this might be a better choice as an input to the ANN.
The density estimation of the D + s → K + π − π + decay acceptance is performed with an ANN consisting of four hidden layers with the number of neurons, from the first to fourth layer, equalling 32, 64, 32, and 8.The regularisation parameter λ 2 is chosen to be equal to 0.1.Normalisation is performed with 5 × 10 5 events distributed uniformly over the space of inputs, the square Dalitz plot.The likelihood minimisation is performed using the TensorFlow framework [36] and the Adam optimiser [37] with learning rate of 10 −3 .The resulting estimate of the density after 30 000 training epochs (passes through the data) is shown in Fig. 5.The χ 2 value obtained for the fit with these conditions for 50 × 50 bins equals 2418.3.It is difficult to estimate the effective number of degrees of freedom for a training with regularisation.By varying the regularisation parameter one can obtain a broad range of χ 2 values, both lower (which might indicate overtraining) and higher (indicating a bias in density estimation).Cross-validation can be used to obtain the optimal value for the λ 2 parameter and control overtraining.
The code used to produce the results in this paper with ANNs is available at Ref. [38].This package uses the TensorFlowAnalysis and AmpliTF libraries [39,40] for the imple-mentation of the functions related to flavour physics analyses within TensorFlow framework.

Extrapolation of background density from sidebands
As highlighted in Section 2.2, the conventional methods that either use sideband distributions or the sPlot technique to determine the combinatorial background contribution can in general introduce systematic biases, since they ignore correlations between the amplitude fit variables and the selection variables (such as the combined invariant mass of the final state particles m D ).The bias can become pronounced if only one of the sidebands can be used to estimate the background (e.g.due to presence of specific peaking backgrounds in the other sideband as often happens in B meson decays).The sPlot procedure also introduces additional statistical uncertainty compared to the parametric approach due to the lack of any assumptions on the behaviour of the background.
To overcome these issues, one can add the selection variables to the background parameterisation.For example, in the case of Dalitz-plot analysis, a 3D fit can be performed to obtain the probability density function P (m , θ , m D ), which can then be used to extrapolate the desired combinatorial PDF B(m , θ ) in the signal region.We present in this section two such approaches, one using a Gaussian process [41], and another using an ANN [42], to extrapolate the combinatorial background PDF from both the upper and lower sidebands of m D to the signal region.To illustrate the performance of both of these approaches, simulated combinatorial background of D + s → K + π − π + decay is used, as described in Section 3.

Gaussian process background fit
In the Gaussian process method, the fit variables (m , θ ) and selection variable (m D ) taken from the sideband sample are first binned to obtain a local estimate of the density, and the parameters of the covariance function are then inferred by fitting the model using the location of the bin centres and their respective yields.As mentioned in Section 5, the model is fairly robust to variations in the choice of the location and size of these bins, providing that they capture sufficient variation in the input variables.
Here, the .92] GeV, and upper, m D ∈ [2.02, 2.17] GeV, D + s sideband described in Section 3.2 are separated into three bins of 0.05 GeV each, for a total of six bins in m D .In each of these bins, the square Dalitz plot is separated into 20 × 20 uniform bins (with bin size 0.05 × 0.05), for a total of 6 × 20 × 20 = 2400 inputs to the Gaussian process.
The Gaussian process uses the Matérn kernel, defined in Section 5, with ν = 5 2 , along with a constant mean function in (m , θ ), and a linear mean function in m D .Results of the estimation of the background density in the sideband regions of m D variable are presented in Fig. 6.Here it can be seen specifically that the Gaussian process model estimates well the variation in the resonance structure in (m , θ ) as a function of m D due to the kinematical constraints, permitting accurate estimation of this background structure in the unobserved signal region.The corresponding kernel parameters that are extracted from the data can be seen in Table 2.
The distribution of the background in the signal region is obtained by querying the Gaussian process at m D = 1.97 GeV, the results of which can be seen in Figure 7.While the resulting density somewhat smears the narrow structure seen in m distribution (Fig. 7(b)), the bias of the distribution is clearly smaller than that obtained from the simple projections of the distribution in the sidebands.

Neural network background fit
If there are no narrow structures in the background, one can consider a background PDF that is positive-definite, reasonably smooth in m D , and is sufficiently generic in square Dalitz-plot variables, such as where P 1,2 (m , θ ) are the functions modelled with ANN.One can then perform an unbinned fit of P (m , θ , m D ) to sideband data, with regularisation to avoid overtraining, with the weights and biases Θ 1,2 of ANN functions P 1,2 and α as the free parameters.The background in the signal region can then be extrapolated using the trained model as B(m , θ ) = P (m , θ , m D = M D + s ).In the presence of narrow structures in the amplitude that vary as a function of m D (such as the resonant K * 0 and ρ 0 contributions in the D + s → K + π − π + sample, see Fig. 2), the approximation shown in Eq.( 21) may not work well.As ANN density estimation can be performed in multiple dimensions, one can estimate the background density in the selection variable m D in addition to that in the Dalitz-plot variables, m and θ .Therefore, as an alternative to the PDF of the form (21), the full three-dimensional ANN can be used to parametrise the background as a function of square Dalitz-plot variables and m D , where additional regularisation has to be applied to ensure continuity as a function of the selection variable m D .The latter can be done by adding an extra penalty term in the likelihood which penalises configurations where the neurons in the input layer have large weights corresponding to m D variable.Such regularisation will effectively result in the  ANN where the first input layer consists of features that are slowly varying as a function of m D .
The fit in the sideband regions to the simulated combinatorial background of the D + s → K + π − π + decay using the ANN is shown in Fig. 8.For the neurons in the first layer that take the m D dimension as input, the regularisation parameter λ 2 is set equal to 10, while for the other neurons this is equal to 1.In Fig. 9, the predicted combinatorial background density in the signal region using ANN approach is shown, where the ANN is trained using Adam optimised for 50 000 epochs with the learning rate of 0.0002.The χ 2 value for 50 × 50 bins equals 2457.9.

Model-assisted density parameterisation with neural networks
In the training of the ANN, it is often difficult to replicate specific features of the resulting density, especially in the case of limited training data.The only handles on the generic ANN training are the topology of the network and the generic regularisation terms, and careful tuning of these is needed to obtain a reasonable description of the density.The procedure of parameterising the background or acceptance using only the input data, without any external knowledge of the processes that govern the features of the distributions, is not the most optimal approach.In general, it is known, for instance, that the acceptance function should be relatively smooth with a fall-off at the boundaries of the phase space due to kinematic selection requirements, or that the combinatorial background should contain contributions from certain two-body resonances.The implication of this is that the behaviour is constrained much more than conventional parameterisation techniques assume, and an ideally efficient procedure should take this prior information into account.For example, one can introduce a simplified model of these processes, and extract only the parameters of this simplified model from the training data samples.
In the case of background and efficiency distributions, even a simple analytical model for this would be difficult to express.Instead, here we propose a technique to perform nonparametric estimation of these distribution, using the formalism described in Section 6, with the assumption that the complex observed behaviour explicitly depends only on a  few underlying parameters that are sufficient to describe the efficiency or background behaviour in the region of interest.The values of these latent parameters can then be inferred from the observed data, or detailed simulation in the case of a description for the acceptance, in order to parameterise the distributions.This way, the features of the resulting density are controlled by the simplified model, which leverages prior information on the correlations between the model parameters and kinematic observables.This results in more stable training of the ANN, as only data for the simplified parameters are required, rather than resource-intensive detailed simulation, and therefore larger sample sizes can be generated.Crucially, reliance on a few latent parameters also results in a an ad hoc regularisation effect, and as such the density obtained via this procedure is less sensitive to statistical fluctuations when obtaining the values of the simplified parameters.A similar technique has recently been independently proposed that utilises generative adversarial networks [43].

Implementation
In the initial stage, an estimate of the joint probability distribution P ≡ P (x, Θ), is constructed, in terms of the kinematic observables x, in which the background or efficiency description is required, and the latent parameters on which the background or efficiency depend, Θ.The variables x could comprise the (square) Dalitz-plot variables, such as in the examples in this paper, but could also be anything else that is required to be parameterised in a physics analysis, such as the invariant mass of the reconstructed particle, or its decay time.The parameters Θ are those that directly control physical constraints on the system, and influence x via their correlations.These parameters necessarily vary between analyses, but it is likely that these would include, e.g., effective threshold values on the final state particle momenta, parameters that describe the shape of these momentum distributions, or fractions of potential background contributions.
An estimate of the joint probability distribution is parameterised using an ANN, obtained via the probability density estimation technique described in Section 6.The ANN is trained using a sample of simulated data, S train = {X train , Θ train }, that encapsulates dependencies between the kinematic observables, x, and the latent parameters Θ.These data are required to span the space of possible model parameter values, however accurate description of any specific configuration of Θ or x is not required (that is, there is no requirement for the set of input data points in this initial construction to overlap with the set of eventual evaluation points, due to the model smoothing).Effectively, the ANN parametrisation obtained at this stage will represent the functional representation of the model that is implemented using simulation.
Secondly, an estimate of the specific values of the latent parameters, Θ pred , that correspond to the background data or detailed simulation, X data , is obtained.This is done by fixing the weights of the ANN obtained at the first stage, and performing a maximumlikelihood fit for these values, treating the ANN output as the probability of the latent parameters conditioned on the known vector of kinematic observables, P (Θ|X data ), such that Θ pred = arg max Θ P (Θ|X data ).
Lastly, again using the ANN as a joint probability function, the sample X is drawn from the distribution P (x|Θ pred ), to obtain the probability density of kinematic observables.This approach is illustrated below for the estimation of the acceptance and combinatorial background distributions of the D + s → K + π − π + decay, described in Section 3. It is worth noting that, whilst these latent parameters should in principle comprise the set of features on which the parameters of interest depend strongly upon, this set need not be exhaustive.Providing that the included parameters are at least reasonably correlated with any additional parameters that are not considered, yet influence the kinematic distributions, an "effective" value of these parameters can be obtained in the maximum likelihood fit stage.As such, these latent parameters can differ from those that can be calculated directly from the dataset that the model is evaluated on, in such a way that the efficiency or background distribution can nevertheless be correctly parameterised using these.

Acceptance parameterisation
To demonstrate the feasibility of the model-assisted approach for the parameterisation of acceptances, the same model as the one described in Section 3.1 is used, however the requirements on the reconstructed D + s mesons and their decay products are varied in the generation of the training sample used to construct the ANN, as in step one above.The range of the variations for each of the five parameters of the model is given in Table 3, where the entire sample consists of 2 000 000 events that satisfy the selection requirements.Since the efficiency model is relatively simple, generation of the training sample takes only a few minutes, and therefore this can be arbitrarily large.Since the events that do not satisfy the selection requirements are rejected during generation, the initial uniform distributions of the model parameters can become significantly non-uniform for the accepted events in the training sample.To compensate for this effect, the model parameters are sampled from non-uniform prior distributions (exponential in our case), where the parameters of the prior distributions are tuned to ensure that the distribution of model parameters for the accepted events are roughly uniform.
The functional form of the efficiency model, a 7-dimensional probability density function p(m , θ , Θ), where Θ is the vector of parameters listed in Table 3, is obtained by the ANN density estimation procedure described in Section 6.The ANN topology and training parameters are the same as those highlighted in that Section 6, with the exception that here the L 2 regularisation parameter is taken to be λ 2 = 5 × 10 −3 .The projections of the result of the ANN training after 50 000 epochs in the square Dalitz-plot variables, as well as the correlations between the Dalitz-plot variables and the latent model parameters, are given in Appendix A.
An unbinned maximum likelihood fit is then performed to obtain the effective model parameters for a test sample corresponding to that described in Section 3.1.The results of the fit are presented in Fig. 10, and the model parameters, both the true generated values and the values obtained from the fit, are given in Table 3.The maximum likelihood fit is performed using iminuit [44], and the interface between the ANN implemented in TensorFlow and iminuit is provided by TensorFlowAnalysis package [39].
Since the same underlying model is used to generate both the test samples and the samples used for the ANN training, one should expect the reconstructed parameters to be statistically consistent with the true generated ones.Certain tension is observed between the parameters in Table 3, which may indicate some deficiency or ambiguities in the ANN model.Nevertheless, this gets corrected by the maximum likelihood fit to the test dataset, and a good-quality parameterisation of the distribution results is obtained with the fit quality calculated with 50 × 50 binning equal to χ 2 /nDoF = 2528.9/2494.In the case of estimating the acceptance distribution from the detailed simulation, the quality of the parameterisation will depend on how well the simplified efficiency model approximates the experimental selection.

Background parameterisation
The estimation of the combinatorial background density is performed in a similar way to that of the acceptance parameterisation, except that the ANN training sample also includes the full range of the selection variable, m D , and the test sample only contains the events in the sidebands (defined in Section 7) to reproduce a more realistic analysis scenario.As in the case of the acceptance parameterisation, the background model from Section 3.2 is used to both generate samples for the initial joint density estimation by the ANN, and the test dataset for the subsequent maximum likelihood fit.The list of  generated parameters of the model, with ranges of their variation, is given in Table 4, where the size of the training sample is 1 000 000 events.Here, the topology of the ANN is unchanged from that described in Section 7, and the penalty factor in the L 2 regularisation term is taken to be λ

Conclusion
Techniques have been proposed here to efficiently parameterise background and acceptance variations that are an essential component to multidimensional fits of hadronic decay amplitudes.Often, treatments of the acceptance variations are sub-optimal, as they either do not exploit rudimentary assumptions of local smoothness, and require large quantities of computationally expensive simulated data, or a sizeable systematic uncertainty results from the use of an inefficient parameterisation.For the background description, assumptions often have to be made on the functional form of specific backgrounds, or on the validity of an extrapolation into the signal region.These assumptions are difficult to validate, and therefore in these cases, the background can be a considerable source of systematic uncertainty.
Here, several new applications of Gaussian processes and neural networks are proposed that attempt to mitigate these issues, by utilising a more efficient parameterisation, or imposing regularisation constraints on a model with a large number of degrees of freedom.Additionally, a method is proposed that utilises a neural network to extract latent dependencies on physics parameters, which permits a more physically motivated way of imposing constraints on the resulting probability density.
The techniques proposed in this paper reduce the overall systematic uncertainty from the aforementioned effects by providing a more efficient, regularised description of the acceptance variations.Furthermore, by doing this they also reduce the computational burden of generating sufficient simulated data to obtain robust results.
These approaches also mitigate biases in the estimation of the background contributions due to correlations with selection variables; permit extrapolation into the signal region of backgrounds that, for example, are not constant throughout the control variables due to decay kinematics or kinematical constraints; and scale efficiently with increasing dimensionality.

Figure 2 :
Figure 2: Simulated combinatorial background to D + s → K + π − π + decay.Twodimensional (a) m vs. θ (b) m vs. m D , and (c) θ vs. m D projections, one-dimensional projections onto (d) m D variable with the definition of signal and sideband regions, and projections of signal and sideband regions onto (e) m and (f) θ variables.Twodimensional distributions are normalised such that the average density is equal to one.

Figure 3 :
Figure 3: Estimate of the acceptance variation using (top) Legendre polynomial and (bottom) cubic spline model over (left to right) two-dimensional m vs. θ variables, m and θ projections.

Figure 4 :
Figure 4: Result of the density estimation of the simulated sample of D + s → K + π − π + decays using Gaussian process method (a) in two square Dalitz-plot variables m and θ , and projections of the two-dimensional distribution onto (b) m and (c) θ variables.

Figure 5 :
Figure 5: Result of the density estimation of the simulated sample of D + s → K + π − π + decays using an ANN (a) in the two-dimensional square Dalitz-plot variables m and θ , and projections of this distribution onto the (b) m and (c) θ variables.

Figure 6 :
Figure 6: Results of the estimation of combinatorial background density in sideband regions using Gaussian process.Two-dimensional (first row) m D vs. m , (second row) θ vs. m , and (third row) m D vs. θ projections of the (left) simulated background sample and (right) density predictions from the fit model.(Bottom row) One-dimensional projections onto (from left to right) m , θ and m D .

Figure 7 :
Figure 7: Results of the interpolation of the combinatorial background density in the signal region using Gaussian process: (a) two-dimensional density and (b,c) its projections.

Figure 8 :
Figure 8: Results of the estimation of combinatorial background density in sideband regions using an ANN.Two-dimensional (first row) m D vs. m , (second row) θ vs. m , and (third row) m D vs. θ projections of the (left) simulated background sample and (right) density predictions from the fit model.(Bottom row) One-dimensional projections onto (from left to right) m , θ and m D .

Figure 9 :
Figure 9: Results of the interpolation of the combinatorial background density in the signal region using an ANN: (a) its two-dimensional density and (b,c) its projections.

Figure 10 :
Figure 10: Result of the density estimation of the simulated sample of D + s → K − π + π − decays using the model-assisted ANN (a) in the two square Dalitz plot variables m and θ , and the projections of the two-dimensional distribution on to the(b) m and (c) θ variables.

Figure 11 :
Figure 11: Results of the interpolation of the combinatorial background density in the signal region using the model-assisted ANN: (a) the two-dimensional density and (b,c) its projections.

2 = 1 .
The projections of ANN variables, after 50 000 training epochs, in the m , θ and m D variables, as well as the correlations of these variables with each other and with the model parameters, are shown in Appendix B. The projections of the resulting background density estimation are shown in Fig. 11, and the true and reconstructed values of the model parameters are given in Table 4.The values of the reconstructed parameters are consistent with the true values within their uncertainties, which indicates a good quality of the 11-dimensional ANN parameterisation of the background model.The fit quality calculated with 50 × 50 binning is χ 2 /nDoF = 2472.6/2491.

FitFigure 12 :FitFigure 13 :
Figure 12: Results of the estimation of the simulated acceptance distribution, as a function of the effective model parameters, using an ANN.

FitFigure 14 :
Figure 14: Results of the estimation of the simulated combinatorial background density, as a function of effective model parameters, using an ANN.

FitFigure 15 :
Figure 15: Results of the estimation of the simulated combinatorial background density, as a function of effective model parameters, using an ANN.
Figure 1: Relative efficiency variation over the square Dalitz-plot variables for D + s → K + π − π + sample obtained from high-statistics Monte-Carlo simulation.The distribution is normalised such that the average efficiency equals 1.

Table 1 :
Parameters of the Gaussian process fit to the simulated D + s → K + π − π + decays, with a Matérn kernel and a constant mean function.

Table 2 :
Parameters of the Gaussian process fit to the simulated D + s → K − π + π − background, with a Matérn kernel, constant mean function in (m , θ ), and linear mean function in m D .

Table 3 :
Parameters of the D + s → K + π − π + efficiency model: the range of parameter variations used at the ANN training stage, the true values used in the generated test sample, and the reconstructed values extracted from the fit of the ANN model to the test sample.

Table 4 :
Parameters of the D + s → K + π − π + background model: the range of parameter variations used at the ANN training stage, the true values used in the generated test sample, and the reconstructed values extracted from the fit of the ANN model to the test sample.