Nuclear Inst. and Methods in Physics Research, A

A common problem in data analysis is the separation of signal and background. We revisit and generalise the so-called sWeights method, which allows one to calculate an empirical estimate of the signal density of a control variable using a fit of a mixed signal and background model to a discriminating variable. We show that sWeights are a special case of a larger class of Custom Orthogonal Weight functions (COWs), which can be applied to a more general class of problems in which the discriminating and control variables are not necessarily independent and still achieve close to optimal performance. We also investigate the properties of parameters estimated from fits of statistical models to sWeighted data and provide closed formulas for the asymptotic covariance matrix of the fitted parameters. To illustrate our findings, we discuss several practical applications of these techniques.


Introduction
This article takes a fresh look at the sWeights (or sPlot ) formalism discussed by Barlow [1] and popularised more recently by Pivk and Le Diberder [2]. The sWeights method is used to infer properties of a signal distribution in a mixed data set containing signal and background events. The signal distribution is extracted non-parametrically by applying weights to individual events. Inference is then done on the weighted data set. The method is applicable, when individual points from the data distribution consist of a discriminating variable(s), here called , and one or more statistically independent control variables, here called . Both and can be vectors and of different dimensions without changing any of the conclusions. We will only refer to the one-dimensional case to simplify the discussion, but the general case is always implied. By fitting parametric models to the signal and background in the discriminating variable , one can compute the weighted distribution that represents the signal density in the control variable . The advantage of this method, compared to a fully parametric fit to the ( , ) distribution, is that one avoids the need to parameterise the background density in the control variable , which is often challenging.
In Section 2 we re-derive the established sWeights method from the starting point of orthonormal functions. We show several ways of calculating the weights and compare their trade-offs, and emphasise that sWeights can easily be computed without some of the restrictions seen previously. * Corresponding author. In Section 3 we then discuss a generalisation of the sWeights method which we dub Custom Orthogonal Weight functions (COWs). COWs relax most of the requirements of the sWeights formalism and can be applied to a larger class of problems than sWeights, at a small loss in precision.
In Section 4 we then discuss the properties of estimates obtained when fitting models to weighted data. We give an asymptotically correct formula for the covariance matrix of the parameters obtained from such a fit.
Finally in Section 5 we perform a variety of studies on simulated Monte Carlo which deploy sWeights and COWs on various applications and show comparisons of their performance. We also discuss a test of independence that can be used to determine if sWeights are applicable or whether the more general COWs method is needed.

sWeights as orthonormal functions
To compute the weights for the signal distribution in the control variable , we use a discriminant variable (often the invariant mass of some particle's decay products). The signal and background density only need to be parameterised in the discriminant variable . In the sWeights formalism, the variables and must be statistically independent in each component, so that the respective p.d.f.s of the variables factorise. The total p.d.f. then has the following form where is the signal fraction, ( ) and ℎ ( ) are the signal p.d.f.s in the discriminating and control variables, respectively, and ( ) and ℎ ( ), the corresponding background p.d.f.s. The sWeights method allows one to obtain an asymptotically efficient non-parametric estimate of ℎ ( ) while only requiring parametric models for ( ) and ( ). We stress that the sWeights method is only applicable when the p.d.f.s in and factorise for both the signal and the background, which is conditional on their independence. Independence is a stronger condition than lack of correlation; tests which demonstrate a lack of correlation between and provide necessary, but not sufficient, evidence for the applicability of the sWeights method. We come back to proper tests of independence in Section 5.

Construction of an optimal weight function
We postulate that a weight function, ( ), exists which extracts the signal component, ℎ ( ), when ( , ) is multiplied by it and integrated over 1 : The left and the right-hand sides of Eq. (2) are equal in general only if the following conditions hold: If we regard ∫ d ( ) ( ) as the inner product of a vector space over functions, then these conditions define ( ) as the vector orthogonal to ( ) and normal to ( ). In other words, ( ) is an orthonormal function in this space.
Since the vector space over is infinite-dimensional, there are infinitely many orthonormal functions ( ) that satisfy these conditions. For example, the classic sideband subtraction method can be regarded as a special case where ( ) is a piece-wise constant function which is positive in the signal region and negative in the background region.
In order to obtain a unique solution for ( ) we can chose to minimise its variance. Since ( , ) factorises and ( ) is only a function of , we can obtain all information about from the density ( ), computed by integrating Eq. (1) over , The expectation of over ( ) is and the variance of over ( ) is given by Minimising the variance, Var( ), guarantees that the sample esti-matê= 1∕ ∑̂( ) asymptotically has minimum variance. As a byproduct, this choice also produces minimum variance for the estimated background fraction (1 −̂), and generally smooth functions, ( ), since oscillating solutions have larger variance. To find the function ( ) which minimises Var( ), we have to solve a constrained minimisation problem. The solution, computed in Appendix A, is 1 Throughout this paper the symbol, ! =, is used to indicate that the equation should be solved.
The constants , are obtained by inserting Eq. (7) into Eq. (3) and solving the resulting system of linear equations. The signal component plays no special role in the derivation so far. We could have equally postulated a weight function ( ) to extract the background, which leads to the conditions and The coefficients and with ∈ { , } can be computed by solving ( ) where , ∈ { , }. In other words, the matrix , formed by the coefficients to compute ( ) and ( ), is the inverse of the symmetric positive-definite matrix. The discussion throughout this section assumes just two components (one signal and one background) but is equally applicable to any number of components, , in which case the and matrices are not 2 × 2 but × . Applying Cramer's rule to Eq. (11), we get One can further replace ( ) in the denominator of Eq. (7) (or Eq. (10)) by inserting Eq. (7) into Eq. (5) to find that = + , and similarly one finds 1 − = + . With these ingredients, we obtain the final equations In summary, to obtain ( ) or ( ) one has to compute the matrix elements , , , which depend only on , ( ) and .

Application to finite samples
The calculations so far were carried out for the true p.d.f.s, , ( ), and true signal fraction, , on which the matrix elements depend. In practice, these need to be replaced by sample estimateŝ, ( ) and , typically obtained from a maximum-likelihood fit, although any kind of estimation can be used. The plug-in estimate [3] of Eq. (15) iŝ For the computation of the estimateŝwe face a choice between several options.
• Variant A: We replace the true quantities in Eq. (12) with their plug-in estimates and compute the integral analytically or numerically, Solving the one-dimensional integral numerically is not an issue; this is a standard problem for which efficient and robust algorithms exist. The computation is independent of the number of data points. A numerical integration typically requires around 100 to 1000 function evaluations. If the p.d.f.s ( ) have no shape parameters that need to be estimated from the data, we havê( ) = ( ) and onlŷhas to be estimated. In this special case, it is shown in Appendix C that the sum over sWeights in a bin of are uncorrelated and their variance is estimated by the sum of weights squared. In practice, however, the component p.d.f.s are often not known, and thus in general, the sums of weights in different bins are not uncorrelated and the sum of weights squared is not a proper estimate of the bin-wise variance. Calculation of the covariance then requires the sandwich estimator described in Appendix C. • Variant B: The integral in Eq. (18) can be replaced by a sum over the observations in the data sample. In general, an integral over a function ( ) can be written as an expectation value over the p.d.f. ( ), In a finite sample, the arithmetic mean is an unbiased estimate of the expectation due to the law of large numbers, thus we can construct an unbiased estimator by replacing the expectation with the arithmetic mean, where is the th observed value of and is the sample size. We obtain which is the formula to compute sWeights given in Ref. [2]. We will refer to sWeights computed with variant B as classic sWeights throughout the paper. The computation with variant B is straight-forward, there is no need for a numerical integration algorithm. For large samples that exceed 1000 items, variant A will in general be faster than variant B, but the difference is hardly noticeable in practice. Sums over sWeights in bins of computed with Eq. (21) are always correlated, even if the component p.d.f.s are parameter-free (in contrast to variant A), as shown in Ref. [4].
In general, both variant A and B produce correlations between sums of sWeights in bins of (a weighted histogram), which makes further analysis more complicated. Ref. [2] states that such bins are uncorrelated and have simple variance estimates, but this is correct only under the special circumstances discussed in the following.
In the case of variant A, the correlations vanish only if the true shapes in the component p.d.f.s, ( ), are known, which is almost never the case in practice. In the case of variant B, correlations are present even if the component p.d.f.s are known. The correlations are usually small, but do not vanish as the sample size grows.
In general, the covariance matrix for a weighted histogram has to be computed with a sandwich estimator, irrespective of whether variant A or B are used. Variant A has a slight advantage over B, since the computation of the sandwich estimator is a bit simpler. Sandwich estimators for binned and unbinned fits to sWeighted data are given in Ref. [4]. The sandwich estimator for an unbinned fit is described in Section 4 and an outline for the computation of a sandwich estimator for a weighted histogram is given in Appendix C. The computation of these sandwich estimators can be automated by software. Altogether, we recommend variant B for practical applications, since it produces estimates with smaller variance than variant A, as described in Section 4, but we also note that the difference between variant A and B is negligible in most toy examples that we tried.
Finally, we note that the sum over sWeights computed with either variant is exactly equal to the previously estimated signal yield̂= that is used in their calculation, The proofs are provided in Appendix D. For variant B this is generally true, and for variant A, it is true, if̂,̂, and the component p.d.f.ŝ ( ) are estimated with the extended maximum-likelihood method, described in the next section.

Connection to extended maximum-likelihood fit
There is a curious connection between Eq. (21) and the results of an extended maximum-likelihood fit in whicĥand̂are maximumlikelihood estimates and the respective signal and background yields, and , are regarded as independent variables [2]. In such a fit, one maximises the extended log-likelihood function [5] which, without constant terms, is The extremum is determined by solving the score functions with ∈ { , }. The maximum-likelihood estimates obtained from these score functions arê=̂and̂= (1−̂), wherêis the estimated signal fraction as before. The elements of the Hessian matrix, of second derivatives of the log-likelihood function, are given by We note the similarity between Eq. (25) and Eq. (21) and evaluate the second derivative at the maximum of ln to find This offers another opportunity to compute estimates of , which was already pointed out in Ref. [2]. The second derivatives of the loglikelihood for̂,̂, and the shape parameters , of̂, ( ; , ) are routinely computed by MINUIT [6] by calling the routine HESSE to estimate the covariance matrix of the parameters once a minimum is found by the routine MIGRAD. The matrix obtained in this way is the negative inverse of the Hessian, The dotted parts of the matrix correspond to derivatives that contain one or two shape parameters of , . Variant C to compute the elements of̂consists of the following steps: • Invert the covariance matrix of the fit of yields , and shape parameters , .
• Isolate the 2 × 2 sub-matrix of the Hessian which contains the derivatives with respect to the yields , . • Use Eq. (26) on these matrix elements to obtain̂.
It would be incorrect to switch steps 1 and 2, i.e. isolate the 2 × 2 submatrix of that contains the yields and invert it, because this does not restore the derivatives.
An equivalent alternative is to do a second fit which leaves only the yields free while keeping shape parameters fixed. In this case, the covariance matrix computed by MINUIT can be scaled to yield an estimate of the coefficient matrix from Eq. (11): Mathematically, variant B and C should produce the exact same result. In practice, this is not exactly true, because the Hessian matrix is not calculated analytically with Eq. (25). The second derivatives in Eq. (27) are instead computed approximately by numerical differentiation of Eq. (23). The accuracy of a numerically computed second derivative is of the order of 1∕3 , where is the round-off error of floating point arithmetic on a computer ( ≈ 10 −12 for double precision). The accuracy of the matrix computed with variant C is much lower than one computed with variant B or A, so that Eq. (22) only holds approximately.

Custom orthogonal weight functions
So far we considered the special case where the p.d.f. is a mixture of two components that each factorise in both the discriminant and control variables. We now generalise to an arbitrary number of factorising components, and also allow for a non-factorising function of frequency weights, ( , ), which are usually identified with an efficiency. In High Energy Physics applications, such a function may arise from a non-uniform acceptance introduced by the detector or the selection requirements applied to the data in order to improve the signal-to-background ratio. The p.d.f. for the observed data then is The normalisation constant, , ensures that ( , ) is a probability density. We expand the true density of the events into factorising components, The Kolmogorov-Arnold representation theorem [7,8] ensures that a finite sum of terms on the right-hand side can represent any twodimensional function ( , ). The ( ) and ℎ ( ) are normalised probability densities. Normalised Bernstein [9] or B-spline [10] basis polynomials can fill this role in general. For practical applications, it is beneficial if the expansion requires only a few terms, which can be achieved with ( ) and ℎ ( ) suitably chosen for the specific case. For a given expansion, we will assume that the first terms pertain to the signal density while the others describe the background, i.e.
Each partial sum in general produces a non-factorising function. This makes it possible to generalise the previous results from Section 2.1 to non-factorising signal and background components. Any single function ℎ ( ) in ( , ) can be isolated by a weight function The function ( ) is an arbitrary non-zero function in the considered range of , this is another generalisation with respect to classic sWeights. We dub it the variance function, because the optimal function ( ) corresponds to the point-wise variance of a density estimate (as we will see later). It is straight-forward to show with ∑ = that weight functions defined in this way are orthonormal to the component p.d.f.s ( ), The Custom Orthogonal Weight function (COW) to extract the density ℎ ( ) from data, is then which is now a function of both and . The expectation value of the COW in an infinitesimal bin, d , in the control variable is so it is an asymptotically unbiased estimate of the efficiency corrected density, ℎ ( ), for any choice ( ). The COWs that project out the entire signal or background component are given by By integrating Eq. (35) over , one sees that E[ ′ ] = . An estimate of iŝ .
The estimate for used here is derived from Eq. (29), In analogy with Section 2, one also has to replace the true matrix with an estimate. The estimates for COWs corresponding to variant A and B for sWeights arê where all true functions are replaced with estimates and̂( ) is an estimate of the observed density, ( ) = ∫ d ( , ), in the discriminant variable. Sincêfor COWs is more cumbersome to estimate, we consider onlŷin the following and omit the distinction. In contrast to sWeights, the sum of COWs is not unity in general, even if ( , ) = 1. This is the case only if ( ) is a linear combination of the basis functions, ( ). One finds for arbitrary constants as shown in Appendix E. When ( ) takes this form, every event contributes with a total weight of unity to the possible states . This property is not of practical relevance, however. A corollary of this result is that with an increasing number of terms the sum ∑ ( ) will converge towards unity for any function ( ), since a linear combination of sufficiently many basis functions ( ) always allows for a good approximation of ( ).
We now discuss an optimal choice for the variance function ( ). We can find functions, such that (I) the variances of thêare minimal, (II) thêare maximum-likelihood estimates.
As shown in Appendix F, requirement I leads to This choice minimises the variances of thêand is therefore optimal. An estimate of ( ) can be obtained by a histogram of the 1∕ 2 ( , ) weighted -distribution or by fitting a suitable parameterisation to it. The extreme case of a single-bin histogram is equivalent to ( ) = 1 (the scale of ( ) is irrelevant). The exact details of the binning are not important, since a coarse binning will merely increase the variance of above the minimum possible. Appendix G shows that the alternative requirement II leads to where thêare estimates for the true fractions obtained from an 1∕ ( , ) weighted unbinned maximum-likelihood fit. In general, this variance function is different from ( ) and not optimal.
In the special case of constant efficiency, equivalent to setting ( , ) = 1, the two variance functions converge asymptotically, Computing sWeights as described in Section 2 is equivalent to using II ( ), which is asymptotically optimal in this case.

Mismodelled signal density
A common task is to extract a single signal component that factorises in and , contaminated with a background that is not factorising. To construct the signal-extracting COW, ′ 0 , one requires a model for the signal density, 0 ( ) according to Eq. (32), and a set of background p.d.f.s ( ) (where = 1, … , ) and a variance function, ( ). Often, the signal shape 0 ( ) is a non-trivial function containing a number of nuisance parameters. We now investigate what happens if 0 ( ) is mismodelled and does not match the true signal density.
We distinguish between the true p.d.f.s, ( ), and their mismodelled proxy p.d.f.s, ( ), which are used in the calculation of COWs, ( ). If we review the mathematical steps in the previous section, we find that ( ) = ( ) is not required for the construction of COWs. The construction steps and the properties of the and matrices remain the same if ( ) ≠ ( ). We only get a different result if we integrate over the product of the total density ( , ) and the COWs.
We can write the expected signal weight in an infinitesimal bin of width d in the control variable as Since we only consider the signal p.d.f. to be mismodelled, we have We use this and the symmetry of to find since the first sum in the square bracket is a constant independent of , and the second sum is zero.
Therefore, a mismodelling of the signal component 0 ( ) in the discriminatory variable introduces no bias for the estimation of ℎ 0 ( ), although the method is then no longer optimal in the previous sense. This offers an intriguing possibility in practice. If one is only interested in projecting out the normalised signal p.d.f., ℎ 0 ( ), there is great freedom in the construction of the p.d.f. 0 ( ) and the variance function ( ) in the discriminating variable. Any p.d.f. works for 0 ( ) which is not a linear combination of background p.d.f.s ( ) with ≥ 1. The best results are nevertheless obtained with a 0 ( ) that is as close to 0 ( ) as possible, and as few basis polynomials for the background as possible. Poorly chosen functions ( ) and 0 ( ) will increase the variance of the estimateĥ 0 ( ).

COWs in the wild
In this section, we remark on points that are important for the practical applications of COWs, which arose in discussions on this method.
Firstly, we emphasise the important consequence for analyses in particle physics that follows from the previous section. If the signal factorises in and , a COW for the signal can be computed without a fit. We refer to this variant as COWs lite, to contrast it with full COWs where both signal and background are non-factorising. A histogram of the simulated signal distribution in obtained from simulation can be used for 0 ( ), the proxy p.d.f. for the signal density, and a histogram of the signal and background distribution ( ) in the simulation can be used for the variance function ( ). Even if the simulation does not describe the real experiment perfectly, using these proxies does not create a bias and the substitutes will usually be close to optimal. The optimal variance function ( ) can be estimated from data with a 1∕ 2 ( , )-weighted histogram of the -distribution. The details of the binning are not important provided bins are not empty. A nonoptimal binning will increase the variance of the signal COW above the minimum possible, but the influence is very weak and therefore it is not necessary to carefully optimise the binning. The extreme case of a single-bin histogram is equivalent to ( ) = 1 (the scale of ( ) is irrelevant) and also a valid choice.
A good description of the background under the signal is however crucial to avoid bias. The background can be expanded generally into Bernstein or B-spline polynomials, which can approximate any p.d.f. with enough terms. A systematic bias related to the choice of the background model, i.e. incurred by not having sufficient terms in the background description, can be probed by adding more terms and verifying with goodness-of-fit tests (see e.g. Ref. [11]) that the model is sufficient to describe the distribution. For a chi-square test statistic in a binned fit, the Fisher -test indicates whether adding further terms improves the model significantly. When two models are tested on a histogram with bins, where model 1 has 1 terms and model 2 has is asymptotically distributed. Cross-validation [12] is another option if the fit is unbinned. Secondly, we want to give some insight into why it is beneficial to include the efficiency function ( , ) into the sWeights estimation, instead of trying to separate signal and background first and then correct for efficiency in a separate step. If the efficiency function is not directly included in the analysis, both signal and background are non-factorising because of the effect of the efficiency function. In the COWs framework, we can still extract sWeights by setting ( , ) = 1, but now sufficiently many terms in the signal and background part of the data model are required to account for factorisation-breaking effects. This reduces the statistical power of the method and an additional complication arises. Once the signal density in the observed sample, ℎ ( ), has been determined, the efficiency correction must be done with the signal efficiency projected into the control variable,̄( ), to obtain the true density ℎ ( ) ∝h ( )∕̄( ). In weighted unbinned fits or when generating a histogram estimate of ℎ ( ), one has to use ( )∕̄( ), where indexes the data points in the sample. Using ( )∕ ( , ) as an event-by-event weight instead is wrong, since the -dependence in the efficiency factor destroys the orthogonality relations for the COW, and the signal estimate in becomes polluted by background. The projected efficiency is given bȳ where ( , ) denotes the signal component of the true p.d.f., and is not straight-forward to estimate from the sample. One could takē( ) from simulation, with the caveat that the simulation may differ from the real experiment. If the efficiency function factorises, ( , ) = ( ) ( ), we get where This marginally simplifies the matter, since ( )∕ ( ) can be used instead of ( )∕̄( ) in this case, if only the shape of ℎ ( ) is of interest. Nevertheless, these additional complications make the two-step approach unfavourable.

Variance of estimates from weighted data
This section discusses how to correctly perform parameter uncertainty estimation in an unbinned fit of weighted data, when using sWeights. The more complex binned fit for this case is described in Ref. [4]. We will give explicit formulas for classic sWeights computed with variant B, as introduced in Section 2.2. The general approach we discuss here also applies to fits of weighted data obtained from variant A or the COWs method, but we will not give explicit formulas for the other variants for the sake of brevity, as each variant has its own varied uncertainty estimate. We will point out how the formulas have to be adapted, but leave the explicit calculations to the reader.
Parameter estimation using weighted unbinned data sets can be performed by maximising the weighted likelihood [13], which is equivalent to solving the weighted score functions with = ( ) in case of sWeights or = ′ ( , ) in case of COWs and shape parameters of the signal p.d.f. ℎ ( ; ). The weighted likelihood is not a true likelihood (product of probabilities) and so the inverse of the Hessian matrix [13] of the weighted likelihood does not asymptotically provide an estimate of the covariance matrix of the parameters. Eq. (50) is an example of an M-estimator [14]. A complete derivation of the asymptotic covariance matrix for the parameters can be found in the appendix of Ref. [4], here we only summarise the main findings.
A complication arises due to the fact that the sWeights depend, via Eq. (17), on the elements of the matrix, which is determined via Eq. (21). The estimateŝin turn depend on the estimates of the signal and background yields,̂and̂, usually determined from an extended maximum likelihood fit. Problems of this type are described as two-step M-estimation in the statistical literature [15,16]. To account for the fact that the parameters are estimated from the same data sample and are not independent, one has to combine the estimating equations for the parameters of interest with those of the yields and the inverse covariance matrix elements in a single vector.
We construct the quasi-score function ( ), where = { , , , , , , } is the vector of all such parameters; and are also vectors for the shape parameters in and , respectively. The elements of are given by where, with , ∈ { , }, ( ) iterating over the three unique combinations { , , }, and the shape parameters of and running between {1 … } and {1 … }, respectively. For reference these can be compared to the equivalent expressions in Eq. (21) and Eq. (24). One can show that E[ ( 0 )] = , if 0 is the vector of true parameter values [4]. Therefore, a consistent estimatêcan be constructed as the solution to ( ) ! = . We note that the elements of ( ) can be multiplied by arbitrary non-zero constants without changing these results.
The asymptotic covariance of , which includes the parameters of interest , is then given by [17][18][19] where ∕ is defined as the Jacobian matrix built from the deriva- We note that the inverse of the Jacobian ∕ introduces correlations between the parameter uncertainties. In a finite sample, the expectation values in Eq. (52) can be estimated from the sample. The estimate for E[ ∕ ] is ∕ |̂, while the elements of the matrix̂are provided in Appendix H. In the literature, Eq. (52) is often referred to as the sandwich estimator, but in this case the variance of the score is modified because we consider fluctuations in the sample size.
This general pattern repeats for sWeights obtained with variant A and for COWs. Eq. (52) is general and holds for all variants, and the in the quasi-score vector in Eq. (51) always remain the same, but the other parts of the quasi-score vector change. The vector has to include a quasi-score for each sample estimate that is used in the calculation of the weights.
• If sWeights are computed with variant A, the estimates of the matrix given by Eq. (18) are not computed from the sample; they are a function of other estimates,̂=̂∕ and̂. The ( ) drop out of the quasi-score vector, since the weights ( ; , , ) in can now be expressed directly as a function of these parameters by inserting Eq. (18).
• If COWs lite are computed as described in Section 3.2 with a fixed signal model 0 ( ) or full COWs with a generic expansion for the signal, a fixed variance function ( ) = 1 (or any other fixed function), and a fixed efficiency function, the quasi-score vector reduces to the . Only in this case, the weights are independent of the sample and the sum of weights squared is an asymptotically correct estimate of the weight variance, as described in Appendix B, and the signal p.d.f. ℎ ( ) can be estimated with a simple weighted histogram, as described in Appendix C. • If COWs are computed and the optimal signal p.d.f. 0 ( ) is estimated from the sample, the estimating equations ln∕ for the shape parameters of 0 ( ; ) have to be included in the score vector as in case of classic sWeights. If the estimated variance function̂( ) =̂( ) is used, one needs in addition ln∕ , where is the amplitude of component , also analogue to classic sWeights. If the optimal variance function̂( ) =̂( ) is estimated from the sample as a histogram, as described in Section 3.2, one has to include a quasi-score function for each bin of the histogram, because the value in each bin is a sample estimate. If the efficiency function is estimated from the sample as well, its score functions also need to be included.
We close with a discussion of an explicit result obtained for classic sWeights (signal and background are each independent in discriminatory and control variable, and there is no factorisation-breaking efficiency correction) computed with variant B, when the shapes of ( ) and ( ) are fixed. Then, some simplifications in Eq. (52) are possible, as detailed in Ref. [4]. They result in the following covariance matrix for the parameters of interest , with where ( ) and ( ) iterate over { , , }, and̂( ) = ( ;̂, ,̂). The asymptotically correct expression for the binned approach is also derived in Ref. [4]. The first term of Eq. (53) is the covariance for a weighted score function as described by Eq. (50) with independent weights . The second term is specific to sWeights computed with variant B. Since the term itself is always positive, it reduces the estimated covariance of̂. This reduction is caused by the fact that sWeights are estimated from the same data sample. If the shapes of ( ) and ( ) are also estimated from the data sample, Eq. (53) has to be extended with further terms, see Appendix H.

Practical applications of COWs and sweights
In this section we investigate the performance of sWeights and the new COW methods in various test-case scenarios. The studies presented in this section make use of the sweights Python package [20], developed by the authors. The software includes generic implementations of extracting classic sWeights (Section 2) and COWs (Section 3) with the variants detailed in this document, as well as a class which performs a correction to the covariance matrix when fitting unbinned weighted data (Section 4). The Python interface offers support for probability distribution functions defined in either scipy [21], ROOT (via TTrees) [22] or RooFit [23]. Classic sWeights computed with variant B are also implemented in the RooStats [24] package, but there is no implementation of the other variants or COWs.
We emphasise again that computing sWeights only requires sensible estimates for the signal and background shapes,̂( ),̂( ), and the signal fraction̂. It does not require any special refitting or yield-only fitting which has been commonly recommended in other sWeights discussions, and is enforced in the RooStats software implementation.
Using the description for sWeights outlined in this article, one only needs to fit the discriminant variable(s) (usually a candidate invariant mass) once to obtain̂( ),̂( ), and the signal fraction̂, with the freedom to float, fix or constrain (by means of penalty terms in the likelihood) any of the shape or yield parameters.
More generally, the sWeights formalism described in this article extracts the weight function ( ) for each component , which is generally valid for any control variable that is independent of the discriminant variable. There can be several control variables or the control variable can be multi-dimensional. There are other beneficial consequences. The range used to estimate the p.d.f.s and yields can be different from the range used to extract the weights. It is also possible to use a binned fit to obtain estimates of the p.d.f.s and still extract perevent sWeights. This is helpful if the sample size is very large, when an unbinned fit can take much more computation time than a binned fit.
In the case of extracting COWs for a factorising signal component (we call this case COWs lite), a fit never even needs to be performed. One needs an approximation ( ), which does not have to agree with the true p.d.f. ( ) (although it should be close to minimise the variance of sWeights), a generic expansion for the background, and the variance function ( ) that can be estimated as a histogram from the data, as described in Section 3.2.
We demonstrate in the examples below that there are some pitfalls to be wary of. Generally, we recommend that each non-trivial use-case follows our approach here: produce ensembles of simulated events to check that biases are small and variances are as expected. We start off by explaining how a statistical test of independence can be helpful in deciding whether sWeights can be applied.

Statistical test of independence
A prerequisite for the extraction of sWeights, described in Section 2, is that the signal and background samples are each independent in the discriminant and control variables, so that the p.d.f. of the respective component factorises for the discriminant and control variables. If this is not the case then the extracted sWeights are biased in general. The COWs method, described in Section 3, overcomes this, but it is useful to know when the sWeights method can be applied and is sufficient. A statistical test of independence is useful here.
It is important to truly test for independence. In practice, it is common to compute the correlation coefficient of the discriminant and control variable and test whether it is compatible with zero, but this cannot detect a non-linear dependence that has no linear component, for example, the functional relationship = 2 will erroneously pass this test, if is distributed symmetrically around zero. We recommend the USP test of independence [25], which is applied to a two-dimensional histogram constructed from pairs of the discriminant and control variable. The test has proven optimal properties and high statistical power [25].
Implementations of the USP test are available in R [26] and Python [27]. The implementations use a histogram as input and compute a -value. The binning of the histogram should be fine enough to capture the essential variation in both variables. Bins with a small number of entries or even zero entries are not an issue for this test. If the -value is very small, evidence for a dependence was detected. The test needs to be applied to the signal and background samples separately, which requires that the test is applied to the simulation where signal and background can be disentangled. If the test passes for the signal, but not the background, one can use the COWs lite approach described in Section 3.2. If both components show dependence, one has to use the full COWs method described in Section 3.

A simple example comparing sWeights variants
We apply the sWeights method on a simple toy example and illustrate the small differences between the variants to compute the matrix described in Section 2.2. A common application of sWeights in particle physics is to extract the lifetime of a candidate using its invariant mass to isolate it from the background. We consider two independent variables; the invariant mass, , and decay time, , of a -meson candidate. The simulated dataset contains a mixture of signal and background events. The signal is normally (exponentially) distributed in ( ), whilst the background is exponentially (normally) distributed in ( ). The p.d.f. used to generate events, which is the ( , ) of Eq. (1), is shown in Fig. 1.
For each simulated dataset, the estimateŝ( ) and̂( ) are obtained from an unbinned maximum likelihood fit to the generated invariant mass distribution. Thêmatrix, with variants A to C, is computed with Eqs. (18), (21), and (26), respectively. Within variant C, we compute thêmatrices from the covariance matrix obtained from the HESSE routine in the iminuit package [6,28] using both of the methods described in Section 2.3: (i) by inverting the covariance matrix obtained from the full fit and extracting the relevant components, and (ii) by fitting once to obtain maximum-likelihood estimates for all parameters, and then again with only the event yields as free parameters, while all other parameters are fixed to their previous values, and inverting the resulting 2 × 2 covariance matrix. Finally, the weight functions,̂( ) and̂( ), are computed for each variant using Eq. (17).
The distribution of the weight functions,̂( ) and̂( ), from one pseudo-experiment containing 50 000 (200 000) signal (background) events, are shown for the nominal variant B method in Fig. 2 (left). The other variants give very similar distributions. As expected, tiny differences can be seen when inspecting their relative differences as shown in Fig. 2 (right), which vary from pseudo-experiment to pseudoexperiment. It is explained in Section 2.3 why the results obtained with variant Ci and Cii differ from B, although they should be mathematically identical. We evaluate the sum of weights and sum of squared weights for all four methods in order to make a comparison with the yield estimates and uncertainties extracted from the discriminant variable fit. The sum of squared weights is an approximate estimate for the variance of the sum of sWeights. As shown in Appendix B, it is an exact estimate only when the weights are independent of the sample, but sWeights are not independent since the component p.d.f.s have nuisance parameters that are estimated from the same sample. The deviation in this example is small, but that is not always the case. The results are shown in Table 1, along with those from maximum-likelihood fits to the distribution in , in which firstly all parameters are estimated, and secondly only the event yields are estimated, while the other parameters are set to their respective true values. As expected, we find that the sum of squared weights is comparable to the variance obtained from a fit in which only the yields are estimated and all shape parameters are fixed. The sum of squared weights alone underestimates the variance of the   fitted yields when also the shape parameters are estimated from the sample. This is important to keep in mind and why one has to use the sandwich estimator described in Section 4 in general to fully propagate the uncertainty, which also takes into account that the weights are not independent of the sample. The validity of Eq. (22) for variant A, B, and C is also demonstrated, i.e. that the fitted yield is exactly reproduced by the sum of weights. In case of variant A and C, the small deviations originate from round-off errors and the comparably low accuracy of a numerically computed second derivative. Finally, the sWeights are applied to the distribution in the control variable, , which is then fitted with an exponential distribution. This accurately reproduces the true shape, ℎ ( ), and gives an estimate of the exponential slope parameter, , consistent with that obtained from a two-dimensional unbinned maximum likelihood fit, as shown in Fig. 3 and Table 2. The estimated exponential slopes for each variant of sWeights and for the full two-dimensional fit are given in Table 2. In the case of the fits to weighted data, the uncertainties on the slopes are calculated with Eq. (53). The precision when fitting the weighted data is comparable to the full two-dimensional fit in this particular case, but in general this will not always be true.
This study is repeated on ensembles containing 500 pseudo-experiments in order to ensure that any of the behaviour seen is not just a fluke of the specific dataset shown in this example. We further perform the same study on ensembles with smaller sample sizes and with different signal to background ratios. The results are shown in Figs. 4 and 5. We find that the four sWeights variants give very similar results and can accurately reproduce the correct decay-time slope, and with a precision comparable to a two-dimensional fit.
We further see in Fig. 4 that the sum of weights (left two panels) for variant B exactly reproduces the fitted yield within numerical precision, as expected. Variant A also produces an unbiased estimate, but has a slightly larger spread (note the very small range on the y-axis). Variants Ci and Cii give a very small bias and tend to overestimate the yield by about 0.1 per mill. This is an artefact of computing the second derivative of the log-likelihood numerically and shows why we recommend variant B over C. When inspecting the variance properties, using the sum of squared weights (right two panels of Fig. 4), we find that the sum of squared weights slightly overestimate the true variance of the yield, which is due to the aforementioned fact that sWeights are not independent from the sample.
The bias of the fitted slope parameters and their average variance estimates computed over the ensembles are shown for different scenarios in Fig. 5. We find that the variance estimates for the parameter obtained from the sWeights variants are comparable to that from a full two-dimensional fit when the sample size is large. For very small amounts of signal, either small overall sample size or small values of the signal to background ratio, we see slight biases and a smaller variance estimate when using the weights method, compared to the full two-dimensional fit. Inspection of the studentised residual distributions suggest a small level (∼ 10%) of under-coverage in these cases, which arises from the asymptotic assumptions made when computing the  variance with the sandwich estimator. The importance of using the full sandwich estimator, which takes all sources of variance into account, is demonstrated by the points labelled as Variant B No Correction, which were computed with the first term of Eq. (53) only (which is sufficient for independent weights) and produce a variance estimate that is too small.

sWeights applied to a more complex example
In this section we show a more complex example with multiple factorising components within ( , ), of which some may be signal and some may be backgrounds. Because each component factorises in and we can still use the classic sWeights approach described in Section 2. In this example, we use the invariant mass of a reconstructed -meson candidate as the discriminant variable once more, but now have six different components. Some are even peaking under or near the signal in a similar way to the signal, as shown in Fig. 6 (left). We label the components with a discrete integer, ∈ [1, 6], where, • = 1 represents the signal, which peaks in invariant mass, and is labelled ''Signal'', • = 2 and = 3 represent backgrounds where one of the final state particles is mis-reconstructed, e.g. a neutral pion is reconstructed as a photon or a charged kaon is reconstructed as a charged pion. These are still peaking but are broader than the signal and have their peak position shifted with respect to the signal. They are labelled ''MisRec 1'' and ''MisRec 2'', • = 4 and = 5 represent backgrounds which have been partially reconstructed, i.e. one final state particle has been missed (shifting the mass shape down) or one final state particle from another source is erroneously added (shifting the mass shape up). These are labelled ''PartRec 1'' and ''PartRec 2'', • = 6 represents the continuum background containing random combinations of particles not from the same decay. This is labelled ''Background''.
The control variables are two so-called Dalitz variables. We have assumed that the discriminant invariant mass variable is constructed from  a three-body decay of the form → and in this case the Dalitz variables are the invariant mass squared of the and combinations. We generate a pseudo-experiment from the true underlying model, in which the Dalitz variables are uniformly distributed across the phase space for all components, apart from the signal which has a resonance in the invariant mass, and one of the backgrounds which has a resonance in the invariant mass. These appear as horizontal and vertical bands in the Dalitz plot, shown in Fig. 6 (right).
The generated dataset is fitted in the invariant mass, by maximising the extended unbinned likelihood with the shape parameters of the p.d.f.s, ( ), fixed to their true values, in order to obtain estimates for the event yields. The result of this fit is shown in Fig. 6 (left). We then use variant B to obtain the (in this case 6 × 6)̂matrix and extract the corresponding weight functions, ( ). The distributions of these weight functions are shown in Fig. 7 (left). As shown in Fig. 7 (right), each sum of weights, ∑ ( ), projects out its component and not the others.
In contrast to the previous example, this more complex example exhibits rapidly oscillating weight functions. They oscillate much more quickly than the actual variation of the relevant component shapes themselves. This is because the weight is related to how the shapes overlap as well as how they vary themselves with mass. One can also see that the weight functions for competing (i.e. similar) shapes have anti-correlated weights, which is what we would expect as their yields are anti-correlated. The sum of all component weights for any value of the discriminant variable is still unity.
The weighted Dalitz variables for each component are shown in Fig. 8. The weights project out the respective components, also the peaking ones. It is worth highlighting that the components with the smallest yields have the largest amplitudes of the weight function, as can be seen in Fig. 7 (left). This is a general feature of sWeights: a component with a small yield has larger uncertainties and correspondingly more oscillations in the weight function and a larger weight variance. This can then lead to sizeable fluctuations when weights are visualised like in Fig. 8. When inspecting these plots, it is possible to mistake fluctuations as features in a distribution; for example, a band might seem to appear in a Dalitz distribution when in reality it is just large fluctuations around zero. Minimising the size of these fluctuations is prudent as it is generally undesirable to have few events with large weights. Since the oscillations of the weight functions cannot be reduced (they are already minimal by construction), we recommend, for display purposes, to increase the bin size accordingly in plots like Fig. 8 for components that have a very small yield. Generally, we recommend to proceed with caution when trying to use sWeights for a component which is considerably smaller than the others.

COWs applied to an example with non-factorising background and efficiency
In the final example, we consider a case similar to the first example in Section 5.2, but which now contains non-factorising background and a non-factorising efficiency. The signal is still factorising and is normally (exponentially) distributed in ( ). The background is exponentially (normally) distributed in ( ) but with the shape parameters in ( ) containing a dependence on ( ). The background p.d.f. can be written as where is a normalisation constant and ( , ) encode the strength of the dependence in ( ) on ( ). This emulates the more realistic use case in which the signal model is straightforward, but the background model is not, and where the efficiency loss is an additional complication. The nature of the true model used to generate ensembles of experiments is shown in Fig. 9, in which the non-factorising nature of the background is manifest in that the exponential slope of the background in mass varies with decay time, and both the mean and width of the normal distribution describing the background in decay time vary with mass. Projections of the integrated distributions along with the projection of the efficiency model used are also shown in Fig. 9.
To demonstrate the failure of classic sWeights in contrast to COWs in this case, we apply both methods to a single high-statistics pseudoexperiment containing 500 000 events. For classic sWeights, a fit to the invariant mass, shown in Fig. 10 (left), is used to estimate the signal and background p.d.f.s required to extract the sWeights. When constructing COWs for this case, no fit is performed as described in Section 3.2. Therefore, only the signal function, the background functions (in this case polynomials up to order 4 are chosen) and the variance function, ( ) = 1, used to build the COWs, are shown in Fig. 10 (right). The choice of ( ) is not optimal, but it is an allowed simple choice.
The extracted weight functions for both methods are shown in Fig. 11. A comparison of the weighted events and true distributions in the control variable, , is shown in Fig. 12. The sWeights method cannot handle the non-factorising nature of the background and the efficiency, and therefore the correct distributions are not recovered. The COWs method correctly reproduces the desired distribution and can be applied even without fitting the data.
We further perform an analysis on ensembles of simulated datasets using variant B of the sWeights procedure along with various implementations of the COW formalism presented in Section 3. The simulated sample size is 2 000 with equal amounts of signal and background. For the sWeights implementation the signal,̂( ), and background,̂( ) distributions are estimated by fitting the simulated sample. To compute COWs, the same signal model is used, while the background expansion and the variance function ( ) are varied. We use these variants for the background: • A single background component, the same as the fitted estimate of the background from the sWeights calculation,̂( ).  • Several background components, given by polynomial powers of , up to 1st, 3rd and 5th order.
We do not expect COWs to perform well with a single background component, because the background is non-factorising, which requires several components. We use the following variance functions: • Unity, ( ) = 1. to 50. This estimate approaches the optimal variance function (general advice on an appropriate binning is given in Section 3.2).
After the weights are constructed, we calculate an estimate of the slope parameter, , with an unbinned weighted maximum-likelihood fit to the signal-weighted decay-time distribution. For reference, we also include in the comparison a 2D maximum-likelihood fit of the two-dimensional parametric model in which all parameters apart from the slope parameter are known, and a fit of the distributions in the discriminant variable in 25 slices of the control variable . The slope parameter of the signal in the control variable is then estimated from the signal yield per slice.
Whether a method produces consistent results is indicated by the distribution of the so-called pull of the slope estimate computed over  the ensemble. The pull is defined as the difference of the estimated and true values of divided by the estimated uncertainty of . For an unbiased estimate with the correct variance estimate, the pull distribution has a mean consistent with zero and a width consistent with one. As described in Section 4, the sandwich estimator is needed to obtain an asymptotically correct estimate of the slope uncertainty in the weighted unbinned maximum-likelihood fit. The standard HESSE routine in MINUIT [6] applied to a weighted likelihood computes the inverse matrix of second derivatives of the weighted likelihood function, but this does not give correct uncertainty estimates since the weighted likelihood is not a true likelihood. We further assess the statistical power of all methods by computing the variance of the slope parameter relative to the ideal case where each signal event can be correctly identified and the uncertainty is just Poissonian.
The results of this study are shown in Fig. 13. As expected, classic sWeights and COWs without sufficiently many background components yield a biased estimate in this factorisation-breaking example; this is indicated by the pull that deviates from zero. COWs with sufficiently many background components produce an unbiased result. Adding more background components than necessary reduces the statistical power of COWs. The optimal trade-off is case-specific, but can be found empirically using simulations or with goodness-of-fit tests on the data, as described in Section 3.2.
The variance function̂( ) performs better than ( ) = ( ) or ( ) = 1, this is also expected aŝ( ) approaches the optimal choice ( ) as increases. The 2D fit has better statistical power than COWs, but it requires a parametric model of the background, which is often not known. Fitting in slices still does not produce an unbiased estimate of the slope because the factorisation breaking is severe enough that the p.d.f.s do not factorise within each slice.
We have further tested cases with different signal-to-background ratios and with different sample sizes and the conclusions are similar, although fewer orders of background polynomial are required for the COWs to produce a minimal bias when the sample size is smaller. In small samples, the residual bias from using a finite number of orders gets masked by the statistical variance. So when using COWs in general there is a trade-off between systematic bias and statistical precision. It is also worth noting that for small samples (< 100 events) there are small biases due to the fact that the covariance computed with the sandwich estimator is only asymptotically valid.

Conclusions
This article provides a new perspective on the classic sWeights method, by re-deriving the results in the context of orthonormal functions. This approach provides many new insights and allowed us to generalise the method to what we dub Custom Orthogonal Weight functions (COWs). COWs produce correct results in cases when sWeights fail, namely when the background is not factorising in the discriminant and control variable, or when the detection is affected by finite efficiency that does not factorise. Both of these ailments are not uncommon in practice. We show how to obtain optimal (minimum variance) COWs under these conditions. When only the shape of the signal distribution in the control variable is of interest, COWs can be constructed without performing a fit to the distribution in the discriminant variable. We further summarise the work presented elsewhere [4] on the correct application of the sandwich estimator to weighted maximum-likelihood fits that use sWeights.
For the specific toy examples that we studied in this paper, fits to sWeighted data had statistical precision comparable to a fully parametric two-dimensional maximum-likelihood fit of the distribution of the discriminant and control variables. This is remarkable, but we also note that weighted fits are in general less efficient than fully parametric maximum-likelihood fits [4]. In a toy example with non-factorising background and efficiency, the statistical power for COWs is lower than  what is obtained in a fully parametric fit, but better than using multiple fits in slices of the control variable.
A summary of our recommendations for the optimal use of sWeights and COWs is given in Fig. 14. When the signal, background, and efficiency all factorise in the discriminant and control variables, one should use classic sWeights computed with variant B, described in Section 2.2. This method minimises the variance of the weights, and is the one previously described in the literature [1,2]. When only the signal factorises, one should use what we dub COWs lite described in Section 3.2. In this case, an approximate signal p.d.f. is sufficient. The result is unbiased for any signal p.d.f., but the statistical power is maximised when the signal p.d.f. is close to the true signal. It can be obtained as a histogram from simulation. The non-factorising background should be expanded into basis functions, we recommend Bernstein polynomials or basis splines. The optimal variance function, ( ), for this case can be estimated from the data sample with a histogram too, as described in Section 3.2. Finally, if the signal is also non-factorising, then one has use the full COWs approach, which is like COWs lite but now the signal is also expanded into basis functions, which must be linear-independent from the basis functions used in the expansion of the background.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request.
According to the fundamental lemma of calculus of variations, the equation is satisfied for any continuous ( ) only if the integrand inside the square brackets is zero. So we obtain

Appendix B. Variance of a sum of weights
We compute the variance of a sum of independently and identically distributed weights, = ∑ , where the sample size is a Poissondistributed number. The latter changes the computation of the variance of . sWeights and COWs in general are not independently distributed, so that the simple formula derived here only applies in rare special cases (as pointed out in Appendix C).
We follow the derivation in Ref. [29]; the key insight is that the sampling of is independent of the sampling of the . An unbiased estimate of this is given bŷ

Appendix C. Covariance matrix of a weighted histogram
For sWeights and COWs, the sums of weights in bins (a weighted histogram) of the control variable are asymptotically unbiased estimates of the respective component p.d.f.. The asymptotically correct covariance matrix of the bin contents is obtained with the sandwich estimator generally described in Section 4, which properly accounts for correlations between all estimates extracted from the same sample.
In special cases, the bins of the weighted histograms are uncorrelated and the variance of each bin is correctly estimated by the sum of the squares of the weights, which simplifies the analysis of the weighted histograms. This is the case if the efficiency is constant and the component p.d.f.s ( ) are fixed a priori, no component ( ) is misspecified, and one of the following conditions applies to the variance function ( ): A. The variance function ( ) is fixed a priori. B. The variance function is the estimated densitŷ( ) = ∑( ). C. The variance function ( ) is a histogram of the observed distribution in , a binned estimate of̂( ). for ( ) are then optimal or nearly optimal in case of the histogram.
We will now prove these claims. Since the component p.d.f.s ( ) are complete and not misspecified, the true density is a linear combination of the components, ( ) = ∑ ( ). The efficiency is constant, which is equivalent to setting ( , ) = 1. For a variance function ( ), the weight function ( ) that projects out a single component is given by Eq. (32). We will omit the index in the following, ( ) ∶= ( ). The content of the th bin of a weighted histogram for component in is given bŷ where the sum is taken over samples with ≤ < +1 and ( ; ) potentially depends on parameters that are estimated from the sample. The number of entrieŝin bin is a Poisson distributed random variable with expectation value . If E [ ( ; )] = , the true fraction of component ,̂is an unbiased estimate of the expected bin content , We will now compute the covariance matrix of the histogram . We introduce unspecified score functions whose roots shall determine all nuisance parameters in the calculation of the weights ( ; ), which may include, for example, the component fractions and shape parameters of the component p.d.f.s ( ; ). The bin-contents in are defined by the roots of the score functions , The combined estimate of all parameters is obtained from the roots of the joint vector ( , ). Asymptotically, the covariance matrix of nuisance parameters and bin contents is given by the sandwich estimator The partial derivatives represent Jacobian matrices of the derivatives of the score-function parts and with respect to the parameters and . We are only interested in the latter; the covariance matrix is the lower-right block matrix of the matrix product. We compute the matrix under this condition, (C.5) Since the bins are non-overlapping, the sums over and are disjoint and the expectation value for ≠ factorises. The off-diagonal elements are zero, We calculate the diagonal elements by splitting the double sum into independent pairs and identical pairs, and we use E[̂2] = 2 + for a Poisson-distributed variable, The variance of the bin is estimated by the sum of weights squared and different bins are uncorrelated. The same result can be derived from the fact that the weights in each bin are independently sampled. It also follows from independence that different bins are uncorrelated and the variance of the sum of weights per bin is given by Appendix B in this case. We now show that Eq. (C.2) is true and E where we used that is the inverse of .
, we find for component , ] . (C.10) So it remains to be shown that E where we used that the and matrices are inverses of each other. Since E [ ∕ ] is zero, the covariance matrix also takes the same simple form as in the trivial case A.
We emphasise that this only holds for COWs and sWeights computed with variant A. If classic sWeights are computed with variant B, extra terms appear in the calculation of the covariance matrix, as discussed in Ref. [4].

Appendix D. Self-consistency of sWeights
Here, we prove Eq. This proof does not make use of Eq. (D.2) and therefore holds more generally.

Appendix E. Sum of component weights obtained from COWs
When ( ) is a linear combination of the p.d.f.s, then the normalisation of the ( ) implies that .

(E.2)
Inverting this matrix equation, it follows that = ∑ If the weight ( ) is to be such that the variance of̂is minimal it then follows that the expectation value E[ 2 ( )∕ 2 ( , )] is minimal, sincêdoes not depend on ( ). The minimisation has to incorporate the constraints that the integrals of ( ) ( ) are either zero or one, which is done by Lagrange multipliers, 2 . The extremum condition becomes which in turn means that the optimal variance weight function is given by