Custom Orthogonal Weight functions (COWs) for Event Classification

A common problem in data analysis is the separation of signal and background. We revisit and generalise the so-called $sWeights$ method, which allows one to calculate an empirical estimate of the signal density of a control variable using a fit of a mixed signal and background model to a discriminating variable. We show that $sWeights$ are a special case of a larger class of Custom Orthogonal Weight functions (COWs), which can be applied to a more general class of problems in which the discriminating and control variables are not necessarily independent and still achieve close to optimal performance. We also investigate the properties of parameters estimated from fits of statistical models to $sWeights$ and provide closed formulas for the asymptotic covariance matrix of the fitted parameters. To illustrate our findings, we discuss several practical applications of these techniques.


I. INTRODUCTION
This article takes a fresh look at the sWeights (or sPlot) formalism discussed by Barlow [1] and popularised more recently by Pivk and Le Diberder [2]. The sWeights method is used to infer properties of a signal distribution in a mixed data set containing signal and background events. The signal distribution is extracted nonparametrically by applying weights to individual events. Inference is then done on the weighted data set. The method is applicable, when individual points from the data distribution consist of a discriminating variable(s), here called m, and one or more statistically independent control variables, here called t, where m and t can both be vectors and of different dimensions. By fitting parametric models to the signal and background in the discriminating variable(s) m, one can calculate the weight distribution that represents the signal density in the control variable(s) t. The advantage of this method, compared to a fully parametric fit to the (m, t) distribution, is that one avoids the need to parameterise the background density in the control variable(s) t, which is often challenging.
In Sec. II we re-derive the sWeights method from the starting point of orthonormal functions. We show several ways of calculating the weights and compare their trade- * hans.dembinski@tu-dortmund.de † matthew.kenzie@cern.ch ‡ christoph.langenbruch@cern.ch § michael.schmelling@mpi-hd.mpg.de offs, and emphasise that sWeights can easily be computed without some of the restrictions seen previously. In Sec. III we then discuss a generalisation of the sWeights method called "Custom Orthogonal Weight functions" (COWs). COWs relax some of the requirements of the sWeights formalism and can be applied to a larger class of problems than traditional sWeights, at a small loss in precision.
In Sec. IV we then discuss the properties of estimates obtained when fitting models to weighted data. We give an asymptotically correct formula for the covariance matrix of the parameters obtained from such a fit.
Finally in Sec. V we perform a variety of studies on simulated Monte Carlo which deploy sWeights and COWs on various applications and show comparisons of their performance.

II. SWEIGHTS AS ORTHONORMAL FUNCTIONS
To compute the weights for the signal distribution in the control variable t, we use a discriminant variable m (often the invariant mass of some particle's decay products). The signal and background density only need to be parameterised in the discriminant variable m. The variables m and t must be statistically independent in the classic sWeights formalism, so that the respective p.d.f.s of the variables factorise. In other words, we assume that the total p.d.f. has the following form f (m, t) = z g s (m) h s (t) + (1 − z) g b (m) h b (t), (1) arXiv:2112.04574v2 [stat.ME] 26 Jan 2022 where z is the signal fraction, g s (m) and h s (t) are the signal p.d.f.s in the discriminating and control variables, respectively, and g b (m) and h b (t), the corresponding background p.d.f.s. The sWeights method allows one to obtain an asymptotically efficient non-parametric estimate of z h s (t) while only requiring parametric models for g s (m) and g b (m).
We stress that the sWeights method is only applicable when the p.d.f.s in m and t factorise for both the signal and the background, which is conditional on their independence. Independence is a stronger condition than lack of correlation. Therefore, tests which demonstrate a lack of correlation between m and t provide necessary, but not sufficient, evidence for the applicability of the sWeights method. We come back to proper tests of independence in Sec. V.

A. Construction of an optimal weight function
We postulate that a weight function, w s (m), exists which extracts the signal component, z h s (t), when f (m, t) is multiplied by it and integrated over m: The left and the right-hand sides of Eq. 2 are equal in general only if the following conditions hold: dm w s (m) g s (m) = 1 (3) dm w s (m) g b (m) = 0.
If we regard dm φ(m) ψ(m) as the inner product of a vector space over functions, then these conditions define w s (m) as the vector orthogonal to g b (m) and normal to g s (m). In other words, w s (m) is an orthonormal function in this space. Since the vector space over m is infinite-dimensional, there are infinitely many orthonormal functions w s (m) that satisfy these conditions. For example, the classic sideband subtraction method can be regarded as a special case where w s (m) is a piece-wise constant function which is positive in the signal region and negative in the background region.
In order to obtain a unique solution for w s (m) we can chose to minimise its variance. Since f (m, t) factorises and w s (m) is only a function of m, we can obtain all information about w s from the density g(m), computed by integrating Eq. 1 over t, The expectation of w s over g(m) is and the variance of w s over g(m) is given by Minimising the variance Var(w s ) guarantees that the sample estimateẑ = 1/N N iŵ s (m i ) asymptotically has minimum variance. As a byproduct, this choice also produces minimum variance for the estimated background fraction (1 −ẑ), and generally smooth functions, w s (m), since oscillating solutions have larger variance.
To find the function w s (m) which minimises Var(w s ), we have to solve a constrained minimisation problem. The solution, computed in Appendix A, is The constants α s,b are obtained by inserting Eq. 8 into Eq. 3 and Eq. 4 and solving the resulting system of linear equations. Before we continue with that, we note that the signal component plays no special role in the derivation so far. We could have equally postulated a weight function w b (m) to extract the background, which leads to the conditions and The coefficients α x and β x with x ∈ {s, b} can be computed by solving with In other words, the matrix A, formed by the coefficients to compute w s (m) and w b (m), is the inverse of the symmetric positive-definite W matrix.
With Cramer's rule, we get One can further replace g(m) in the denominator of Eq. 8 (or Eq. 11) by inserting Eq. 8 into Eq. 6 to find that z = α s + α b , and similarly one finds 1 − z = β s + β b . With these ingredients, we obtain the final equations . (17) In summary, to obtain w s (m) or w b (m) one has to compute the matrix elements W ss , W sb , W bb , which depend only on g s,b (m) and (implicitly) z.

B. Application to finite samples
The calculations so far were carried out for the true p.d.f.s, g s,b (m), and true signal fraction, z, on which the matrix elements W xy depend. In practice, these need to be replaced by sample estimatesĝ s,b (m) andẑ, typically obtained from a maximum-likelihood fit, although any kind of estimation can be used. The plug-in estimate [3] of Eq. 16 iŝ For the computation of the estimates W xy with x ∈ {s, b} we face a choice between two possibilities.
• Variant A: We replace the true quantities in Eq. 13 with their plug-in estimates and compute the integral analytically or numerically, • Variant B : We additionally replace the integral with a sum over the observations in the data sample. We note that an integral over a function φ(m) can be written as an expectation value over the p.d.f. g(m) (assuming that the expectation exists), In a finite sample, the arithmetic mean is an unbiased estimate of the expectation due to the law of large numbers, where m i is the i-th observed value of m and N is the sample size. Applying this replacement to Eq. 13 yields Variant B has several attractive properties which make it the recommended method. The computation is straightforward from the fitted estimatesẑ andĝ s,b (m) and the data sample. The additional complexity of computing an integral (possibly numerically) is avoided. Furthermore, this choice is guaranteed to exactly reproduce the previously fitted signal yieldN s = Nẑ when the sWeights are summed: whereŵ s (m) is the estimate of w s (m) computed from W B xy . The proof for this is provided in Appendix B.
In other words, Variant B produces self-consistent estimatesŵ s for the sample at hand. This is not exactly true in general for Variant A. The matrix elements W A xy are numerically close to the elements W B xy but differ. We consider the self-consistency of Eq. 23 important: sWeights are computed from a fitted estimateẑ, and so they should reproduce that estimate exactly.

C. Connection to extended maximum-likelihood fit
There is a curious connection between Eq. 22 and the results of an extended maximum-likelihood fit in whicĥ g s andĝ b are fixed to their maximum-likelihood estimates and the respective signal and background yields, N s and N b , are regarded as independent variables. In such a fit, one maximises the extended log-likelihood function [4] which is without constant terms (24) The extremum is determined by solving the score functions with x ∈ {s, b}. The maximum-likelihood estimates obtained from these score functions areN s = Nẑ and N b = N (1 −ẑ), whereẑ is the estimated signal fraction as before. The elements of the Hessian matrix, of second derivatives of the log-likelihood function, are given by We note the similarity between Eq. 26 and Eq. 22 and evaluate the second derivative at the maximum of lnL to find This shows another opportunity to compute estimates of W xy , since the second derivatives of the log-likelihood are routinely computed (for example, by the program MINUIT) as part of the fit forN s ,N b , and the shape parameters θ s,b ofĝ s,b (m; θ s,b ), and are therefore readily available. The covariance matrix C returned by such a fitting program is the negative inverse of the Hessian, The dotted parts of the matrix correspond to derivatives that contain one or two shape parameters of θ s,b . Thus one can use Variant C to compute the elements of W C xy which consists of the following steps: • Invert the covariance matrix C of the fit of yields N s,b and shape parameters θ s,b .
• Isolate the 2 × 2 sub-matrix of the Hessian which contains the derivatives with respect to the yields N s,b .
• Use Eq. 27 on these matrix elements to obtain W C xy . It would be incorrect to switch steps 1 and 2, i.e. isolate the 2 × 2 sub-matrix of C that contains the yields and invert it, because this does not restore the derivatives.
A close alternative is to do a second fit which leaves only the yields free while keeping shape parameters fixed. In this case, the covariance matrix computed by MINUIT can be scaled to yield an estimate of the coefficient matrix from Eq. 12: If the Hessian matrix was actually calculated with Eq. 26, Variant B and C would give identical results. In practice however, the second derivatives in Eq. 28 are usually computed only approximately by numerical differentiation of Eq. 24. The accuracy of numerical differentiation is several orders below the machine precision. This means that Variant C produces a less accurate estimate than Variant B and that Eq. 23 only holds approximately for Variant C. In conclusion, Variant B is recommended over Variant C, since the computation is inexpensive and the result more accurate.

III. CUSTOM ORTHOGONAL WEIGHT FUNCTIONS
The discussion so far has focused on the restricted case where the p.d.f. is a mixture of two components that each factorise in both the discriminant and control variables. We now generalise to an arbitrary number of factorising components, and also allow for a non-factorising function of frequency weights, (m, t), which in practical applications is often identified with an efficiency function. The total p.d.f. for the observed data then becomes The normalisation term, D, ensures that the observed density, ρ(m, t), is properly normalised. The true density of interest is The Kolmogorov-Arnold representation theorem [5,6] ensures that a finite sum of terms on the right-hand side can represent any two-dimensional function f (m, t). For practical applications it is beneficial if the expansion requires only a few terms, which can be achieved with g k (m) and h k (t) suitably chosen for the specific case. For a given expansion we will assume that the first s terms pertain to the signal density while the others describe the background, i.e.
If there are multiple terms, in either the signal or background part, that do not contain either identical g k (m) or h k (t) components, then the respective p.d.f.s are nonfactorising.
Generalising the insights obtained when identifying the sWeights as orthogonal functions (see Sec. II A), it is easy to show that any single function h k (t) in f (m, t) can be isolated by a weight function Here I(m) is an arbitrary function (which we hereafter refer to as the "variance function"), that is only required to be non-zero in the considered range of m, and A kl is akin to the α, β matrix of Eq. 12. It follows that When considering a non-uniform efficiency, (m, t) = 1, the appropriate weight to apply to the data, in order to extract the density h k (t), is w k (m)/ (m, t). For a particular bin in the control variable, ∆t, the expectation value of this weight is i.e. an unbiased estimate for the integral of the efficiency corrected density, h k (t), over the bin ∆t. This holds for any choice I(m). The weights that project out the entire signal or background component are given by Integrating Eq. (35) over all t one sees that every expectation value, E k , is proportional to z k . Therefore, an estimate ofẑ k can be obtained from the corresponding sample average of w k (m)/ (m, t), with 1/D estimated by the sample average of 1/ (m, t). Special properties hold when I(m) is a linear combination of the basis functions, g k (m). As proven in Appendix C, for arbitrary constants a k (where k ∈ [0, n]), one finds i.e. every event contributes with a total weight of unity to the possible states k. One corollary of this result is that for every measured m i , the COWs, w k (m i ), sum to unity when one of the g k (m) is constant. Another consequence is that with an increasing number of terms the sum k w k (m) will converge towards unity for any function I(m), since a linear combination of sufficiently many basis functions g k (m) always allows for a good approximation of I(m).
It remains to select the weight function I(m). While I(m) = 1 may be a reasonable default, it certainly will not be optimal. Here we consider two options to choose a better weight function I(m), such that 1. the variances of theẑ k are minimal, 2. theẑ k are the Maximum Likelihood estimates.
As shown in Appendix D, requirement (1) leads to Numerically, q(m) can be obtained from a histogram of the 1/ 2 (m, t) weighted m-distribution or a suitable parameterisation thereof. For the construction of the COWs the exact form of I(m) is uncritical, therefore a histogram approximation will usually be good enough.
The extreme case of a single-bin histogram is equivalent to I(m) = 1. Asymptotically a sufficiently fine-binned histogram will be arbitrarily close to the ideal q(m). Appendix E shows that the alternative requirement (2) leads to where theẑ k are estimates for the true fractions z k obtained from an 1/ (m, t) weighted unbinned Maximum Likelihood fit. For (m, t) = 1 this is the sWeights solution. Numerically theẑ k can be determined iteratively, starting with e.g.ẑ k = 1/n and updating the values using sample averages of w k (m)/ (m, t), based on the resulting weight functions, w k (m). Since any initial choice for I(m) yields unbiased estimates,ẑ k , the iteration converges quickly. Numerical studies indicate that n steps, where n is the number of coefficients, are usually sufficient. In the case of non-uniform efficiencies, (m, t), the weight function, I(m), from the Maximum Likelihood criterion is different from q(m). The fact that q(m) was derived from the requirement of minimum variance illustrates the known result that weighted Maximum Likelihood estimates are in general not efficient. It is interesting to compare the two options discussed for I(m) in the case of uniform efficiency weights, (m, t) = 1. In this case the weight functions are i.e. the sWeights solution, I (2) (m), is the Maximum Likelihood estimate of the theoretically optimal weight function, I (1) (m). Asymptotically I (1) (m) and I (2) (m) are the same. For a non-uniform efficiency function this will not generally be true. One also finds that the COWs, w k (m), determined from Eq. 33, with I(m) = I (2) (m), satisfy the consistency condition 1/N N i=1 w k (m) =ẑ k found for sWeights when using the respective sample averages for W kl .

A. COWs in the Wild
The previous section covers the general framework regarding COWs. It shows how one can extract a true density, h k (t), in the control variable, t, from efficiencydistorted data by using only p.d.f.s, g k (m), in the discriminant variable, m. In the discussion above, these densities, g k (m), are defined at the truth level. However, for practical applications these are usually unknown, and additional considerations come into play.
If the efficiency function is not sufficiently well known, it may be preferable to first separate signal and background and handle the efficiency corrections in a later step of the analysis. This case is covered in the COWs framework by simply setting (m, t) = 1. However, one has to keep in mind that even when the true signal density factorises in m and t, the efficiency function in general will not, and thus sufficiently many terms in the signal part of the data model are required to account for factorisation-breaking effects. Furthermore, once the signal density, h s (t), has been determined, the efficiency correction must be done with the signal efficiency projected into just the control variable,¯ (t). This can be obtained by averaging (m, t) over m usinḡ Here f s (m, t) denotes the signal part of the true p.d.f.. If the efficiency function factorises in m and t, (m, t) = η(m) (t), the m averaged efficiency can be expressed as where g s (m) is the observed signal p.d.f. in m. Normally one will get¯ (t) from a Monte Carlo simulation of the signal. One can then also directly apply weights w k (m)/¯ (t) when filling the respective t-histogram. It should be noted that using w k (m)/ (m, t) as an eventby-event weight instead would be manifestly wrong, since the m-dependence in the efficiency factor destroys the orthogonality relations for the COW, and the signal estimate in t becomes polluted by background. Another use case is a signal component that can be assumed to factorise in m and t on top of a background that may be non factorising. In the above formalism the signal p.d.f. is then g 0 (m)h 0 (t), and if one is only interested in projecting out the p.d.f., h 0 (t), of the signal component, there is additional freedom in the construction of respective COWs. As shown in Appendix F, in this case not only arbitrary non-zero weight functions I(m) can be used, but also the assumed signal density can be chosen freely as long as it is not a linear combination of the background p.d.f.s g k (m) (where k = 1, . . . , n). This may at first glance seem surprising, but just reflects the fact that in order to remove the background in the control variable, t, knowledge of the signal shape in the discriminant variable, m, is of secondary importance. However, a good description of the background under the signal is crucial.
To construct a signal-only COW, w k , according to Eq. 33 one requires an input model for the signal density, p(m), a set of background p.d.f.s g k (m) (where k = 1, . . . , n) and a weight function, I(m). Here, p(m) and g k (m) must be normalised, but the normalisation of I(m) can be arbitrary. The requirement that w(m) cancels all background contributions is dm w(m) g k (m) = 0 for k = 1, . . . n . (43) An additional requirement is needed to fix the normalisation of w(m), which is conveniently chosen as For p(m) = g 0 (m) this is same condition as before. For the case (m, t) = 1 one can show that p(m) = g 0 (m) and asymptotically give estimates for h 0 (t) that have exactly the same statistical accuracy in terms of the number of equivalent events N eq = ( w i ) 2 / w 2 i . This suggests one should use This offers an intriguing possibility to extract and estimate the signal p.d.f. in the control variable, h 0 (t), from a set of N measurements {m i , t i }, i = 1, . . . , N . All one needs is a model for the background in m, and estimates, e.g. histograms, of p(m) and q(m). Formally the background can always be expanded into a complete set of functions, e.g. polynomials. With the conventions adopted above, a factorising model on the interval m ∈ [0, 1] would be a non-factorising model would be obtained by For practical applications the above sums have to be truncated. If one imposes factorisation of the background, then estimatesâ k need to be determined from the data in order to specify the background p.d.f.. If one allows for factorisation breaking, then all one needs are individual p.d.f.s g k (m) = k m k−1 , p(m) in place of the actual signal component g 0 (m) and I(m) = q(m) to determine w(m) = w 0 (m) according to Eq. 33. The event-by-event weights w(m)/ (m, t) for a histogram in t then produce an asymptotically efficient and unbiased estimate of the signal p.d.f. h 0 (t). At finite statistics the use of estimates from the data for p(m) and q(m) will give rise to a bias of order 1/N , which is negligible compared to the statistical uncertainties. A formal proof for this is still pending. However, any biases will be small since using a priori fixed functions for p(m) and q(m), provides an unbiased estimate of h 0 (t), although with less than optimal statistical precision. Systematic uncertainties related to the choice of the background model can be probed by adding terms and checking the stability of the result.

IV. VARIANCE OF ESTIMATES FROM WEIGHTED DATA
Parameter estimation using weighted unbinned data sets can be performed by maximising the weighted likelihood [7], which is equivalent to solving the weighted score functions with sWeights w i = w s (m i ) or w s (m i , t i ) and shape parameters θ of the signal p.d.f. h s (t; θ). The weighted likelihood is not a classic likelihood (product of probabilities) and so the inverse of the Hessian matrix [7] of the weighted likelihood does not asymptotically provide an estimate of the covariance matrix of the parameters. Eq. 48 is an example of an M-estimator [8]. A complete derivation of the asymptotic covariance matrix for the parameters θ can be found in the appendix of Ref. [9], here we only summarise the main findings.
A complication arises due to the fact that the sWeights depend, via Eq. 18, on the inverse covariance matrix elements W xy , which are usually determined via Eq. 22. The estimates W xy in turn depend on the estimates of the signal and background yields,N s andN b , usually determined from an extended maximum likelihood fit. Problems of this type are described as two-step M-estimation in the statistical literature [10,11]. To account for the fact that the parameters are estimated from the same data sample and are therefore not independent, one has to combine the estimating equations for the parameters of interest with those of the yields and the inverse covariance matrix elements in a single vector.
We construct the quasi-score function S(λ), where λ = {N s , N b , φ, W ss , W sb , W bb , θ} is the vector of all such parameters, and φ and θ are also vectors for the shape parameters in m and t, respectively. The elements of S are given by . . .
where  [9]. Therefore, a consistent estimateλ can be constructed as the solution to S(λ) ! = 0. We note that the elements of S(λ) can be multiplied by arbitrary non-zero constants without changing these results.
The asymptotic covariance of λ, which includes the parameters of interest θ, is then given by [12][13][14] where ∂S/∂λ T is defined as the Jacobian matrix built from the derivatives ∂S k /∂λ and C S = E SS T . We note that the inverse of the Jacobian ∂S/∂λ T introduces correlations between the parameter uncertainties. In a finite sample, the expectation values in Eq. 50 can be estimated from the sample. The estimate for E[∂S/∂λ T ] is ∂S/∂λ T |λ, while the elements of the matrix C S are provided in Appendix G. In the literature, Eq. 50 is often referred to as the sandwich estimator, but in this case the variance of the score is modified because we consider fluctuations in the sample size.
In the case of classic sWeights and when the shapes of g s (m) and g b (m) are known, some simplifications of the expressions in Eq. 50 are possible, as detailed in Ref. [9]. They result in the following covariance matrix for the parameters of interest θ, with where (xy) and (uv) iterate over {ss, sb, bb}, and w s (m i ) = w s (m i ;Ŵ ss ,Ŵ sb ,Ŵ bb ). The asymptotically correct expression for the binned approach is also derived in Ref. [9]. The first term of Eq. 51 is the covariance for a weighted score function as described by Eq. 48 with independent weights w i . The second term is specific to sWeights and always reduces the covariance of the estimateθ. This reduction is caused by the fact that the sWeights are estimated from the same data sample. If the shapes of g s (m) and g b (m) are also estimated from the data sample, Eq. 51 has to be extended with further terms, see Appendix G.

V. PRACTICAL APPLICATIONS OF COWS AND SWEIGHTS
All of the studies in this section are available to view online at Ref. [15]. This includes generic implementations of extracting sWeights (Sec. II) and COWs (Sec. III) with the variants detailed in this document, as well as a class which performs a correction to the covariance matrix when fitting unbinned weighted data (Sec. IV). The interface is provided in python and offers support for probability distribution functions defined in either scipy [16], ROOT (via TTrees) [17] or RooFit [18]. We also point out that the RooStats [19] package implements what we here call sWeights Variant B but does not implement the other variants or COWs.
An important point to remember is that the derivation of the sWeight formalism in Sec. II simply requires a sensible estimate for the signal and background shapes, g(m). It does not require any special refitting or yieldonly fitting which has been commonly recommended in other sWeights discussions. Using the formalism outlined in this article, one only needs to fit the discriminant variable(s) (usually a candidate invariant mass) once; with the freedom to float, fix or constrain any parts of the shape or yields therein to obtainĝ(m). One can then extract the sWeights for any component ofĝ(m) and need not be concerned about fixed or constrained yield parameters. Moreover, the range used to compute the weights can even be different from the one used to extract the weights and indeed one could even use a binned fit (e.g. if the sample is large) to obtain estimates of the p.d.f.s and still extract per-event sWeights. This formalism also allows one to extract the pure weight function, i.e. one that is valid for any value of the mass not just a weight per event. In the case of extracting COWs a fit never even needs to be performed, one simply needs estimates forĝ s (m),ĝ b (m) and I(m). As described in Sec. III A these can be obtained from the data sample directly for g s (m) and I(m), and as a sum of polynomials forĝ b (m). As we will see in the practical examples below there are some pitfalls to be wary of and we would always recommend that each use case follows a similar approach to that shown here: produce ensembles of simulated events to check that biases are small and variances are as expected.

A. Statistical test of independence
An important prerequisite for the extraction of sWeights is that the data samples for the discriminant and control variables are statistically independent for both signal and background; which means that the total p.d.f. factorises for the discriminant and control variables. If this is not the case then the extracted sWeights can be biased. The COWs formalism, described in Sec. III, allows one to overcome this by expanding the p.d.f. into a series of terms which do factorise. In order to check the independence in a data sample we recommend use of the Kendall rank correlation coefficient [20]. A simple function to compute the correlation coefficient, τ , is provided in Ref. [15]. It should be noted that the uncertainty on τ scales approximately with 1/ √ N , where N is the sample size.

B. A simple example comparing sWeight variants
A simple example has been considered to demonstrate the method and illuminate some of the small differences between the variants described in Sec. II B. A common application of sWeights in flavour physics is to extract the lifetime of a candidate using its invariant mass to isolate it from the background. In this example we take two independent variables; invariant mass m and decay time t of a B-meson candidate. Our observed dataset contains an arbitrary mixture of signal; normally (exponentially) distributed in m (t), and background; exponentially (normally) distributed in t (m), events. The m and t projections of the p.d.f., which is the f (m, t) of Eq. 1, used to generate simulated events is shown in Fig. 1.
For each simulated dataset, the estimatesĝ s (m) and g b (m) are obtained by fitting back the generated mass distribution. We then compute the W xy matrices of Eqs. 19, 22 and 27 for variants A, B and C respectively. Finally, the weight functions, bothŵ s (m) andŵ b (m), are extracted for each variant using Eq. 18. Within variant C we extract the weight functions using both of the methods described in Sec. II C: i) by twice inverting the covariance matrix and ii) by using Eq. 29 on the covariance of a fit in which only the yields float.
The distribution of the weight functions, w s (m) and w b (m), as a function of the discriminant variable, invariant mass, are shown for the nominal Variant B method in Fig. 2 for one pseudo-experiment containing 5K (20K) signal (background) events. The other variants give very similar looking distributions, although small differences can be seen when inspecting their relative differences as shown in Fig. 3. It is useful to confirm the formalism of Sec. II with a numerical evaluation of this example. Indeed we see, with all four of the methods inspected here, that w i (m)g j (m)dm = δ ij , as well as i w i (m) = 1 for all m, to a high numerical precision. We also evaluate the sum of weights and sum of squared weights in order to make a comparison with the yield estimates and uncertainties extracted from the discriminant variable fit (for a proof that the sum of squared weights provides an 5000 5100 5200 5300 5400 5500 5600 Invariant estimate for the asymptotic variance see Appendix H).
The results are shown in Table I, along with those from the free fit and a fit with only the yields floating. This demonstrates Eq. 23 for Variant B, i.e. that the fitted yield is exactly reproduced by the sum of weights. Whilst at first glance the sum of squared weights may appear to underestimate the variance of the fitted yield, one has to realise that the weights are agnostic of any variance in the shape parameters. Table I shows that the sum of squared weights accurately reproduces the variance of a fit in which only the yields float. Finally, we apply the signal weights to our dataset in the control dimension, t, and fit this with the expected exponential distribution. We subsequently find that we obtain an accurate estimate of the shape,ĥ s (t), finding that the slope parameter has a very similar value to that which would have been obtained had we performed the fit in two dimensions to start with. The weighted and true distributions in the control variable t are shown in   Table II. Note that the uncertainties on these parameters are appropriately scaled according to the description given in Sec. IV as we are now fitting weighted data. We then repeat this study on ensembles containing 500 pseudo-experiments in order to ensure that any of the behaviour seen is not just a fluke of the specific dataset shown in this example. We also perform the same study on ensembles with smaller samples sizes and with different signal to background ratios, the results are shown in Figs. 5 and 6. We find that each of the variants described here give very similar results and can accurately reproduce the full two-dimensional fit with, at least in this case, a minimal loss in precision. Figure 5 shows that the sum of weights (left two panels) for Variant B accurately reproduce the fitted yield. Variant A is also unbiased in this respect but has a slightly larger spread (note the very small y-axis), whilst Variants Ci and Cii give a very small bias and tend to overestimate the yield by about 0.1 per mill. When inspecting the variance properties, sum of squared weights (right two plots), we can see that all of the methods tend to very slightly over estimate the fit uncertainty. Variant B has a much larger spread of variances than the other methods which are all similar. Figure 6 shows the importance of computing the covariance matrix correction using Eq. 51 (a comparison of the brown points with the rest). For very small amounts of signal, either small overall sample size or small values of the signal to background ratio, we see some slight biases and a much smaller average uncertainty when using the weights method, as compared to the full twodimensional fit. Inspection of the studentised residual distributions suggest a small amount (∼ 10%) of undercoverage in these cases, which is more than likely due to the asymptotic assumptions made when correcting the covariance matrix no longer being valid.

C. A more complex example with Variant B
In this section we test a more complex example for another common use case in flavour physics in which there are multiple different factorising components within f (m, t) of which some may be signal and some may be backgrounds. In this example we have an invariant mass as the discriminant variable once more but now have six different components each with different p.d.f.s; some even peak under or near the signal in a similar way. For the control variable(s) we use a simple discrete integer which labels the true component, c ∈ [1,6], as well as two "Dalitz" variables. We have assumed that the discriminant invariant mass variable is constructed from a threebody decay of the form X → ABC and in this case the Dalitz variables are the invariant mass squared of the AB and AC combinations. We generate a pseudo-experiment from the true underlying model in which the Dalitz variables are flat across the phase space for all components, apart from the signal which has a resonance in the AB invariant mass, and one of the backgrounds which has a resonance in the AC invariant mass, which appear as horizontal and vertical bands in the Dalitz plot. A visualisation of the generated dataset in the discriminant variable, m, is shown in Fig. 7. The control variable distributions are shown in Fig. 8 where events have been coloured according to their true event type. As in the previous example the generated dataset is fitted to obtain estimates forĝ i (m) and it is actually the result of this fit which is shown in Fig. 7. We then use the method of Variant B to obtain the W xy matrix (in this case a 6×6 matrix), after which the 6 weight functions, w i (m), are extracted. The distributions of these weight functions are shown in Fig. 9. We can then inspect the distributions of the control variables when the various weights have been applied. One can see a very nice recovery of the "control " variable in Fig. 10 and the Dalitz variables for the signal component in Fig. 11. The weighted Dalitz plots for the other components show a similar level of agreement with the truth. As seen before in Table I we again find in this example that w i (m)g j (m)dm = δ ij , i w i (m) = 1 for all m and that the sum of weights and sum of squared weights accurately reproduce the corresponding fitted yield and variance.
This more complex example, in contrast to the previous simple case, exhibits rapidly oscillating weight functions (see Fig. 9) which oscillate much more quickly than the actual variation of the relevant component shapes themselves. This is because the weight is related to how the shapes overlap as well as how they vary themselves with mass. One can also see that competing (i.e. similar) shapes oscillate out of phase, which is what we would ex-250 1000 2500 5000 10000 25000 Sample size   pect as their yields are anti-correlated. It is worth noting that the sum of all component weights for any value of the discriminant variable, in this case invariant mass, is still unity. It is worth highlighting that the components with the smallest yields have the largest amplitudes of the weight function. Clearly, this is because small yields will have large uncertainties and therefore will require a large variance of weights. This can then lead to fairly sizeable fluctuations in the weights for small contributing samples when inspecting a relatively fine grain phase space, like that of the Dalitz plot. When inspecting certain distributions it is possible to see artefacts of these fluctuations appearing as features in a distribution, for example a band might seem to appear in a Dalitz distribution when in reality it is just large fluctuations around zero. Clearly, minimising the size of these fluctuations is prudent as it  is generally undesirable to have few events with large weights. However, this issue only arises when trying to project out control variables for components which have a very small yield in the discriminant variable. Therefore our recommendation is to proceed with caution if you are trying to use the sWeight method for a fit component which is considerably smaller than others in the fit.
D. An example exploiting COWs with a non-factorising background and efficiency effects The final example we investigate considers an extreme case which has similar features to the first ex-5000 5100 5200 5300 5400 5500 5600 Invariant  ample (Sec.V B) but contains a highly non-factorising background model and a non-factorising efficiency. This emulates the use cases in which the signal efficiency is straightforward to estimate but the background efficiency is not. The nature of the true model used to generate ensembles of experiments is shown in Fig. 12, in which the non-factorising nature of the background is manifest in that the exponential slope of the background in mass varies with decay time, and both the mean and width of the normal distribution describing the background in decay time vary with mass. Projections of the integrated distributions along with the projection of the efficiency model used are also shown.
For this set of tests we perform an analysis on ensembles of simulated datasets using Variant B of the sWeights procedure described above along with various implementations of the COW formalism presented in Sec. III. For the sWeights implementation the signal,ĝ s (m), and background,ĝ b (m) distributions are estimated by fitting the simulated sample as is done for the other examples above. For the COWs implementation the same signal model,ĝ s (m), is used and a variety of tests are performed using: • The same estimate of the background as in the The results for this analysis are shown in Fig. 13 in which the simulated sample size is 2K events, with equal amounts of signal and background. We have also tested cases with different signal-to-background ratios and with different sample sizes and the conclusions are rather similar, apart from that fewer orders of polynomial are required to achieve a minimal bias when the sample size is smaller. It is also worth noting that for small samples (< 100 events) there are small biases due to the fact that the covariance correction of Sec IV is only asymptotically valid.
It can be seen from Fig. 13 that in the case of a highly non-factorising background model the traditional sWeights method can have a severe bias (first panel of Fig. 13). This is also the case for the COW formalism when I(m) = f (m) or I(m) = 1 (second and third panels of Fig. 13), neither of which contain the appropriate efficiency correction. This is overcome when using sums of polynomials which can effectively mitigate the non-factorising efficiency and non-factorising background. One can see that higher orders of polynomial achieve a smaller bias but reduce the statistical power of the method (the bottom panel of Fig. 13 shows the equivalent sample size from the sum of signal weights with respect to the generated number of signal candidates). When using I(m) = q B (m) these biases are significantly reduced, because in this case the estimate of q B (m) is suitably efficiency corrected. Small biases remain in this case if the background description is not sufficient (e.g. in this case a first order polynomial is not enough). Figure 13 shows that when using the polynomial expansions for the non-factorising backgrounds the COWs formalism performs well, even in this extreme case, depending on the order of polynomial used in the background modelling and the form of the I(m) variance function. With suitable choices of these, the bias can be minimised, with a price to pay in statistical precision (the higher order polynomial used the worse the precision, the fewer bins used and the smaller the sample used for the q B (m) estimate the worse the precision). It is clear that this choice will be analysis specific and it should be carefully considered on an individual basis. There will be a trade-off between systematic bias and statistical precision.

VI. CONCLUSIONS
In summary this article gives a fresh overview and review of the sWeights method before discussing a generalisation of them which we dub "Custom Orthogonal Weight functions" (COWs). We demonstrate that COWs can handle a variety of different applications and achieve statistically robust results with minimal loss in precision. Indeed COWs are applicable to situations in which the specific case of sWeights do not work.
In the last step, Eq. 22 was inserted and the products of sums expanded. We note that the denominators of the two terms in braces differ by a factor g i and convert the first term: We used ik s i b i b 2 k = ik b 2 i s k b k in the last step. Recall that an estimate for the fraction z k is given bŷ Given that E[ẑ k ] = z k then Here the normalisation D is an unknown constant and for the following it is sufficient to simply assume that D exists. As an aside, if one assumes a functional form of I(m) which provides weights which sum to unity (Appendix H shows that any linear combination will satisfy this requirement) and noting that the estimatesẑ k also have to sum to unity, then D can be estimated from the data using the harmonic average of the efficiencies, (D3) Assuming simply that D exists, then following from Eq. D1, the variance ofẑ k is If the weight I(m) is to be such that the variance ofẑ k is minimal it then follows that the expectation value E[w 2 k (m)/ 2 (m, t)] is minimal. The minimisation has to incorporate the constraints that the integrals of w k (m)g l (m) are either zero or one, which is done by Lagrange multipliers, 2λ l . The extremum condition be- . (E2) Inserting the estimatesẑ k =N k D/N means that The solution for this system of non-linear equations requires that the right-hand-side is the same for all k, namely N/D. Noticing here the similarity with Eq. D1, one can choose I(m) such that the sum in Eq. E3 becomes N/D. In this case one finds that and therefore I(m) = n l=0ẑ l g l (m). (E5)