Handling uncertainties in background shapes: the discrete profiling method

A common problem in data analysis is that the functional form, as well as the parameter values, of the underlying model which should describe a dataset is not known a priori. In these cases some extra uncertainty must be assigned to the extracted parameters of interest due to lack of exact knowledge of the functional form of the model. A method for assigning an appropriate error is presented. The method is based on considering the choice of functional form as a discrete nuisance parameter which is profiled in an analogous way to continuous nuisance parameters. The bias and coverage of this method are shown to be good when applied to a realistic example.


Introduction
A very common problem in data analysis is to determine the parameters of a narrow signal which occurs on a wide, smoothly varying background. In high energy physics, this is often achieved using a maximum likelihood technique [1] in which separate parametric models for the signal and background processes are constructed and "fit" to the data. If, however, the shape of the background is not known a priori, then there will be some uncertainty in the signal parameters resulting from the uncertainty in the function used. This issue is exacerbated in the case of a small signal to background ratio.
A common approach is to try various different plausible functions and determine the spread of the signal parameters when using these functions. However, these methods tend to have some degree of arbitrariness and so a new approach is discussed in this paper. This new method was developed as part of the analysis of data at the CMS experiment following the discovery of the Higgs boson [2,3]. It was applied to the case where the Higgs decays to two photons, which results in a narrow signal on a large background [4].
This method tries to be less arbitrary and treats the uncertainty associated with the background parameterisation in a way which is more comparable with the treatment of other uncertainties associated with the measurement; the choice of background function is a systematic error which is handled as a nuisance parameter [5]. There are two major new parts to this approach, namely the method for treating the choice of function as a nuisance parameter, and how to compare functions with different numbers of parameters.
The concept of this approach is described in Section 2. The application of the method to functions with the same number of parameters is described in Section 3 and to functions with different numbers of parameters in Section 4. Further discussion on the method, namely its practical application to the real-world problem of the Higgs measurements, is given in Section 5.
Within this paper, twice the negative of the logarithm of the likelihood function is denoted by Λ. The fits discussed are binned and the likelihood used for each bin is the Poisson likelihood ratio to the best possible likelihood given the observed data, i.e.
where n i are the observed and ν i is the expected number of events given a particular background model in the i th bin. For the purposes of fitting and generating datasets, the statistical package "RooFit" is used throughout this paper [6].

Concept of the method
If a small signal sits on a large background, relatively small changes to the background shape can have significant effects on the apparent signal size and position. Unless there is a reliable theory or simulation which can constrain the background shape, there is an uncertainty about which function to use to fit the background. Hence, different choices of the form of the function will give different results for the signal parameters, meaning there is a systematic uncertainty associated with the choice of function. The basic concept discussed in this paper is to consider this choice as a source of systematic error which is modelled as a nuisance parameter. It is then consistently treated like other nuisance parameters as far as possible.
The two main parameters of a signal are the position (i.e. the mean) and size (i.e. the amplitude), although other quantities such as the width may also be of interest. This results in a multi-dimensional parameter space. For simplicity in this paper, the position and width of the signal are considered to be known and only the signal size is to be determined. However, the method is applicable to cases where more than one parameter is being estimated.

Continuous nuisance parameters
It is useful to consider briefly the usual way in which nuisance parameters are used to incorporate systematic errors. Consider measuring a signal property for a known background functional shape, but with unknown function parameters. The background parameters themselves are considered as nuisance parameters, since their actual values are not of interest. When fitting for a parameter of interest x, then twice the negative logarithm of the likelihood (Λ) is minimised with respect to x and all the nuisance parameters. It is usual to construct a profile likelihood such that Λ is minimised with respect to the nuisance parameters for each value of x over a range. The (for example) 68.3% confidence interval of x is then taken as the region for which Λ is less than Λ BF + 1, where Λ BF is the best-fit value at the overall minimum. Similarly, the 38.3%, 95.4% and 99.7% confidence intervals are defined by the regions in which Λ is less than Λ BF + 0.25, Λ BF + 4 and Λ BF + 9 respectively.
Consider this absolute minimum in Λ: if the nuisance parameters at that point were fixed, then the profile Λ curve would be steeper than for the original overall Λ curve. This directly reflects the fact that if the nuisance parameters had no uncertainty, the error on x would be reduced, i.e. there would be no systematic uncertainty arising from this source. The same is true for any other point on the original curve; if the nuisance parameters were not allowed to vary, each would result in a new steeper curve, in these cases with a minimum above the original curve. The width of these curves would again only be affected by the statistical power of the dataset being fitted, not by the systematic errors encompassed in the nuisance parameters. Finally, there are in general many other sets of nuisance parameters which do not result in a point on the original profile curve. Fixing the parameters to these values will also result in further curves which are enclosed by the original profile curve. The critical point is that the overall minimum as a function of x encloses the curves for all possible values of the nuisance parameters. Now consider finding the profile curve in a different way, illustrated in figure 1. Each set of nuisance parameter values will result in a curve, the steepness of which reflects the statistical error only. Consider picking many different sets of nuisance parameter values; each gives one such curve. Drawing an "envelope" around the minimum Λ value of all these curves will give an approximation to the original profile curve. Clearly, in practice, many such sets of nuisance parameter values would be needed to make this curve smooth. However, in principle, sampling the nuisance parameter space sufficiently and then finding the enclosing minimum envelope should give the profile curve required. Note, it would of course be possible to mix the two methods, i.e. choose sets of only some of the parameter values and fit for the others.

Background function uncertainty
The method of picking various sets of nuisance parameters and finding an envelope can conceptually be applied to the uncertainty on the background functional form. Each background function considered can be labelled by an integer. The integer is then treated as a (discrete) nuisance parameter for the profile calculation and hence this method has been termed "discrete profiling". The minimum envelope then gives the overall profile curve including all functions considered. This automatically includes any systematic error arising from the choice of function, as the envelope will in general be broader than any of the individual curves which contribute to the envelope.
In practical terms, most minimisation routines cannot easily handle a discrete parameter, particularly when the number and values of the other parameters of the fit (i.e. the parameters of the functions) change with the discrete parameter. Hence, in practice it has to be handled by mixing the methods as discussed in the previous section, i.e. choose each function and minimise the other parameters in the usual way to give a profile curve for each function in turn.
While the concept is straightforward, there are some statistical issues in applying it. Firstly, there is a question of whether the Λ minimum envelope covers, i.e. has the correct properties for a profile curve. This is discussed in Section 3. Secondly, to find the minimum requires a comparison between the absolute Λ values from fits to different functions. In general, these will The Λ profile curve for the nuisance parameters fixed to the best fit values is shown by the blue, solid line. The full profile curve allowing the nuisance parameters to be fitted for every value of x is shown by the black, solid line. The red dashed lines show the Λ curves for fixed nuisance parameter values other than those at the best fit, while the green dashed line is the envelope, i.e. the lowest value of any of the red dashed curves for each x. Even with such a coarse sampling of the nuisance parameters, it is seen that the envelope approximates the full profile curve.
include functions with different numbers of parameters. Hence, a way to meaningfully compare their Λ values must be found. This is discussed in Section 4.

Using functions with equal numbers of parameters
The simplest application of the envelope method is to the case where all functions used have the same number of parameters. This avoids several extra complications which are discussed in the next section.

Function definitions
The first study presented uses four functions, each of which has two parameters. These functions are chosen as they, and their higher order equivalents, are feasible representations of the background shape seen in the Higgs analysis. The functions are detailed below; in each case p 0 and p 1 are the two parameters.  Figure 2. Best fits of the four two-parameter functions (described in the text). The Laurent function is effectively identical to the power law function and so is hidden below the power law line. Note, for clarity in this plot, the data have been rebinned into 40 bins, although the fits were done with finer binning of 160 bins.

Example case
In the CMS Higgs to two photons analysis, the discrete profiling method is applied to the actual data taken by the CMS experiment; specifically the likelihood fit is performed using the invariant mass spectrum of the pairs of two photons simultaneously across different categories. However, for the purposes of this paper, the "data" used to illustrate the method are generated by a Monte Carlo method from a smooth background function which is similar in shape and magnitude to the real data sample used. (However, it is emphasised that this is not the actual data sample and so none of the detailed results presented below can be used to deduce any properties of the Higgs boson itself.) This original dataset was generated in 160 bins between 110 and 150 GeV in the mass variable used, m γγ . It included signal events generated according to a Gaussian distribution with a normalization of 50.8 events, a mean of 125 GeV and a width of 1.19 GeV. These values are taken from a single category of the Higgs analysis. In the following, the signal strength results are given in terms of the relative strength µ, meaning the ratio of the measured number of the signal events relative to the expected number.
The four two-parameter functions mentioned above were each separately fitted to this original dataset. A Gaussian signal component was also included, where the mean and width of the Gaussian were fixed to the same values as used to generate the events. The magnitude of the signal Gaussian and both parameters in the background function were determined from the fit. The results are shown in figure 2. It is clear the first order polynomial does not fit well at all, while the other three functions appear to give reasonable fits.
The profile scan as a function of the relative signal strength µ between −1.0 and 2.5 for the four functions is shown in figure 3. The absolute minimum occurs for the power law function at a relative signal strength of µ = 0.93. If just considering this function, then one would set a 68.3% confidence interval on µ of 0.43 < µ < 1.40, determined as the interval for which ∆Λ = Λ − Λ BF < 1. The measurement would therefore be reported with its standard error as µ = 0.93 +0.47 −0.50 . The Laurent function gives an identical result within the precision given. When fitting the exponential function a very similar Λ value is obtained at its best fit, but gives a best relative signal strength of µ = 0.72 and a 68.3% confidence interval of 0.27 < µ < 1.24, which would be quoted as Fitting with the straight line yields a very different result of µ = 0.01 +0.51 −0.47 although it is clear that this function does not describe the data well and it gives a much larger Λ value. The fact that the different functions can give different best fit values is a direct example of the systematic uncertainty associated with the choice of function.
The envelope around these functions is shown in figure 4. As it has to be, the best fit is still µ = 0.93 from the power law but now the standard error is enlarged by the exponential function contribution to the envelope on the lower side of the scan. Hence, taking all four functions into account, the 68.3% confidence interval on µ is 0.37 < µ < 1.40, i.e. the lower limit extended by the exponential fit and the upper limit from the power law fit, and the measured value of µ which would now be quoted is µ = 0.93 +0.47 −0.56 . The enlarged uncertainty is a direct reflection of the systematic error arising from the function uncertainty. As also shown in figure 4, the 95.4% confidence interval is −0.18 < µ < 1.92. There are two things to note. Firstly, although the Laurent fit is effectively identical to the power law, there is no issue with "double-counting" as the envelope just takes the lowest Λ. Also, it is clear that the poor fit of the polynomial means it plays no role in the envelope and so this function is "automatically" ignored by the method, without requiring any arbitrary criterion for including it or not.

Bias and coverage
The bias and coverage properties of the discrete profiling method were tested using a large ensemble of pseudo-experiments ("toys"). In each pseudo-experiment, a dataset including signal and background events was generated using a Monte Carlo technique. Ensembles were generated, and consequently the bias and coverage properties tested, under various different background hy-  potheses and for different values of the signal strength, µ. The background function from which the Monte Carlo events where generated was chosen to be one of the power law, exponential or Laurent two-parameter functions discussed above in section 3.1. The background parameters for generating the toys are set to their best fit values for the given value of µ.
In addition, a further ensemble of toy datasets was generated, for which the background function itself, as well as its parameters, was chosen according to the best fit for each value of µ. As can be seen from figure 4, this means that for values of µ < 0.55 the exponential background function will be used and it will be a power law function otherwise. Conceptually, this method again treats the choice of function as a discrete nuisance parameter and so picks the best fit values of all nuisance parameters for each µ value.
The resulting toy datasets are now treated identically to the original dataset. The bias and coverage are assessed by calculating the fitted signal strength, µ, and its error, σ , for each toy when using different background models. The background functions tested were the four twoparameter functions discussed above and additionally the result of forming the envelope around all four (in other words the result of treating the choice of function as a discrete nuisance parameter).
We define the bias as the mean of the pull distribution, where the pull for an individual toy is defined as where µ is the generated value of the signal strength,μ is the fitted value of µ per toy and σ is the positive error onμ ifμ < µ and is the negative error onμ ifμ > µ. The results of the mean pull as a function of the generated signal strength are shown in figure 5. It can be seen, as one would expect, that when fitting back with the same background function as used to generate the toy dataset, the bias is negligible. It is also shown that the bias is not dependent on the value of µ used to generate the signal. However, when fitting back with a different background function the bias can be large; here giving a mean pull up to 0.5. The discrete profiling method provides a medium between these two in which the bias is small (a mean pull of around 0.1) regardless of the generating function used. This is important given that this method is to be applied when the true underlying function is unknown.
The coverage was tested, using the same fits, by determining the frequency of toys in which the difference in Λ between the best fit and at the value of µ used to generate the signal was less than 0.25, 1, 4 and 9. These frequencies were compared with the expected frequencies, 38.3%, 68.3%, 95.4% and 99.7% respectively. The results for these are shown in figure 6. It can be seen that when the fitting function is different from the generating function the calculated confidence interval can undercover, whereas when using the discrete profiling method the coverage is good regardless of the generating function. The coverage is found to be good independently of the value of µ used to generate the signal.

Using functions with different numbers of parameters 4.1 Corrections to the likelihood
The Λ calculated from the fit of a function to a dataset is purely a measure of the agreement of the function and the data; the number of parameters used in the function has no impact. Hence, at least for nested families of functions (such as polynomials of varying order), then the lowest Λ will always be given by the highest order considered. Because this function also has the largest number of parameters, it will generally have the largest statistical error and hence widest profiled Λ curve. Hence, simply using the Λ values without any "penalty" for the number of parameters would effectively always result in the minimum envelope being mainly defined by the highest order function considered. It would also mean there is no "natural" way to know when to no longer consider yet more higher order functions. However, decisions based on quantities, such as an Ftest [7], are standardly used to determine when higher order functions in a family can be ignored. Hence, when using functions with different numbers of parameters, it seems necessary to have some correction to the Λ value to account for this difference in number. The idea is therefore to compare Λ values, correcting for the differing number of parameters (or equivalently differing number of degrees of freedom) in the fit functions. Specifically here, where applied, the correction is done so as to get a value equivalent to a function with no parameters, so the number of degrees of freedom is the number of bins used. Two obvious methods were considered, based on the χ 2 p-value and the Akaike information criterion [8].
1. For a binned fit using the expression for the Λ ratio for each bin specified in equation 1.1, then for the large statistics case, the Λ becomes equivalent to a χ 2 for the fit. In this case, it is meaningful to find the p-value of the χ 2 value. A new χ 2 value can now be obtained, namely that which would give the same p-value but with a different number of degrees of freedom, equal to the number of bins. The corrected Λ is then given by Besides being a function of the number of bins and parameters, the size of the correction χ 2 − χ 2 depends on the original fit quality (or equivalently p-value). Figure 7 shows examples of the size of the correction as a function of the fit p-value, when correcting for various numbers of parameters. The correction is monotonically increasing as the p-value gets smaller. Hence, when correcting the profile likelihood curve, the fits further away from the best fit minimum will get a larger correction. Hence, the profile curve becomes steeper, which could in principle affect the coverage, even if only considering one fit function. However, it can be seen that the correction is approximately given by for the central range of p-values. This approximation means the Λ corr shape, and hence coverage, is unchanged. Both the exact p-value correction and the N par approximate correction are used in this paper.

The basic Akaike formula for very large sample sizes is
so the correction is simply 2N par and hence is twice as big as the approximate p-value correction. For finite samples, then it is modified to where n is the sample size, i.e. the number of bins (for a binned likelihood fit) or events (for an unbinned likelihood fit). For large n, the original formula is clearly regained. This can be used for binned or unbinned fits. Also the correction does not depend on the Λ value and so is a simple shift of the whole profile curve, without changing its shape. Hence, it also has no effect on coverage.

Function definitions
The fit functions which were used are the two-parameter functions listed in Section 3.1 and higher order generalisations of these. Specifically, these are

Example case
The functions listed above were fit to the original dataset for values of 2 ≤ N par ≤ 6; this resulted in three fits for the power law and exponential functions, and five fits for the Laurent and polynomial functions. The profile curves from the fits of these functions are shown in figure 8, where results for three different corrections to Λ have been shown, namely no correction, the approximate p-value correction, and the Akaike correction.
Consider the example case of the approximate p-value correction, i.e. correcting by one unit per background function parameter, which is shown in the middle row of figure 8. For this case, the lowest corrected Λ value is still given by the two-parameter power law function and so gives an identical central value to that described in Section 3.2. The lowest corrected Λ value for µ < 0.55 is also again given by the two-parameter exponential function, also as for the previous case. Indeed, comparison to figure 3 shows that the minimum values of the corrected Λ for low µ are simply two units larger here than in that figure. However, for 1.48 < µ < 1.68, the N par = 5 polynomial is the lowest function in the profile, while for µ > 1.68, the N par = 6 polynomial is lowest. Hence, the envelope is formed from four different functions in this case. Figure 8 also shows the 68.3% and 95.4% confidence intervals which would be determined from the profile envelope. The 68.3% region is identical to that found using just the two-parameter functions (see Section 3.2), but when including higher order functions, the 95.4% confidence interval on µ is −0.18 < µ < 2.11, i.e. it is extended to higher µ due to the influence of the higher order polynomial functions providing a reasonable description of the data.
In contrast, the case with no correction to Λ, shown in the top row of figure 8, shows that the highest order polynomial function used, with N par = 6, defines the envelope across the whole range of µ. This domination of the functions with the highest numbers of parameters is exactly what the penalty correction to Λ is attemping to avoid. Finally, the lowest plot in figure 8, corresponding to the Akaike correction, shows that all functions with more than the minimum of two parameters get a large penalty and so do not contribute to the envelope. This hints that this correction may be too severe.
The three corrections shown in figure 8 can be considered to be examples of a continuous spectrum of corrections which can be written as Λ corr = Λ + cN par where c = 0, 1 or 2 in the cases shown. Other values of c are clearly also possible. Figure 9 shows the 68.3% and 95.4% intervals which would be derived from the envelopes for these and some other values of c, as well as for the p-value correction. It is seen that the best fit value and intervals change for c < 0.5 but for larger values of c they are quite stable. Hence, the result which would be derived from this method is effectively insensitive to the exact correction used, as long as c 1.

Bias and coverage dependence on correction
In a similar way to Section 3.3, toy datasets were generated using various individual functions and also the best fit functions for each µ value. The individual functions chosen were those which gave the lowest corrected Λ values in figure 8. Figure 10 shows the average pull vs µ for each choice of generating function, with the envelope based on the three correction methods considered, i.e. p-value, approximate p-value and Akaike. It is seen that the p-value and approximate p-value corrections give effectively identical results. It is also found that the bias for some generating functions can get up to a significant fraction of the error, but that the p-value corrections always give less bias than the Akaike correction. Similarly, figure 11 shows the fraction of times the envelope fit result was within the various Λ ranges, as used previously in figure 6. Again, the coverage is reasonably good for all cases. As for the bias, the two p-value corrections are very similar and always give better coverage than the Akaike correction. Hence, overall the two p-value correction methods seem to give effectively identical results and also perform better than the Akaike correction method. In practical terms, the approximate p-value correction method is easier to implement than the exact method and also can be applied to an unbinned fit (where the χ 2 is not approximately given by the Λ value). For these reasons, the approximate p-value correction was used throughout the CMS Higgs to two photon analysis [4]. Figure 12 shows the distribution of the 68.3% uncertainty on µ when considering the full set of functions using the three different correction schemes compared with considering a single first order function. The uncertainty (σ µ ) is defined as half of the 68.3% interval on µ. The distributions are obtained from fitting toy datasets generated using the first order power law function for the background, with signal generated assuming µ = 1. This quantifies the increase in the error due to the systematic contribution arising from the uncertainty in the choice of functional form. The tail of the distributions observed when considering the envelope of functions are from fits in which functions other than the best fitting function contribute to the 68.3% interval. As expected, correcting using a higher penalty reduces the size of this tail since in this case, it is less likely that a higher order function can contribute.

Application to a real case
The actual CMS Higgs to two photon analysis [4] is significantly more complex than the simplified version used here. In particular, the 2011 and 2012 data samples are split into eleven and fourteen categories, respectively, which have differing signal:background ratios. Because the categories (by definition) have different selection criteria, they can have different background shapes. There is no a priori reason to make any assumptions that the functions used in each category should be the same. Hence, each category should be tested with all functions, in a similar way to the above. A major complication then arises because there are common systematic effects across the categories, arising from nuisance parameters in the signal model. In the absence of these common nuisance parameters, the different categories could be profiled independently, each using the minimum envelope technique to produce an envelope curve per category. These could then be summed to give the overall profile curve. However, with common nuisance parameters, all categories must be profiled at the same time. Since minimisation code to handle the discrete nuisance parameter identifying the function seems difficult, in practical terms, this means that all possible combinations of each function in each category must be fitted. The minimum envelope made from the results of all these combinations would then be found. While this is conceptually straightforward, the actual implementation is prohibitive. For example, using the 16 functions discussed in this   paper, there would be 16 11 (∼ 10 13 ) combinations of functions to be fitted for the 2011 data sample and 16 14 (∼ 10 17 ) combinations for the 2012 data sample. Instead of considering all possible combinations of functions, the function which minimizes the likelihood and its parameters can be determined through an iterative procedure. The first step is to determine the set of functions which best fit the data, in each category, fixing the values of all nuisance parameters. Since the signal parameter, µ, is also fixed for a given point in the scan no free parameters are correlated between categories. Finding the functions which minimize the likelihood in this case then reduces to the simple case, namely finding the minimum in each category independently. Once these functions, and their parameters, are found, the choice of function is then fixed while the nuisance parameters are freed and a single fit of the nuisance parameters and chosen background parameters is performed. If no significant change in the likelihood is observed, the overall minimum is achieved. However, if this is not the case, the nuisances must be fixed again and the procedure to find the best fit functions given the new values of the nuisance parameters is repeated. This iterative procedure continues until the likelihood remains stable. This procedure dramatically reduces the number of fits to be performed to minimize the likelihood and is found to produce identical results when compared to trying all combinations in toy examples. The bias and coverage properties of the discrete profiling method were tested extensively in the CMS Higgs analysis using a similar procedure to that described in Section 4.4. It was found the method gave consistently good coverage and negligible bias.

Conclusions
A method of treating the uncertainty due to the choice of background function as a discrete nuisance parameter has been shown to give good coverage and small (potentially minimal) bias.
There is an uncertainty about how to penalise functions with higher numbers of parameters. The results indicate that a penalty based on the p-value is less biased and gives slightly better coverage than one based on the Akaike criterion. The approximate p-value correction is easier to implement and can be applied more widely and so was chosen as the method to use. However the actual results for the case presented seem to be relatively independent of the exact value of the penalty.