Hunting for bumps in the margins

Data driven modelling is vital to many analyses at collider experiments, however the derived inference of physical properties becomes subject to details of the model fitting procedure. This work brings a principled Bayesian picture, based on the marginal likelihood, of both data modelling and signal extraction to a common collider physics scenario. First the marginal likelihood based method is used to propose a more principled construction of the background process, systematically exploring a variety of candidate shapes. Second the picture is extended to propose the marginal likelihood as a useful tool for anomaly detection challenges in particle physics. This proposal offers insight into both precise background model determination and demonstrates a flexible method to extend signal determination beyond a simple bump hunt.


I. INTRODUCTION
The searches and measurements taking place at the energy frontiers of collider physics take place in a data rich environment. Theoretical calculations of backgrounds in these analyses are compromised by the complexity of the full theory compared to what is computationally feasible. The problem of modelling these background processes is often further complicated by experimental systematics that are accounted for in-situ from collider data [1]. These factors motivate usage of parametric models, fit to data, to perform the background model estimation [2][3][4][5]. As a result, the search for (or measurement of) rarer signal processes amongst the background becomes sensitive to this data driven fit. There will be two main sources of uncertainty on the background model in such an analysis; parametric and modelling uncertainties. Parametric uncertainty arises due to the uncertainty on the input data points that the model is fit to, leading to variance of the parameters of a given model. Modelling uncertainty arises due to the potential for model misspecification biasing the resulting inferred functional form, this is generally much harder to estimate. Typically this is achieved by independently fitting multiple models and comparing the results of the individual fits, however the resulting recipe to select the model can become rather ad-hoc. Without quantifiable justification, the uncertainty from the model selection is difficult to precisely represent.
Bayesian inference offers a natural framework to discuss both modelling and parametric uncertainties [6]. Parameter uncertainty is naturally expressed in the posterior distribution of the parameters of interest. Model uncertainty is captured in the marginal likelihood, or evidence, of a Bayesian calculation. Comparison of evidences between models gives a concrete method of Bayesian model selection, and can be used to define Bayesian model averages which effectively ensemble the modelling uncertainty. Alternatively (but giving equivalent results) the posterior can be extended over hyperparameters governing the choice of the model itself -for example the family of functions composing the model or the number of basis functions. Such hyperparameters can be thought of as being discrete in nature, and understanding how to deal with discrete hyperparameters is one of the core goals of this study.
Numerically marginalising over discrete parameters is a computationally difficult task as discrete model choices induce discontinuous and multimodal likelihood surfaces. Nested Sampling (NS) [7] is a well established method to perform numerical marginalisation over such a target likelihood function. Bayesian sparse reconstruction (BSR) [8] has been demonstrated to perform the marginalisation over mixtures of discrete and continuous model parameters and hyperparameters in the context of astronomical imaging.
In this analysis the framework of BSR is applied to a toy problem representing a typical case of modelling a smoothly falling background process. The resulting composition of the model is examined and contrasted with existing maximum likelihood based approaches, motivating the utility of a Bayesian model in this context. To further highlight the unique opportunities afforded by NS in tackling multimodal sampling problems, a search for generic resonances -a bump hunt -is considered on top of the discrete marginalised background model. Signal analysis in particle physics has been cast as a Bayesian model comparison problem in plenty of previous work (see [9] and references therein), by extending these kinds of signal analyses over a discrete marginalised background model, a method with potential to tackle more subtle shape discrimination is proposed. The result is an analysis pipeline defining a powerful, fully Bayesian method for anomaly detection with well-calibrated background modelling uncertainties.

II. MODELS FOR BINNED DATA
Measuring an observable that is dominated by a single background process that has a smoothly falling rate as a function of the observable is a common scenario in many LHC physics analyses. The searches for resonances in dijet arXiv:2211.10391v2 [physics.data-an] 22 May 2023 mass spectra at the ATLAS and CMS detectors [2, 3] are flagship searches at the LHC that center on modelling the smoothly falling QCD background process. These analyses are particularly sensitive to the issue of extrapolating the fitted model into the poorly constrained tail of the spectrum. Precision measurements of Higgs properties in the diphoton decay channel with the ATLAS and CMS detectors [4, 5] provide a complementary example where the result hinges on the precision of the background model in a window as a function of the observable, surrounded and constrained by high statistics measurements of the diphoton mass spectrum. This study blends features from both these sources, examining the precise construction of a background model for a Higgs diphoton style measurement, then using this to perform an anomaly detection task reminiscent of a search analysis. In the remainder of this section the likelihood for this type of observable is defined, and the relevant basis functions to build the background model are defined.

A. Likelihoods from histograms
A single measurement channel consists of M histogram bins in a measured property of collision events, x (the dijet or diphoton mass for example). With bins indexed by i, data is collected in the i-th bin by counting events occurring in a range from the lower bin edge to upper bin edge. For simplicity the bin can instead be characterized by its mid-point, x i , and the degree of freedom associated with the bin width can be effectively integrated out. Each bin records an observed event count, n i , and the background model predicts this count as a function of the measured property, b(x i ). The likelihood for the set of data observations n i , given a background model parameterized by Θ can be taken as a product of independent Poisson distributions, Additional constraint terms, nuisance parameters or parameters of interest on signal models can then be incorporated. This work initially explores the problem of inferring a background-only model in Section III to establish the challenges of surveying discrete parameters in models. In Section IV this likelihood is then extended to consider a signal model in the guise of an anomaly detection task for generic signal models. The smoothly falling backgrounds in this problem can be modelled as a sum of simple basis functions φ, The background model is dependent on three factors in addition to the input x i , redefining the background model parameters as Θ = {θ, N, φ}. These three components can be identified as -the number of basis functions N , the parameters defining the basis functions θ and the family of basis functions φ. The parameters N and φ are identified as the discrete hyperparameters of the background model, and the inference can either be extended over these parameters or they can be specified and conditioned on. A frequentist approach to deal with inference extended over nuisance parameters is to profile them [10]. Profiling a parameter removes the associated degree of freedom by conditioning on the maximum likelihood estimator for the parameter. Profiling discrete parameters is challenging, achieved by considering a model ensemble conditioned on the set of discrete model choices, which can be used to build an envelope of profiled curves over the continuous parameters. A method for constructing this ensemble is demonstrated in Section III A. A Bayesian interpretation of this problem instead marginalises over these discrete hyperparameters, inverting the likelihood using Bayes theorem to get a posterior distribution for the model parameters, P (θ, N, φ | n i ). By marginalising over all parameters and hyperparameters in the model, a background model can be constructed that removes dependence on these choices by sampling directly from this full posterior distribution. An example of this marginalisation is demonstrated in Section III B, emphasising the role of the evidence as an alternative tool for building model averages.
Methods involving Bayesian non-parametric models have received increased attention recently [11], giving strong motivation to consider the contrasting adaptive basis function method presented in this work. Much of the attraction of Gaussian Processes derives from the fact that the expression for the marginal likelihood of such models is directly calculable as a function of the training data from simple linear operations. A marginal likelihood calculated for a set of adaptive basis functions -as presented in this work -retains the strengths of promoting functional diversity and embodying automatic relevance determination encapsulated by the marginal likelihood, whilst negating conceptual issues arising from using non-parametric methods in this context. When used for anomaly detection tasks, non-parametric models require the length scale over which the anomaly exhibits to be well separated from the length scale of the background process. Whilst this is generally achievable for the types of narrow width signals considered in this demonstration, posing the modelling problem as a Bayes ratio enables the data driven background models to be extended to less well separated model comparison tasks.

B. Smoothly falling functions
In order to construct a suitable set of basis functions to examine, a weak prior belief that the background model should be both constantly decreasing and smooth is initially considered. Following similar analyses conducted at collider experiments, three distinct candidate families of basis function are considered; Exponential (exp), Power law (power), and Log Polynomial (logpoly) functions. A more complete set of such functions has been demonstrated for similar physical use cases in the context of radio astronomy [12], and applied to similar bump hunting style problems in the same field [13]. The three function families chosen are by no means an exhaustive list, being an arbitrary choice inspired by a priori reasonable candidates, leaving the task of selecting and weighting these candidate models to the data. These three choices of function family can be defined in terms of the remaining background model parameters as, The data is transformed such that x i ∈ [0, 1] and n i scaled such that it has unit variance. The factors and translations appearing in these function definitions are required to ensure that the three families to admit similarly expressive functions. The x axes of all functions are shifted such that b(θ 2,m , x = 0, θ 1,m = 1, N = 1) = 1, and the exponents scaled such that b(θ 2,m , x = 1, θ 1,m = 1, N = 1) has a similar range for all families. This similarity in range is illustrated in Figure 1. The whitening of the data ensures that, given the chosen prior ranges on the functions, only sensible realisations of the background function are considered. When evaluating the likelihood, the functions are transformed back to the observed counts.
Each basis function in Eq.
(3) is characterized by two parameters; an amplitude parameter θ 1,m and an exponent parameter θ 2,m . The exponent parameter is given a uniform prior on [0, 5]. The amplitude parameter is given a sorted (more formally, a forced identifiability prior [14]) uniform prior on [−10, 10]. Allowing negative amplitudes relaxes the smooth constantly decreasing criteria in order to match the maximum likelihood implementation, however this is trivial to reinstate for a model with stronger inductive bias (as done in Section IV). The discrete hyperparameter N is given a uniform prior over integers [1, N max ], with φ being effectively given a uniform prior over integers [1, 3] indexing the family choice. The resulting sampling space is of dimension (N max × 2 + 2), with previous applications of BSR finding it sufficient to mask inactive basis parameters when N < N max , and use the choice of φ as the key for a dictionary learning problem. Using a uniform prior on N removes the sparsity promotion central to BSR, this can be easily reinstated by using an exponential prior on N for example.

III. DATA DRIVEN HIGGS BACKGROUND MODELS
A toy dataset illustrative of a diphoton background modelling problem can be constructed using event generators. The leading order diphoton MEGammaGamma built-in matrix element generator in Herwig 7 [15,16], is used to generate particle level LHC events. Rivet [17] is used to analyse the events and produce a target histogram of the diphoton mass, m γγ , in 80 bins uniformly spaced on m γγ ∈ [100, 180] GeV. The events are simulated at √ s = 13 TeV, and projected to a dataset of size integrated luminosity, L = 10 fb −1 . The generator is run until the Monte Carlo error in each bin is negligibly small. In order to simulate an experimental set of observations, the simulated n i in each bin is taken as the mean of a Poisson from which a toy "observed" n i is drawn. The model inference is performed over parameters and hyperparameters of the candidate background models with ranges and priors as detailed in Section II B, with N max being set to 3. This section demonstrates a comparison of a maximum likelihood based method as well as the proposed Bayesian marginalisation method.

A. Maximum likelihood based methods
The standard toolkit to estimate both the point estimate of the model parameters, and the various uncertainties, is based on estimating the parameter values that maximise the likelihood given in Eq. (1). Conditioning on a choice of discrete hyperparameters, the values of the continuous basis parameters that maximise the likelihood (subject to the constraints and scaling applied in Section II B) can be found by a variety of optimization methods such as gradient descent. The value of the maximum log-likelihood for each set of discrete parameter choices are listed in Table I. Whilst the likelihood is a useful tool to pick the continuous parameters, there is an obvious flaw in using it to pick the value of N -larger values of N will always be preferred. A potential resolution that is often employed in maximum likelihood approaches to modelling uncertainties is to introduce a correction based on the dimensionality of the model to counterbalance this preference for overfit models. There are a number of ways to motivate a penalty term, previous exploration of the correction term in particle physics applications has considering the magnitude of this correction as a tunable parameter [10]. In this example the Akaike information criterion (AIC) is used as a benchmark, one of the candidate corrections considered in previous work. The AIC is derived from the maximum likelihood value, L, as, where k is the number of free parameters in the model (equal to 2N for all models considered). The discrete profiling method uses this idea to correct the profiled values of parameters of interest. In a simpler background only fit, it is illustrative to consider another quantity that can be derived from the AIC of each model, the relative likelihood, with the subscripts referring to the minimum AIC found across the entire range of {N, φ} and the specific choice in question. This is listed for the considered functions in Table I, giving an effective model composition based on the penalised maximum likelihood.

B. Sampling based methods
When uncertainty quantification on parameter values is required alongside extracting the best fit choices, the background model can be equivalently constructed by sampling from a typical likely set of parameters. This Bayesian interpretation of the background modelling problem is achieved by sampling across a defined range of parameter values, as given in Section II B, and using the density weighted by the likelihood to define the regions of parameter space of interest. The result is a discrete marginalised model, and the sampling is performed by using the PolyChord implementation of Nested Sampling [18] to numerically marginalise over the multimodal likelihood. The resulting posterior for the discrete hyperparameters is listed in Table I, sampling from this distribution gives a weighted sample of background models which can be propagated through further inference machinery.
Using the same sample we can calculate the local evidence for each mode, noting that the problem could be equivalently formulated as a set of individual evidence calculations conditioned on each pair of discrete hyperparameter values. The local log-evidence of each mode can be calculated as, where the log-evidence, ln Z, is equal to the posterior weighted mean of the log-likelihood, ln L P , minus the Kullback-Leibler divergence, D KL , between the prior and the posterior [19]. This equation can be viewed in analogy to the maximum likelihood construction given in Eq. (6); with the posterior mean likelihood playing the same role as the maximum likelihood in setting the "height" of each mode and the Kullback-Leibler divergence playing the role of the penalty term. The Kullback-Leibler divergence, D KL is defined as the relative entropy between the prior, π(θ), and posterior, P (θ), as, This quantifies the information gain between prior and posterior, and hence can be interpreted as an Occam penalty [19], punishing models based on relative lack of information gain. The posterior weighted mean loglikelihood is listed in Table I and the corresponding Kullback-Leibler divergence calculated from the sample is shown in Fig. 2. As the sample was extended over the hierarchical parameters, the posterior distribution listed in Table I can be estimated directly from the samples, which is what is listed. Alternatively, it is equivalent to instead sample each choice of {N, φ} as distinct models and then weight each model i according to Z i /Z sum .

C. A comparison of methods
The previous sections derived two alternative methods to estimate a model composition for a diphoton background model. Inspection of the two model compositions in Table I    is dictated by the number of different entries in Table I with a significant probability, where P (N, φ) > 0.1 is used as a rough visual guide. The diversity along the N axis of this table can be compensated for in the maximum likelihood method by tuning the penalty term used, the work of Dauncy et al. [10] notes the difficulty in making general recommendations for such a tuning due to it's dependency on the studied application. It is equivalently possible to instead "tune" the sampling based method by introducing a sparsity promoting prior -for example an exponential prior on N . There is also increased diversity in the model composition along the φ axis in the marginalising case, particularly visible for the functions with N = 3. There is no clear mechanism to account for this effect in a discrete profiling framework. To summarize this comparison suggests that inference built from a maximum likelihood based method needs problem specific tuning and is challenged when faced with subtle shape differences. These two model compositions can be visually presented by examining how the inferred background models behave as a function of the physical observable, this is shown in the composition of the background model is understood in terms of its influence on inferred parameters of a hypothesised signal model. It is conceptually simple enough to introduce an explicit parameterized signal model into Eq.
(1) without introducing any particularly challenging computational overhead to this method, and this is explored in Section IV.

IV. BUMP HUNTING WITH MARGINAL LIKELIHOODS
Performing a Bayesian analysis of typical particle physics parameterized signal models is a well examined topic, with nuanced analysis needed to navigate issues such as the impact of choice of prior on physical parameters of interest [9]. The key quantity needed for a Bayesian analysis of a signal model is already derived, the evidence for a hypothesis as in Eq. (8). A generic parametric signal model can be introduced to the existing likelihood shown in Eq. (1) as, where the signal model, s(x i , ψ), is a function of the input x data as well as a new set of parameter(s) of interest, ψ. Conditioning on a choice of ψ specifies the hypothesis in question, H ψ , with the likelihood in Eq.
(1) being identified as the null hypothesis, H 0 .
Rather than further study the properties of derived intervals on signal models with strong priors -or validating the properties of frequentist vs. Bayesian discovery metrics -a generic anomaly detection task is considered. This scenario employs weak priors on signal parameters, and considers a likelihood with only a weak signal injected. This is a deliberately challenging scenario and is reflective of recent activity in the field trying to push the envelope of anomaly detection. By injecting a signal at around the noise threshold, a signal model that is multimodal in its parameter space is induced. This plays to a previously highlighted strength of a Nested Sampling based approach, the multimodal background model built earlier is still employed and a multimodal signal model on top of this creates a sampling problem that many common numerical sampling tools would struggle with.
For the purposes of this study a generic Gaussian signal model is considered, defining the set of ψ = {A, µ, σ}, as the amplitude, mean and variance of a Gaussian distribution. In this analysis all of these parameters are included into the marginalisation process, seeking to obtain the evidence for the hypothesis Z(H ψ ) = Z ψ . In particular the search for a generic resonance signal across the entire spectrum is sought, commonly recognised as a bump hunt in particle physics [20]. In the following sections the same resonance search is applied in a null dataset (i.e., the data used in Section III) and a dataset with a true signal injected. The priors on the background model parameters are the same as described in Section II B, apart from the amplitude prior which is given a sorted uniform prior on [0, 5].

A. Uncovering true positives
First a check is performed to see if, upon injecting a "true" signal to the null dataset, a correct inference of the signal can be recovered. To perform both of these searches the evidence for the signal model, Z ψ , and the evidence for the null model, Z 0 , are calculated. To allow a generic search across the spectrum the following uniform priors are put on the components of ψ, • A -the signal amplitude is given a prior in a range [0, 500], noting that as the signal model has by construction an integral of 1, this corresponds to a signal model with cross section [0, 0.05] fb. The true injected value is A = 150.
• σ -the signal variance is given a prior in a range of [0.5, 3]. With a unit bin width this restricts the search to a narrow resonance search, confining the signal to being contained mostly within [1, 6] bins. The true injected value is σ = 1.0.
• µ -the signal location is given a prior range covering [100, 180] GeV. This styles this search as a bump hunt across the full spectrum of data. The true injected value is µ = 125.
A summary of the results of calculating the marginal likelihood for signal and null hypotheses is shown in Fig. 4. The four panels in this figure show (in descending order); the physical spectrum and background model (as previously displayed in the top panel of Fig. 3), the posterior predictive distribution for the physical observable under H ψ , a set of 1000 candidate forms for the signal model itself sampled from the posterior, and lastly the data residuals normalised by the error. The key quantity that amortises the hunt for the signal hypothesis is the ratio of the evidences, Z ψ /Z 0 = 9.29 ± 1.37. This can be interpreted as "betting odds" [21], implying the data favouring the signal hypothesis over the null with an odds ratio of over 9 to 1. This global hypothesis odds is complemented by the calculated per-bin posterior signal model probability. By sampling a set of posterior samples for the signal model parameters,ψ, a set of candidate signal models can be constructed, s(x,ψ). The binned weighted sum of these posterior predictive distributions for s gives the per-bin signal predictive posterior under H ψ . In this example there is a clear preference for candidates around the true injected signal, however another potential mode for the signal model is weakly identified. The bottom panel of Fig. 4 shows the residuals -the ratio of the data to the mean background model normalised to the variance of the background model. It is notable that the signal injected at m γγ = 125 GeV consists of multiple adjacent bins where the data deviates from the model by over two standard deviations. The Bayesian significance implied by the ratio of Z ψ /Z 0 is noticeably lower than one would naively expect, and the tension between this and the equivalent frequentist global significance has been described as the Bayes effect [9]. This tension is a significant open problem that merits further exploration building on both the NS application and discrete marginalising presented in this work. Examining the model prediction as a function of the physical observables is a useful diagnostic, but the most natural language to discuss signal models is to calculate the posterior distributions for the parameters of the model. The corner plot for ψ is shown in Fig. 5. In order to highlight the utility of the marginal background model as described in Section III B, this corner plot shows the 95% credibility region posterior for the signal parameters under four background hypotheses. The full discrete marginalised background model is shown as a shaded contour, and comparison is drawn to three background models with fixed hyperparameters, corresponding to each of the three families considered and N = 2. The dotted black lines show the true inserted signal parameter values. Of the three fixed background models, φ = exp covers the true values well, but offers a worse resolution of the location and amplitude parameters. Both φ = power and logpoly offer better resolution of the signal parameters but are close to failing to cover the true values. Marginalising over these models (and the other choices of N ) results in a combined model that is able to retain the desirable features of the component backgrounds. In all cases the width of the signal model is poorly constrained but this conclusion is a feature of the toy problem, wherein the signal was injected at a level only slightly larger than Poisson noise. Conclusions on the various choices of fixed background hyperparameters is largely dependent on the generated pseudo-data used, but the overall method of marginalising out these choices is robust. Further investigation preferably on real data and incorporating more exhaustive comparison with results obtained via discrete profiling would be well motivated.
This example establishes a marginal likelihood calculation as a successful strategy to search for, and infer parameters of, anomalous bumps in smoothly falling spectra. More in-depth study is needed to explore the threshold of signal magnitude (and variety of signal shapes) to establish the false negative error rate for a given dataset. Examining how precisely shape deviations can be extracted is pushed to new limits with the well calibrated in-situ data driven backgrounds composed in this work, opening up an attractive prospect for an unsupervised learning algorithm built on the marginal likelihood. This can be further extended by considering signal models with shapes that go beyond a simple Gaussian [22], combining the power of the functionally diverse data driven background models exhibited in this work with more exotic signal shapes.
Pushing beyond the bump hunting paradigm is a topic of renewed interest more generally in the field [23], where signal models are pushed to undetectable levels when relying on simple bump hunting paradigms. Unsupervised machine learning methods generally attempt to overcome this by extending inference over a higher dimensional input data space, however in many cases these approaches rely on some variant of bump hunting in the latent space of a larger ML architecture. Adapting this work to such a scenario is an area with strong potential.

B. Checking for false positives
Alongside exploring the ability to retrieve true positive injected signals, it is important to balance this with a consideration of the robustness in the face of false positive signal identification. This sort of false positive can be examined by running the proposed method on a spectrum with no signal injected. The summary of running a marginalisation of H ψ , in the case where there is no signal injected, is shown in Fig. 6, mirroring the structure of Fig. 4. It is in examining this scenario that the strength of a Bayesian anomaly detection really shines. Whilst there are locally significant deviations, the overall evidence for the signal hypothesis disfavoured over the nullby a probability ratio of Z ψ /Z 0 = 0.59 ± 0.08. The fully Bayesian method proposed in this work demonstrates the advantages of the Bayesian Occams razor end-to-end. In Section III B it was noted that Bayesian Occams razor optimally navigated background model complexity, punishing overly complex models based on relative lack of information gain. The principles of Occams razor are once again on display when the marginalisation is extended over signal models, allowing discrimination of signal models based on how economically they improve the description of the data. It is precisely this amortisation of models based on dimensional economy in comparison to the null, that render this a powerful yet well calibrated search method.  Fig. 4, in the regime where no signal has been injected.

V. CONCLUSION
Bayesian inference, with a particular focus on evidence calculations, gives a principled framework to handle comparison between models. By including hyperparameters governing choice of model directly into a Bayesian sampling framework we can simultaneously perform model comparison whilst sampling appropriate values of the background model parameters. The predictions can be treated simply as a single band covering all the necessary uncertainty on this background model and propagated to further inference machinery.
A toy example of inferring a data driven background model for a Higgs measurement was used to demonstrate the machinery in a familiar collider physics use case. The marginalised Bayesian model gives a principled uncertainty in terms of all the nuisance parameters -including discrete hyperparameters -of the model, capturing both modelling and parameter uncertainty simultaneously. Precision measurements and searches stand to benefit from the precise uncertainty and improved weighting of functionally diverse candidate models.
The benefit for search analyses was demonstrated by extending the discrete marginalised background model to include a search for a generic anomaly using a Gaussian signal model. Calculation of evidence ratios in challenging scenarios reveals a well calibrated method for anomaly detection in particle physics spectra. This constructs a prototype analysis pipeline that can handle complex signal shapes alongside diverse data driven backgrounds. This pipeline presents a challenging sampling problem, but one that is handled well by the Nested Sampling tools employed in this analysis.
More broadly, evidence calculations and subsequent evidence weighted combinations of distinct models presents a statistically justifiable method to combine discrete model choices into a single unified model. Long standing comparisons in collider phenomenology, such as comparison of hadronization models or choice of hard process scale [24], can be viewed as a Bayesian model comparison using the framework presented in this work. The anesthetic package [25] was used to appropriately manipulate the NS chain files to construct the figures throughout this work. The fgivenx package [26]