Null Hypothesis Test for Anomaly Detection

We extend the use of Classification Without Labels for anomaly detection with a hypothesis test designed to exclude the background-only hypothesis. By testing for statistical independence of the two discriminating dataset regions, we are able to exclude the background-only hypothesis without relying on fixed anomaly score cuts or extrapolations of background estimates between regions. The method relies on the assumption of conditional independence of anomaly score features and dataset regions, which can be ensured using existing decorrelation techniques. As a benchmark example, we consider the LHC Olympics dataset where we show that mutual information represents a suitable test for statistical independence and our method exhibits excellent and robust performance at different signal fractions even in presence of realistic feature correlations.


I. INTRODUCTION
The combination of increased experimental sensitivity and no clear leading theoretical guide for how physics beyond the standard model would manifest in current and future particle physics experiments has resulted in increased development of anomaly detection techniques for collider applications, see Ref. [1] for a living review with a continuously updated list of references. These techniques, which make use of state of the art unsupervised and/or weakly supervised algorithms, have the advantage of being sensitive to a large variety of signals at the expense of losing statistical power in comparison to dedicated searches. However, appropriately quantifying said sensitivity is still an open problem [2], with differing proposals, see e.g. Ref. [3]. An especially pressing question is how to evaluate the null hypothesis exclusion sensitivity of an anomaly detection method. The current strategy is to perform cuts using the anomalous score and extrapolate a background model from a control region. This can be problematic for several reasons. First, the use of the anomalous score itself to select events is not guaranteed to yield a robust method that disentangles the underlying processes, see e.g. Ref. [4] for a recent discussion of how ambiguities in the data representation can lead to different notions of anomalous events which vary in their discriminating power. Second, even if the anomaly score is an appropriate event selection tool, the use of cuts, which in an unsupervised search cannot be optimized on a targeted signal model, necessarily introduces a loss in sensitivity by discarding possible signal events. Finally, the use of a control region potentially introduces additional biases when assuming the absence of signal in the control region and/or employing interpolation methods such as the fit to a monotonic mass spectrum in a Bump Hunt.
In this work we aim to address some of the shortcomings outlined above. In particular, we propose a null hypothesis statistical test for anomaly detection which does not rely on fixed anomaly score cuts nor requires background model extrapolations from control regions. We apply it to a specific anomaly detection technique, Classification Without Labels (CWoLa) introduced as a quark/gluon tagger in Ref. [5] and as an anomaly detection technique in Refs. [6,7], and its extension introduced in Ref. [8] incorporating simulation assisted decorrelation of features. We show that by testing for independence between the set of features used in the anomaly score, and those used to define signal and control regions, we can obtain a p-value which avoids false signal-detection and is robust in presence of slight correlations between the two sets of features.
The work is structured as follows. In Section II we review CWoLa and introduce the proposed statistical test. In Section III we apply our method to a LHC Olympics benchmark to demonstrate its power and limitations. We conclude in Section IV where we also discuss possible future extensions and improvements. All the necessary code to reproduce our results is available at GitHub [9].

II. METHOD
Introduced in Ref. [5], CWoLa is a weakly-supervised technique for anomaly detection which aims to learn a monotonic function of the Likelihood Ratio between Signal S and Background B processes for a set of features of interest x, L S/B ( x) = p( x|S)/p( x|B), with the help of an additional feature y uncorrelated with x. The latter variable, often but not necessarily the invariant mass of the event, can be used to define two regions of interest: the signal region M 1 and the control (or side-band) region M 2 , where the signal-to-background ratio is assumed to be higher in M 1 than in M 2 . A weakly-supervised algorithm, CWoLa trains a classifier to distinguish between M 1 and M 2 . The obtained output function s( x) can then be mapped to L M1/M2 ( x) through the likelihood ratio trick. The orthogonality of y and x guarantees that L M1/M2 ( x) is a monotonous function of L S/B ( x) and thus possesses in principle optimal statistical power.
Usual applications of CWoLa use the learned optimal classifier s( x) to select events of interest and assign a certain significance to the difference in selected events in M 1 and M 2 . The difference in the resulting selection efficiencies M1,2 is a smoking-gun for the presence of signal in M 1 (and also M 2 ). However, this is only true in the limit of infinite statistics. In a realistic setting where the dataset is finite, quantifying the degree to which the difference in efficiencies relates to the presence of signal is non-trivial. One common strategy is to assume that there is no signal in M 2 and assess the agreement between the selected events in M 1 and a background extrapolation from M 2 .
Our method constitutes an alternative to assess how the learned output s( x) encodes differences between M 1 and M 2 caused by the presence of a signal. To introduce it, we focus on the density estimation framing of CWoLa, which clearly defines a background-only or null hypothesis. At its heart, CWoLa is a mixture model where x and y are assumed to be conditionally independent given the process label z = {S, B}. After defining M 1 and M 2 using y, the trained classifier output is a function s( x) that inherits the conditional independence with respect to y. The statistical model can be explicitly written as where π is the signal probability. The background-only hypothesis is explicitly written as p(s( x), y|π = 0) and corresponds to the case where the observed data shows independence between s( x) and y. This is the key observation for our strategy. For a given measured dataset of pairs {s( x i ), y i }, one can assess whether they are statistically independent. If statistical independence is ruled out, the background-only hypothesis is ruled out, provided conditional independence holds. Conversely, if statistical independence cannot be ruled out, one has a clear statement about the incapability of CWoLa to discern whether any difference between M 1 and M 2 originates from the presence of a signal or is due to statistical fluctuations in the data. Several tests of statistical independence exist for both discrete and continuous distributions, including mutual information [10], Hoeffding's D independence test [11] and distance correlation [12]. For simplicity, in the present work we focus on the use of the estimated mutual information (MI) I of the measured probability distribution. MI encodes the exact property we want to test as it measures the difference between the joint distribution and the marginals: = ds dy p(s, y) log p(s, y) p(s)p(y) , where D KL (p, q) is the Kullback-Leibler divergence between two probability distributions, capturing how much information is lost when approximating the distribution p with the distribution q. The MI thus captures how well one can approximate the joint distribution by the product of its marginals and it is trivial to show that it vanishes for independent variables. Conditional Independence can then be expressed as a vanishing MI conditioned on a given process I(s, y|z) = ds dy p(s, y|z) log p(s, y|z) p(s|z)p(y|z) = 0 .
On the other hand, for the full dataset the possible mixture between the two processes encoded in π ∈ [0, 1] results in with the equality achieved when there is only one process or the two processes have the same probability distributions. A very nice feature of the MI is that it has well behaved asymptotic properties in the limit of small MI and large sample size [13]. Thus, we can estimate it from the measured sample of N events and obtain the p-value of said estimatorÎ(s, y) under the null hypothesis I(s, y) = 0. Assuming a two dimensional binning of (s, y) with d s and d y the number of chosen bins per variable, the estimatorÎ is a random variable that under the null hypothesis I = 0 follows a Gamma distribution with shape parameter (ds−1)(dy−1) 2 and scale parameter N . To estimateÎ we need to estimatep(s, y), withp(s) andp(y) obtained by marginalizing. We estimatep(s, y) through two-dimensional histogram event counts with the aforementioned d s and d y chosen bins. Because we are dealing with continuous variables, the use of binning introduces additional hyperparameters. In this work we bin s and y in such a way that each bin has a relative statistical uncertainty equal or lower than 1%. Other criteria for statistical independence that deal explicitly with continuous variables such as Hoeffding's D independence test or distance correlation could be used to avoid the introduction of binning at the expense of increased computational cost. We choose MI as it is straight forward to implement with a general signal-blind binning criteria and it suffices to establish the relevance of the strategy detailed in this work.
We emphasize that the role of CWoLa is to provide a one-dimensional observable s( x) which can then be combined with y to test for statistical independence. Once s( x) is obtained, the rest of the test relies only on data without the need to introduce additional cuts or labels. If testing for statistical dependence between x and y directly was feasible, then one would not need to introduce any learnable function. However, this is often not the case. One in general needs a high-dimensional x to ensure discriminative power between possible signals and the background, which is in turn converted by our method into statistical power to exclude statistical independence. On the other hand working directly with a high-dimensional set of features renders any statistical test problematic either due to the test being designed for two variables, as is the case for Hoeffding's D independence test and distance correlation, or due to the necessary density estimation suffering from the course of dimensionality as is the case for the mutual information test presented in this work.
The method relies on the assumption of conditional independence between x and y. In a realistic application this is not ensured, specially when considering highly-discriminative variables between the background and potential signals. The presence of correlation between x and y will result in non-null MI for each process separately. Thus, the p-value obtained from the MI estimation will be merely testing for conditional independence, not the presence of a single process. In other words, the null hypothesis ceases to be equal to the background-only hypothesis. This challenge is already present in current implementations of CWoLa, with correlations resulting in loss of classification power.
One possible strategy introduced in Ref. [8] is to ensure that s( x) is agnostic to the correlation between x and y through the addition of a simulated background dataset during the training stage. In this approach, named Simulation Assisted Classification Without Labels (SA-CWoLa), the loss function is modified with an additional term that incorporates the simulation dataset. Following Ref. [8], we define the loss function as that inverts the labelling in the simulation so as to penalize learning background differences between M 1 and M 2 , with λ the hyper-parameter that controls the relative importance of said penalization 1 . Note that the specific choice of the loss function is not relevant as long as it ensures proper decorrelation. Similarly, the simulated dataset does not need to be perfect, it only needs to encode accurately enough the correlation between x and y for the background process. Learning to ignore the correlations in the background guarantees that excluding the null hypothesis I(s, y) = 0 corresponds to excluding the background-only hypothesis p(s, y) = p(s, y|π = 0) and not merely excluding conditional independence p(s, y|B) = p(s|B)p(y|B). The main drawback of introducing decorrelation is that the learned function s( x) ceases to be optimal and looses classification power. Thus, one should tune λ with a given criteria that balances learning to decorrelate between x and y for B and learning to distinguish between B and S through discriminating between M 1 and M 2 .
In this proof-of-principle, we are satisfied with presenting results for fixed λ that is large enough to ensure decorrelation in the simulation sample and at the same time small enough so that s( x) is sensitive to the presence of signal in the measured sample and improves over the naive significance estimation S/ √ B. To verify that decorrelation is enforced, we follow Ref. [8] and compute the Area-Under-Curve (AUC) for the s( x) classifier for the M 1 and M 2 samples in the simulation dataset. If the AUC is approximately 0.5, we have a classifier that is not better than a 1 We always reweigh the events from both data and simulation during training in such a way that the each of the four subset of events random classifier and thus it has learned to ignore any possible correlations between x and y. We emphasize that the sole purpose of the simulation dataset is to ensure decorrelation, and we never compare data to simulation to obtain a significance after training. This makes the test more robust against background mismodelling than other unsupervised methods for anomaly detection which avoid the use of anomaly cuts at the expense of performing data-to-simulation hypothesis tests such as Refs. [14][15][16][17][18][19]. This is only possible because we assume that the simulations are precise enough to capture qualitative correlations between features in data. The simulator precision needed for decorrelation to be effective is considerably less than what is needed for a full multivariate comparison with measurements which often suffers both from systematic biases in the simulations and from the computational cost of achieving a given statistical precision.

III. APPLICATION: LHC OLYMPICS
In order to demonstrate its power, we apply our method to the LHC Olympics R&D labelled dataset [20]. The dataset is comprised of dijet events from two different sources: SM quantum chromodynamics (QCD) processes (background B), and the production of a hypothetical new resonance W with mass m W = 3.5 TeV, decaying to two intermediate particles X and Y with masses m X = 500 GeV and m Y = 100 GeV, which in turn both decay promptly to pairs of quarks producing two large-radius jets with a two-prong substructure (signal S). Our variable of interest y is the reconstructed dijet invariant mass m jj of the two hardest (in p T ) jets in the event. Mimicking the selection criteria of Ref. [ For anomaly score input features x, we choose a set of variables based on the invariant masses and the first Nsubjettiness ratios [21,22] of the two selected jets. Ordering the jets by mass, with j = 1 being the heavier jet, our variables are The correlation between x and m jj is mostly concentrated in the correlations between {m 1 , m 2 } and m jj . To illustrate how important they are, we define where a, b ∈ {B, S} and c ∈ {1, 2}. ∆ ab c (m bin jj ) represents the relative difference between the average of m c in a given m jj bin m bin jj for process a and the average of the same observable over the whole m jj range for process b. If a = b, this observable conveys the presence of a correlation between m c and m jj for a given process. For a = b, this observable conveys the difference in the m c probability distributions for the two processes and its dependence on m jj . By comparing ∆ BB c with (S/B)∆ SB c we can check whether the correlations between m c and m jj can obscure the differences between signal and background that CWoLa aims to learn. The prefactor S/B is introduced to account for the fact that the data contains less signal than background and originates from comparing the full data distribution to the background-only distribution and separating the signal and the background contributions: We show in Fig. 1 the resulting distributions for m 1 and m 2 , with the largest S/B considered in this work, S/B = 0.01. We observe how the correlation between features for the background process, as evidenced by the monotonic increase of ∆ BB c from negative to positive values towards larger m jj , is sizable and crucially more pronounced compared to the S/B weighted difference between S and B as traced by ∆ SB c . In other words, the correlation between m c and m jj in the background can easily mask the presence of a small signal. For lower S/B, the correlations become even more dominant and can lead to a strongly biased anomaly score. In addition, in our approach the correlations will also induce statistical dependence between s( x) and M 1,2 even in absence of a signal and thus jeopardize the validity of the null-hypothesis test.
In order to address this crucial issue, we follow Ref. [8] and incorporate a set of simulated events into the training stage. As simulation, we consider the background provided in the labelled version of the Black Box 1 (BB1) dataset. We use BB1 as simulation to take advantage of the larger signal sample provided in the R&D dataset. Using the loss function defined in Eq. (6), the classifier is then trained to distinguish M 1 and M 2 in data but not in simulation, obtaining a s( x) that is agnostic to correlations between x and y for QCD. As a classifier, we use a very similar set-up as in Ref. [8]: we train a Neural Network composed of three hidden layers with 64 nodes each and ReLU activation function with a sigmoid function applied to the output to ensure s( x) ∈ [0, 1] for 20 epochs using the ADAM [23] optimizer. The Neural Network is implemented in PyTorch [24]. We have trained our classifier using k-fold crossvalidation with k = 10 to avoid overfitting by ensuring that every s( x n ) is obtained by combining the data point x n with a classifier which has not seen x n during training. We also perform several random weight initializations to ensure better convergence. At the end of training, we evaluate the AUC score between the M 1 and M 2 simulated samples to ensure that decorrelation is achieved.
We show in Fig. 2 the learned s( x) for different S/B with B = 250k and λ = {0.0, 1.0}. In each plot, the binning is chosen in such a way that each bin has a relative statistical uncertainty lower than or equal to 1%. This choice ensures good performance of the density estimation needed for the hypothesis test. We can appreciate how as S/B increases, s( x) goes from being mostly centered around s = 0.5 to yielding higher s values, indicating improved learning of signal features. However, the binning choice obscures somewhat how much the background and signal are separated (within the highest s bin), as the whole signal is grouped together with the necessary background events to obtain a 1% statistical uncertainty. In absence of correlation mitigation (for λ = 0), anomaly score bias causes clearly unbalanced classification of events in M 1 and M 2 even in absence of any signal. The introduction of λ = 1 causes the events to be even more centered around 0.5, specially for low S/B. However, more importantly, λ ≥ 0 forces the training to ignore possible correlations between x and y for (simulated) background and thus approaches a random classifier for S/B → 0.
As expected, the impact of λ is even more pronounced when testing for statistical independence. For each dataset of {s, m jj } values, we estimate mutual information and obtain the p-value associated with the null hypothesis as detailed in Section II. We show in Fig. 3 the resulting estimatedÎ data and their corresponding p-values. We also ran a series of pseudo-experiments to verify that the asymptotic limit is appropriate. We only present results for λ = 1 because for λ = 0 we are able exclude I(s, y) = 0 for all S/B with very high confidence (p-value < 10 −14 ). This, as detailed in Section II, is because we are excluding conditional independence in the background process due to correlations between x and y. This is specially important for S/B = 0, where the effect of correlations can mislead CWoLa to falsely exclude the background-only hypothesis.
We observe that for λ = 1 the proposed test has the required behavior: for S/B = 0 the test yields results consistent with statistical independence, while an increase of S/B leads to an increasingly strong exclusion of the null hypothesis. The use of SA-CWoLa thus ensures that we can identify the null hypothesis with the background-only hypothesis. For S/B > 0, we also compute the discovery significance Z = Φ −1 (1 − p), where Φ is the unit Gaussian cumulative distribution function, and compare it to the naive counting significance Z 0 = S/ √ B. We observe how our method presents an increased discovery significance even compared to the case of perfect (up to statistical fluctuation) knowledge of background yields.
Overall, Fig. 3 shows howÎ can be used to infer whether the data presents a deviation from the null hypothesis, defined as the case where a single process (or an m jj independent mixture of processes) is present for which s Anomaly score s( x) probability density function (PDF) after training for different event labellings. Each row corresponds to a given S/B and each column to a given λ. Non-uniform binning ensures that when considering the full dataset each bin has a relative statistical uncertainty equal or lower than 1%, resulting in between 15 and 25 bins per plot. The data is shown both labelled according to the (observable) values of mjj (defining M1 and M2) as well as according to the (unobservable) truth labels (background and signal).  3. PDFs of estimated mutual information, its numerical distribution under the null hypothesis estimated through resampling, and its asymptotic distribution under the null hypothesis, for different considered benchmark datasets. Each plot corresponds to a different choice of S/B with λ = 1. We combine the estimatedÎ with their asymptotic distribution to obtain the resulting p-values. When S/B > 0, we also compute the discovery significance Z = Φ −1 (1 − p), where Φ is the unit Gaussian cumulative distribution function, and compare it to Z0 = S/ √ B.
and m jj are independent. This is the analogous to the p-value obtained using the Bump Hunt in a "traditional" implementation of CWoLa. However, contrary to existing approaches, here there is no selection cut to be optimized, and no extrapolation of the background into the signal region is required.
In principle the method could learn the likelihood-ratio between signal and background, which is the optimal teststatistic. However, the presence of the decorrelation term in Eq. (6) reduces the optimality of s and consequentlyÎ. This can be seen by the decrease in Z/Z 0 as S/B increases. When S/B is large enough, s( x) will ignore the small correlations in each individual process even for λ = 0.0. A non-null λ will thus only worsen the performance of the algorithm. However, as we are interested in low S/B cases where anomaly detection is useful, a more conservative approach which is robust to correlations even when there is no signal present is warranted.
In the previous paragraphs, we have shown how the proposed method yields an appropriate test statistic which is different from existing approaches. In Table I we provide a significance comparison of the proposed method to two traditional implementation of CWoLa which we denote as "Anomaly cuts" and "Bump Hunt". The former, implemented e.g. in Ref. [25], assumes that no signal is present in the side-band region M 2 and estimates the total number of background events in the signal region M 1 through the use of cuts on the anomaly score. For a fixed efficiency in the side-band region 2 , the estimated background event yield in signal region is 2 N 1 . If the measured efficiency in the signal region 1 is larger than 2 , then there is an excess of events with a significance of Similarly, the Bump Hunt method also assumes no signal is present in M 2 but estimates the background event yield in M 1 differently. In this approach, the background m jj distribution is explicitly modelled and a Profile Likelihood Ratio fit of the number of signal events in M 1 is performed. We follow Ref. [8] and model the background distribution for m jj ∈ [3.1, 3.9] TeV as where √ s is the center-of-mass energy and p i are parameters to be fitted from the m jj distribution with the signal region masked. Once the fit is performed, the expected number of background events in M 1 b is estimated from the integral of the background distribution over the signal region.We define the Likelihood function in the signal region as where P is the Poisson probability mass function of measuring N 1 events, N is the Normal probability density, θ is a nuisance parameter for background mismodelling and σ is the background yield error propagated from the p i fit. From this Likelihood we build the usual test statistic whereŝ,θ are the maximum likelihood estimates of s and θ andθ is the maximum likelihood estimate of θ when keeping s fixed to 0. The resulting significance is Z = √ q 0 . We implement the Bump Hunt by itself and in conjunction with the use of cuts in the anomaly score to enhance the efficiency as would be done in a resonance search. From Table I we observe how in every case the introduction of λ > 0 reduces the significance. However, it does not imply resilience against spurious signals for all strategies. Both traditional methods are highly dependent on the arbitrary 2 choice, showing the appearance of spurious significance at S/B = 0 for certain values. In general, we observe that our method is better suited for smaller S/B than existing methods. This is mainly because it does not discard potential signal events. The Bump Hunt with no cuts also does this, but its significance is lower for non-null S/B (for the extreme S/B = 0.01, its significance is even lower than S/ √ B due to the presence of the nuisance parameter). From this comparison, we assess that without clear criteria for an optimal anomaly cut ( 2 ), the mutual information-based method performs better for low to null S/B whereas the cut and count based methods outperform the mutual information test for larger S/B.
Another benefit of our model compared to traditional searches is scalability with sample size. If decorrelation is ensured, the larger the sample size the more powerful the method. This is certainly not true for the Bump Hunt, where the background modelling is an inherent approximation which necessarily introduces bias for large enough sample sizes. Conversely, for smaller datasets our method takes advantage of the full dataset in a better way than through the use of fixed anomaly cuts. The asymptotic approximation for the mutual information CDF, which is vital for the test, has been shown numerically [13] to be valid for N > 50 events and true mutual information I ≤ 0.14, conditions that we expect to always be satisfied in a realistic anomaly search at the LHC.

IV. DISCUSSIONS AND OUTLOOK
In this work, we have presented a novel strategy to quantify the sensitivity of a specific anomaly detection technique, Simulation-Assisted Classification Without Labels, by testing for statistical independence of the learned {s( x), y} samples. We have shown that as long as one can rely on SA-CWoLa to enforce conditional independence of the background processes p(s, y|B) = p(s|B)p(y|B), the null hypothesis of statistical independence is equivalent to the background-only hypothesis. Thus, testing for statistical independence in the observed data corresponds to testing for the background-only hypothesis. As a proof of principle, we have considered mutual information as a test statistic. MI has a known asymptotic distribution under the null hypothesis for binned data and has low computational cost. We have tested our method with LHC Olympics datasets and have shown that the test statistic yields the expected behavior. Most importantly our proposed test statistic provides a clear statement on the presence of signal, i.e. is capable of correctly yielding a no-signal response. This opens the door to testing for new physics in LHC datasets without the need for anomaly score cuts, as well as reducing the need for accurate background modelling.
Possible extensions of the present work could consider other tests for statistical independence such as Hoeffding's D independence test or distance correlation, which can be applied on unbinned s( x) and y, at the expense of increased computational cost. Similarly, other methods for anomaly score training and decorrelation of features could be explored. Employing the most suitable classification and decorrelation methods can be model and dataset dependent, and has not been the main focus of this work.
Another possibility is to assume that no signal populates the side-bands and identify p(ŝ|M 2 ) = p(ŝ|z = 0). This opens the door for an optimal analysis since one can now perform a template fit in the signal region [26]. However, it also potentially introduces additional uncertainties and/or biases due to background modelling that the present method avoids. We leave a more complete study in this direction for future work.
Regarding other physics applications, we emphasize that by dispensing with the need for explicit functional background modelling, our test is especially useful for anomaly detection applications that do not search for predetermined (modelled) signal shapes [2], such as invariant mass resonances as in e.g. Refs. [25,27]. However, it could also be applied in supervised searches as an additional cross-check to control bias due to modelling at the expense of loss of optimality.