On new physics searches with multidimensional differential shapes

In the context of upcoming new physics searches at the LHC, we investigate the impact of multidimensional differential rates in typical LHC analyses. We discuss the properties of shape information, and argue that multidimensional rates bring limited information in the scope of a discovery, but can have a large impact on model discrimination. We also point out subtleties about systematic uncertainties cancellations and the Cauchy-Schwarz bound on interference terms.


I. INTRODUCTION
In modern High Energy Physics, the use of large datasets has become commonplace. In two areas in particular, Particle Physics and Cosmology, the forefront of discoveries and characterization of new phenomena relies on extraction of information from complex datasets produced by experiments like Planck [1] and the LHC [2]. In both fields, a precise theoretical paradigm is used to interpret the data (ΛCDM and SM, respectively) and the search for new phenomena depends then on identifying subtle deviations within the data, often relying on machine learning techniques. For example, the discovery rare SM processes, like mono-top [3] and Higgs decays to tau-leptons [4], has been achieved using this methodology.
On the theoretical side, these multivariate techniques obscure the physical understanding of which variables drive the analysis, making the re-interpretation of results very difficult and in general hindering the public use of the data. Yet more detailed information, in particular differential rates, is required to advance the programme of searching for a new paradigm beyond the standard one. For example, the use of differential information on Higgs production [5] has proven key to pushing the limits of understanding the impact of possible new phenomena in the Higgs boson properties.
In this paper we investigate the advantages and limitations of multidimensional shape information in searching for new physics and present two case studies, the new physics search in the context of the SM Effective Field Theory (SMEFT) and the characterization of the quantum numbers of a new resonance. These case studies, together with the material collected in the Appendix, can * ff83@sussex.ac.uk † sylvain@ift.unesp.br ‡ v.sanz@sussex.ac.uk readily be used as guidelines on how experiments could provide data and how theorists would use it.

II. INFORMATION CONTENT OF MULTIDIMENSIONAL DIFFERENTIAL RATES
In this section we study the information content of differential rates and their use in discovery and model comparison. The statistical formalism and details are provided in App. A.
The information content of a likelihood function with respect to a parameter θ is measured by the observed Fisher information I θ [L] ≡ −∂ 2 θ log L [6]. The likelihood functions we focus on arise from event counting and can always be factored as where L tot (θ) contains the information on the total rates and L shape (θ) contains information on the shape of differential distributions [7]. The information content from total event number and from the shape (I tot ≡ I[L tot ], I shape ≡ I[L shape ]) are thus independent from each other. I shape , therefore, could be arbitrarily large with respect to I tot , i.e. the amount of information contained in the shape could dominate over the amount of information contained in the total rate. It is thus fully justified to systematically take into account the shape information on top of the total rate information.
The information content of L shape with respect to the dimensionality of the differential rate distribution is slightly more subtle. For concreteness let us consider the case of one kinematic variable ("1D") versus two kinematic variables ("2D"). The variables are labelled a and b. If the a and b variables are totally correlated, one has I 2D shape = I 1D,a shape = I 1D,b shape , and there is no gain in going from 1D distributions to 2D information. On the other extreme, if the two variables a, b provide uncorrelated information, the likelihood factorises and the total infor-mation is given by I 1D,a shape + I 1D,b shape . This is the maximum information possible, thus one obtains that as the potential gain from 1D to 2D cannot be arbitrarily large.
The gain from 1D to 2D is maximal when the two 1D distributions are of the same order of magnitude, I 1D,a shape ∼ I 1D,b shape ≡ I 1D shape , with the maximal value for I 2D shape given by twice I 1D shape . In the rest of this paper, we will often refer to the information gain obtained from using a 1D differential distribution to using a 2D differential distribution as the "1D/2D" gain.
Pursuing in our general considerations, let us evaluate the impact of the various pieces of information discussed above in the subsequent statistical analyses. In order to proceed, we need to consider statistical tests. We adopt the framework of Bayesian statistics, which allows a unified treatment of discovery and model comparison For hypothesis testing, the relevant quantity to use is the Bayes factor (see app. A for definitions).
We assume that the likelihood for each hypothesis can be approximated by a Gaussian with respect to the parameter of interest θ. This limit tends to occur once at least O(10) events are collected [8]. The likelihood function then takes the form whereθ is the value of θ preferred by the data, I is the Fisher information for θ, and the constant L max encodes the information about goodness-of-fit between the hypothesis and the data. In this Gaussian limit, the Bayes factors exhibit simple expressions with respect to the Fisher information(s). Moreover, the Fisher information depends linearly on the observed total event number n obs to a good approximation. Hence we obtain I 1D shape = α 1D n obs , I 2D shape = α 2D n obs , with α 2D ≤ α 1D,a + α 1D,b . Note that the α 2D information coefficient can be at best α 1D,a ∼ α 1D,b . This direct link of Fisher information to the event number is crucial to concretely quantify the impact of the various likelihoods.
To characterize discovery, we introduce the discovery Bayes factor, which compares a model hypothesis with a free parameter θ with the same hypothesis restricted to θ = θ 0 (see App. A). The discovery Bayes factor is given by with α = α tot +α shape . Note that the constant L max does not appear in this expression. The first term in Eq. (4) encodes the comparison of central values while the second term encodes prior information. This second term becomes quickly negligible once n obs increases. Comparing the discovery Bayes factor for 1D and 2D distributions, and assumingθ 1D ∼θ 2D , we get that The bound is saturated when the 2D information is maximal, α 2D shape = α 1D,a shape + α 1D,b shape , and for α tot α shape . Finally, the information gain from 1D to 2D would be maximal when α 1D,a shape ∼ α 1D,b shape ≡ α 1D shape , in which case we obtain that log B 2D 0 could be at most twice log B 1D 0 . This bound on the 2D Bayes factor can be easily translated in terms of sample size and evidence strength. In terms of sample size, the bound can be translated using the fact that the 1D/2D gain amounts to at most doubling the n obs from the 1D case. In terms of strength of evidence (see Jeffreys' scale), we observe that moving from 1D to 2D can lead to a shift of at most one step in evidence strength. For instance, if the 1D Bayes factor would give moderate evidence (log B 0 = 2.5), the 2D Bayes factor could at most reach strong evidence (log B 0 = 5).
So far we have discussed how the 1D/2D information gain is bounded in the scope of a discovery. But what about model discrimination? Approximating the likelihoods as Gaussians in both hypotheses H 1 , H 2 , the Bayes factor comparing H 1 to H 2 reads Note that the structure of this Bayes factor is different from the discovery Bayes factor. The first term encodes the relative goodness-of-fit of the models with respect to data, whereas the second term in Eq. (6) is a ratio of Fisher information, and should be understood as a measure of the relative fine-tuning of the two models, see [9]. This second, "naturalness" term is independent of n obs . In contrast, the ratio of maximum likelihoods depends in general on n obs , as goodness-of-fit is in general different in both hypotheses. In fact, in the large sample limit, one expects where β a positive or negative constant. The case β > 0 corresponds to the H 1 model being a better fit than H 2 , and conversely. The absolute value of log (L max,1 /L max,2 ) is thus expected to grow with n obs , reducing the relative impact of the naturalness term.
We can now compare a Bayes factor based on a 1D distribution, log B 1D 12 , with a Bayes factor based on a 2D distribution, log B 2D 12 . Neglecting the naturalness terms (which are however different for 1D and 2D), we are left with comparing the goodness-of-fit terms of the 1D and 2D cases, roughly given by β 1D n obs and β 2D n obs . We have found no bound on the β 2D /β 1D ratio based on general information considerations. This suggests that the 1D/2D information gain can be arbitrarily large in case of model comparison. This is related to the fact that for model comparison the goodness-of-fit matters, while in case of discovery it does not, i.e. B 0 does not depend on L max .

III. CASE STUDIES FOR DISCOVERY AND CHARACTERIZATION
In this section we evaluate the impact of multidimensional differential information using realistic LHC simulations and apply the procedure to various hypotheses of physics beyond the SM (BSM).

A. Simulation and analysis setup
To simulate the conditions in the LHC for different model hypothese, we use FeynRules to implement the BSM models, and the UFO [10] output to interface with MadGraph5 aMC@NLO platform [11]. The parton events are then passed through Pythia [12] for partonshowering and hadronization. Finally, the hadrons are reconstructed via the anti-k T algorithm [13] with an Rparameter set to 0.4 using the FastJet [14] interface of MadAnalysis 5 [15]. A jet is tagged as arising from a b-quark when a B -hadron is present within a cone of radius R = 0.4 centred on the jet momentum direction. A private pyROOT script has been developed in order to automatize and monitor the whole analysis in the framework of MadAnalysis.
As our focus is on evaluation and comparison of analyses based on future data prospects, we have introduced "projected" data in our likelihoods, see App. A for details. No statistical fluctuations in the projected data are assumed, and these are thus directly given by the expected rates.

B. Case I: CP-violating and -conserving SMEFT
In the scenario where new particles are too heavy to be produced on-shell at the LHC, their observables effects are better described by a low-energy effective theory, in the so-called SMEFT framework. For our case study we assume the presence of two characteristic dimension-six operators, where the operators O HW andÕ HW are defined as Here Φ is the Higgs doublet and W µν the SU (2) L field strength. The Wilson coefficients c i are normalized following the SILH basis conventions [16] and their current bounds can be found in Refs. [17,18]. Note we use the implementation of these operators provided in Ref. [19].
)GeV γ PT (   0  50  100  150  200  250  300  350  400 450 500 )GeV γ PT (   0  50  100  150  200  250  300  350  400 450 500 (   0  50  100  150  200  250  300  350  400 450 500 As an example of the use of differential information, we consider Higgs production in the Vector Boson Fusion (VBF) mechanism, and generate samples of 450K events. Basic selection cuts require the presence of two jets with a transverse momentum p j T > 20 GeV, pseudo-rapidity |η j | < 4.5, as well as typical VBF cuts: the dijet invari- ant mass is required to be larger than 400 GeV and the jet separation in pseudo-rapidity to be above 2.8. The analysis selects two high-momentum jets j 1 , j 2 and two photons γ 1 , γ 2 from the Higgs decay. The indexes 1 and 2 denote the leading and sub-leading particles.
In order to determine the differential rates for arbitrary values of the effective operator coefficients, we use the reconstruction method described in [20][21][22] -which has been dubbed "morphing" in experimental references. The optimal version of the reconstruction method has been described in [22] and is used in our analyses. The reconstruction provides estimates of the various components of the rate on a given bin,σ r (c) =σ SM r + cσ int r + c 2σBSM r with c = c HW ,c HW . An important subtlety related to the estimation of the interference component in regions with low event rates is described in App. B The projected data are directly given by theσ(c ) event rates, where c is the value operator coefficient assumed to be present in these projected data. The fact that we use the same ratesσ(c) for both projected and expected rates leads to an interesting simplification. It turns out that the main Monte Carlo uncertainties cancel out from the likelihood, leaving the maximum likelihood rigorously unchanged (see app. C). Rather, the uncertainties are only changing the Fisher information part, and more generally the likelihood line-shape. This simplification implies that in practice, the number of Monte Carlo events to perform the simulation needs to be only mildly larger than the nominal number of events. Having for example n MC r > 3n obs r gives a systematic uncertainty of ∼ 33% on the Fisher information, and thus of ∼ 16.5% on the projected statistical uncertainty.
Having described all the aspects of our simulations and likelihoods, let us proceed with the data analysis focus- The gray, blue, purple, red lines correspond respectively to total rate (0D), 1D differential rate in ∆φγγ, 1D differential rate in p γ 1 T , 2D differential rate in (∆φγγ, p γ 1 T ). A flat prior for cHW ,cHW is assumed. The gray, blue, purple, red lines correspond respectively to total rate, 1D differential rate in ∆φγγ, 1D differential rate in p γ 1 T , 2D differential rate in (∆φγγ, p γ 1 T ).
ing on 1D and 2D differential distributions. We have tested the constraining power of a set of basic kinematic variables including transverse momenta, azimuthal angles and longitudinal rapidity differences between final state objects. Throughout our study we found that the pair of variables with the best 1D/2D gain for discovery are p γ1 T , ∆φ γγ , hence we present the analysis with respect to these two variables, see Fig. 1. Note, though, that the analysis does not include detector effects which could change the set of optimal variables.
We first compute the posterior distributions for L tot , L 1D , L 2D , assuming c HW =c HW = 0 in the projected data. The preferred regions are shown in Fig. 2. We can see that taking into acccount the 1D distributions is crucial in order to lift the degeneracy in the c HW −c HW plane. In contrast, the gain from 1D to 2D differential information turns out to be mild.
Assuming that the O HW operator with c HW = −0.01 is present in the data, we compute the discovery Bayes factor for O HW as a function of the sample size, as shown in Fig. 3. A mild gain between B tot 0 and B 1D 0 and between B 1D 0 and B 2D 0 is observed. That the two 1D Bayes factors have almost the same value is apparently a mere coincidence. The 1D/2D gain for this pair of kinematic variables is the best we found among the kinematic variables considered. A positive result is also obtained when assuming the existence of the operatorÕ HW instead of O HW .
Still assuming the presence of the O HW in the data, we then use a Bayes factor comparing the c HW = 0 hypothesis with thec HW = 0 hypothesis, see Eq. (6). The result is shown in Fig. 4. We observe that the 1D/2D gain in this case is much larger than for discovery. For example we can see that the 1D/2D gain in sample size is about 90%, which corresponds to almost doubling the sample size [23]. For comparison, for the discovery, the gain in sample size is of 20%.
We should stress that certain kinematic variables such as m jj have a better discriminating power than the variables we consider but are not as good for discovery, hence we present results based on the p γ1 T , ∆φ γγ variables in order to have a direct comparison with the discovery Bayes factor. Nevertheless, the large 1D/2D gain persists for these other combinations of variables, the pair m jj − p γ1 T , for instance, has 1D/2D gain in sample size is about ∼ 100% .

C. Case II: Testing the spin of a resonance
We consider now the discovery of a new resonance with either spin zero or two, and how our analysis would help on the characterization of the resonance using a final state of a pair of Z bosons further decaying leptonically, pp → φ + X → 2l + 2l − + X. The spin-0 and spin-2 resonance behaviour is simulated using existing FeynRules models. Samples of 450K events were generated, following the hadronization procedure previously described. The two pairs of opposite-sign leptons are required to have an invariant mass close to the Z boson mass, 75GeV < m ll < 105GeV, and are sorted by their transverse momentum, with l 1 being the hardest lepton. We chose p l1 T , ∆φ ll as kinetic variables for analysis. Widths and production rates of the two resonances are assumed to be the same, so that only differential distributions may be used to distinguish the spin of the resonance.
Following the same approach as for Case I, we assume that the projected data arises from a spin-2 resonance, and compute the Bayes factor B 20 comparing the spin-2 hypothesis to spin-0 hypothesis. The 1D/2D gain in Bayes factor for spin-2 versus spin-0, assuming spin-2 in the data and no statistical fluctuations. The blue, purple, red lines correspond respectively to the 1D differential rate in p l 1 T , 1D differential rate in ∆φ ll , 2D differential rate in (p l 1 T , ∆φ ll ).
sample size is found to be ∼ 50%. Similar results are obtained when assuming a spin 0 resonance and computing B 02 . We observe thus again a substantial gain of information when using the 2D distribution instead of individual 1D distributions.

IV. CONCLUSIONS
In the view of future new physics searches and characterization at the LHC, we investigate the impact of multidimensional differential rates in typical LHC analyses. Through general observations based on Bayes factors and Fisher information, we find that in the occurrence of a discovery, the gain from using 2D differential distributions instead of 1D is fundamentally bounded. In contrast, for model discrimination, no such bound is found, thus the gain from 1D to 2D can be much higher. To illustrate these features and show realistic values of the 1D/2D information gain, we study two new physics scenarios: operators from the SMEFT, and bosonic resonances.
We carried out discovery and discrimination tests in the VBF channel in presence of CP-even and CP-odd operators. The best 1D/2D gain for discovery is found for the combination of variables (p γ1 T , ∆Φ γγ ). As expected, the 1D/2D gain for CP discrimination is found to be much larger than for discovery. This observation also holds for various other choices of variables. In the presence of a heavy bosonic resonance, we evaluate the discrimination power of spin-0 versus spin-2 using the (p l1 T , ∆Φ ll ) variables, and observe a 1D/2D gain of about 50%. Note, though, that the procedure of adding more differential information saturates as the kinematic information in a given final state is limited. Hence the gain from 2D to higher-dimensional distributions will be re-stricted due to correlations among of the variables involved.
All details needed to reproduce our analysis are provided, and important subtleties generally present in these analyses are pointed out. First, in the reconstruction method of the differential rates in the SMEFT, we point out that in the low-event regions of phase space the Cauchy-Schwarz bound on the interference can be violated by the large systematic uncertainties, resulting in unphysical results. Second, we find that when using projected data, a cancellation between the leading uncertainties of expected and projected rates naturally occurs, implying that the maximum likelihood would remain unchanged and hence the MonteCarlo sample size would only have to be mildly larger than the nominal sample size to provide meaningful results.
We hope this study serves as a guide to experiments to provide differential information to theoretical collaborations, and as how to use this information for model discrimination.
In this Appendix we set the notation of the statistical analysis. We denote phase space by D, and consider a binning of D in d dimensions. The bins are set along a dimension i ∈ (1 . . . d) and labelled by r i , with the coordinates (r 1 , . . . , r d ) of a bin denoted r, and the associated piece of phase space D r .
The observed event number in the bin r is denotedn r , and the expected event number for a given value of the underlying parameter θ is denoted n r (θ). Total number of observed events isn = rn r and total number of observed events is n = r n r .
For further convenience one also introduces the ddimensional density of expected events f X (x), where X = (X i ) denotes the set of binned variables. f X (x) is simply the differential event rate normalized by the total event rate. The expected event number in a bin r is then given by (A1)

Likelihood
The likelihood function L is defined as the conditional probability of obtaining the observed data given a hypothesis, taken as a function of this hypothesis. For a hypothesis H with a set of parameters θ, L(θ) ≡ Pr(data|H, θ) . (A2) The likelihood function can be defined up to an overall constant factor. The events counted in each of the bins are statistically independent, hence the likelihood factorises as The event number in every bin follows a Poisson statistics, so that the likelihood function in the bin r is given by For a given integrated luminosity L, n r (θ) is given by the event rate on the bin, n r (θ) = Lσ r (θ). The likelihood can be formally factored in a Poisson term L tot containing the information about the total rate and a term L shape containing the information about the shape of the differential distribution, so that L(θ) = L tot (θ)L shape (θ) with L tot (θ) = n(θ)ne −n(θ) (A5) (A6)

Credible regions and hypothesis testing
We adopt the framework of Bayesian statistics [24]. The model parameters are given an a-priori probability density π(θ), called "prior", that can encode both subjective and objective information. The "posterior" density is defined as p(θ) ∝ L(θ)π(θ), it provides the preferred regions of θ ones data are taken into account. The shape of the prior becomes irrelevant once enough data are accumulated, i.e. when the posterior is data-dominated.
Comparison between two hypotheses H 0 and H 1 is done by means of the Bayes factor where the π 0,1 are the priors for hypotheses H 0,1 respectively. The Bayes factor is interpreted using the Jeffreys' scale [26], which associates weak, moderate and strong evidence in favour of H 0 to the threshold values log B 01 ∼ 1, 2.5, 5 (i.e. B 01 ∼ 3, 12, 150 ). The Bayes factor framework can be used in the context of new physics searches. In order to assess that the data favour a hypothesis where a parameter θ is different from a given value θ 0 one has to compare the H 1 hypothesis to H 0 ≡ H 1 |θ = θ 0 . In the context of effective operators, H 1 can for instance be the SM deformed by higher dimensional operators (the SMEFT), while H 0 is the SM.
that we refer to as the discovery Bayes factor. The test assesses that θ = θ 0 for B 0 > 1, using the thresholds given above.

Projected data
In order to evaluate the sensitivity of a future analysis, measurement, or experiment, one can rely on imaginary, speculative data. That is, instead of introducing actual observed data in the likelihood Eq. (A2), one can instead introduce speculative data coming for instance from a simulation of the experiment. We refer to these as projected data.
An important subtlety, well discussed in [27], is that an assumption has to be made on the statistical fluctuations present in the projected data. Along this paper, we will simply consider the case where no statistical fluctuations are present in the projected data. A dataset satisfying this condition is sometimes referred to as an "Asimov" dataset [27].
The event numbers in the projected dataset assuming no statistical fluctuations and the presence of an operator with coefficient c are then simply given by L σ r (c ). In practice, these rates have to be estimated by MonteCarlo simulations, just like the expected ones.

Appendix B: Violation of the interference's upper bound
The interference component in the true rate σ r (c) = σ SM r + cσ int r + c 2 σ BSM r satisfies a bound based on Cauchy-Schwarz inequality, |σ int | < 2 √ σ SM σ BSM . From a concrete viewpoint, this bound prevents the cross-section for becoming negative for any value of c. Now, on bins featuring few events, the uncertainty is large enough so that there is a non-negligible probability that the Cauchy-Schwarz bound be violated, i.e. |σ int | < 2 √σ SMσBSM can be false. This then results in a negative rate on the bin for some interval of c.
If the simulation has been done with enough events compared to the nominal event number, for any bin and any c, these rare events regions have a negligible impact on the subsequent analysis. Violation of the Cauchy-Schwarz constitutes then only a practical obstruction, and it is convenient to simply remove the bins on which the Cauchy-Schwarz bound is not respected [28].

Appendix C: Cancellation of systematic uncertainties
When confronting the expected event numbers to observed event numbers, one has to make sure that the systematic uncertainty on the expected event numbers has to be negligible with respect to statistical uncertainty, for every bin and for both SM and BSM hypotheses (for any relevant value of c, for example). In practice, this means that the number of MC events has to be larger than the number of observed events in all of these situations.
However, when the analyses involve projected data instead of actual data, a very nice and useful feature appears. It turns out that, provided one uses the same estimates for the projected rates and for the expected ones, the respective systematic uncertainties present in these rates will approximately cancel each other.
Let us detail how this occurs. The uncertainties on the reconstructed rates take the form σ(c) =σ SM (1 + δ SM ) + cσ int (1 + δ int ) + c 2σ BSM (1 + δ BSM ) (C1) where the δ's are the nuisance parameters -which are in general correlated (see [22] for their correlation matrix). Let us first adopt the Gaussian limit for simplicity. The likelihood using projected data takes the form exp(−L (σ(c ) −σ(c)) 2 /2σ(c)), in which it is clear that the SM uncertainty in the numerator cancels out exactly, and the other ones are suppressed by c − c and c 2 − (c ) 2 . As a result, the maximum of the likelihood remains unchanged and the main effect of the uncertainty is to distort the Gaussian and the Fisher information.
But these features turn out to be rigorously valid beyond the Gaussian limit, for Poisson likelihoods with any number of events. To see this, one first combines the three nuisance parameters into a single one. This operation is rigorously defined, and has already lead to useful developments in the context of LHC analyses [29][30][31]. After combination, the event rate takes the form σ(c) =σ 0 (c)(1 + δ∆(c)) , where δ is the nuisance parameter and ∆(c) controls the relative magnitude of the uncertainty. The marginal Poisson likelihood with projected data is L(c) = dδπ(δ) e −n(c) n(c) n(c ) Γ(n(c ) + 1) , where n(c) = Lσ(c) and π(δ) is the prior for δ. Comput-ing the derivate ofL(c), it turns out that the maximum ofL(c) still occurs for c = c , in spite of the deformations induced by the uncertainties, and for any π(δ).