Direct optimization of the discovery significance in machine learning for new physics searches in particle colliders

We introduce two new loss functions designed to directly optimize the statistical significance of the expected number of signal events when training neural networks and decision trees to classify events as signal or background. The loss functions are designed to directly maximize commonly used estimates of the statistical significance, s/s+b , and the so-called Asimov estimate, Za . We consider their use in a toy search for Supersymmetric particles with 30 fb−1 of 14 TeV data collected at the LHC. In the case that the search for this model is dominated by systematic uncertainties, it is found that the loss function based on Za can outperform the binary cross entropy in defining an optimal search region. The same approach is applied to a boosted decision tree by modifying the objective function used in gradient tree boosting.


Introduction
The optimal way to design a search for new particles is a typical problem in particle physics. With the widespread use of machine learning techniques, it is necessary to reconsider the common approaches to this problem. This is directly related to the statistical question how to claim a discovery which has been discussed in a seminal paper on likelihood-based tests of new physics [1]. Within this approach, it is possible to define the median experimental sensitivity of a search. Following the wording of the authors in [1], we call this the Asimov discovery significance.
The idea to use the Asimov discovery significance as a metric in machine learning (ML) is not new. For example, it has been used for the Higgs Boson Machine Learning Challenge [2], in this context named AMS. An important feature of this kind of significance measure is that it allows the inclusion of systematic uncertainties. The main difference of the approach presented here is the use of the Asimov significance as the loss function in the training of a classifier, in contrast to being used only as a metric. Starting with the concept that the trained classifier will cut out an area in phase space, where a certain amount of signal and background events are observed, we aim to find the best region such that the discovery significance for Poisson distributed events becomes maximal. For the case of Poisson distributed background (b) and signal (s) events with background uncertainty σ b , the approximated median discovery significance becomes: In the case where the background is known exactly (σ b = 0) this simplifies to: and for the case where s, σ 2 b b we are left with s/ b + σ 2 b . The systematic uncertainty σ 2 b is assumed to be proportional to b and is given as a percentage on the background for the results presented here. This is in contrast to the AMS, where a constant term b r = 10 is added. We note, that such a term would prevent us from finding a purely background free region. In addition, we use s/ √ s + b in the following, which can be understood as the exclusion significance in the large sample limit. To calculate the uncertainties on the significance, we propagate the uncertainties in the significance approximations and assume Poisson uncertainties for the number of selected events before applying physical event weights.
We would like to note that our approach can be related to other attempts to use the Neyman-Pearson Lemma [3] as starting point for a statistic aware classifier training [4] and [5].

Toy SUSY model
As a physics example, we consider a toy search for supersymmetric top quarks (stops) at the LHC with a single lepton in the final state. The search is designed for the case of direct stop pair production with subsequent decays of each squark into a top quark and a neutralino, the lightest supersymmetric particle (LSP) in this model. We assume that the top squark and the LSP are the only SUSY particles that have low enough masses to be accessible at 14 TeV and consider two model scenarios depending on the mass difference between stop and LSP. In the case that the stop is much heavier than the LSP, the SM decay products are produced with significant energy, making them easy to observe in the detector. These models are known as uncompressed. If the mass difference between the stop and LSP is small, the SM decay products are produced close to rest, making the signature difficult to distinguish from the background. These models are known as compressed. For our toy study, we assume two sets of mass parameters (table 1) that are expected to be on the border of discovery with about 30 fb −1 of 14 TeV LHC data.
We consider only the dominant background in the one-lepton channel, top-antitop (tt) production. A leading-order Pythia8 [6,7] simulation for signal and background is sufficient for our toy studies with next-to-leading order results for the total cross sections (844 fb for tt, details and references are given in [8,9]). Delphes3 [10] is used to model the detector response using the CMS detector model. Jets are clustered with the anti-k T algorithm [11] with a cone parameter of R = 0.4. For simplicity, we do not consider tau leptons and use lepton as a generic term for electrons and muons.
Signal and background events are pre-selected to reduce training times. We require at least one lepton with transverse momentum p T > 30 GeV within a pseudorapidity of |η| < 2.4. Each event must contain at least four jets with p T > 40 GeV, where the highest-p T (leading) jet is required to have p T > 80 GeV, and the sub-leading jet p T > 60 GeV. At least one of the jets must be tagged as originating from a bottom quark. The missing energy perpendicular to the  beam direction (E T / ) is required to be above 200 GeV, and the scalar sum over the transverse momenta of all preselected jets (H T ) to be above 300 GeV. The selected sample, which is used for training and testing, consists of 1.4×10 6 events with 50% signal and 50% background events. An independent sample of 6 × 10 5 events is used in the final evaluation step.
We include both low level and high level variables when training the classifiers. The low level variables consist of the multiplicities of jets (n jet ) and b-quark jets (n b−jet ) and the Lorentz vector components (E, p T , φ and η) of the reconstructed physics objects. A summary of all used variables are listed in table 1. More details about this stop toy search can be found in [8,9].

Neural network implementation
We consider two commonly used approximations of statistical significance, s/ √ s + b and the Asimov estimate, introduced in section 1. The two estimates are defined for a training batch, where s is the number of correctly classified signal events and b is the number of background events classified as signal. When used as a metric, a certain cut is applied to the classifier probability and the discrete number of events above this cut is counted. When used as a loss function, differentiability must be ensured. The new idea introduced here is to define the values  Table 2. SUSY toy model results for the neural network. All numbers are derived from an independent evaluation sample. The Asimov significance Z A is used as metric in all cases with a cut on the classifier output that maximizes Z A .
where N batch is the number of events in each batch and y true i is the true value, 1 for signal and 0 for background. W s and W b are the physical weights of the signal and background samples. They scale up the number of signal and background events per batch to the numbers expected to be observed in the experiment given a certain luminosity. The above definitions ensure that s and b are continuous w.r.t. the network parameters, as it is needed for backpropagation. It can be seen that in the limit of an absolute certainty in the value of y pred , either 1 or 0, s and b converge to a discrete definition. With this definition of s and b, two loss functions are defined as the squared inverse of the two approximations of significance that are considered here: where Z A is defined in (1). Minimizing one of these loss function corresponds to maximizing the corresponding significance. This approach had been implemented with the Keras python library [12]. To avoid strong statistical fluctuations of the estimated significance, a sufficient number of events in a single batch is necessary. It was found empirically that a batch size of 4096 works well. The results shown here are evaluated from a network with a single hidden layer of 23 neurons, as it had the least sensitivity to overtraining and adequate performance for the simple toy model.

Implementation in gradient boosting
The Asimov-loss approach can also be used to train a Boosted Decision Tree (BDT). XGBoost [13] uses a Taylor expansion on an event-wise objective function during tree building. The default loss function, cross-entropy, can be replaced with the Asimov loss and the "smoothing" trick (3). The Taylor expansion of 1/Z A forms a sum over events in this case which can be implemented to first order, replacing the gradient in XGBoost by: The gradient has been implemented with Autograd [14]. We note that this approach does not allow the implementation of the Hessian, which is therefore assumed to be a constant.

Results and conclusions
The specific behaviour of the new Asimov loss functions can be clearly seen in the classifier output distributions in figure 1: the default cross-entropy training (a) aims to minimize the classification error for signal and background equally, while the Asimov-loss training (c) maintains the purity of the signal classification at the expense of signal events which are misclassified as background. Significance is gained at the cost of reduced efficiency. The Asimov-loss training tries to find an almost background free region (classifier output > 0.5) in both implementations (c) and (d). It is important to note that the simulation statistics must be large enough to prevent the algorithm from just finding random holes in the trainings data due to low statisics. We account for this by large sample sizes, proper error calculation and independent evaluation samples. For s/ √ s + b, (b) in figure 1, the advantage to finding a background free region lower. This loss function finds a different optimum, which is also not identical to the optimum with the highest classification accuracy. Optimizing for discovery usually means finding an area in phase space with high signal-to-background ratio, even at the cost of efficiency, while optimizing for exclusion usually requires a better signal acceptance even with more background. Our algorithm supports both strategies.
Numerical results for the SUSY toy model with the neural network implementation are given in table 2. More results can be found in [8]. The results for the cross-entropy training are also optimized for highest Asimov significance in the conventional way of applying a cut to the classifier output. This approach forces the numerical results to be similar in most cases. Clear differences appear at high systematic uncertainties. For example, at 50% systematic uncertainty in the compressed case the Asimov-loss training performs better. This shows that it can be important to include systematic uncertainties into the training.
The validity of the approximations used in (1) for small s and b had been confirmed numerically in [1]. We intend to invest further work in the low s and b region (s, b < 2 − 3), to evaluate the performance of the classifier in such cases. Also the comparison with a full limit setting will be considered.

Acknowledgments
We would like to thank our DESY colleague Oleksii Turkot for his support during the BDT studies.