Adversarial domain adaptation to reduce sample bias of a high energy physics event classifier

We apply adversarial domain adaptation in unsupervised setting to reduce sample bias in a supervised high energy physics events classifier training. We make use of a neural network containing event and domain classifier with a gradient reversal layer to simultaneously enable signal versus background events classification on the one hand, while on the other hand minimizing the difference in response of the network to background samples originating from different Monte Carlo models via adversarial domain classification loss. We show the successful bias removal on the example of simulated events at the Large Hadron Collider with ttˉH signal versus ttˉbbˉ background classification and discuss implications and limitations of the method.


Introduction
Many measurements and searches for new phenomena performed by the experiments at the Large Hadron Collider (LHC) use a classification algorithm, such as Boosted Decision Trees or Neural Networks, to discriminate the physics process of interest (signal) from other physics processes with similar signature (background). The algorithms are optimized using supervised training on detailed Monte Carlo (MC) simulation data sets, containing samples labeled as signal or background. The resulting classifier is applied to unlabeled data to separate signal and background, and to measure the statistical significance of the signal or its strength, assuming that the simulated and the real data sets are identically distributed.
However, significant differences between domains of real and simulated data sets always exist and the learner may pick up those domain-specific discriminating features that perform well on classification task in one domain while being not suitable for classification in the other, introducing a bias via the source samples used for training when attempting to classify samples from target domain. This problem is similar to that of visual recognition where, for instance, training may be performed on artificially generated images, the source domain, and applied to real photographs, the target domain. In order to avoid training a model that is suitable for classification on the source domain only, while failing when employed on target domain, algorithms of domain adaptation have been developed.
In this paper we apply the method of domain adaptation to a problem of classification on high energy physics data using a Domain Adversarial Neural Network (DANN) [2] to classify events

The Deep Adversarial Neural Network
We follow the architecture presented in [2] with a feed-forward neural network composed of three parts as shown in Fig. 1: a feature extractor which splits into the label predictor, performing the signal-background classification, and the domain classifier. Domain adaptation is enabled via an adversarial interplay between domain classifier and feature extractor. For training and testing we have two data sets (domains): source and target, both containing signal and background events. The target domain is constructed as a representative pseudo-data, meaning that it is treated as unlabeled and it has a signal to background proportion similar to the one expected in a real data sample. For measuring our algorithm performance we make use of the target labels in the final test.
For the label classification we train the network only using events from the source domain. The gate layer stops the target events propagation making the label predictor loss being evaluated only on the source events. This allows training the network on mixed samples of both domains. The classification is adapted to the target domain by connecting the feature extractor with the domain classifier through a gradient reversal layer. This layer does nothing during the forward propagation but inverts the sign of the gradients flowing from domain classifier during the backpropagation. The domain classifier is trained to determine which domain the events belong to. Due to the gradient reversal, the feature extractor is pushed to form such feature representation that do not allow to distinguish between two different domains, thus avoiding the sample bias. As a result of such adversarial training, the features in the last layer of the feature extractor will both allow the classification between signal and background and become domain invariant, rendering classification model domain-independent. The gradients of the reversal layer are scaled by the parameter allowing to regularize their influence and hence tune the importance of the label classification versus the domain invariance. Figure 1: Domain-adversarial network as an alternative to reduce classification bias, adapted from [2].
In order to have balanced classes for each classification the event weights of the source domain are scaled as required according to class ratios. For the label predictor the weights are such that the effective number of signal and background events are the same. For the domain classifier, the weights are scaled to match the signal to background ratio existing in the target domain.

Data sets
The feature selection for the input of the network was inspired by the analysis presented in [1] to separate¯from the¯+¯background. In total 41 geometrical and kinematic quantities are used as input to the network, such as the angular distance between different jets and/or leptons, the mass of various jet and lepton systems and the event topology. The complete list of features, their correlations and the relative importance are given in [9].
We use MC samples provided by the HepSim Group [10]. The ttH signal sample containing 13 · 10 6 events was generated with MadGraph [11] matched to the Herwig6 parton shower [12]. Two background samples were generated, significantly differing in the theoretical predictions. One, used for the source domain, with 2 · 10 6 events of top quark pair production with additional light quarks using MadGraph matched to the Herwig6 merged with 10 7 top quark pair events with additional bottom-quarks using MadGraph matched to Pythia6 [13]. The other background sample, which is used for the target domain, contains 3 · 10 7 events of top quark pair production in association with bottom quark pairs, generated with the PowhegBox+OpenLoops [14] and is matched to Pythia8 for the full event generation including the prediction of additional light quarks.
The ATLAS detector response was simulated using Delphes simulation [15]. For this study, reconstructed leptons, jets and bottom quark iniated jets (called -jets in the following) are used. Jets are reconstructed using the anti-kT algorithm [16] with a radius of = 0.4. The identification efficiency of -jets was taken from [17], assuming the reconstructed -jets to have a 70% tagging probability with a corresponding light jet/c-jet rejection probability parameterization.
Events selected for the neural network training were required to fulfil the following criteria: • one electron or muon with transverse momentum ≥ 20 GeV • at least 5 jets with ≥ 25 GeV • at least 3 -jets.
With this selection applied the source and target data sets where constructed with 546 · 10 3 signal each and same amount of background events, using statistically independent events from the same simulation as signal but different background simulations for source and target. One half from each data set was left for testing purposes, the remaining were used for training. For the target domain only 14368 signal events were randomly selected for training, to match the 5:95 ratio of signal to background estimated in real data.

Network set-up and training
The network was implemented using the Keras v2.2.4 [18] with TensorFlow v1.12 [19] as back-end library. The training set-up is described in Sec. 4.1. A hyper-parameter scan was done to optimize the performance of the network, as described in Sec. 4.2. Some special considerations for the loss function and its optimization are described in Sec

Training set-up
The initial weights of the network were set by the Xavier initializer, as suggested in [20]. The number of training epochs was dynamically selected with the following condition applied: the training were stopped if the running average over 50 epochs in the total loss does not decrease more than 0.05% with respect to the previous 50 epochs. This number was restricted to the interval [200, 1000]. The lower limit was set to skip some random fluctuations at the beginning. The upper limit is just a big value that was never reached with the specified condition. After the training was stopped the weights of the network in the epoch with the lowest label predictor loss were selected. A batch size of 16384 was used. Each batch was composed by source and target events in a 1:1 proportion. The events were randomly shuffled at each epoch, resulting in a different batch selection each time.
The domain classifier and label predictor outputs were set to have two neurons each, using softmax activation function and cross-entropy loss in both (Sec. 4.3 describes an alternative). The RMSProp Keras optimizer was used, with the parameters: learning_rate =0.001 and rho=0.9.

Hyper-parameter optimization
The hyper-parameters of the network were chosen with the help of the Hyperopt library [21], using the Tree of Parzen Estimators (TPE) algorithm implemented on it. The number of layers in each part of the network was let vary from 1 to 8. Each layer could have a number of neurons between 5 and 100, but having a linear behavior in each part of the network (either decreasing or increasing). For the activation function of the hidden layers ReLU, tanh and ELU were tested. Each of this hyper-parameters were sampled from a uniform distribution. Additionally, the parameter was sampled from a log-uniform distribution in the range [1,1000], with this giving more priority to low values as these were found to give better results.
The additive inverse of the label label predictor area under the receiver operating characteristic curve for the target domain was used as the loss to minimize. Three independent optimizations where performed in parallel in order to have a better view of the hyper-parameter space. This also helps to detect if the global minimum of the loss is found. Approximately 1000 iterations where performed in each case.
By analyzing the sets of parameters with good performance and the decisions made by the sampling algorithm, we were able to draw the following conclusions: • The optimal number of layers in the label predictor is one: only the output layer. Two is also good in cases of a very complex feature extractor.
• Higher complexity in the feature extractor provides performance improvement but also makes the network more prone to over-training.
• The number of neurons in the last layer of the feature extractor should be at most the same that in the input. We think this number is also related to the correlations in the input features: a smaller number for high correlations could provide a better optimized feature extraction.
• An increase in the domain classifier complexity does not cause significant improvements, but it needs at least a similar complexity than the feature extractor in order to provide good corrections.
• The performance with ELU and tanh as activation function for the hidden layers was very similar. ReLU was significantly worse.
Finally the feature extractor was chosen to have four layers with 20, 16, 13 and 10 neurons respectively, the label predictor with only the output layer (2 neurons) and the domain classifier with four layers of 20, 35, 50 and 2 neurons respectively. The ELU activation function was used in all the hidden layers.
Note that due to the non-deterministic nature of the training process, results during the optimization were sometimes not representative of the behavior for each set of hyper-parameters tested. Set-ups with higher performance were found, but its results were not reproduced in further tests. Therefore, we chose a configuration with stable results instead of the best one reported by the optimization process. It also had the advantage of being not complex enough to be affected by over-training.

Loss and activation functions for the outputs
The total loss of the network ( ) is given by the sum of the individual losses of the label predictor ( ) and domain classifier ( ): The gradient reversal layer affect the backpropagation in such a way that the gradients of the total loss with respect to the feature extractor weights ( ) are computed as: Two alternatives were used for computing the loss: set-up A with a softmax activation and cross-entropy loss in both outputs, and set-up B with softmax and cross-entropy loss in the label predictor, and linear loss following eqn.4.4 in the domain classifier.
The cross-entropy loss for a single event is given by: where represents the network output for that event. Note that even though we have a two-neuron output we refer to as a single value since the second neuron behaves as 1 minus the first. Class 0 corresponds to background and class 1 to signal for the label predictor. A perfect classification yields a loss of 0, value toward which the loss is optimized. Set-up A also uses this loss for the domain classifier, with in equation 4.3 corresponding to the domain classifier output and classes 0 and 1 corresponding to target and source domains respectively. In this case, perfect separation also results in a loss of 0 but a separation between the domains is not intended. Instead, the network response should be the same for both classes of events which is provided as an additional restriction via the gradient reversal layer. The domain classifier loss is minimized but, under this restriction, the lowest achievable loss is when the response for both classes, i.e. source and target, is = 0.5, resulting in a loss of − ln 0.5 ≈ 0.693. This behavior is visible in Fig. 2b, where the predicted loss of 0.693 is reached in the first epochs and kept most of the training. It should be noted that this poses an extra requirement on the feature extractor, which besides providing domain independent features, is also optimised to provide features for which the domain classifier output are exactly 0.5 for all events.
We found that deviations in the output of the domain classifier from the optimal value of = 0.5 had severe influences on the classification in general. Analyzing at a lower level we found that these changes were driven by huge gradients back-propagated from the domain classifier loss, further amplified by as > 1 was used. To avoid the change in the gradients under deviations we tested a set-up where the derivatives of the domain classifier loss were independent of the values. To achieve this behaviour, we removed the activation function from the domain classifier output and changed the loss to a linear function, computed for a single event ( ) as: This new set-up has also the advantage that is not limited to 0.5 in the optimized case, since now, if the condition of no domain separation is met, this loss has a value of 0 for any value of the domain classifier output so the feature extractor has more freedom during the optimization.

Training of the neural network
The ADAM optimizer [22], being commonly used nowadays, was used as starting point. ADAM is an extension of RMSProp with SGD Momentum i.e. adding momentum terms defined as decaying average of the past gradients. The momentum terms should help to faster escape from highly suboptimal loss regions. However, when we used the default values of the momentum term ( = 0.9) we noticed severe oscillations of the label predictor loss, as shown in Fig. 2a. These oscillations seem to be caused by fluctuations in the domain classifier loss part on which the label predictor has then to react in the common effort to minimize the global loss. We switched to RMSProp, which does not use a momentum term, resulting in a more stable loss course during the training. We therefore did not attempt to further use ADAM.
Beside those small fluctuation described above, infrequent large spikes where found. One of them is shown in Fig. 2b, where minimizes smoothly for over 300 epochs but suddenly raise to huge values together with . Running 3000 independent trainings we found that these spikes appear in around a 0.7% of the cases for set-up A and 1.6% for set-up B.
Performing analyses we found that changing the weights of the network to the ones used 10 epochs before makes the spikes vanish. This indicates that the cause of the spikes involves initial random fluctuation related to the adversarial training with the gradient reversal layer. Backing up this assumption, we also found that the frequency of these spikes increases by increasing the value of .
Furthermore, comparing set-ups A and B, we found in A stronger dependence of the learning curves on the randomization (initial weights, data shuffling, etc.), which is demonstrated in Fig. 3. The learning curves for set-up B agree better indicating a more stable training. They also converge faster. The stopping criterion is reached in set-up A after around 600 epochs but after about 400 epochs in set-up B.

Tuning impact of the adversarial domain classification
The parameter controls the influence of the label predictor and domain classifier responses on the total loss. It determines how much the responses to source and target data input produced by feature extractor should agree. High values forces a strong agreement but may impair the ability of the feature extractor to provide useful features for the classification, low values give more freedom for the feature representation density distribution but might not be enough for obtaining a good agreement between the domains and thus introduce a bias for source domain samples. To give an example, Figure 4 shows the discriminant output for the set-up A for values of between 0 and 20. A large difference between source and target domain feature extractor response density can be observed for = 0, while with increasing values of the influence of the domain classifier on the density alignment and consequently also on label prediction increases and finally a very strong agreement between feature extractor responses to both background samples is reached at the highest value of , while label predictor performance deteriorates. The optimal lambda value is specific to the problem and the performance measure applied as will be discussed in the following.

Results
The performance of the network depends on the relative importance of the adversarial branch containing domain classifier steered by the parameter . As for any hyper-parameter, the values of are specific to the network architecture and data sets used and need to be determined for each particular use case. We consider three measures of performance, demonstrating the bias without the adversarial treatment and their improvement when the adversarial branch is included.
First we report AUC which is a common performance measure for binary classifiers. Since a good value for was still not selected we made a scan over a range of possible values (Fig. 5). We extend it with the Kolmogorov-Smirnov distance as a measure of agreement between the response of the two domains. This distance is given by the maximum absolute difference between the cumulative distributions of the normalized label predictor response for the two domains. The best choice of is the value for which the maximum source domain AUC is achieved among those with the lowest Kolmogorov-Smirnov distance. This criterion for the optimal has the advantage that it can be computed without using the target labels, i.e. using labeled source and unlabeled target data. To demonstrate that the criterion for selection leads to desired behavior on target data, we compute the AUC for the target domain, as in our study target labels were provided by the simulation. As depicted in Fig. 5, the closest match between source and target domain and highest AUC performance is achieved when using lambda values obtained from the criterion procedure.
With = 0, corresponding to absence of adversarial network, an AUC on the target domain of 0.657 is achieved. This value is improved to 0.756 using = 20 for set-up A, and 0.760 using = 10 −5 for set-up B. This improvements have the cost of reducing the AUC obtained for the source domain from 0.776 in the no adversarial case, to 0.757 and 0.760 for set-ups A and B respectively with the selected values. Increasing above those values only decreases the performance, but in the case of set-up B a plateau exist such that taking values up to 100 times the selected one keeps the same performance.
To further approximate the significance as reported in Higgs discovery searches as performance measure, we use the approximate median significance (AMS) as proposed in [23]. This definition corresponds to a test of the signal discovery versus background only hypothesis by taking systematic uncertainties into account. It is calculated as: where the sum is over the bins in the histogram of the response, and represents the signal and background counts in the bin for the source domain and 2 = 1 2 ( − . ) 2 + (0.1 ) 2 is an estimator of the variance on the background counts. The variance is computed from the difference between and the background count for the target domain in the same bin ( . ) plus a flat 10% uncertainty on the background, approximating the values of the reference analysis. The AMS is a valid simplification of the significance in the context of this paper as long as we consider only the qualitative behavior, not the absolute values. The AMS as a function of is shown in Fig. 6. A low significance is observed in the case when the response for both domains disagree. The significance increases with until reaching a maximum at similar positions of the maximal AUC where source and target values agree (Fig. 5). For higher values of the significance decreases, reflecting the loss of classification power.
Using the optimized setting we measure the performance in terms of signal purity, which is related to the sensitivity of the measurement. It is defined as the ratio between number of signal events ( ) and the total of events ( + ) that fall above a specific cut in the label predictor response: = + . Each possible cut corresponds to a signal efficiency ( = ), which is defined as the fraction of signal selected ( ) from the total number of signal events ( ). Figure 7 shows the whole profile of the signal purity as a function of the signal efficiency. The expected signal to To give a numerical example taking the signal purity as an approximation of the analysis sensitivity, we take the results for the source domain as the central value and the difference between the two domains as a 1 uncertainty. For = 0.5 we get: • no adversarial network: = 0.148 ± 0.069 • adversarial set-up A: = 0.137 ± 0.005 • adversarial set-up B: = 0.1369 ± 0.0004 The relative uncertainty due to the choice of the background model on the signal purity, ignoring other sources of uncertainty, can be improved from 47% to 4% (0.3%) by employing the adversarial network in set-up A (set-up B).

Extension of the method towards training with real collision data
In this study, no labels were used for computing loss of the domain classifier (except its alignment for signal and background ratio that was so far taken to be the same as for the source domain). One natural extension of the method would be to use real collision data to train domain classifier for adaptation to real data domain. However, in real collision data the ratio of signal to background is only known with limited precision obtained from previous measurements or theoretical predictions. background ratio was tested. It was found that a change in its value had no impact, as long as it is the same in both source and target (Fig. 8a). However, if there is a discrepancy in the signal to background fraction between the two domains, a small bias is introduced. This behavior is shown in Fig. 8b, where a fixed value of 5% was used for the source signal fraction, while varying signal fraction in the target domain. By varying the signal-to-background ratio by a factor of two away from the ratio in the source domain, a 1.4% bias was introduced on AUC which is however, still small compared to case without adversarial training. It becomes therefore important to get a good estimate for the signal to background ratio in the target domain when using unlabeled data and to properly account for the effect of this bias on the final result.
We hypothesized that the gap we observe between performance for classifying events in the source or in the target domain when signal and background ratio do not match across domains may be caused by the shift of label distribution. Following number of works attempting to address this issue in unsupervised setting [7], we applied one such approach to see whether gap issue can be tackled. The implicit alignment approach [25] points out that, among other issues, the differences in label distribution may provide a harmful shortcut to identify the respective domain just on the basis of differences in label frequencies. This may strongly impair domain adaptation by ignoring actual differences in data distributions and thus not handling properly data distribution shift. To circumvent that, authors propose creating re-balanced mini-batches for training domain classifier using pseudo-labels delivered by the label predictor for target inputs, arguing for removing label frequency differences between domains in this way. We saw however that generated pseudo labels have very low reliability, which in turn seems to strongly impair the composition of re-balanced mini batches and do not result in reduction of the classification gap between domains in our case -on the contrary, the gap falls back to the state observed without any domain adaptation. This is not suitable for use case of real collision data adaptation in unsupervised setting, and makes the method in the current form rely on faithful estimate of signal to background ratio in the real collision data as pointed out above.

Conclusion
We successfully built a feed-forward fully connected adversarial neural network for performing domain adaptation on high energy physics data to enable event classification in the search for thē ( →¯) process at the LHC. We demonstrate that adding a domain classifier sub-network with a gradient reversal layer helps removing training bias while retaining most of the nominal classification power. We analyzed the dependence on the hyper-parameters of the network. We studied the training stability issues that appear due to the addition of a gradient reversal layer. We demonstrated that by using linear activation and loss functions, stability and convergence can be significantly improved and better performance of the network can be achieved.
For the example use case of the ttH(bb) analysis, we demonstrate that the adversarial domain adaptation can produce a label predictor that is almost completely independent of the domain background model while preserving most of the classification power for target domain. We report the improvements using different measures. Taking the expected signal purity for a signal efficiency of 50% as a proxy measure for the sensitivity of the analysis, the uncertainty due to the choice of background model can be strongly reduced from a 47% to a 0.3% with the Monte Carlo samples used in this study. Significant improvements are also reached in the approximated median significance.
Although not demonstrated, we do not expect limitations when extending this approach to adapt to multiple alternative domains, i.e. sources of uncertainty, during training.
Application of our approach to target samples from real collision data was discussed where no explicit label information from target is required for training of the domain classifier. For the selection of optimal value for hyperparameter that controls the impact of adversarial domain classifier on label predictor, we designed a procedure that does not require labeled target data. However, while per input event example labels from the real target are not necessary for training procedure, we show that in absence of a faithful estimate of the signal to background ratio for the real data target domain, misalignment of the signal to background ratio between source and target domains may lead to a small bias in the classification. This small bias and its impact has then to be addressed in a further downstream analysis.
Using a different ratio of signal to background in source and target domains introduces label distribution shift to the original formulation of the problem, in addition to the already existing data distribution shift given by the different background models in the two domains. Handling both data and label distribution shift for domain adaptation is still a largely unresolved problem in machine learning. For the unsupervised domain adaptation setting we have worked with here, our observations with a recently introduced implicit alignment approach [25] that makes use of pseudo-labels suggests that quality of pseudo labels required for such an approach to cope with both data and label distribution shift is not sufficient for our case. Application of our method to real experimental collision data adaptation in unsupervised setting in its current form will have to therefore rely on fair estimates of signal to background ratio in the real data.
As discussed above, we see the differences in label frequency between the domains which provides a shortcut for domain identification [25] and harms domain adaptation, as one central issue hampering successful domain adaptation given unknown, different signal background ratios in our case. One potential solution that we envisage for the follow-up work may employ a network architecture that uses yet another adversarial branch dealing explicitly with the task to erase harmful signal-background label information from domain classifier. Given the current progress, we anticipate that this and other advanced approaches [26] will render our method also capable of handling label shift as well and enable successful adaptation to real collision data in fully unsupervised manner.