Towards recognizing the light facet of the Higgs Boson

The Higgs boson couplings to bottom and top quarks have been measured and agree well with the Standard Model predictions. Decays to lighter quarks and gluons, however, remain elusive. Observing these decays is essential to complete the picture of the Higgs boson interactions. In this work, we present the perspectives for the 14 TeV LHC to observe the Higgs boson decay to light jets assembling convolutional neural networks, trained to recognize abstract jet images constructed embodying particle flow information, and boosted decision trees with kinetic information from Higgs-strahlung $ZH\to \ell^+\ell^- + jj$ events. We show that this approach might be able to observe Higgs to light jet decays with a significance of around $2\sigma$ improving significantly previous prospects based on cut-and-count analysis. An upper bound of $BR(H\to jj)\leq 2\times BR^{SM}(H\to gg)$ at 95\% confidence level after 3000 fb$^{-1}$ of data is obtained using these machine learning techniques.


I. INTRODUCTION
The Standard Model (SM) Higgs boson established the spontaneous electroweak symmetry breaking (SSB) mechanism as the responsible to give the particles their masses at the same time that it preserves the gauge symmetry of the SM interactions [1][2][3][4][5]. The immediate consequence of the SSB mechanism is that the Higgs boson interactions scale with the particles masses m as m v , where v ∼ 246 GeV is the vacuum expectation value of the Higgs field. Thus, interactions with heavier particles are likely to be observed first and that expectation has been fulfilled in the LHC where Higgs bosons H interacting to W , Z, b, τ and t have already been observed [6]. The couplings to muons should be observed in the forthcoming years and so far just an upper bound of ∼ 2.8 times de SM prediction has been obtained [7]. The difficulty to observe a clean muon pairs at the Higgs pole is due the its diminished mass.
The couplings to heavy particles also permit the observation of 1-loop induced interactions of the Higgs boson to photons, gluons and Zγ through couplings to top quarks and weak gauge bosons in the loop. In the case of photons, despite its tiny branching ratio, a narrow peak in the photons invariant masses on the top of a smooth monotonically decreasing background spectrum makes the observation possible in the gluon fusion production mode [6].
On the other hand, the coupling to gluons can just be inferred from the initial state gluon fusion into Higgs bosons. Contrary to photons pairs, the gluons pair decay is completely buried beneath a huge background of jet pairs from leading order QCD interactions turning its observation practically impossible in the gluon fusion channel. This motivates the search for untagged jets in Higgs decays, H → gg and H → qq, q = u, d, s, and also b and c-jets, in cleaner production channels, for example, in the Higgs-strahlung process pp → V H, V = W, Z. Such analyses have been carried out in Refs. [8][9][10] and the prospects for the observation of the Higgs decay to light jets were found to be rather difficult, just an 1σ sensitivity after an integrated luminosity of 3000 fb −1 and an upper bound of BR(H → jj) < 4 × BR SM (H → gg), at 95% CL, according to Ref. [8]. The authors of that work, however, suggest that a multivariate analysis might improve the sensitivity of the LHC searches compared to their standard cut-and-count analysis.
Following that suggestion, we use machine learning (ML) techniques in order to improve the prospects to observe light quark and gluon jet pairs from Higgs boson decays in the Higgs-strahlung process pp → Z(→ + − )H(→ jj), = e, µ at the 14 TeV LHC. Our results can be improved by combining the one and two-lepton categories from ZH and W H channels just like in the previous works of Refs. [8][9][10] but, in the present study, it complicates the application of the ML techniques mainly due the increasing of signal and background categories for classification. In principle, this is not a major difficulty once the algorithms that we are going to use can handle this situation, however, the number of backgrounds increase considerably. Anyways, we will show that using just the two-lepton category is enough to considerably improve the prospects of the LHC to observe Higgs to light jets compared to previous, non-ML based analysis.
The discerning power of Convolutional Neural Networks (CNN) has been demonstrated in several studies where the separation of jet types is essential, especially in the gluon/quark-initiated jet separation and heavy quark tagging [11][12][13][14][15][16][17][18][19][20][21]. Many of these investigations were performed using the classic jet image, that is it, a mapping of the energy deposits of jet constituents in cells of the φ × η plane covered by the hadronic calorimeter (HCAL). In Ref. [22], however, an ingenious manner to embody the particle flow information in the images was proposed -representing leptons, photons and hadrons as geometrical figures whose sizes reflect their transverse energy deposits in the various segments of the detector. Including these extra pieces of information improves significantly the classification accuracy of the CNNs. Such an improvement showed itself essential in the present task of discerning jets from high p T Higgs bosons from the SM backgrounds in + − +jj events compared to the standard jet images approach.
The backgrounds to Z(→ + − )H(→ jj) involve tt and Z+jets events, among other subdominant sources, which far exceed the number of signal events at the LHC. Even after classifying QCD and non-QCD events with great accuracy with the help of the jet images, a large number of backgrounds still pollute the signals precluding an statistical significant observation of light jet decays. Instead of relying just on CNNs and jet image information, kinematic information from particles of the Higgs decays into leptons and jets can be used to further separate signals from backgrounds. The CNN information can be taken together with other available information in an ensemble of ML algorithms. In Ref. [23], stacking algorithms were shown to be very useful to classify Higgs bosons events at the LHC. We employed the same idea in this work with very good results, improving upon the previous results of Ref. [8].
Our paper is organized as follows, in Section II we describe the signal and backgrounds simulations and cuts employed to define a signal region; Section III is dedicated to explain how we construct the jet images that feed the CNN model; Section IV then contains the details of the construction of the CNN model and its training methodology. The performance of the ML algorithms are discussed in Section V, while Section VI presents our results. The conclusions can be found in Section VII.

II. SIGNAL AND BACKGROUNDS IN THE
The most promising production mechanism for the identification of a hadronic decaying Higgs is the associated production with a gauge boson (W /Z) decaying to both charged leptons and neutrinos, while the Higgs boson can be detected using the reconstructed invariant mass of the hadronic products. Including W bosons in the decay chain increases the number of signal events but it also makes the separation from huge backgrounds like tt and W +jets more challenging. We choose to allow just the leptonic Z decays to facilitate the identification of our signals. The Higgs branching ratio to gluons is 360 times larger than the three light quarks u, d, s combined for a 125 GeV Higgs boson. We thus perform simulations of the following signal processes The gg → ZH contributes to up to 20% to the total rate of ZH production. All the events for qq → ZH were generated using MadGraph [24] at the leading order, with the Higgs effective coupling model and NN23NLO PDFs [25]. For the gg → ZH process with quark loops, we used the Madspin module [26] to generate the decays of the Z boson into a pair of leptons and H into our light jets (H → gg) signal as well the H → bb and H → cc backgrounds. We apply overall rescaling QCD K-factors to the signal and background processes to match the total NNLO QCD and NLO EW cross section results taken from the Higgs cross section working group [27].
All the events were showered and hadronized using PYTHIA8 [28]. Hadronized events were then passed to DELPHES3 [29] to simulate detector effects with default settings. In order to build jet substructures for image classification and also to facilitate the vetoing of background jets, we search for the fat-jets reconstructed with radius of R = 1.5 and p min T = 75 GeV. In addition we turn on the computation of N-subjettiness variables using optimized (one-pass) anti-kt with β = 1. The backgrounds considered comprise the following irreducible and reducible ones: (1) Zj(jj) → + − + j(jj), (2) ZZ → + − + jj, (3) W Z → + − + jj, (4) tt → + − + ν ν + jj. We also included the decays of (5) H → bb and (6) H → cc in the Higgsstrahlung qq, gg → ZH processes to take into account the mistagging event contamination. There are six background classes in total.
In order to stay safely away from infrared and colinear divergences, we apply the basic cuts of Eq. (2) at the generation level  Beside the basic cuts we also imposed the following cuts to further eliminate backgrounds at least two same-flavour opposite-charge leptons with: at least one central jet with: where M j is the invariant mass of the leading jet, m H = 125 GeV is the Higgs mass, M is the mass of the two leading charged leptons, p 1 T and p 2 T are the transverse momentum of the leading and the subleading leptons, respectively, and E T is the missing transverse energy of the event.
In Table (I) we show the cut flow of the cross sections for each signal and background channel for these selections and the basic selection of Eq. (2). The Higgs mass window of Eq. (5) is very efficient in eliminating backgrounds without a Higgs boson while Eq. (7) is effective to veto tt events that contain neutrinos. Requiring a very hard leading jet, Eq. (4), helps to decrease the backgrounds of weak bosons with extra jets.
We see that these cuts reduce the Z+jets and tt backgrounds of more than ∼ 2 orders of magnitude, however, they are still 4 orders of magnitude larger than the signals. At this level of cuts only around 50 signal events are surviving after 3000 fb −1 . Hardening cuts will not make a better job, we need a better plan to select the signal events. This strategy must focus mainly in the most dangerous backgrounds of Z+jets and tt. We discuss next the selection tool that we are going to use to raise the prospects to identify light decays of the Higgs boson.

III. CONSTRUCTION OF ABSTRACT IMAGES -EMBODYING INFORMATION TO BOOST THE DISCRIMINATION POWER
The use of computer vision and jet-images [11,13] have proved their places among the most powerful and efficient techniques for classifying data according to different hypothesis [12,[14][15][16][17][18][19][20]30]. Although the use of jet-images is a very well established framework, in some cases the information contained in jet-images is not enough to deep convolutional neural networks to make reliable predictions. Cases where either the differences between signal and background are very subtle or the signal in question occur in a much smaller rate than the background events (i.e. rare events) might be difficult to disentangle. Different approaches to increase the amount of information contained in a jet-image have been proposed [16,21,31]. These methods employ a hybrid use of high-level features (i.e. kinematic observables) together with the information recovered from each detector section (electromagnetic calorimeter (ECAL), hadronic calorimeter (HCAL), Muon chambers, tracking system) encoded as image channels, greatly improving the amount of information, which subsequently increase the discriminant power of the algorithms. However, this improvement comes at the price of drastically increase of the model complexity which leads to overfitting and/or a slow training phase for the NN. A more simple approach was proposed in Ref. [22] to tackle this problem using abstract shapes as a method to encode the information from the particles detected in each event.
After the event generation, reconstruction and selection, we can construct the so called abstract images by following the guidelines of the Ref. [22]. Delphes uses a particle-flow algorithm [29] which produces two collections of 4-vectors -particle- flow tracks and particle-flow towers -that serve later as input for reconstructing high resolution jets and missing transverse energy. In order to construct the abstract images, we use the information stored in the particle-flow tracks. These tracks outputs from the Delphes particle-flow algorithm are stored in the EFlowTrack array which are used to construct the objects displayed as red circles in Fig. (1). EFlowPhoton and EFlowNeutralHadrons store the photon and neutral hadrons output, respectively, which, by their turn, are used to construct the objects displayed as as green squares and blue hexagons, respectively, in Fig. (1). All the shapes are centered at the η × φ coordinates of the object (charged particles, photons or neutral hadrons) and their radius are proportional to the logarithm of their transverse momentum. In Figure ( • tt → + − + ν ν + jj: 204 images with 224 x 224, 8-bit/color RGBA. After applying the selection criteria of Table (I), we get a very imbalanced data set which is reflected in the relative number of images of each class displayed right above. The low number of samples for some of the background channels are due to the selection criteria that we imposed, which drastically reduces the number of selected events and consequently the number of images. It is important to point out that without demanding hard cuts it would be virtually impossible to identify the signals even with a very powerful ML classifier. From Table (I), we see that the signal to background ratio is of order 10 −6 before cuts. In order to make sure that the CNN model will not overfit towards the class with more instances, we can either use class weights to overcome this issue or generate more images "artificially" for the classes where we have fewer samples. Although the first option is the less computationally expensive and straightforward approach, the former one is a well tested approach commonly used in image recognition tasks.
It is important to mention that we generate the artificial images only during the training phase and only for the classes with fewer images samples (W Z, ZZ, Zj(jj) and tt). For the evaluation/test phase we used the original images strictly. This method also helps the CNN model to avoid overfitting towards the classes where we have more image samples. The artificial images are generated according to the following data augmentation scheme: • randomly select an image sample from the class we want to augment, • resize and crop the image, • rotate the image in a range of -30 to 30 degrees, • horizontally flip, • apply random noise.
Each transformation has an uniform probability to be applied in a selected image, i.e. for a random selected original image, we generate a new artificial image by randomly selecting one of the 5 transformations listed above. In order to generate more artificial images without take the risk of over-saturate the data set with just copied images, we allow the selected images to have two or more transformations applied in the same image, with the caveat that each new transformation has a 50% chance to be applied. For example, for a given selected image, we have 100% chance of one of the 5 transformations be applied, a second transformation has 50% chance of applied in the same image, a third transformation has 25% chance, a fourth transformation will have 12.5% chance of be applied, and so on. We also ensure that the transformations are not applied two times consecutively in the same image, so that if an image has first an horizontal flip, the next transformation will not be flipped horizontally again.

IV. CNN ARCHITECTURE AND TRAINING METHODOLOGY
In this section, we describe in detail our training methodology. We tested several procedures and strategies to achieve robust and reliable results paying special attention in estimating the uncertainty of our results which will be provided in the next section. Some of these methods reflect the state-of-art in machine learning and our work serve as a good test in a particle physics application as well.
We want to classify whether a given abstract image belongs to one of the 7 classes: the signal class ZH(jj), and the background classes ZZ, W Z, Z + j(jj), ZH(bb), ZH(cc), tt. For this task we choose the Residual Network (ResNet) as the base architecture. ResNets were first proposed in Ref. [32] and consist of a deep neural network (DNN) built as blocks of convolutional layers together with short cut connections (or skip layers) that help the ResNet to avoid problems associated with DNN, in particular, the well known gradient vanishing/exploding problem [33]. In our analysis, we tested the discriminant power of the ResNet with an increasing number of layers: ResNet-18, ResNet-34, ResNet-50, and ResNet-101, using the abstract images data set described in the Section (III). The ResNet-50 presented the better results taking in consideration the training time, classification accuracy and signal significance.
The ResNet-50 consists of 50 convolutional (Conv2D) layers, in between each Conv2D layer we have a series of batch normalizations, average pooling and rectified activations (ReLU). For our task, we replace the last fully connected layers of the ResNet-50, responsible for the classification, with the following sequence of layers: • an adaptive concatenate pooling layer (AdaptiveConcatPool2d), • a flatten layer, • a block with batch normalization, dropout, linear, and ReLU layers, • a dense linear layer with 7 units as outputs, each unit corresponding to a class and a softmax activation function.
The AdaptiveConcatPool2d layer uses adaptive average pooling and adaptive max pooling and concatenates them both. Such procedure provides the model with the information of both methods and improves the overall performance of the CNN. We also make use of the label smoothing methodology [34] as well as the MixUp [35] training method, more details about these two techniques are presented in the Appendices (B) and (C).
One important aspect of the training of DNN models, and yet often not given the due attention, is the choice of the batch size. The use of large batch sizes helps the optimization algorithms to avoid overfitting [36][37][38] acting as a regularizer. However, the batch size is ultimately bounded by the amount of memory available in hardware. One way to work around this limitation is the use of mixed precision training [39], this method uses half-precision floating point numbers, without losing model accuracy or having to modify hyper-parameters. This nearly halves memory requirements and, on recent GPUs, speeds up arithmetic calculations.
The learning rate and weight decay are other two key hyperparameters to train DNNs. A good choice of these two parameters can greatly improve the model performance, in our particular case it means a high accuracy classification and good signal significance, and reduce drastically the training time. Instead of using a fixed value for the learning rate we opted to use the so called Cyclical Learning Rates (CLR) [37]. To use CLR, one must specify minimum and maximum learning rate boundaries and a step size. The step size is the number of iterations used for each step and a cycle consists of two such steps -one in which the learning rate increases and the other in which it decreases [37].
Following the guidelines from Ref. [40], we perform a scan over a selected range of values for learning rates and weight decays. According to Ref. [40], the best initial values for learning rates are the ones who give the steeper gradient towards the minimum loss value, which in our case, was found to be ∼ 3.0 × 10 −5 for the learning rate and 1.0 × 10 −5 for the weight decay, and for the maximum learning rate value we just multiply the initial value by 10. The particular architecture of the dense layers  of the ResNet-50 and the hyperparameters were found using a genetic algorithm. This genetic algorithm is capable to find the best combination of hyperparameters and neural networks modules (a.k.a. layers) which maximize the significance of our signal. The data set were split in 80% for training and 20% for testing. We trained our model in a 3-stage scheme: in the first stage, we trained the ResNet-50 all the way from the beginning to the end within 50 epochs without the use of transfer learning. At the end of the 50 epochs, we save the weights and bias of the trained model. Next, in the second stage, we loaded these weights from the previous trained model and "freeze" all layers up to the last 3 ones, so that during this training phase the backpropagation only takes effect on the parameters of the last 3 layers (this methodology is what we call transfer learning), we then trained these last 3 layers for 25 epochs. In the last stage, we loaded the weights and bias saved from the stage two and "freeze" all layers up to the last layer (the classification layer or header) and trained for 15 epochs. This 3-stage training scheme help us to find the most stable results while gradually increasing the performance of our model.
The CNN architecture, training methodology and all the state of art techniques employed, such as mixed precison training and MixUP, aswell the fining tuning of the hyperparameters were all done using Pytorch [41] and the Fast.AI [42] framework. The Fast.AI framework enabled us to easily implement all the techniques described above and made possible to modify all the aspects from the ResNet-50 with very few lines of codes.

A. Classification with ResNet-50
The performance of our training framework of the CNN with abstract jet images can be evaluated on the basis of signal efficiency and background rejection factors as shown at the left panel of Fig. (2) for each one of the seven event classes. The more rectangular is the Receiving-Operator-Characteristic (ROC) curve, the more efficient is the background rejection for a fixed signal acceptance. A simple figure to evaluate how good is the signal-background separation is the area under the ROC curve, AUC. The closer AUC is to one, the better we should expect the backgrounds can be cleaned up for a giving signal efficiency. In order to construct the ROC curves, we impose cuts on the CNN scores of the classes displayed at the right panel of Fig. (2) and compute the number of events left for the signal versus a chosen background for which we want to obtain the ROC curve. The ideal separation is having the CNN score distribution of backgrounds concentrated at the left and the signal at the right in such a way that a cut on the scores eliminates more backgrounds than signal events. The range of the CNN scores seen at the right panel of Fig. (2), roughly from -6 to +6, is a feature of the particular loss function used to train the ResNet, see Appendix B for more details.
From the ROC curve shown in Fig. (2), we can choose the point with the highest signal significance which depends on the integrated luminosity and also on the effect of systematic uncertainties which are often disregarded in machine learning studies. The Asimov estimate of significance [43], a well-established approach to evaluate likelihood-based tests of new physics taking into account the systematic uncertainty on the background normalization, can then be used for a more careful estimate of the signal significance at the training and testing phases of construction of the classifier. The formula of the Asimov signal significance is given by where, for a given integrated luminosity, s is the number of signal events, b is the number of background events, and the uncertainty associated with the number of background events is given by σ b . In Fig. (3), left panel, we estimate the Asimov significance for 3000 fb −1 with a fixed 5% systematic uncertainty, σ b /b. The shaded blue band corresponds to a ±1σ error band computed by propagating the Poisson uncertainty of the background and signal counts [44]. As one can see, the ResNet-50, although gives an AUC of 0.91 for our signal, does not show promising prospects in terms of statistical significance due mainly to the large remaining number of events from the Z+jets and tt backgrounds. On the other hand, ZZ, ZW , ZH(bb) and ZH(cc) backgrounds are efficiently reduced by using abstract jet images.
The number of signal and background events can be found in the table at the right panel of Fig. (3). Selecting events with scores larger than 4.2 gives the best signal significance at the expense of just two surviving signal events against more than 120 background events corresponding to a very small Asimov significance of 0.18. Looking at the right panel of Fig. (2) we can easily understand the cut on the CNN score, it is a very hard cut to eliminate the backgrounds, relaxing it allows just too much backgrounds to increase the significance. The CNN actually does an excellent job in reducing the number of backgrounds but it is just not enough to give an useful signal significance to constrain the light jet decays.

B. Classification with ResNet-50 and BDTs
In spite of the fact that the CNN was not able to deliver a good signal significance, the classification scores of the signal and backgrounds can be used as a new feature to compose the data representation for some other ML classifier. In order to further separate the signal samples from the backgrounds, we chose to stack the scores obtained from the CNN along with kinematic variables to construct another data representation to be classified by boosted decision trees (BDT) algorithms. This type of ensemble has already been used to improve the classification power of phenomenological analyses [23,45].
The representation of the data used to train the decision trees algorithm comprises the following variables: • two leptons invariant mass, M , two jets invariant mass, M jj , the invariant mass of the reconstructed leading jet, M j , the invariant mass of the leading jet and the two leading leptons, M j . The two leading leptons reconstruct the Z boson in the classes where it is present, except for the tt background, • transverse momenta: p T , p j1 T , p j2 T , p b1 T , p b2 T , p 1 T , p 2 T , • angular distributions: the separation between the leading leptons pair and the leading jet, ∆R ,j1 = (∆φ ,j1 ) 2 + (∆η ,j1 ) 2 , the separation in the azimuthal angle of the leading leptons, ∆φ , , the cosine of the angle between the transverse momentum of the leading leptons pair and the leading lepton, cos(∆φ , 1 ), and between the leading leptons pair and the leading jet, cos(∆φ ,j1 ),  • the score provided by the CNN classification for each one of the 7 classes.
After constructing our new data set with the kinematic variables and CNN scores of each channel we can now turn our attention into the selection process of our ensemble classifier. A plethora of multivariate classification models are available right out of box in the scikit-learn [46] packages. Boosted decision tress are well-established classifiers and can provide very accurate predictions for classification tasks where the data set contains multiple classes.
BDTs, as any other ML algorithm, should be tuned in order to achieve a good classification performance. A first approach is to use "brute force" to tune the hyperparameters by using a grid search but the number of combinations and the computational time to test each one of them increases exponentially. More efficient ways beyond grid search are random sampling or using gaussian process algorithms to learn the best hyperparameters. Another way to tackle this problem is to use genetic/evolutionary algorithms, as in Ref. [47]. To do so, we make use of the Python Evolutionary Algorithm toolkit DEAP [48], in conjunction with the the scikit-learn library. Such implementation cannot only provide the models with the highest accuracy, but also modify the "fitness" function of the evolutionary algorithm to search the best hyperparameters combination which returns the highest Asimov significance values. Just like the CNN, we randomly split the data set in 80% for training and 20% for testing.
In our analysis, we found that a multi-class AdaBoost classifier [49] with 700 base estimators, a maximum tree depth of 5 and a learning rate of 1.0, keeping other hyperparameters as default options, presented the higher Asimov significance. Just like in the case of the CNN, we found that adjusting the hyperparameters to get a high accuracy does not guarantee the higher signal significance.
In Fig. (4), we show, at the left panel, the ROC curves after the the BDT classification. Remember that the CNN classification was very efficient to eliminate all the backgrounds, except for Z+jets and tt backgrounds, but at the expense of cutting out too many signal events. Moreover, Z+jets and the contaminants ZH(bb) and tt amounted to around 120 events resulting in a very small significance. The BDT found another solution instead, trading ZZ and W Z for Z+jets and the tt at the same time it keeps a much larger number of signal events. Comparing the ROC curves of Figs. (2) and (4), we clearly see that the AUC of the Z+jets increases very much. At the right panel of Fig. (4), we show the BDT scores of the event classes. The cut on the BDT score is also very hard but and the signal region is defined for a BDT score larger than 0.3. The star markers show the point of cut and the corresponding signal efficiency and background rejection.
In Fig. (5), we show, at the left panel, the Asimov significance with its corresponding ±1σ error band. The increase of the mean Asimov significance, 1.76, is quite pronounced compared to the CNN classification. This could raise doubts whether the CNN classification is necessary after all. To confirm that the CNN scores are playing any role in helping the BDT to separate the signals, we performed a feature importance analysis of the BDT features and found that indeed the CNN scores are the most important features for the BDT classification followed by M , ∆φ and the missing transverse energy of the event. At the right panel of Fig. (5), we show the number of events for each class after the cut on the BDT scores. The number of signal events jumps to around 15 events and the backgrounds by half compared to the CNN case. The Z+jets, tt and the ZH contaminants are completely washed away, leaving just ZZ and W Z events in the backgrounds. We want to point out an important bonus in using ML algorithms to separate the signals from backgrounds in this processthe ZH(bb) and ZH(cc) events can be completely eliminated with hard cuts on the score outputs of the classifiers. This makes the statistical analysis simpler and less prone to tagging efficiencies as we are going to show in the next section.

VI. SIGNAL SIGNIFICANCE AND CONSTRAINTS ON THE LIGHT JET HIGGS BRANCHING RATIO
With the results obtained in our ML analysis, we are able now to constrain the branching ratio of the Higgs boson into light jets. To do so, we closely follow the statistical analysis presented in Ref. [8]. It is important to highlight the differences between our analysis and the one performed in Ref. [8]. First of all, and more importantly, we consider the two-lepton category only, while in that work, the one+two-lepton categories are considered once they also take W H into account in the analysis. Second, we include the signal contaminants ZH(bb) and ZH(cc) in the background category from the beginning. In Ref. [8], these categories are considered only in the statistical analysis to constrain the light jets branching ratio multiplying the gluons signal by suitable tagging and mistagging factors and, in this way, being able to discount for the bb and cc contamination. We, however, are able to increase the purity of the signal by eliminating the b-jet and c-jet Higgs decays using our ensemble classifier as discussed in the previous section. After a hard cut on the BDT score, the number of bottom and charm jet contaminants is negligibly small, and only ZZ and ZW background events survive as discussed in the previous section.
This efficient clearing up from ZH(bb) and ZH(cc) background events allows to place a direct upper bound on the Higgs to light jets branching ratio at the 95% confidence level (CL) where S j is the mean signal significance obtained after the BDT classification computed with the simple significance metrics where s = 15.37 and b = 60.9 from the table at the right panel of Fig. (5) assuming uncertainties in the background normalization. Note that these estimates might fluctuate in the blue band in Fig. (5). We use this simple significance metrics for a fair comparison with the results of Ref. [8] and because the results do not differ much compared with the Asimov formula. We can also follow the steps of Ref. [8] to combine the signal strength obtained in our light jet analysis with other estimates taking into account tagging and mistagging factors that consider mixings among the jet classes. In this case, our analysis should be interpreted as a bound on untagged jets, that is it, jets that are not tagged as b or c-jets. In this case, we define the signal strength for a decay channel H → ii as where we consider ii = bb, cc, and jj. Assuming each decay channel is statistically independent and following Gaussian statistics, we can get goodness of fit using the same chi-square function that we used in Eq. (10) above where S a is the significance from each category identified by experiments, ai are the tagging and mistagging factors, e ai = are the fractions of the Higgs decay channel i where both jets are tagged as a given in Table (II), and N prod sig is the number of signal events produced in the ZH associated production.
For the significance without systematic errors, we take (S b , S c , S j ) = (7.5, 1.35, 1.97), while for the significance with 1% of systematic errors we have (S b , S c , S j ) = (7.5, 1.35, 1.89); in this case, the significances for b and c channels barely change. This level of systematics seems to be adequate in view of the estimates of Ref. [8]. The b-channel significance S b = 7.5 comes from Table 12 in the ATLAS MC study [50] for the category "Two-lepton", while c-channel significance S c = 1.35 comes from Fig. 2(a) of Ref. [9] and computed in Ref. [8]. Contrary to S b , S c includes contributions from W H events then our estimate of the following signal strengths are approximated. The fully correlated signal strengths using only the + − + jj channel are plotted in Fig. (6).
We obtained the following 95% CL upper bound limit for the Higgs branching ratio into untagged jets with 0% and 1% systematic errors, in parenthesis, for 3000 fb −1 BR(H → j j ) 3.26(3.28) × BR SM (H → gg).
The bound for BR(H → cc) can also be obtained from Eq. (12). Actually, the contour plot in the µ j × µ c plane can be readily obtained as shown in Fig. (6) which displays the cut-based contours from Ref. [8] as dashed lines and our results as shaded areas. We see that the constraints get considerably tighter for µ j but not for µ c which is expected once we are taking exactly the same significances for bb and cc signals as in the cut-based analysis.
While at the left panel of Fig. (6) we show the case with no systematics included, at the right panel we show how the bounds degrade once systematics are taken into account. Even assuming a rather small systematics of 0.3%, the cut-based bounds loosens compared to the statistics dominance case. On the other hand, the ML results do not change for this level of systematics once the s/b ratio is much larger than that achieved with cut-and-count only.
In order to give an idea of the impact of systematic uncertainties, we compute the signal significance with our ML analysis with various levels of uncertainty and integrated luminosity. The results are shown in Table (III). First of all, we see that the results are rather robust against systematics levels up to 10%. Second, 1000 fb −1 should be enough to reach an 1σ significance, but only at the end of the HL-LHC run we should be able to probe Higgs to light jets in the ZH → + − + jj channel.

VII. CONCLUSIONS
Completing the SM picture in all its colors, shapes and nuances demands observing and studying all its predicted interactions. The LHC is confirming that the Higgs boson coupling scales with the mass of the SM particles. This makes it easier to probe the Higgs couplings to bottom and top quarks, the heavy gauge bosons and the tau lepton but it turns the observation of interactions to the light quarks, especially up, down and strange quarks, and also the lighter leptons, much more challenging. Previous attempts to probe the Higgs coupling to light jets, that is it, to gluons and u, d and s quarks, based on standard cut-and-count strategies, have shown how difficult is to select the jets coming from the Higgs decay apart from the SM backgrounds.
FIG. 6: Contours in µc-µj plane, for statistics only (left panel) and including systematic uncertainties (right panel) for the + − + j j channel, respectively, where j is a jet that could not be tagged as b or c-jet. The green and yellow regions are the 68.3% and 95% CL regions, respectively, given by the ML based analysis (comsbined BDT + ResNet-50 predictions), while the green and yellow dashed lines are the corresponding regions from the cut based analysis [8].
In this work, we used machine learning tools to make the classification of signal and background jets more efficient. We found that analyzing + − + jj events at the 14 TeV LHC, the decays from boosted ZH production can be used to put stronger constraints on the branching ratio of H → jj, j = g, u, d, s than those of Ref. [8]. These boosted events generate fat jets whose substructure can be "seen" by convolutional neural networks after subtle transformations that make the jet images more discernible. We employed several state-of-art ML techniques to improve the performance of the CNN algorithm in obtaining the highest signal significance possible. In spite of its power, the CNNs were not able to separate signal from backgrounds at the level we need, however, the output scores assigned by the CNNs to each event class is by themselves a very distinctive feature that can be combined with kinematic information of the particles of the event to train another ML algorithm -this is an ensemble learning.
Following this insight, we trained a boosted decision trees algorithm using the CNNs outputs and kinematic information of the events to further cleaning of Higgs to light jets signals. This time, we were able to double the signal significance compared to the cut-and-count analysis [8], reaching ∼ 2σ in the statistics dominance scenario after 3000 fb −1 , despite even a 10% systematics on the backgrounds normalization should not change these results considerably. Moreover, the ML algorithm was able to eliminate the Z(H → bb) and Z(H → cc) contaminants allowing us to derive the following 95% CL bound directly on the light jets branching ratio instead of a bound on the untagged jet class as in Ref. [8] BR(H → jj) ≤ 2(2.26) × BR SM (H → gg) , assuming a 0(10)% systematic uncertainty on the background normalization.
Combining the significance reached in this analysis with the ones in the search for H → bb and H → cc taking into account mixings of tagged and mistagged jet classes, it is possible to put bounds on the BR(H → j j ), where j is a jet that could be not associated to a b ot c-jet, it is an "untagged" jet. Following closely the analysis of Ref. [8], we found BR(H → j j ) ≤ 3.26(3.28) × BR SM (H → gg) , which improves the results obtained exclusively with a dedicated cut-and-count analysis.  x = λx i + (1 − λ)x j , (C1) y = λy i + (1 − λ)y j , where λ ∈ [0, 1]. Therefore, mixup extends the training distribution by incorporating the prior knowledge that linear interpolations of feature vectors should lead to linear interpolations of the associated targets. We show an example in Fig. (7).