1 Introduction

Since the discovery of a Higgs boson with a mass of about \(125\, \text {GeV}\) in 2012 [1, 2], a tremendous and ongoing effort has been enacted in order to gain insights into the properties and interactions of the detected state. Its couplings with third generation fermions and weak gauge bosons, as well as the loop-induced couplings with gluons and photons, have been investigated in detail, indicating agreement with the predictions of the Standard Model (SM) within the present experimental and theoretical uncertainties. In view of the plethora of possible connections of the detected Higgs boson to sectors of physics beyond the SM (BSM), probing the Higgs interactions with respect to possible effects of BSM physics will be of central importance at the present and future runs at the LHC and at any future collider.

In this context the Higgs boson self-couplings are of particular relevance, while experimentally these couplings are very difficult to access. Experimental information about the trilinear and quartic Higgs couplings is needed to gain insights about the shape of the Higgs potential, which will have implications for a better understanding of the electroweak phase transition in the early universe and may be instrumental for explaining the observed asymmetry between matter and anti-matter in the universe. In the SM the Higgs potential is given by

$$\begin{aligned} V(\Phi ) = \lambda (\Phi ^\dagger \Phi )^2 - \mu ^2 \Phi ^\dagger \Phi \end{aligned}$$
(1)

in terms of the single Higgs doublet field \(\Phi \). In extended scalar sectors the potential can have a much richer structure. While the cubic and quartic Higgs couplings arising from Eq. (1) are correlated in the SM and can be predicted in terms of the known experimental values of the mass of the detected Higgs boson and the vacuum expectation value, large deviations from the SM predictions for the Higgs self-couplings are possible even in scenarios where the other couplings of the Higgs boson at \(125\, \text {GeV}\) are very close to the SM predictions (see e.g. Ref. [3] for a recent discussion of this point for the case of the trilinear Higgs coupling). Experimental constraints on the trilinear and quartic Higgs self-couplings can be expressed in terms of the so-called \(\kappa \)-framework, where \(\kappa _3\) (\(\kappa _4\)) denotes the coupling modifier of the cubic (quartic) coupling from its SM value at lowest order, i.e. \(\kappa _i = g_i / g_i^\text {SM}\), where \(g_i\) denotes the value of the coupling and \(g_i^\text {SM}\) its lowest-order SM prediction, and \(i = 3, 4\).

The most direct probe of the trilinear Higgs coupling at the LHC is the production of a pair of Higgs bosons, where \(\kappa _3\) enters at leading order (LO). Both the ATLAS [4] and CMS [5] collaborations determine the limits on \(\kappa _3\) from both gluon fusion and weak boson fusion (WBF) from different decay channels of the Higgs boson. At next-to-leading order (NLO), the trilinear Higgs coupling contributes to the Higgs-boson self-energy and also enters in additional one-loop and two-loop diagrams in WBF and gluon fusion, respectively, enabling the possibility of an indirect measurement through single-Higgs production [6,7,8,9,10]. The inclusion of single-Higgs information by the ATLAS collaboration results in the most stringent bound to date on \(\kappa _3\): \(\left[ -0.4, 6.3 \right] \). Triple-Higgs production is known to suffer from very small cross sections, but yields additional information on \(\kappa _3\) which could be used in combination with the aforementioned searches. Furthermore, it can provide the first experimental constraints on the quartic Higgs coupling \(\kappa _4\).

The paper is structured as follows. In Sect. 2 we discuss the allowed values of \(\kappa _3\) and \(\kappa _4\) from the perspective of perturbative unitarity and show that sizeable contributions to \(\kappa _4\) can occur, especially if \(\kappa _3\) deviates from the SM value. We explore in Sect. 3 how well the HL-LHC will be able to constrain \(\kappa _3\) and \(\kappa _4\) from the 6b and \(4b2\tau \) channels. Lepton colliders are additionally explored in Sect. 4 before conclusions are presented in Sect. 5.

2 Current bounds, unitarity and theoretical motivation

Besides the experimental constraints from Higgs pair and triple Higgs production processes, which will be discussed below, theoretical bounds can be placed on the Higgs self-couplings from the requirement of perturbative unitarity. In our analysis we employ the unitarity constraints obtained at tree level.

A general matrix element for \(2\rightarrow 2\) scattering with initial and final states \(\vert i\rangle \) and \(\vert f\rangle \), respectively, can be decomposed in terms of partial waves through the Jacob-Wick expansion [11]

$$\begin{aligned} \mathcal{{M}}_{if} = 16 \pi \sum _J (2 J + 1) a_{fi}^J (\mathcal{{D}}_{\lambda _i, \lambda _f}^J (\theta , \phi ))^* \;, \end{aligned}$$
(2)

where J indicates the total angular momentum of the corresponding amplitude, and the \(\lambda _{i,f}\) denote the helicities of the initial and final states. The most relevant channel at tree level for constraining \(\kappa _3\) and \(\kappa _4\) is \(HH \rightarrow HH\) scattering, where the Wigner-D functions \(\mathcal{{D}}_{\lambda _i, \lambda _f}^J\) reduce to unity for the zeroth partial wave. Conservation of probability leads to the requirement that the perturbative expansion must satisfy the optical theorem, which can be used to obtain an upper bound on the zeroth partial wave of

$$\begin{aligned} |\text {Re}\left( a_{ii}^0\right) | \le \frac{1}{2}\;, \end{aligned}$$
(3)

which if violated indicates inconsistencies in the perturbative calculation.

The zeroth partial wave at tree level can be calculated asFootnote 1

(4)

In the limit where the centre of mass energy is high, \(a_{ii}^0\) solely depends on \(\kappa _4\), while at lower energies a sizeable contribution from \(\kappa _3\) can yield a peak in \(a_{ii}^0\) that surpasses the allowed limit. We have calculated the zeroth partial wave for different values of \(\kappa _3\) and \(\kappa _4\) for a large range of energies in order to identify the parameter regions that are allowed by tree-level perturbative unitarity. Figure 1 shows the bounds from perturbative unitarity along with the current experimental bounds on the trilinear coupling from ATLAS, \(\kappa _3 \in [-0.4, 6.3]\) at the 95% C.L. [4], and the 95% combined ATLAS and CMS HL-LHC projection under the SM hypothesis, \(\kappa _3 \in [0.1, 2.3]\) [19].

Fig. 1
figure 1

Bounds from perturbative unitarity on \(\kappa _3\) and \(\kappa _4\) as obtained from \(HH \rightarrow HH\) scattering. In addition, the current experimental bounds on \(\kappa _3\) are shown (black dashed lines), as well as the expected projections from the HL-LHC (black solid lines). The shaded light blue region indicates where the dimension-eight contributions to \(\kappa _4\) are smaller than the dimension-six ones, while the dotted blue line corresponds to the case where the dimension-eight contributions vanish, \(\kappa _4 - 1 \simeq 6 (\kappa _3 - 1)\)

Fig. 2
figure 2

The left plot shows the impact of an increasing splitting between the masses of the BSM Higgs bosons on the one-loop prediction for the trilinear Higgs coupling \(\kappa _3\), where \(M = m_H = 600\) GeV and \(m_A = M_{H^\pm }\) is varied, in agreement with the results of Ref. [3] for the quoted benchmark point. On the right, the respective plot for \(\kappa _4\) is shown. The shorthand notations \(s_\beta \), \(c_\beta \) and \(t_\beta \) denote \(\sin \beta , \cos \beta \) and \(\tan \beta \), respectively

The unitarity bounds on \(\kappa _4\) are significantly weaker than the ones on \(\kappa _3\). This feature can be understood from the Effective Field Theory (EFT) perspective, where effects from higher-dimensional operators to the potential are incorporated as an expansion in terms of inverse powers of a UV-scale \(\Lambda \) [20] (see also the discussion in Refs. [18, 21])

$$\begin{aligned} V_\text {BSM}\! = \!\frac{C_6}{\Lambda ^2} \left( \Phi ^\dagger \Phi \!-\! \frac{v^2}{2} \right) ^3 \!+ \frac{C_8}{\Lambda ^4} \left( \Phi ^\dagger \Phi - \frac{v^2}{2} \right) ^4 \!+ \mathcal{{O}}({\frac{1}{\Lambda ^6})}\;. \end{aligned}$$
(5)

We use the convention where in the unitary gauge \(\Phi = \left( 0, (v + H)/\sqrt{2}\right) \), where v denotes the electroweak vacuum expectation value (VEV), and H is the 125 GeV Higgs boson. The benefit of this parameterisation of the higher-dimensional operators is that \(\kappa _3\) receives corrections purely from dimension-six operators, while \(\kappa _4\) only from dimension-six and dimension-eight operators (interaction vertices with more Higgs legs would additionally receive corrections from \(\mathcal{{O}}(1/\Lambda ^6)\) terms and so on). With the definitions of \(\kappa _i\) as before, the coupling modifiers receive corrections

$$\begin{aligned} (\kappa _3 - 1)&= \frac{C_6 v^2}{\lambda \Lambda ^2},\nonumber \\ (\kappa _4 - 1)&= \frac{6 C_6 v^2}{\lambda \Lambda ^2} + \frac{4 C_8 v^4}{\lambda \Lambda ^4}\nonumber \\&\simeq 6 (\kappa _3 - 1) + \mathcal{{O}}\left( \frac{1}{\Lambda ^4}\right) . \end{aligned}$$
(6)

Thus, if a small correction is induced in \(\kappa _3\), one should expect that in an EFT theory with high scale cutoff where the dimension-eight terms are negligible, the deviation in \(\kappa _4\) from the SM expectation would be six times larger. Even if the higher-dimensional contributions are relevant, \(|(\kappa _4 - 1) - 6 (\kappa _3 - 1) |< 6 |\kappa _3 - 1|\) needs to be satisfied in order to maintain a well-behaved expansion in powers of \(\Lambda \). Although in this work we choose to work in all generality without any EFT assumptions on the \(\kappa _3\) and \(\kappa _4\) modifiers, we indicate the region where this condition is fulfilled in Fig. 1.

Fig. 3
figure 3

Correlation of \(\kappa _3\) and \(\kappa _4\) for different values of \(M = m_H\) and \(m_A = M_{H^\pm }\). The light purple area indicates the range of 2HDM one-loop predictions for the correlation between \(\kappa _3\) and \(\kappa _4\) for \(m_A, M \in [0.3, 10]\) TeV. We show specific choices of M as solid lines with varying \(m_A\), in order to demonstrate the effect on \(\kappa _3\) and \(\kappa _4\). The gray (dark) shaded regions are the areas excluded by tree-level perturbative unitarity according to the analysis of Fig. 1, and the shaded light blue region is defined in the same way as in Fig. 1

In order to present an example where Eq. (6) can be realised, we consider the Two-Higgs Doublet Model (2HDM), where beyond tree level, the cubic and quartic self-couplings can receive significant contributions, as shown in Ref. [3] (see also Ref. [22]). A review of the 2HDM can be found in Ref. [23]. We work in the alignment limit with the lightest scalar identified as the 125 GeV Higgs boson (after Electroweak Symmetry Breaking (EWSB) and rotation to the Higgs basis) and perform a one-loop calculationFootnote 2 of the trilinear and quartic couplings employing the on-shell renormalisation scheme. As a motivated example we pick a benchmark point from Refs. [3, 26] which is compatible with the latest experimental results while also receiving sizeable trilinear-coupling corrections. We reproduce the one-loop result of Refs. [3, 26] and also show the quartic coupling in Fig. 2. As expected, the prediction for the quartic coupling quickly rises to values even beyond what is allowed by tree-level perturbative unitarity in the \(\kappa \)-framework if the splitting between the mass of the CP-odd Higgs boson in the 2HDM, \(m_A\), and the BSM Higgs scale \(M = m_{12}/(c_\beta s_\beta )\) increases. In the displayed example the unitarity bound is violated if \(m_A\) surpasses \(\sim 1100\) GeV, and further two-loop contributions would tighten the bound on \(m_A\).

In Fig. 3 the linear relation between \(\kappa _3\) and \(\kappa _4\) in the 2HDM is shown for variations of the scale M and for masses \(m_A \le 1.5\) TeV. Varying the values of \(m_A\) and M shifts the relation between the self-couplings while maintaining a linear correlation between them. For \(\kappa _3 =6\) the corresponding results for \(\kappa _4\) vary between \(\kappa _4 \approx 22\) and \(\kappa _4 \approx 31\) for the displayed scenarios. Thus, the largest allowed values for \(\kappa _3\) according to the present bounds are correlated in the 2HDM with very large shifts in \(\kappa _4\). As indicated by the shaded light blue region in the plot these predictions for \(\kappa _3\) and \(\kappa _4\) are associated with a well-behaved power expansion within an EFT framework. While it would also be of interest to explore which models can induce an even larger deviation of \(\kappa _4\) for relatively small values of \(\kappa _3\), potentially resulting in regions that require a non-linear effective prescription (for instance the Electroweak Chiral Lagrangian), we leave such an investigation for future work.

3 Triple Higgs production at the HL-LHC

The production of three Higgs scalars at the LHC and future colliders is highly suppressed compared to single and double Higgs production, severely limiting the available final states that can be explored at the LHC. In order to obtain the highest values for the product of cross section and branching ratios, one needs to consider the dominant production mode through gluon fusion, but also the main decay channel to a b-quark pair. The latter is difficult in hadron collisions due to the sizeable multi-jet background from QCD processes. It can be problematic for typical cut-and-count analyses to sufficiently suppress the background while at the same time avoiding a large reduction of signal events in order to maximise significance. In this work we resort to Machine Learning (ML) techniques for appropriately selecting the signal region of the considered channels.

In order to identify which of the decay channels of the on-shell Higgs bosons can be utilised for the analysis at the LHC, we start with an optimistic estimate of the number of events for the 6b, \(4 b 2 \tau \), \(2 b 4 \tau \) and \(4 b 2 \gamma \) final states.Footnote 3 Within the SM the involved branching ratios are given as

$$\begin{aligned}{} & {} \text {BR}(H \rightarrow b \bar{b}) = 0.584, \nonumber \\{} & {} \text {BR}(H \rightarrow \tau ^+ \tau ^-) = 6.627 \times 10^{-2} \nonumber \\{} & {} \text {BR}(H \rightarrow \gamma \gamma ) = 2.26 \times 10^{-3} \;. \end{aligned}$$
(7)

We note that the \(4 b 2 \gamma \) and \(2 b 4 \tau \) final states only produce a few events at 3/ab, even at relatively large coupling modifiers \(\kappa _3 > rsim 4.5\), \(\kappa _4 > rsim 30\) (taking into account K-factors of 1.7 [28] and tagging efficiencies of all taus and all-but-one b-quarks). It is therefore unlikely that these channels will be statistically significant at the HL-LHC, even though they can be highly relevant for colliders utilising higher energies, as shown in Refs. [27, 29,30,31].Footnote 4 We therefore will not consider these channels further, and instead focus on the 6b and \(4 b 2 \tau \) channels.

The background processes for the 6b final state have been thoroughly discussed in Ref. [34] (see also Ref. [35]), and it is expected that the dominant contribution arises from multi-jet QCD 6b events. This is the only background that is taken into consideration for this final state in this work, and we neglect subdominant channels.

In the \(4b2\tau \) channel,Footnote 5 the dominant backgrounds arise from the production of four b-quarks along with two W bosons (WWbbbb) or one Z boson (Zbbbb). The former includes the production of a top and bottom pair (ttbb) with subsequent decays \(t \rightarrow W b\). The production of a top pair associated with a Higgs (ttH) or a Z boson (ttZ) also yields noteworthy contributions. Here the ttH channel is particularly problematic if a reconstructed resonance close to the 125 GeV mass is required during an analysis to isolate the triple-Higgs signal. The final background included in our analysis is the four top production (tttt).

3.1 Analysis

3.1.1 Event generation and pre-selection

We use Madgraph5_MC@NLO [37, 38] for event generation and modify the provided SM model file in the UFO [39] format to introduce the modifications of the trilinear and quartic Higgs couplings \(\kappa _3\) and \(\kappa _4\), respectively.Footnote 6 Signal events are generated for \(p p \rightarrow h h h\) and are subsequently decayed on-shell with Madspin [42] in order to obtain the cross section rates. Due to the complexity of the multi-particle final states we generate events with a minimum transverse momentum for the b-quarks of \(p_T(b) > 28\) GeV and within the pseudorapidity region \(|\eta | < 2.5\), while we will later impose stricter cuts during the analysis. Additionally, since the signal consists of three on-shell Higgs bosons, we impose a cut on the invariant mass of the process of \(\sqrt{\hat{s}} > 350\) GeV at generation level.

While in principle one could explore different cuts in order to efficiently identify the signal region, the complexity of the final states would render this a cumbersome and difficult procedure, possibly requiring the use of complicated observables. Instead, we resort to Graph Neural Networks (GNNs) for an efficient discrimination between signal and background events. This requires an appropriate embedding of particle events to graphs. Before we address the ML aspects of the analysis it is appropriate to define pre-selection conditions required to be satisfied by each event that gets passed to the network.

Showering and hadronisation is performed with Pythia8 [43] saving the resulting events as HepMC files [44]. FastJet [45, 46] is interfaced through Rivet [47, 48], and jets are clustered using the anti-kT algorithm [49] of radius 0.4 and requiring a transverse momentum of \(p_T(j) > 30\) GeV. We use Rivet to calculate the events that will pass the pre-selection using a b-tagging efficiency (independent of \(p_T\)) of 0.8. For the 6b (\(4b2\tau \)) channel, at least five (three) b-quarks are required, satisfying the conditions \(p_T(b) > 30\) GeV and \(|\eta (b)| < 2.5\). For the \(4b 2\tau \) channel two \(\tau \) particles must also be identified in the central part of the detector, \(|\eta (\tau )| < 2.5\), with \(p_T(\tau ) > 10\) GeV. The \(\tau \) particles are identified with the TauFinder class of Rivet, and at least one \(\tau \) particle must decay hadronically.Footnote 7 We apply an efficiency of 0.8 for both leptonic and hadronic taus.Footnote 8 The invariant mass of the sum of the four-momenta of the above final states should exceed 350 GeV, otherwise the event is vetoed. Finally, we form combinations of b-quark pairs, and at least one pair is required to have an invariant mass close to the mass of the Higgs boson, \(m_{b \bar{b}} \in \left[ 110, 140\right] \) (in GeV). In the case of the \(4b 2 \tau \) channel the event passes the pre-selection also if the invariant mass criterion is satisfied by the invariant mass of the di-tau state, \(m_{\tau \tau }\).

3.1.2 Graph embedding and neural network architecture

GNNs, stemming from the idea that certain types of data can be efficiently represented as graphs, have been increasingly utilised in particle physics. Various works have indicated their applicability for BSM-relevant tasks such as event classification [52, 53], jet-tagging [54, 55], particle reconstruction [56], identifying anomalies in data arising from BSM interactions [57, 58] and obtaining constraints on parameters in SMEFT or the \(\kappa \)-framework [59, 60].Footnote 9 The latter is what we aim to achieve by performing a fit within the \(\kappa _3\)\(\kappa _4\) plane after the efficient selection of a signal region from the GNN. Similar architectures using graphs have also been recently utilised by experiments at the LHC, see e.g. Ref. [62].

The generated events need to be embedded in graphs before they are passed to the neural network. We explore two different pathsFootnote 10:

  1. 1.

    Fully connected (FC) Add nodes for all the considered final states (i.e. b quarks and \(\tau \) leptons, denoted as \(b_i\) and \(\tau _i\) according to their \(p_T\) values) and edges connecting all the nodes. We use the transverse momentum, pseudorapidity, azimuthal angle, energy, mass and PDG identification number as node features, \(\left[ p_T, \eta , \phi , E, m, \text {PDGID}\right] \), while no edge features are introduced. A node is also added for the missing momentum of the event.

  2. 2.

    Reconstructed nodes (RN) Add fully connected nodes for b quarks (and \(\tau \) leptons for the \(4 b 2\tau \) final state) as before, but additionally add nodes \(H_i\) for reconstructed pairs of particles i, j that are (relatively well) compatible with the Higgs-boson mass, \(m_{ij} = 125 \pm 25\) GeV. This is achieved by forming combinations between all the b-quarks and (if applicable) the \(\tau \)-pair. The \(H_i\) nodes correspond to the four-momentum and mass of the reconstructed pair, ordered according to which is closest to the Higgs-boson mass of 125 GeV. All the nodes have \(\left[ p_T, \eta , \phi , E, m, \text {PDGID}\right] \) as associated features, where the PDGID for \(H_i\) is zero.

Such physics-inspired approaches according to the expected chain of the event have been shown to improve results in semi-leptonic top decays [59] and are actively being explored [63].

GNNs operate by calculating messages using node features (and edge features if these exist) and iteratively updating the node features for each message passing layer. We rely on the EdgeConv [64] operation for message passing, where the message of node i at the l-th message passing layer is calculated with

$$\begin{aligned} m_{ij}^{(l)} = \text {{ReLU}} \left( \Theta (\textbf{x}_j^{\,(l)} - \textbf{x}_i^{\,(l)}) + \Phi (\textbf{x}_i^{\,(l)})\right) \;, \end{aligned}$$
(8)

where \(\Theta \) and \(\Phi \) indicate linear layers. The node features for \(l=0\) are the kinematical quantities we have defined as inputs, and the updated node features are obtained from the messages by averaging over the neighbouring nodes,

$$\begin{aligned} \textbf{x}_i^{(l+1)} = \frac{1}{|\mathcal{{N}}(i)|} \sum _{j \in \mathcal{{N}}(i)} m_{ij}^{(l)}\;. \end{aligned}$$
(9)

The final node features after all EdgeConv operations are aggregated to a single vector using a ‘mean’ graph readout operation. In principle, it is possible to additionally include further (non-graph related) layers at this stage. The final network score is obtained with a linear layer with the SoftMax activation [65] that reduces the resulting features to a two-dimensional vector, with each entry representing the probability that the event was signal or background. The amount of EdgeConv and the following linear layers need to be optimised to achieve high performance while at the same time avoiding overfitting. After experimenting with different setups, we settled on using two EdgeConv layers with hidden features of size 96 before the output layer for both channels.

3.1.3 Training and comparison of graph embeddings

The data is split into subsamples of \(\sim 56\%\), 19 and \(25\%\) for training, validation and testing, respectively,Footnote 11 and we minimise the cross-entropy loss function in order to train the network using the Adam optimiser [66]. The learning rate is one of the hyperparameters requiring tuning, and for our case the value of 0.001 (0.01) performs best for \(3b2\tau \) (5b). If for three epochs in a row the loss has not decreased, then the learning rate is reduced by a factor of 0.1. In principle the training can run for up to 200 epochs, although we impose early stopping conditions if the loss has not improved for ten consecutive epochs. A batch size of 128 is used for every update of the loss function.

The GNN for the 5b analysis is trained on two classes (signal and background). The situation is more involved for the \(3b 2 \tau \) case where the analysis benefits significantly from a multi-class classification which allows identifying different thresholds for the different background scores. In particular we choose to train on the WWbbbb, Zbbbb and \(t \bar{t} (H \rightarrow \tau ^+ \tau ^-)\) contributions. The signal events used for training are always for the \((\kappa _3, \kappa _4) = (1,1)\) point (using different values does not significantly alter the performance of the network).

We use the EdgeConv implementation from the Deep Graph Library [67] with PyTorch [68] as backend. The graph embedding relies on PyLHE [69] to extract events from the Les Houches Events (LHE) files [70]. In order to compare the different graph embeddings, we use functionality from scikit-learn [71] to calculate the true and false positive rates at different thresholds,Footnote 12 and we show the corresponding Receiver Operating Characteristic (ROC) curves for both channels in Fig. 4.

The ROC curves and the distributions allow one to conclude that the RN embedding utilising the reconstructed Higgs boson mass can lead to significant improvements. This is not unexpected as additional information (available at detector level) is passed to the network to aid classification. While in principle a sufficiently deep neural network with fully-connected graphs could also eventually learn to map the input features of the b-jets (and taus) to the masses of the reconstructed Higgs bosons, including the information in the graph embedding allows easier optimisation and quick convergence with a shallow network. We therefore utilise only the RN embedding for performing the final signal region selection.

Fig. 4
figure 4

ROC curves for the 5b and \(3b2\tau \) analyses are displayed on the left and right, respectively, showing the performance of the two embedding cases. For the \(3b2\tau \) case we binarise in a one-vs-rest scheme, grouping together the background classes. The areas under the ‘RN’ and ‘FC’ curves for the 5b case are 0.862 and 0.823, respectively. For the \(3b2\tau \) analysis the ‘RN’ area is 0.909 and the ‘FC’ area is 0.833

3.1.4 HL-LHC results

For simplicity we will use the signal efficiency of the network for the \((\kappa _3, \kappa _4) = (1,1)\) point and assume that it will be mostly the same irrespective of the coupling modifier values as our analysis is largely dependent on the cross section rates. Ideally one could train and optimise a network on each point, or alternatively train on event samples from topologies that depend differently on \(\kappa _3\) and \(\kappa _4\).Footnote 13 For the 5b analysis we optimise the signal selection to reduce the false positive rate to \(\sim 0.6\%\). In the \(3 b 2 \tau \) channel we require the following conditions to be satisfied on the background network scores:

  • \(\text {P}[WWbbbb] < 3\%\),

  • \(\text {P}[Zbbbb] < 10\%\),

  • \(\text {P}[t\bar{t}(H \rightarrow \tau ^+ \tau ^-)] < 30\%\).

It should be noted that even though the network was trained only on a subset of possible background contributions, it still performs well, as discussed below, and manages to remove background contributions from other sources as well. We calculate the efficiencies for each contribution and show the reduction of cross sections in Table 1. Our results for both channels include a K-factor for the signal of 1.7 [28] and a conservative estimate of the higher-order contributions to the background processes in terms of a K-factor of 2.

Table 1 Background contributions included in the \(3b2\tau \) analysis and reduction of the generated cross sections (labelled as “gen.”) after pre-selection cuts (“sel.”) and GNN selection (“NN”). W-bosons arising from tops are allowed to decay hadronically and c-jets can be mis-tagged as b-jets with a probability of 0.2
Fig. 5
figure 5

Projected contours indicating the \(1\sigma \) and \(2\sigma \) bounds in the \(\kappa _3\)\(\kappa _4\) plane from the 5b (left) and the \(3 b 2 \tau \) (right) analysis, including effects from showering, hadronisation and reconstruction

We define the significance for our analysis according to [72]

$$\begin{aligned} Z = \sqrt{2 \bigg ( (S + B) \ln {(1+ \frac{S}{B})} - S\bigg )} \;, \end{aligned}$$
(10)

where S and B denote the signal and background events, respectively. This allows us to obtain \(1\sigma \) and \(2\sigma \) bounds within the \(\kappa _3\)\(\kappa _4\) plane (which roughly correspond to \(68\%\) and \(95\%\) CL, respectively), as shown in Fig. 5 for the 5b and \(3 b 2 \tau \) analyses. We assume an integrated luminosity at the HL-LHC of 3/ab and a combined ATLAS and CMS luminosity of 6/ab.

Overall, we observe that the \(3 b 2 \tau \) analysis is more sensitive than the 5b analysis, and the latter will additionally suffer from further subdominant electroweak contributions to the background that have not been included.Footnote 14

Fig. 6
figure 6

The left plot shows the projected contours indicating the \(1\sigma \) and \(2\sigma \) bounds in the \(\kappa _3\)\(\kappa _4\) plane obtained from a combination of the 5b and \(3 b 2 \tau \) channels under the assumption that there are no correlations. The right plot shows the corresponding result where the backgrounds for both channels are increased by 50%

However both channels should be utilised in combinations to maximise the significance. Assuming for simplicity zero correlations between the channels, we combine the significances as \(Z_\text {comb} = \sqrt{Z_{5 b}^2 + Z_{3 b 2 \tau }^2}\), giving rise to the contours shown in Fig. 6 (left). While the projected bounds of about \(\pm 20\) times the predicted value for the quartic Higgs self-coupling in the SM may appear to be quite weak, in view of our discussion above we emphasise that such bounds go much beyond the existing theoretical bounds. Furthermore, deviations of this size in \(\kappa _4\) are well compatible with the existing experimental bounds on \(\kappa _3\) according to the correlations between \(\kappa _3\) and \(\kappa _4\) that are present in the BSM scenarios analysed above. Regarding the sensitivity to \(\kappa _3\) from triple Higgs boson production at the HL-LHC, Fig. 6 shows that the expected sensitivity in this channel at the HL-LHC is weaker than the present experimental limits that have been derived from di-Higgs production. Combining this independent set of experimental information on \(\kappa _3\) with the experimental results from di-Higgs production may nevertheless turn out to be useful. While our analysis may be optimistic in some respects (e.g. we neglect fake taus), on the other hand we note that further developments of the triggers, tagging and reconstruction algorithms of final states could result in higher efficiencies than the values that we have adopted in our analysis, enhancing the significance. The ability to discriminate between jet flavours is highly important for HHH studies (as well as HH studies) and could also allow experiments to study fully hadronic final states where H decays to W bosons. On the other hand, we note that even in the case that the backgrounds are increased by \(50\%\), the resulting constraints on \(\kappa _3\) and \(\kappa _4\) degrade only slightly, as shown in Fig. 6 (right).

3.2 Interpretability of NN scores

Understandably, NN techniques are often viewed as “black boxes”, due to their inability to indicate the input features that are most important for determining their predicted scores. In order to address this shortcoming, various approaches have been explored in the recent years with the goal to yield interpretability, allow efficient debugging of the network, better understand the mapping between input and output, and ultimately allow the identification of ways to improve it. These methods gained traction in particle physics in the recent years to obtain a better insight for various different tasks such as jet- and top-tagging and detector triggers [73,74,75,76,77,78,79].

Fig. 7
figure 7

Attributes for the features of the b jets, \(\tau \) leptons and the reconstructed invariant Higgs-boson mass that is closest to the 125 GeV resonance. The height of the attribution value indicates to what extent the network is using the particular feature in order to discriminate between signal and background. In the figure we denote collectively all the background classes as ‘B’ and the signal as ‘S’

There are various techniques for gaining interpretability in ML, but in general they can be separated into two categories: intrinsically interpretable models that are specifically designed to increase transparency providing intuition and post-hoc explanation methods that were developed to enhance our understanding of generic ML models. The latter is what applies to the case of this work. However, many post-hoc techniques lack certain properties that are beneficial to maintain; for example one could directly use the product of the gradients computed during backpropagation and the input in order to attribute the most relevant features [80, 81]. As the gradients of the network hold information regarding variations of the inputs, it should be possible to use them to quantify the dependence of the score on features. It is known, though, that gradient methods can yield the same attribution for an input and a baseline that differ from each other and have different outputs (for an example see Ref. [82]), due to the gradient becoming flat (this is often the case as NNs are trained until the loss is saturated).

Shapley [83] values (originating from Game Theory), are formulated based on certain axioms to distribute the attributions amongst the participating variables in a ML approach and have been applied for obtaining interpretations [84] (for an application in particle physics, see Refs. [85,86,87]). Their attractiveness stems from the fact that they follow axiomatic principles unlike earlier methods (e.g. DeepLift [88] or Layer-wise relevance propagation [89]). However, their evaluation is often computationally expensive and requires multiple calls of the neural network.

Integrated Gradients (IGs) is an alternative approach, designed in Ref. [82] using axiomatic considerations, which requires significantly fewer calls to the network function. The trade-off is the requirement that the ML technique must be differentiable, which is the case for NNs optimised through gradient descent approaches, and the application of IGs also requires access to the gradient of the model.Footnote 15 Let a generic classification NN denoted as \(F: \mathbb {R}^{n} \rightarrow \left[ 0, 1\right] \) for input features \(x \in \mathbb {R}^n\) and \(x^\prime \in \mathbb {R}^n\) denote an appropriate baseline (e.g. a zero vector). Integrating over all the gradients of F in a straight path from \(x^\prime \) to x defines IGs as

$$\begin{aligned} \mathcal{{I}}_i(x) = (x_i - x_i^\prime ) \int _{0}^{1} d\alpha \frac{\partial F(x^\prime + \alpha (x - x^\prime ))}{\partial x_i}\;. \end{aligned}$$
(11)
Fig. 8
figure 8

Histograms showing the attribution for different events against the value of the reconstructed mass for the (true) signal and backgrounds. The plot on the left (right) shows the reconstructed mass of the Higgs boson that is closest (second-closest) to the 125 GeV resonance. A positive attribution close to 1 indicates events with a high output score (i.e. identified as signal), while lower values of the attribution imply a low output score

We thus utilise IGs, implemented in the Captum [90] library, in order to obtain attributions for our predictions and identify the most relevant inputs for our processes.

The attributions obtained from IGs allow us to interpret the results of the network in terms of the input parameters for each node, as shown in Fig. 7, although some care is necessary when interpreting such results. Quite intuitively the transverse momenta and the energy of the b-jets are relevant parameters that receive high attributions. This is expected since restricting to higher values of \(p_T\) can help in the discrimination between signal and background (this was also the reason for applying a pre-selection momentum cut). Angular momenta are not so helpful for discrimination; this is not unexpected as we are dealing with scalars. The network additionally utilises the PID of the tau leptons more than the identification of the b quarks; this is likely due to the fact that the di-tau state is correlated with the highly discriminative reconstructed invariant mass of the Higgs boson. We clearly see that the introduction of the reconstructed masses significantly boosts the performance of the network, being the most important observable for the signal events. We note that as a reconstructed particle, the Higgs node has been assigned a PID of zero which as required by the ‘dummy’ axiomFootnote 16 has no attribution and thus zero contribution to the network results.

Taking a closer look at the reconstructed masses and their attributions, we see in Fig. 8 that the node with a reconstructed mass that is closest to 125 GeV receives a sizeable attribution.Footnote 17 The attributions from the mass of the \(H_1^\text {reco}\) node indicate that due to the similarity of the \(t\bar{t}(H\rightarrow \tau ^+ \tau ^-)\) background with the signal, the network is unable to clearly discriminate the two classes based on this feature alone. The inclusion of the mass of the second reconstructed Higgs boson, however, helps the network as indicated by the higher attributions assigned to the signal events as compared to the other sources of backgrounds. This implies that the inclusion of reconstructed observables can enhance the performance of GNNs in certain analyses, as also expected from the discussion in Sect. 3.1.3.

We stress that while the IG attributions provide an indication of the most important variables, our approach does not yield detailed information on how the specific correlations between the input features can impact the network score. While in many cases this would be desired, this is beyond the scope of our work where we use IGs as a method to verify that the introduction of the reconstructed Higgs-boson mass is indeed the most relevant variable. We leave explorations of alternative techniques (also specific to GNNs) pinpointing to important connections between input features and nodes for future work.

In our work we utilised interpretation methods mostly to ensure that the GNN works as expected and in order to identify potential issues during the implementation of the network. However, the usefulness of such techniques extends well beyond this. For example, in the case of limited computing resources one could check which features are irrelevant and remove them from the analysis before scaling the network up. Indeed, in our analysis we checked that if we remove the seemingly unused angular information, we obtain similar results as before (resulting in no visible changes in Fig. 5 for \(3b2\tau \)). Additionally for analyses with multiple final states the most practical observable that can be exploited is not always straightforward to identify. Interpretation techniques could therefore be used as a first step to identifying the most relevant observables before optimising the analysis to enhance its significance.

4 Reach assessment for lepton colliders and comparison with the HL-LHC

For comparison with the prospects of the HL-LHC, we finally consider the expected upper limits on \(\kappa _3\) and \(\kappa _4\) from possible future lepton colliders.Footnote 18 We consider an inclusive analysis of \(\ell \ell \rightarrow H H H + \text {other}\) which includes both the associated ZHHH production and the production through WBF. In principle one could consider dedicated analyses for each channel, optimising the selection of final states; however, we choose to perform an inclusive analysis to avoid further assumptions on the identification of other states which could vary depending on the collider concept and the detector. We will consider the decay \(H \rightarrow b \bar{b}\) of the Higgs boson, which yields the largest possible cross section for the signal, and assume throughout that the b-tagging efficiencies will be 0.8. Our analysis relies solely on identifying 5b jets in the clean environment provided by lepton collisions. We apply an additional \(\sim 0.83\) efficiency which arises from requiring the \(p_T\) of the b jets to be larger than 30 GeV. We note that in practice the results for an electron or muon collider would be similar, i.e. the obtained contours for the limits in the \(\kappa _3\)\(\kappa _4\) plane for a given collider c.m. energy and integrated luminosity would not be expected to significantly differ for the two collider types. Therefore we will refer to generic lepton colliders in the following, although we use the centre-of-mass energies of 1 and 3 TeV envisaged for the ILC and CLIC, as well as 10 TeV collisions that could be realised at a muon collider. We scan over different values of \(\kappa _3\) and \(\kappa _4\) for the aforementioned energies and subsequently apply the relevant tagging efficiencies.

An important limitation of high-energy lepton collisions in this case, however, arises from the region where the detectors can tag b jets. While for energies \(\sim 1\) TeV the b quarks are in the central part of the detector, the situation is significantly different for 10 TeV collisions, as shown in Fig. 9. It is thus necessary to explore possibilities for extending the tagging capabilities of future detectors to even \(|\eta | \sim 4\) in order to avoid a significant loss of events.

Fig. 9
figure 9

Pseudorapidity distribution for the leading b quark for different collision energies

Fig. 10
figure 10

On the left, the projected 95% CL contours for lepton colliders at different energies and integrated luminosities are shown, mainly focusing on the energies of ILC, CLIC and a possible muon collider. The SM value is shown as a black dot. The plot on the right shows a zoomed-in version

Fig. 11
figure 11

Comparison of the projected 95% CL contours for the 5b and \(3 b 2 \tau \) analyses at the HL-LHC with the projected 95% CL sensitivities at lepton colliders with different energies (indicated by the different coloured regions). The shaded gray area indicates the region that is excluded by the bound from tree-level perturbative unitarity

The leptonic collisions deliver clean signals avoiding the large background contamination from QCD that is present at hadron colliders. We checked the leading order QCD background of the signature \((5 b + \text {other})\) and found that the cross sections of these background processes are small. Assuming that the selection of the signal region will enforce \(p_T(b) > 30\) GeV, the requirement that one di-bottom pair should be compatible with the mass of the observed Higgs boson and a cut ensuring that the total invariant mass of the final state particles produced in the process is at least 350 GeV would result in no remaining background events (even with more relaxed cuts, the number of events is negligible when taking b-tagging efficiencies into account). However, similar to Refs. [21, 91, 92] we do not take into account electroweak backgrounds which could be dominant and deserve a dedicated study. We turn to a Poissonian analysis as described in Ref. [93], where n corresponds to the number of events expected from the SM, i.e. for \((\kappa _3, \kappa _4) = (1,1)\). Upper limits on the mean value of the Poisson distribution \(\mu \) are then calculated with

$$\begin{aligned} \mu _\text {up} = \frac{1}{2} F^{-1}_{\chi ^2} \bigg [ 2 (n+1) ; \text {CL} \bigg ]\;, \end{aligned}$$
(12)

where \(F^{-1}_{\chi ^2}\) denotes the inverse of the cumulative distribution of the \(\chi ^2\) distribution, and CL is the confidence level (e.g. 95%).

The resulting bounds at \(95\%\) CL are shown in Fig 10 for different centre-of-mass energies and integrated luminosities. The plots show that the large luminosities expected to be utilised by colliders at 3 TeV and 10 TeV (as envisaged for CLIC and muon-colliders, see Refs. [94, 95]) provide significant constraining power via the triple Higgs production process for \(\kappa _4\) and \(\kappa _3\).

The lepton collider projections are compared with our results for the HL-LHC in Fig. 11. We find that the HL-LHC sensitivity for \(\kappa _4\) is competitive with the one achievable at a 1 TeV lepton collider such as the ILC. In particular the comparison shows that for negative \(\kappa _4\) the HL-LHC is expected to have a better sensitivity than a 1 TeV lepton collider.

As discussed above further developments in ML could increase both the tagging and selection efficiencies beyond our assumptions, and additional channels will provide additional information.

5 Conclusions

Our investigation of the prospects at the HL-LHC shows that even though triple-Higgs production is limited by low rates at the LHC, its exploration provides interesting information even if it does not receive additional contributions from new scalar resonances. Bounds can be placed on \(\kappa _4\) significantly beyond the theoretical constraints from perturbative unitarity.

While as expected the bounds on \(\kappa _3\) will be much weaker than the ones from double-Higgs production, they should be useful for improving the sensitivity through combinations. Additionally, if deviations from the SM are found, the correlation between the Higgs self-couplings can shed light on the possible scenarios of physics beyond the SM.

If an excess in the triple Higgs production process is observed, the correlation with the result for double-Higgs production will be immensely informative. On the one hand, if no deviation from the SM value is identified in \(\kappa _3\) from other channels, an indication for a large deviation in \(\kappa _4\) would likely imply the presence of non-linear effects that cannot be described consistently within an effective field theory approach via the expansion in terms of a heavy scale. On the other hand, a deviation in both coupling modifiers could indicate a correlation between \(\kappa _3\) and \(\kappa _4\) that can be confronted with predictions of specific models such as the 2HDM and of effective field theories.

The physics gain that can be achieved via the statistically limited channel of triple Higgs production at the HL-LHC crucially depends on an efficient signal–background discrimination. For this purpose we have employed in our analysis the use of GNNs. It is already evident from current experimental searches that such ML techniques will be the centerpiece of future studies. However, it is especially important in particle physics to be able to identify the relevant kinematical features that contribute to the identification of the signal. An unintuitive behaviour (e.g. a high-attribute quantity that is already known to be irrelevant) could indicate a possible issue in the learning framework. Alternatively, potentially interesting quantities could be identified that could provide discriminative power even in simpler analyses that do not use ML algorithms. We have explored interpretability within GNNs using IGs which satisfy necessary axioms. We have shown that, as expected, the invariant mass of bottom and tau pairs is the most important feature in the data that is utilised for discrimination. We expect that such techniques will play an important role not only for the development of analyses for BSM searches but also for further applications in particle physics.

Our comparison of the prospects at the HL-LHC with future lepton colliders shows that the sensitivity to \(\kappa _4\) at the HL-LHC should be competitive with a 1 TeV lepton collider such as ILC. While the sensitivities of lepton colliders at 3 and 10 TeV (e.g. CLIC or a possible muon-collider) are expected to be considerably higher, these results will presumably become available only on a longer time scale, such as the one for a future higher-energetic hadron collider. Thus, it can be expected that the HL-LHC will be able to establish the first bounds on \(\kappa _4\) beyond theoretical considerations.

Note added Shortly after our paper, Ref. [96] appeared on the arXiv which studies triple Higgs production in the 6b final state at HL-LHC and FCC, including coupling modifications beyond the Higgs self-couplings.