Lorentz Boost Networks: Autonomous Physics-Inspired Feature Engineering

We present a two-stage neural network architecture that enables a fully autonomous and comprehensive characterization of collision events by exclusively exploiting the four momenta of final-state particles. We refer to the first stage of the architecture as Lorentz Boost Network (LBN). The LBN allows the creation of particle combinations representing rest frames. The LBN also enables the formation of further composite particles, which are then transformed into said rest frames by Lorentz transformation. The properties of the composite, transformed particles are compiled in the form of characteristic variables that serve as input for a subsequent network. This second network has to be configured for a specific analysis task such as the separation of signal and background events. Using the example of the classification of ttH and ttbb events, we compare the separation power of the LBN approach with that of domain-unspecific deep neural networks (DNN). We observe leading performance with the LBN, even though we provide the DNNs with extensive additional input information beyond the particle four momenta. Furthermore, we demonstrate that the LBN forms physically meaningful particle combinations and autonomously generates suitable characteristic variables.

We present a two-stage neural network architecture that enables a fully autonomous and comprehensive characterization of collision events by exclusively exploiting the four-momenta of final-state particles. We refer to the first stage of the architecture as Lorentz Boost Network (LBN). The LBN allows the creation of particle combinations representing rest frames. The LBN also enables the formation of further composite particles, which are then transformed into said rest frames by Lorentz transformation. The properties of the composite, transformed particles are compiled in the form of characteristic variables that serve as input for a subsequent network. This second network has to be configured for a specific analysis task such as the separation of signal and background events. Using the example of the classification of ttH and tt+bb events, we compare the separation power of the LBN approach with that of domain-unspecific deep neural networks (DNN). We observe leading performance with the LBN, even though we provide the DNNs with extensive additional input information beyond the particle four-momenta. Furthermore, we demonstrate that the LBN forms physically meaningful particle combinations and autonomously generates suitable characteristic variables. K : Analysis and statistical methods; Computing (architecture, farms, GRID for recording, storage, archiving, and distribution of data); Pattern recognition, cluster finding, calibration and fitting methods A X P : 1812.09722 1Corresponding author.
In particle physics, characteristic features are used to identify particles, to separate signal from background processes, or to perform calibrations. These features are usually expressed in manually engineered variables based on physics reasoning. As input for machine learning methods in particle physics, these so-called high-level variables were in many cases superior to the direct use of particle four-momenta, often referred to as low-level variables. Recently, several working groups have shown that deep neural networks can perform better if they are trained with both specifically constructed variables and low-level variables [13][14][15][16][17][18][19][20][21][22][23][24][25]. This observation suggests that the networks extract additional useful information from the training data.
In this paper, we attempt to autonomize the process of finding suitable variables that describe the main characteristics of a particle physics task. Through a training process, a neural network should learn to construct these variables from raw particle information. This requires an architecture tailored towards the structure of particle collision events. We will present such an architecture here and investigate its impact on the separation power for signal and background processes in comparison to domain-unspecific deep neural networks. Furthermore, we will uncover characteristic features which are identified by the network as particularly suitable.
Particle collisions at high energies produce many particles which are often short-lived. These short-lived particles decay into particles of the final state, sometimes through a cascade of multiple decays. Intermediate particles can be reconstructed by using energy-momentum conservation when a parent particle decays into its daughter particles. By assigning to each particle a four-vector defined -1 -by the particle energy and its momentum vector, the sum of the four-vectors of the daughter particles gives the four-vector of the parent particle. For low-energy particle collisions, bubble chamber images of parents and their daughter particles were recorded in textbook quality. Evidently, here the decay angular distributions of the daughter particles are distorted by the movement of the parent particle, but can be recovered in the rest frame of the parent particle after Lorentz transformation. The same principles of particle cascades apply to high-energy particle collisions at colliders. Here, the particles of interest are, for example, top quarks or Higgs bosons, which remain invisible in the detector due to their short lifetimes, but can be reconstructed from their decay products which often form collimated particle jets. Exploiting the properties of these particles or jets of particles in appropriate rest frames is a key to the search for high-level variables characterizing a physics process. Throughout this work, the term particle is interchangeably used to describe jets as only four-momentum components are employed. We consider particles at reconstruction level, where energies and momenta are measured with limited resolution and acceptance.
For the autonomous search for sensitive variables, we propose a two-stage network, composed of a so-called Lorentz Boost Network (LBN) followed by an application-specific deep neural network (NN). The LBN takes only the four-vectors of final-state particles as input. In the LBN there are two ways of combining particles, one to create composite particles, the other to form appropriate rest frames. Using Lorentz transformations, the composite particles are then boosted into the rest frames. Thus, the characteristics of a particle decay can be accessed directly through kinematic variables evaluated in that frame.
Finally, characteristic features are derived from the boosted composite particles: masses, angles, etc. The second network stage (NN) then takes these variables as input to solve a specific problem, e.g. the separation of signal and background processes. While the first stage constitutes a novel network architecture, the latter network is interchangeable and can be adapted depending on the analysis task.
This paper is structured as follows. First, we explain the network architecture in detail. Second, we present the simulated dataset we use to investigate the performance of our architecture in comparison to typical deep neural networks. Thereafter, we review the particles, rest frames, and characteristic variables created by the network to gain insight into what is learned in the training process, before finally presenting our conclusions.

Network architecture
In this section we explain the structural concept on which the Lorentz Boost Network (LBN) is based and introduce the network architecture.
The measured final state of a collision event is typically rather complex owing to the high energies of the particles involved. Except for detector effects, all available information is encoded in the particles' four-vectors, but the comprehensive extraction of relevant properties poses a significant challenge. To this end, physicists engineer high-level variables to decipher and factorize the probability distributions inherent to the underlying physics processes. These high-level variables are often fed into machine learning algorithms to efficiently combine their descriptive power in the context of a specific research question. However, the consistent result of [13][14][15][16][17][18][19][20][21][22][23][24][25] is that the combination of both low-and high-level variables tends to provide superior performance. This -2 -observation suggests that low-level variables potentially contain additional, useful information that is absent in hand-crafted high-level variables.
The aim of the LBN is, given only low-level variables as input, to autonomously determine a comprehensive set of physics-motivated variables that maximizes the relevant information for solving the physics task in the subsequent neural network application. Figure 1 shows the proposed two-stage network architecture (LBN+NN) in detail. The first stage is the LBN and constitutes the novel contribution. It consists of several parts, namely the combination of input four-vectors to particles and rest frames, subsequent Lorentz transformations, and the extraction of suitable high-level variables. The second stage can be some form of deep neural network (NN) with an objective function depending on the specific research question.  In the LBN, the input four-vectors (E, p x , p y , p z ) are combined in two independent ways before each of the combined particles is boosted into its particular rest frame, which is formed from a different particle combination. The boosted particles are characterized by variables which can be, e.g., invariant masses, transverse momenta, pseudorapidities, and angular distances between them. These features serve as input to the second network designed to accomplish a particular analysis task.
The LBN combines N input four-vectors, consisting of energies E and momentum components p x , p y , p z , to create M particles and M corresponding rest frames according to weighted sums using trainable weights. Through Lorentz transformation, each combined particle is boosted into its dedicated rest frame. After that, a generic set of features is extracted from the properties of these M boosted particles.
Examples of variables that can be reconstructed with this structure include spin-dependent angular distributions, such as those observed during the decay of a top quark with subsequent leptonic decay of the W boson. By boosting the charged lepton into the rest frame of the W boson, its decay angular distribution can be investigated. In more sophisticated scenarios, the LBN is also capable of accessing further properties that rely on the characteristics of two different rest frames. An example is a variable usually referred to as cos(θ * ), which is defined by the angular difference between the directions of the charged lepton in the W boson's rest frame, and the W boson in the -3 -

JINST 14 P06006
rest frame of the top quark. The procedure of how this variable can be reconstructed in the LBN is depicted in figure 2.

Input vectors Particles
Rest frames

Boosted particles Features
Trainable Figure 2. Example of a possible feature engineering in top quark decays addressing the angular distance of the direction of the W boson in the top rest system and the direction of the lepton in the W boson rest system, commonly referred to as cos(θ * ).
The number N of input particles is to be chosen according to the research question. The number of matching combinations M is a hyperparameter to be adjusted. In this paper we introduce a specific version of the LBN which constructs M particle combinations, and each combination will have its own suitable rest system. Other variants are conceivable and will be mentioned below.
In the following paragraphs we describe the network architecture in detail.
Combinations. The purpose of the first LBN layer is to construct arbitrary particles and suitable rest frames for subsequent boosting. The construction is realized via linear combinations of N input four-vectors, to a number of M particles and rest frames, which are represented by four-vectors accordingly. Here, M is a hyperparameter of the LBN and its choice is related to the respective physics application. The coefficients W of all linear combinations C, with m ∈ [1, M], are free parameters and subject to optimization within the scope of the training process. In the following, combined particles and rest frames are referred to as C P and C R , respectively. Taking both into consideration, this amounts to a total of 2 · N · M degrees of freedom in the LBN. In order to prevent the construction of objects which would lead to unphysical -4 -implications when applying Lorentz transformations, i.e., four-vectors not fulfilling E > m > 0, all parameters W mn are restricted to positive values. We initialize the weights randomly according to a half-normal distribution with mean 0 and standard deviation 1/M. It should be noted that, in order to maintain essential physical properties and relations between input four-vectors, feature normalization is not applied at this point.
Lorentz transformation. The boosting layer performs a Lorentz transformation of the combined particles into their associated rest frames. The generic transformation of a four-vector q is defined as q * = Λ · q with the boost matrix (2. 3) The relativistic parameters γ = E/m, ì β = ì p/E, and ì n = ì β/β are to be derived per rest-frame four-vector C R m . The technical implementation within deep learning algorithms requires a vectorized representation of the Lorentz transformation. To this end, we rewrite the boost matrix in equation (2.3) as and the 4 × 4 unit matrix I. The operators ⊕ and denote elementwise addition and multiplication, respectively. This notation also allows for the extension by additional dimensions to account for the number of combinations M and an arbitrary batch size. As a result, all boosted four-vectors B are efficiently computed by a single, broadcasted matrix multiplication, B = Λ R · C P . While the description above focuses on a pairwise mapping approach, i.e., particle C P m is boosted into rest frame C R m , other configurations are conceivable: • The input four-vectors are combined to build M particles and K rest frames. Each combined particle is transformed into all rest frames, resulting in K · M boosted four-vectors.
• The input four-vectors are combined only to build M particles, which simultaneously serve as rest frames. Each combined particle is transformed into all rest frames derived from the other particles, resulting in M 2 − M boosted four-vectors.
The specific advantages of these configurations can, in general, depend on aspects of the respective physics application. Results of these variants will be the target of future investigations.
Feature extraction. Following the boosting layer, features are extracted from the Lorentz transformed four-vectors, which can then be utilized in a subsequent neural network. For the projection of M × 4 particle properties into F × 1 features, we employ a distinct yet customizable set of generic mappings. The autonomy of the network is not about finding entirely new features, but rather about factorizing probability densities in the most effective way possible to answer a scientific question.

JINST 14 P06006
Therefore, we let the LBN network work autonomously to find suitable particle combinations and rest frames which enable this factorization, but then use established particle characterizations. We differentiate between two types of generic mappings: 1. Single features are extracted per boosted four-vector. Besides the vector elements (E, p x , p y , p z ) themselves, features such as mass, transverse and longitudinal momentum, pseudorapidity, and azimuth can be derived.
2. Pairwise features are extracted for all pairs of boosted four-vectors. Examples are the cosine of their spatial angular difference, their distance in the η-φ plane, or their distance in Minkowski space. In contrast to single features, pairwise features introduce close connections among boosted four-vectors and, by means of backpropagation, between trainable combinations of particles and rest frames.
The set of extracted features is concatenated to a single output vector. Provided that the employed batch size is sufficiently large, batch normalization with floating averages adjusted during training can be performed after this layer [26].

Subsequent problem-specific application.
In the previous sections, we described the LBN as shown in the left box in figure 1, namely how input vectors are combined and boosted into dedicated rest frames, followed by how features of these transformed four-vectors are compiled. These features are intended to maximize the information content to be exploited in a subsequent, problem-specific deep neural network NN as shown in the right box in figure 1.
The objective function of the NN defines the type of learning process, such as a regression or classification task. Weight updates are performed as usual through backpropagation. These updates apply to the weights of the subsequent network as well as to the trainable weights of the combination layer in the LBN.
In the following, we evaluate how well the autonomous feature engineering of the compound LBN+NN network operates. We compare its performance with only low-level information to that of typical, fully-connected deep neural networks (DNNs). We alternatively supply the DNNs with low-level information, sophisticated high-level variables (see appendix A), and the combination of both.

Simulated datasets
The P 8.2.26 program package [27] was used to simulate ttH and tt + bb events. Examples of corresponding Feynman diagrams are shown in figure 3. The matrix elements contain angular correlations of the decay products from heavy resonances. Beam conditions correspond to LHC proton-proton collisions at √ s = 13 TeV. Only the dominant gluon-gluon process is enabled and Higgs boson decays into bottom quark pairs are favored. Hadronization is performed with the Lund string fragmentation model.
To analyze a typical final state as observed in an LHC detector, we use the D package [28]. D provides a modular framework designed to parameterize the simulation of a multi-purpose detector. In our study, we chose to simulate the CMS detector response. All important effects such as pile-up, deflection of charged particles in magnetic fields, electromagnetic and hadronic calorimetry, quark pair production with an additional bottom quark pair induced by the splitting of a gluon. Despite differences caused by the spin and color-charge of the gluon, the process is very similar to the ttH (H ! bb) signal process in terms of final state content and kinematic topology.
The theoretical description of the production mechanisms of these tt+hf contributions is ambiguous. For example, processes with additional bottom quarks can be modeled via gg ! ttg matrix element calculations in (N)LO with collinear g ! bb splittings in the parton shower, gb ! ttb production using 4FS and 5FS calculations, or solely via collinear gluon splittings in the parton shower [29][30][31]87]. Currently neither higher-order theoretical calculations nor experimental methods can constrain expected rates with an accuracy better than 35 % [28][29][30][31].
However, a clear and robust process definition is required in the scope of this analysis for two particular reasons. First, production processes might be subject to different theoretical uncertainties which must be properly assigned to establish a reasonable modeling of recorded collision data with simulated events. Second, common analysis strategies are based on the separation of signal and background contributions to define specific signal-and background-enriched phase space regions in which the measurement is performed. In the presented analysis, the most relevant background contributions are also separated from each other (cf. Sections 4 and 7). In case of inadequate process definitions, the efficiency of this separation approach is impaired which in turn reduces the measurement sensitivity.
The definition of the particular tt subprocesses is prescribed in the following. After the simulation of events using matrix element generators, parton showering, and phenomenological hadronization processes (cf. Section 2.2.1), final state particles are clustered into jets using the anti-k T algorithm with a distance parameter of DR = 0.4. If a jet can be associated to the hard scattering process, i.e., if it originates from the top quark or hadronic W boson decays, or if it exhibits a transverse momentum below 20 GeV, it is rejected in subsequent considerations. The clustering history of the remaining jets is traversed backwards up to the hadronization stage and comprises, in particular, the decays of unstable hadrons. Based on this, five tt subprocesses are defined by counting the numbers of additional jets and contained b and c hadrons.
• "tt+bb": The event has at least two additional jets which each contain at least one b hadron. The theoretical description of the production mechanisms of these tt+hf contributions is ambiguous. For example, processes with additional bottom quarks can be modeled via gg ! ttg matrix element calculations in (N)LO with collinear g ! bb splittings in the parton shower, gb ! ttb production using 4FS and 5FS calculations, or solely via collinear gluon splittings in the parton shower [29][30][31]87]. Currently neither higher-order theoretical calculations nor experimental methods can constrain expected rates with an accuracy better than 35 % [28][29][30][31].
However, a clear and robust process definition is required in the scope of this analysis for two particular reasons. First, production processes might be subject to different theoretical uncertainties which must be properly assigned to establish a reasonable modeling of recorded collision data with simulated events. Second, common analysis strategies are based on the separation of signal and background contributions to define specific signal-and background-enriched phase space regions in which the measurement is performed. In the presented analysis, the most relevant background contributions are also separated from each other (cf. Sections 4 and 7). In case of inadequate process definitions, the efficiency of this separation approach is impaired which in turn reduces the measurement sensitivity.
The definition of the particular tt subprocesses is prescribed in the following. After the simulation of events using matrix element generators, parton showering, and phenomenological hadronization processes (cf. Section 2.2.1), final state particles are clustered into jets using the anti-k T algorithm with a distance parameter of DR = 0.4. If a jet can be associated to the hard scattering process, i.e., if it originates from the top quark or hadronic W boson decays, or if it exhibits a transverse momentum below 20 GeV, it is rejected in subsequent considerations. The clustering history of the remaining jets is traversed backwards up to the hadronization stage and comprises, in particular, the decays of unstable hadrons. Based on this, five tt subprocesses are defined by counting the numbers of additional jets and contained b and c hadrons.
• "tt+bb": The event has at least two additional jets which each contain at least one b hadron. and muon detection systems are covered. The simulated output consists of muons, electrons, tau leptons, photons, jets, and missing transverse momentum from a particle flow algorithm. As the neutrino leaves the detector without interacting, we identify its transverse momentum components from the measurement of missing transverse energy, and we reconstruct its longitudinal momentum by constraining the mass of the leptonically decaying W boson to 80.4 GeV.
Following the detector simulation, an event selection with the following criteria was carried out: events must have at least six jets clustered with the anti-k T algorithm, implemented in the F J package [29], with a radius parameter of ∆R = 0.4. A jet is accepted if its transverse momentum fulfills p t > 25 GeV and its absolute pseudorapidity is within |η| < 2.4. Furthermore, exactly one electron or muon with p t > 20 GeV and |η| < 2.1 is required. Events with further leptons are rejected to focus solely on semi-leptonic decays of the tt system. Finally, a matching is performed between the final-state partons of the generator and the jets of the detector simulation with the maximum distance ∆R = 0.3, whereby the matching must be unambiguously successful for all partons. For ttH events, the combined selection efficiency was measured as 2.3%.
For tt + bb processes, further measures are taken to identify the two additional bottom quark jets. The definition is an adaption of [25]. At generator level, the anti-k T algorithm with a radius parameter of ∆R = 0.4 is used to cluster all stable final-state particles. If a generator jet is found to originate from one of the top quarks or W boson decays, or if its transverse momentum is below a threshold of 20 GeV, it is excluded from further considerations. For each remaining jet, we then count the number of contained bottom hadrons using the available hadronization and decay history. An event is a tt + bb candidate if at least two generator jets contain one or more distinct bottom hadrons. Finally, a matching is performed between the four quarks resulting from the tt decay and the two identified generator jets on the one hand, and the selected, reconstructed jets on the other hand. Similar to ttH, a tt+bb event is accepted if all six generator-level objects are unambiguously matched to a selected jet. This leaves a fraction of 0.02% of the generated tt +bb events.
A total of 10 6 events remain, consisting of an evenly distributed number of ttH and tt + bb events, which are then divided into training and validation datasets at a ratio of 80 : 20.
For further analysis, a distinct naming scheme for reconstructed jets is introduced that is inspired by the semi-leptonic decay characteristics of the tt system. The two light quark jets of the -7 -hadronically decaying W boson are named q 1 and q 2 , where the index 1 refers to the jet with the greater transverse momentum. The bottom jet that is associated to this W boson within a top quark decay is referred to as b had . Accordingly, the bottom quark jet related to the leptonically decaying W boson is referred to as b lep . The remaining jets are named b 1 and b 2 , and transparently refer to the decay products of the Higgs boson for ttH, or to the additional b quark jets for tt +bb events.
In figure 4 we show, by way of example, distributions of the generated ttH and tt+bb events. In figure 4a we compare the transverse momentum of the jet b 1 , and in figure 4b the largest difference in pseudorapidity between a jet and the charged lepton. In figure 4c we show the invariant mass of the closest jet pair, and in figure 4d the event sphericity.

Benchmarks of network performance
As a benchmark, the LBN is utilized in the classification task of distinguishing a signal process from a background process. In this example, we use the production of a top quark pair in association with a Higgs boson decaying into bottom quarks (ttH) as a signal process, and top quark pairs produced -8 -with 2 additional bottom jets (tt + bb) as a background process. In both processes, the final state consists of eight four-vectors, four of which represent bottom quark jets, two describe light quark jets, one denotes a charged lepton, and one a neutrino. Within the benchmark, the LBN competes against other frequently used deep neural network setups, which we refer to as DNN here. For a meaningful comparison, we perform extensive searches for the optimal hyperparameters in each setup. As explained above, the LBN consists of only very few parameters to be trained in addition to those of the following neural network (LBN+NN). Still, by varying the number M of particle combinations to be constructed from the eight input four-vectors, and by varying the number F and types of generated features, we performed a total of 346 training runs of the LBN+NN setup, and a similar amount for the competing DNN. The best performing architectures are listed in appendix A.
The LBN network exclusively receives the four-vectors of the eight particles introduced above, which shall be denoted by 'LBN low' in the following. For the networks marked with DNN we use three variants of inputs. In the first variant, the DNN receives only the four-vectors of the eight particles ('DNN-low'). For the second variant, physics knowledge is applied in the form of 26 sophisticated high-level variables which are inspired by [25] and incorporate comprehensive event information ('DNN-high'). A list of these variables, such as the event sphericity [30] and Fox-Wolfram moments [31], can be found in the appendix A. In the third variant, we provide the DNNs with both low-level and high-level variables as input ('DNN-combined').
In the first set of our benchmark tests, the task of input particle identification is bypassed by using the generator information in order to exclusively measure the performance of the LBN. Jets associated to certain quarks through matching are consistently placed at the same position of the N input four-vectors. We quantify how well the event classification of the networks performs with the integral of the receiver operating characteristic curve (ROC AUC). It summarizes the ability of the binary classifier to maximize the rate of true positive events while simultaneously minimizing the rate of false positive events. Figure 5a shows the performance of all training runs involved in the hyperparameter search. The best training is marked with a horizontal line, while the distribution of results is denoted by the orange and blue shapes which can be read as vertical histograms that manifestly depend on the structure of the scanned hyperparameter space. For the LBN, the training runs achieve stable results since there are only minor variations in the resulting ROC AUC.
Also shown are the training runs of the three input variants for the DNNs. If the eight input fourvectors are ordered according to the generator information, the 'DNN low' setup already achieves results equivalent to the combination of low-level and high-level variables ('DNN combined'). However, both setups are unable to match the 'LBN low' result. Note that high-level variables are mostly independent of the input ordering by construction, and hence contain reduced information compared to the ordered four-vectors. Consequently, the weaker performance of the 'DNN high' is well understood.
In the second set of our benchmark test, we disregard generator information and instead sort the six jets according to their transverse momenta, followed by the charged lepton and the neutrino. This order can be easily established in measured data. Figure 5b shows that this alternative order consistently reduces the ROC AUC score of all networks.
The training runs of the LBN again show stable results with few exceptions and achieve the best performance in separating events from signal and background processes. For the DNNs, the above-mentioned hierarchy emerges, i.e., the best results are achieved using the combination of high-and low-level variables, followed by only high-level variables with reasonably good results, whereas the DNN exhibits the weakest performance with only low-level information.
Overall, the performance of the compound LBN+NN structure based on the four-vector inputs is quite convincing in this demanding benchmark test.

Visualizations of network predictions
Several components of the LBN architecture have a direct physics interpretation. Their investigation can provide insights into what the network is learning. In the following, we investigate only the best network trained on the generator-ordered input in order to establish which of the quark and lepton four-vectors are combined.
The number of particle combinations to create is a hyperparameter of the architecture. For the generator ordering, we obtain the best results for 13 particles. This matches well with our intuition since the Feynman diagram also contains 13 particles in the cascade. In total we extract F = 143 features in the LBN that serve as input to the subsequent deep neural network (figure 1). For details refer to the appendix A.
In our particular setup, each combined particle has a separate corresponding rest frame. As a consequence of demanding the second network to solve a research question, the decisions about which four-vectors are combined for the composite particle and which for its rest frame are strongly correlated. A correlation also exists between all 13 systems of combined particles and their rest frames from the pairwise angular distances exploited by the LBN as additional properties.
We start by looking at the weights used to combine the input four-vectors to form the particle combinations and rest frames.
Weight matrices of particle and rest-frame combinations. Figure 6 shows the weight matrices of the LBN. Each column of the two matrices for combined particles (red) and rest frames (green) -10 -is normalized so that the weight of each input four-vector (bottom quark jet b 1 of the Higgs, etc.) is shown as a percentage. The color code reflects the sum of the particle weights for the Higgs boson (or bb system for tt +bb) and the two top quarks, respectively. . Particle combinations (red) and their rest frames (green). The numbers represent the relative weights of the input particles for each combination i. The color code reflects the sum of the particle weights for the Higgs boson (ttH), the bb system (tt+bb), or the two top quarks. For better clarity of the presentation, the zero weights are kept white.
Note that combined particles and rest frames are complementary. Extreme examples of this can be seen in columns 2 and 6, in which one of the two bottom quark jets from the Higgs boson decay was selected as the rest frame and a broad combination of all other four-vectors was formed whose properties are exploited for the event classification.
Conversely, in column 0 a Higgs-like combination is transformed into the rest frame of all particles, to which the hadronic top quark makes the main contribution. Similarly, in columns 1 and 7, top quark-like combinations are formed and also transformed into rest frames formed by many four-vectors. These types of combinations allow for the Lorentz boost of the center-of-mass system of the scattering process to be nearly compensated.
-11 -It is striking that four-vector combinations forming a top quark are often selected. In figure 7, we quantitatively assess the 13 possible combinations of the combined particles (red) and the 13 rest frames (green) in order to determine which combinations are typically formed between the eight input four-vectors. In order to build rest frames (green), the LBN network recognizes combinations leading to the top quarks and treats the bottom quark jets from the Higgs individually. This appears to be an appropriate choice as the top quark pair is the same in tt +bb and ttH events while the two bottom quarks typically originate from gluon splitting or the Higgs decay, respectively. For the hadronic top quark, the W boson (75%) is often accounted for as well. With the leptonic top, the LBN network considers bottom quark jet and lepton (91%). The lepton and neutrino are related to form the W (68%).
To form the particle combinations (red), the LBN network builds the light quark jets to the W boson (78%) and the hadronic top quark. In the leptonic top quark, the LBN combines bottom quark jets and leptons (61%). Sometimes the lepton and the neutrino are also correlated to build the W boson (22%). The LBN rarely combines the Higgs boson (−4%).
The positive and negative correlations show that the LBN forms physics-motivated particle combinations from the training data. In the next step, we investigate the usefulness of these combinations for the separation of ttH and tt +bb events. Distributions resulting from combined particle properties. The performance of particle combinations and their transformations into reference frames in the LBN is evaluated through the extracted features which support the classification of signal and background events. Here, we present the distributions of different variables calculated from the combined particles.
-12 - Figure 8a shows the invariant mass m distributions of all 13 combined particles for the ttH signal dataset. Figure 8b shows the mass distributions for the tt + bb background dataset. The difference between the two histograms is shown in figure 8c, where combined particle 0 of the first column shows the largest difference between signal and background. Figure 6 shows that this combination 0 approximates a Higgs-like configuration (red). For this Higgs-like configuration (combined particle 0), we show mass distributions in figure 9a. For ttH, the invariant mass exhibits a narrow distribution around m = 22 a.u. reflecting the Higgslike combined particle, while the tt+bb background is widely distributed. Note that because of the weighted four-vector combinations, arbitrary units are used, so that the Higgs boson mass is not stated in GeV.
Analogously, we determine the transverse momentum p t and pseudorapidity η distributions of the 13 particles for signal and background events, and then compute their difference to visualize distinctions between ttH and tt + bb. While the invariant mass distribution can also be formed without Lorentz transformation, both the p t and η distributions are determined in the respective rest frames. Clear differences in the distributions are apparent in figure 10a for the transverse momenta of the combined particles 11, 8, 5, 10 (in descending order of importance), while the combined particles 5, 9, 12, 10 are most prominent for the pseudorapidities (figure 10b).   As a selection of the many possible distributions, in figure 9 we show the invariant mass m, the transverse momentum p t , and the pseudorapidity η for two combined particles. Figures 9a, 9c, and 9e present the distributions of the combined Higgs-like particle 0. In addition to the prominent difference in the mass distributions, for ttH events its p t is smaller while its direction is more central in η compared to tt +bb events.
Furthermore, in figures 9b, 9d, and 9f we show combined particle 5 because of its separation power in pseudorapidity and in transverse momentum (figure 10). Combination 5 is essentially determined by the bottom quark jet b 1 boosted into a rest frame of all other four-vectors ( figure 6). Here, the distribution of mass m for ttH events is smaller, p t is larger and more defined, and η is more central than for tt +bb events.
In the example of the characteristic angular distribution of the charged lepton in top quark decays (figure 2, cos θ * ), two different reference systems are combined to determine the opening angle θ * between the W boson in the top quark rest frame and the lepton in the W boson rest frame. As mentioned above, the LBN is capable of calculating these complex angular correlations involving four-vectors evaluated in different rest frames.
As an example, we select distributions for angular distances of combined particles 2 and 5 (figure 6). Particle 2 is a combination of all four-vectors boosted into the rest frame of the bottom quark jet b 2 , whereas particle 5 corresponds to the bottom quark jet b 1 boosted into a combination of all four-vectors. Figure 11a shows that, for tt + bb background events, combined particles 2 and 5 predominantly propagate back-to-back, while the ttH signal distribution scatters broadly around 90 deg.
The opposite holds true for the angles between combined particles 2 and 6 ( figure 6). Here, the bottom quark jet b 1 , or alternatively b 2 , serves as a rest frame for characterizing combinations of both top quarks together with the other bottom quark jet. Figure 11b shows that, for tt + bb, combined particles 2 and 6 preferably propagate in the same direction, whereas for ttH signal events this is less pronounced.
Overall, it can be concluded that the LBN network, when given only low-level variables as input, seems to autonomously build and transform particle combinations leading to extracted features that are suitable for the task of separating signal and background events.  Figure 11. Cosine of the opening angles between combined particle 2 (tt + b 1 ) in its rest frame b 2 and a) combined particle 5 (b 1 ) in its rest frame (tt + b 1 + b 2 ), and b) combined particle 6 (tt + b 2 ) in its rest frame (b 1 ).

Conclusions
The various physics processes from which the events of high-energy particle collisions originate lead to complex particle final states. One reason for the complexity is the high energy that leads to Lorentz boosts of the particles and their decay products. For an analysis task such as the identification of processes, we reconstruct aspects of the probability distributions from which the particles were generated. How well probability distributions of individual physical processes are reflected is compiled in a series of characteristic variables. To accomplish this, both the Minkowski metric for the correct calculation of energy-momentum conservation or invariant masses and the Lorentz transformation into the rest frames of parent particles are necessary. So far, such characteristic variables have been engineered by physicists.
In this paper, we presented a general two-stage neural network architecture. The first stage, the novel Lorentz Boost Network, contains at its core an efficient and fully vectorized implementation for performing Lorentz transformations. The aim of this stage is to autonomously generate variables suitable for the characterization of collision events when using only particle four-vectors. For this purpose, the input four-vectors are combined separately into composite particles and rest frames. The composite particles are boosted into corresponding rest frames and a generic set of variables is extracted from the boosted particles. The second stage of the network then uses these autonomously generated variables to solve a particular physics question. Both the weights used to create the combinations of input four-vectors and the parameters of the second stage are learned together in a supervised training process.
To assess the performance of the LBN, we investigated a benchmark task of distinguishing ttH and tt + bb processes. We demonstrated the improved separation power of our model compared to domain-unspecific deep neural networks where the latter even used sophisticated high-level variables as input in addition to the four-vectors. Furthermore, we developed visualizations to gain insight into the training process and discovered that the LBN learns to identify physics-motivated particle combinations from the data used for training. Examples are top-quark-like combinations or the approximate compensation of the Lorentz boost of the center-of-mass system of the scattering process.

JINST 14 P06006
The LBN is a multipurpose method that uses Lorentz transformations to exploit and uncover structures in particle collision events. It is part of an ongoing comparison of methods to identify boosted top quark jets and has already been successfully applied at the IML workshop challenge at CERN [32]. The source code of our implementation in TensorFlow [33] is publically available under BSD license [34].

A Network parameters and variables
Network parameters. The network parameters of the best performing networks are illustrated in table 1. Table 1. All deep neural networks use a fully connected architecture with n layers and n nodes . ELU is used as the activation function. The Adam optimizer [35] is employed with decay parameters β 1 = 0.9, β 2 = 0.999, and a learning rate of 10 −4 . L 2 normalization is applied with a factor of 10 −4 . Batch normalization between layers was utilized in every configuration. The generic mappings in LBN feature extraction layer create E, p T , η, φ, m, and pairwise cos(ϕ).