Parameterized neural networks for high-energy physics

We investigate a new structure for machine learning classifiers built with neural networks and applied to problems in high-energy physics by expanding the inputs to include not only measured features but also physics parameters. The physics parameters represent a smoothly varying learning task, and the resulting parameterized classifier can smoothly interpolate between them and replace sets of classifiers trained at individual values. This simplifies the training process and gives improved performance at intermediate values, even for complex problems requiring deep learning. Applications include tools parameterized in terms of theoretical model parameters, such as the mass of a particle, which allow for a single network to provide improved discrimination across a range of masses. This concept is simple to implement and allows for optimized interpolatable results.


INTRODUCTION
Neural networks have been applied to a wide variety of problems in high-energy physics [1,2], from event classification [3,4] to object reconstruction [5,6] and triggering [7,8].Typically, however, these networks are applied to solve a specific isolated problem, even when this problem is part of a set of closely related problems.An illustrative example is the signal-background classification problem for a particle with a range of possible masses.The classification tasks at different masses are related, but distinct.Current approaches require the training of a set of isolated networks [9,10], each of which are ignorant of the larger context and lack the ability to smoothly interpolate, or the use of a single signal sample in training [11,12], sacrificing performance at other values.
In this paper, we describe the application of the ideas in Ref. [13] to a new neural network strategy, a parameterized neural network in which a single network tackles the full set of related tasks.This is done by simply extending the list of input features to include not only the traditional set of event-level features but also one or more parameters that describe the larger scope of the problem such as a new particle's mass.The approach can be applied to any classification algorithm; however, neural networks provide a smooth interpolation, while tree-based methods may not.
A single parameterized network can replace a set of individual networks trained for specific cases, as well as smoothly interpolate to cases where it has not been trained.In the case of a search for a hypothetical new particle, this greatly simplifies the task -by requiring only one network -as well as making the results more powerful -by allowing them to be interpolated between specific values.In addition, they may outperform isolated networks by generalizing from the full parameter-dependent dataset.
In the following, we describe the network structure needed to apply a single parameterized network to a set of smoothly related problems and demonstrate the application for theoretical model parameters (such as new particle masses) in a set of examples of increasing complexity.

NETWORK STRUCTURE & TRAINING
A typical network takes as input a vector of features, x, where the features are based on event-level quantities.
After training, the resulting network is then a function of these features, f (x).In the case that the task at hand is part of a larger context, described by one or more parameters, θ.It is straightforward to construct a network that uses both sets of inputs, x and θ, and operates as a function of both: f (x, θ).For a given set of inputs x0 , a traditional network evaluates to a real number f (x 0 ).A parameterized network, however, provides a result that is parameterized in terms of θ: f (x 0 , θ), yielding different output values for different choices of the parameters θ; see Fig. 1.
Training data for the parameterized network has the form (x, θ, y) i , where y is a label for the target class.The addition of θ introduces additional considerations in the training procedure.While traditionally the training only requires the conditional distribution of x given θ (which is predicted by the theory and detector simulation), now the training data has some implicit prior distribution over θ as well (which is arbitrary).When the network is used in practice it will be to predict y conditional on both x and θ, so the distribution of θ used for training is only relevant in how it affects the quality of the resulting parameterized network -it does not imply that the resulting inference is Bayesian.In the studies presented below, we simply use equal sized samples for a few discrete values of θ.Another issue is that some or all of the components of θ may not be meaningful for a particular target class.For instance, the mass

TOY EXAMPLE
As a demonstration for a simple toy problem, we construct a parameterized network, which has a single input feature x and a single parameter θ.The network is trained using labeled examples where examples with label 0 are drawn from a uniform background and examples with label 1 are drawn from a Gaussian with mean θ and width σ = 0.25.Training samples are generated with θ = −2, −1, 0, 1, 2; see Fig. 2a.
As shown in Fig. 2, this network generalizes the solution and provides reasonable output even for values of the parameter where it was given no examples.Note that the response function has the same shape for these values (θ = −1.5, −0.5, 0.5, 1.5) as for values where training data was provided, indicating that the network has successfully parameterized the solution.The signalbackground classification accuracy is as good for values where training data exist as it is for values where training data does not.

1D PHYSICAL EXAMPLE
A natural physical case is the application to the search for new particle of unknown mass.As an example, we consider the search for a new particle X which decays to t t.We treat the most powerful decay mode, in which t t → W + bW −b → qq b ν b.The dominant background is standard model t t production, which is identical in final state but distinct in kinematics due to the lack of an intermediate resonance.Figure 3 shows diagrams for both the signal and background processes.
We first explore the performance in a one-dimensional case.The single event-level feature of the network is m W W bb , the reconstructed resonance mass, calculated using standard techniques identical to those described in Ref. [14].Specifically, we assume resolved top quarks in each case, for simplicity.Events are are simulated at parton level with madgraph5 [15], using pythia [16] for showering and hadronization and delphes [17] with 3: Feynman diagrams showing the production and decay of the hypothetical particle X → t t, as well as the dominant standard model background process of top quark pair production.In both cases, the t t pair decay to a single charged lepton ( ), a neutrino (ν) and several quarks (q, b).
the ATLAS-style configuration for detector simulation.Figure 4a shows the distribution of reconstructed masses for the background process as well as several values of m X , the mass of the hypothetical X particle.Clearly the nature of the discrimination problem is distinct at each mass, though similar to those at other masses.
In a typical application of neural networks, one might consider various options: • Train a single neural network at one intermediate value of the mass and use it for all other mass values as was done in Refs.[11,12].This approach gives the best performance at the mass used in the training sample, but performance degrades at other masses.
• Train a single neural network using an unlabeled mixture of signal samples and use it for all other mass values.This approach may reduce the loss in performance away from the single mass value used in the previous approach, but it also degrades the performance near that mass point, as the signal is smeared.
• Train a set of neural networks for a set of mass values as done in Refs.[9,10].This approach gives the best signal-background classification performance at each of the trained mass values, degrades for mass values away from the ones used in training, and leads to discontinuous performance as one switches between networks.
In contrast, we train a single neural network with an additional parameter, the true mass, as an input feature.For a learning task with n event-level features and m parameters, one can trivially reconcieve this as a learning task with n+m features.Evaluating the network requires supplying the set of event-level features as well as the desired values of the parameters.
These neural networks are implemented using the multi-layer perceptron in PyLearn2, with outputs treated with a regressor method and logistic activation function.Input and output data are subject to preprocessing via a scikit-learn pipeline (i.e.MinMaxScaler transformation to inputs/outputs with a minimum and maximum of zero and one, respectively).Each neural network is trained with 3 hidden layers and using Nesterov's method for stochastic gradient descent.Learning rates were initiated at 0.01, learning momentum was set to 0.9, and minibatch size is set to treat each point individually (i.e.minibatch size of 1).The training samples have approximately 100k examples per mass point.
The critical test is the signal-background classification performance.To measure the ability of the network to perform well at interpolated values of the parametervalues at which it has seen no training data -we compare the performance of a single fixed network trained at a specific value of m 0 X to a parameterized network trained at the other available values other than m 0 X .For example, Fig. 4 compares a single network trained at m 0 X = 750 GeV to a parameterized network trained with data at m X = 500, 1000, 1250, 1500 GeV.The parameterized network's input parameter is set to the true value of the mass (m 0 X , and it is applied to data generated at that mass; recall that it saw no examples at this value of m 0 X in training.Its performance matches that of the single network trained at that value, validating the ability of the single parameterized network to interpolate between mass values without any appreciable loss of performance.

HIGH-DIMENSIONAL PHYSICAL EXAMPLE
The preceding examples serve to demonstrate the concept in one-dimensional cases where the variation of the output on both the parameters and features can be easily visualized.In this section, we demonstrate that the parameterization of the problem and the interpolation power that it provides can be achieved also in highdimensional cases.
We consider the same hypothetical signal and background process as above, but now expand the set of features to include both low-level kinematic features which correspond to the result of standard reconstruction algorithms, and high-level features, which benefit from the application of physics domain knowledge.The low-level features are roughly the four-vectors of the reconstructed events, namely: • the leading lepton momenta, • the momenta of the four leading jets, • the b-tagging information for each jets • the mass (m ν ) of the W → ν, • the mass (m jj ) of the W → qq , • the mass (m jjj ) of the t → W b → bqq , • the mass (m j ν ) of the t → W b → νb, • the mass (m W W bb ) of the hypothetical X → t t, for a total of five high-level features; see Fig 6.
The parameterized deep neural network models were trained on GPUs using the Blocks framework [18][19][20].
Seven million examples were used for training and one million were used for testing, with 50% background and 50% signal.The architectures contain five hidden layers of 500 hidden rectified linear units with a logistic output unit.Parameters were initialized from N (0, 0.1), and updated using stochastic gradient descent with minibatches of size 100 and 0.5 momentum.The learning rate was initialized to 0.1 and decayed by a factor of 0.89 every epoch.Training was stopped after 200 epochs.
The high dimensionality of this problem makes it difficult to visually explore the dependence of the neural network output on the parameter m X .However, we can test the performance in signal-background classification for the decay of X → t t with two choices of mX as well as the dominant background process.
[GeV] work is capable of generalizing the solution even in a high-dimensional example.
Conversely, Fig 8 compares the performance of the parameterized network to a single network trained at m X = 1000 GeV when applied across the mass range of interest, which is a common application case.This demonstrates the loss of performance incurred by traditional approaches and recovered in this approach.Similarly, we see that a single network trained an unlabeled mixture of signal samples from all masses has reduced performance at each mass value tested.
In previous work, we have shown that deep networks such as these do not require the additional of high-level features [21,22] but are capable of learning the necessary functions directly from the low-level four-vectors.Here we extend that by repeating the study above without the use of the high-level features; see Fig 7 .Using only the low-level features, the parameterized deep network is achieves essentially indistinguishable performance for this particular problem and training sets of this size.

DISCUSSION
We have presented a novel structure for neural networks that allows for a simplified and more powerful solution to a common use case in high-energy physics and demonstrated improved performance in a set of examples with increasing dimensionality for the input feature space.While these example use a single parameter θ, the Comparison of the performance in the signalbackground discrimination for the parameterized network, which learns the entire problem as a function of mass, and a single network trained only at mX = 1000 GeV.As expected, the AUC score decreases for the single network as the mass deviates from the value in the training sample.The parameterized network shows improvement over this performance; the trend of improving AUC versus mass reflects the increasing separation between the signal and background samples with mass, see Figs. 5 and 6.For comparison, also shown in the performance a single network trained with an unlabeled mixture of signal samples at all masses.
technique is easily applied to higher dimensional parameter spaces.
Parameterized networks can also provide optimized performance as a function of nuisance parameters that describe systematic uncertainties, where typical networks are optimal only for a single specific value used during training.This allows statistical procedures that make use of profile likelihood ratio tests [23] to select the network corresponding to the profiled values of the nuisance parameters [13].
Datasets used in this paper containing millions of simulated collisions can be found in the UCI Machine Learning Repository [24] at archive.ics.uci.edu/ml/datasets/HEPMASS.

FIG. 1 :
FIG. 1: Left, individual networks with input features (x1, x2), each trained with examples with a single value of some parameter θ = θa, θ b .The individual networks are purely functions of the input features.Performance for intermediate values of θ is not optimal nor does it necessarily vary smoothly between the networks.Right, a single network trained with input features (x1, x2) as well as an input parameter θ; such a network is trained with examples at several values of the parameter θ.
of a new particle is not meaningful for the background training examples.In what follows, we randomly assign values to those components of θ according to the same distribution used for the signal class.In the examples studied below the networks have enough generalization capacity and the training sets are large enough that the resulting parameterized classifier performs well without any tuning of the training procedure.However, the robustness of the resulting parameterized classifier to the implicit distribution of θ in the training sample will in general depend on the generalization capacity of the classifier, the number of training examples, the physics encoded in the distributions p(x| θ, y), and how much those distributions change with θ.

FIG. 2 :
FIG. 2: Top, training samples in which the signal is drawn from a Gaussian and the background is uniform.Bottom, neural network response as a function of the value of the input feature x, for various choices of the input parameter θ; note that the single parameterized network has seen no training examples for θ = −1.5, −0.5, 0.5, 1.5.

FIG. 4 :
FIG. 4: Top, distributions of neural network input m W W bb for the background and two signal cases.Bottom, ROC curves for individual fixed networks as well as the parameterized network evaluated at the true mass, but trained only at other masses.

•
the missing transverse momentum magnitude and angle for a total of 21 low-level features; see Fig 5.The highlevel features strictly combine the low-level information to form approximate values of the invariant masses of the intermediate objects.These are:

FIG. 5 :
FIG.5: Distributions of some of the low-level event features for the decay of X → t t with two choices of mX as well as the dominant background process.

FIG. 6 :
FIG.6: Distributions of high-level event features for the decay of X → t t with two choices of mX as well as the dominant background process; see text for definitions.
FIG. 8:Comparison of the performance in the signalbackground discrimination for the parameterized network, which learns the entire problem as a function of mass, and a single network trained only at mX = 1000 GeV.As expected, the AUC score decreases for the single network as the mass deviates from the value in the training sample.The parameterized network shows improvement over this performance; the trend of improving AUC versus mass reflects the increasing separation between the signal and background samples with mass, see Figs.5 and 6.For comparison, also shown in the performance a single network trained with an unlabeled mixture of signal samples at all masses.