Improving Parametric Neural Networks for High-Energy Physics (and Beyond)

Signal-background classification is a central problem in High-Energy Physics (HEP), that plays a major role for the discovery of new fundamental particles. A recent method -- the Parametric Neural Network (pNN) -- leverages multiple signal mass hypotheses as an additional input feature to effectively replace a whole set of individual classifiers, each providing (in principle) the best response for the corresponding mass hypothesis. In this work we aim at deepening the understanding of pNNs in light of real-world usage. We discovered several peculiarities of parametric networks, providing intuition, metrics, and guidelines to them. We further propose an alternative parametrization scheme, resulting in a new parametrized neural network architecture: the AffinePNN; along with many other generally applicable improvements, like the balanced training procedure. Finally, we extensively and empirically evaluate our models on the HEPMASS dataset, along its imbalanced version (called HEPMASS-IMB) we provide here for the first time, to further validate our approach. Provided results are in terms of the impact of the proposed design decisions, classification performance, and interpolation capability, as well.


Introduction
Selecting events that contain interesting processes is a fundamental requirement of HEP experiments and one of the most established areas in advanced computing techniques i.e. Machine and Deep Learning [1]. Physicists are interested in rare events (yield by the collision of known particles), following their theoretical assumptions, measuring the fraction of events that contain a specific decay channel. These rare events -the so-called signal -must be separated out from the background, i.e. anything else originating from already known processes. The usual way of doing that relies on building an event selection algorithm, estimating its efficiency on selecting signal and rejecting background, and measuring the count of events passing it. Being able to effectively separate background events from the signal is a central problem in HEP, that can help the discovery of new fundamental particles with further analysis. Compared to traditional, expert-designed algorithms based on single 2 Related Work

Parametric Neural Networks
A parametric neural network (pNN) [8,7,6] is a neural network architecture that leverages an additional input (in our case the mass of the hypothetical particle) to replace many individual classifiers, and potentially even improve their classification performance. Let be x the input features, m the generated mass of the signal (or the signal mass hypotheses), and θ a set of learnable weights (or parameters). A pNN can be denoted as f θ (x, m), i.e. as a learnable function of both the input features x and the mass feature m. A canonical neural network, instead, would be denoted as f θ (x), depending only on the input features. The Baldi's pNN [8] first concatenates x with m, then applies five dense layers each with 500 units and activated by ReLU, after that a final dense layer with sigmoid non-linearity outputs the predicted class label. Such architecture results in about 1M learnable parameters.
Indeed the idea of "parametric" is not new, as in other fields of machine learning, like imitation learning [11], multi-task and meta-learning [12], unsupervised reinforcement learning [13], and deep generative models [14], is commonly called "conditioning". Here the general idea is to condition the learning (i.e. output) of a neural network on some additional representation z, in order to let the network's output change as z varies. The vector z -called the task representation -can take various forms: ranging from a one-hot encoding to a dense embedding, or be a single discrete or continuous variable as well. In our case, the mass feature (i.e. the task representation) is a single scalar m belonging to a finite set M = {m 1 , m 2 , . . . , m M } of mass hypotheses about the signal process we're interested in.
This idea, whether called conditioning or parametrization, is promising in HEP since it may enable to replace (potentially many) individual classifiers with a unique classifier trained on all mass hypotheses. Thus, leveraging the sharing of weights for more efficient learning, and distributed representations shared among mass hypotheses for improved predictions, which we also found to be beneficial in low-data regimes: i.e. when some of the data corresponding to certain masses, is imbalanced compared to the most representative masses 2 . The authors also claim that, since the additional input would define a smoothly varying learning task, a pNN would be able to smoothly interpolate between such learning tasks. Ideally, this means that a pNN would be able to generalize (correctly classify events) beyond the mass hypotheses it was trained on, thanks to the additional mass feature.
In this case, our supervised dataset D have the form {(x, m, y) i } N i=1 , in which we have input features x (i.e. the variables associated to each event), their mass hypothesis m, and the target class labels y we aim to predict. For each signal mass hypothesis, m i ∈ M, we can slice our dataset such that only the features and targets corresponding to mass m i are retained, i.e. D mi = {(x, y) j : m j = m i , ∀j = 1, . . . , N }, where N is the total number of events 3 . In this way, we obtain |M| datasets for which each of them can be used to train an individual classifier (but also to evaluate our pNN at single mass points). The pNN can replace all of them, and if trained on a subset of the mass hypotheses,M ⊂ M, it is, in principle, able to automatically account for the missing masses (m j ∈ M M ) thanks to the interpolation capability (discussed in section 5.1), which should work on novel intermediate mass points as well.

Conditioning Mechanisms
The authors of the original pNN [8] utilize a simple conditioning mechanism: they just concatenate the features with the mass (or task representation, in general) obtaining a new set of features, x = [x, m], going back to a standard feed-forward neural network formulation, i.e. f θ (x), that learns from an extended set of featuresx. Indeed, many arbitrarily-complex conditioning mechanisms exists, two of them ( figure 1) are yet simple but powerful [15]: • Concatenation-based conditioning: the task or conditioning representation m (our mass) is first concatenated along the last dimension (axis) of the input features x, and then the result is passed through a linear layer. Notice that in the original pNN, the linear layer after concatenation is missing.
• Conditional scaling: a linear layer first maps the conditioning representation m to a scaling vector s, to which follows an element-wise multiplication (Hadamard product) with the input features x, i.e. x s.
These two conditioning mechanisms are widely applicable, although it's not yet clear in which case one mechanism is preferable to the other(s). Anyway, these two mechanisms can be both combined into a third one 4 : a conditional affine transformation [15], which motivates our new parametric architecture (refer to section 4.1).
(a) Concatenation-based conditioning (b) Conditional scaling Figure 1: Two popular conditioning mechanism, that tuned out to be complementary. Reproduced from [15].

Datasets
In this section, we provide details about the two datasets we used to conduct our study. Figure 2: Feynman diagrams depicting the hypothetical particle decay: Reproduced from [8].

HEPMASS
The HEPMASS dataset [10,8] was utilized by the pNN's authors to demonstrate their novel idea. The dataset contains 7M training samples, and 3.5M test samples. The physical case under consideration is the search of a new particle X with an unknown mass. This particle decays into a tt pair, and the final state consists in the most probable decay product: tt → W + bW −b → qq blνb. The dominant background considered for this specific signal is the standard model tt production, identical in the final state but different in kinematics due to the absence of the X resonance. The Feynman diagrams showing the signal and background processes are shown in Figure 2. There are a total of 27 features (without considering the 28-th mass feature) already normalized to have approximately zero-mean and unitary variance. Each datapoint x (i) ∈ D can belong to either a signal process, with a mass hypotheses (in GeV) m X = {500, 750, 1000, 1250, 1500}, or to a background process, where the mass feature is randomly sampled from m X . Signal (background) samples are then labeled with class 1 (0). Moreover, the two classes are perfectly balanced, and also each D mi is balanced : containing the same amount of events for each m i ∈ m X . As discussed in the previous section, when training pNNs we want to pay attention to the balance of classes as well as the balanced of each m i : this dataset avoids such issue. For further details about the data, refer to [8].
By studying the distribution of each feature we can deduce three major things: 1. The background is unique and covers the signal, although partially (figure 3a).
2. Consequently, the signal's events at m X = 500, 750 GeV are the most difficult to separate out from the background, since the features distribution is mostly completely overlapped with the background's one. This explains why, in figure 3b, the AUC is considerably lower at 500 GeV, while being almost perfect for 1500 GeV.
3. By only considering some features (figure 4) a classifier (even simple) can easily tell which event belongs to the signal-class or not, thanks to these features being highly correlated with the class label: figure 5.

HEPMASS-IMB
Since the HEPMASS dataset is rather simple, leaving almost no room for improvement, we decided to imbalance the dataset by hand in order to being able to demonstrate novel methods for improving pNNs: we call this new dataset, derived from it, HEPMASS-IMB [9]. In particular the dataset is doublyimbalanced : there is class-imbalance with respect to the class label, and mass-imbalance with respect  The plot (a) shows the reconstructed mass (m W W bb ) of the simulated decay (X → tt → W + bW −b ), depicted in HEPMASS. We observe that the background spans the entire mass range, mostly overlapping the signal at m X = 500 and 750 GeV. This fact also motivates why the authors' results (presented in plot b), and also our own, exhibit a neat loss in AUC below 750 GeV, especially at 500 GeV.
the theoretical parameter (m X ), as well. A comparison between the two dataset is depicted in table 1.
The way we imbalance the dataset is as follows. We first take all the background events (without any change), and sub-sample (without replacement) only the signal, differently at each m i ∈ M. In particular, we select: 350k (for m X = 500), 140k (for m X = 750), 35k (for m X = 1000), 7k (for m X = 1250), and lastly 2k events for m X = 1500; for a total of almost 534k signal events. Indeed, we only imbalance the train-set of HEPMASS, leaving its test-set as it was provided by the authors [10]. In such way we are able to simulate a double imbalance of both class-labels and signal mass hypotheses (figure 6), that resembles more the imbalance found in real-world dataset of Monte-Carlo simulated particle decays.    (a) Resulting distribution of the reconstructed mass after having imbalanced HEPMASS. Compared to figure 3a, we can see how less the signal is: the background has been weighted by 1/5, for visualization purpose.
(b) Imbalance of all the mi ∈ M signal mass hypotheses: log-scale. We can see that the imbalance ratio grows as mi increases, for a maximum difference of few orders of magnitude.  Signal process Having the extra mass feature as input gives us an additional degree of freedom for the design of classifiers. In this section we study different design decisions about network architecture, background distribution, and training procedure.

The Affine Architecture
Baldi et al [8] utilized a regular feed-forward design for their parametric network, which we will refer to (vanilla) pNN, by just concatenating the mass feature m with the input features x right after the input layer. Here we propose a novel conditioning (or parametrization) scheme to better exploit m, that is based on the two conditioning mechanisms described in section 2.2, namely: conditional scaling, and conditional biasing (equivalent to concatenation-based conditioning). There are two inputs h and m. The dimensionality of m is expanded to match the dimensionality of h through linear combinations, that yield scaling (s) and biasing (b) vectors that are, respectively, multiplied and added to h in an element-wise fashion, resulting in the conditioned representation: z.
We propose an Affine Parametric Neural Network (AffinePNN) architecture, that relies on multiple affine-conditioning layers (figure 7) instead of simply concatenating the mass at the beginning of the network. Such a layer takes two vectors h and m as input, where h can be the features x (if the layer is directly applied on the inputs) or the previous layer's output, and m is the mass feature. Assuming them to have dimensionality D h and D m (that in our case is just one), respectively, the layer applies an element-wise affine transformation (scaling and bias addition) on h, such that the output z is a function of m. Considering vectors at a generic index i of the input batch, we have: where the dimensionality of z (i) is the same as h (i) , i.e. D h . The scaling and biasing operations are defined as linear functions over the mass 5 : s φ = W φ m (i) , and b ψ = W ψ m (i) , where the learned 5 In practice, these are implemented as two distinct Dense layers with linear activation.
weight matrices W ψ and W φ have both shape D h × D m , since the number of linear units have to match the dimensionality of h (i) . An AffinePNN interleaves such layers with ReLU-activated 6 dense layers: in figure 8 and table 2, an overview of the architecture is shown. In principle, the affine layers can be further generalized by introducing non-linear activation functions (f and g) on the scaling s φ , and biasing b ψ , such that: In our preliminary experiments, we also evaluated some modifications of the network design shown in table 2. In particular, we tested various combinations of both activation function and weight initializer, finding that the ReLU activation paired with the default initialization scheme achieves the best performance: refer to section 6.3 for more details about the hyper-parameters. The last variation we tried, was the application of batch normalization [16] after each affine-conditioning layer: the resulting trained network had similar classification performance, but at the cost of a slightly longer training (due to the overhead of the additional operations). One possible explanation is that since the network is not so deep (there are just four hidden dense layers), the gradient flow is not affected by either vanishing or exploding gradients, thus making batch normalization not fully necessary. This fact was confirmed by tracking the magnitude (l 2 -norm) of the gradients during training. The same modifications were also tried on the pNN architecture, showing a similar behavior. Indeed, the choice of architecture-related hyper-parameters, like the activation function, weight initialization, number of layers (or blocks) and number of units, is usually dataset and problem dependent: what we found to be working here, is not said to be equally good in other circumstances.

Background's Mass Distribution
In our reference work [8], the background's mass feature is identically distributed as the signal's mass. In other analyses, background events receive a mass that is either: (1) randomly assigned from the distribution of signal masses (only during training) [17], or (2) uniform in the interval of considered mass hypotheses. In general, we can analyze two situations: identical and different distribution of m for background events only.   Therefore, the i-th background event will be assigned a mass feature by sampling randomly from such distribution, i.e. m (i) ∼ U .
In both cases, we can fix the values of the mass feature (only for background) in the dataset (e.g. by sampling m only one time, and writing them to disk), or we can sample them during training, repeatedly and differently at each mini-batch. So, we have two degrees of freedom (distribution type, and assignment strategy) that lead us to a total of four unique combinations: (1) identical fixed, (2) identical sampled, (3) uniform fixed, and lastly (4) uniform sampled. In our experiments we noticed that, without proper regularization, having a uniform mass feature for the background allowed the network to almost perfectly fit both the training and validation sets: this may be due to the introduction of an artificial correlation between the mass feature and the class label, that was exploited during training. Nevertheless, by regularizing the model enough generalization is still ensured.
We may further discuss an additional assignment strategy where the mass feature m for the background can be also determined by means of mass intervals (or bins), based on the underlying reconstructed mass of the selected decay products. For example, if we consider (150, 250) to be a specific mass interval centered around 200 GeV, we can assign m = 200 as mass feature for each background event x whose reconstructed mass is within the mass interval. In this work, such third assignment strategy have not been taken into account.

Training Procedure
Beyond the architecture and regularization of the parametric network, as well as the distribution of the mass feature m, we can make some further considerations about how to properly train such kind of neural networks in light of what we already know about the structure of our own data. In general, we known that the signal is arranged in |M| groups (one for each m i ), and that the background is (eventually) composed of different processes. Therefore, beyond class labels, our data is naturally divided in sub-classes: in terms of the signal generating mass (i.e. the various m i ∈ M), and background processes 7 . We can exploit such domain-knowledge to design a training procedure that embeds such inductive biases.
In particular, we can further notice that each sub-class may have its own unique frequency, in terms of how much data samples fall into each sub-class, e.g. due to data imbalance: as shown in figures 3a and 6b. Such frequencies may bias the (parametric) network towards certain sub-classes, resulting in an overall sub-optimal fitting of the data.
We propose to mitigate this simply by balancing each sub-class in a way that an equal number of events belongs to each of them. We call such approach balanced training, that, without discarding or generating new data, can be easily implemented by balancing each mini-batch during training, e.g. by sampling each sub-class in equal proportion. Specifically, we have: • No balance: The usual training procedure, in which the network is trained by experiencing the data as it is. This will be our baseline for comparison.
• Class-only balance: Only the class labels are balanced within each mini-batch. Considering two classes, we can balance them in two ways: (1) by associating sample or class weights to each event, resulting in a weighted loss function, or (2) by sampling the events for a minibatch such that half the batch is populated with samples belonging to the positive class (i.e. signal), and the other half with background samples (the negative class). We implement classbalancing by following the second option, as if the class weights were implicitly provided to weight the loss function.
• Signal balance: Since the entire signal is generated at different values of m X , we can build mini-batches such that there is an equal number of events for each m i ∈ M. In practice, we take half the size B of a batch, i.e. B s = B/2, and then split that in even parts such that each m X = m i is represented by exactly B s /|M| events. This implies that, at each mini-batch, all the signal mass hypotheses are always represented: this may not occur, especially if the batch size is small, for the case of no class and background balancing.
• Background balance: Similarly to signal balancing, mini-batches are divided into equallysized parts such that each part contains events that belong to a certain background process. Of course, such balancing strategy is only meaningful if our background comprises more than one process: not the case of HEPMASS-IMB. Also in this case, each mini-batch will contain samples coming from all background processes.
• Full balance 8 : A combination of class, signal, and background balance. A batch is divided into two halves (each of size B/2) to ensure class balance. Then, one half will be signal balanced, and the other one is balanced according to the background. In this way, the network will always experience mini-batches that comprise all signal hypotheses, as well as all background processes.
Model Selection. For every balancing procedure (even when the training is not balanced) and regardless the background's mass feature distribution, we always perform validation in the same way on a 25% split of the original (not batch-balanced) data. The AUC of the ROC is used as validation metric. The way we perform validation resembles the way the model is evaluated on the respective test-set (see section 6). In particular, for each m i ∈ M we take the corresponding signal samples s, and all the background b; for the latter only, we sample their mass feature m from M: as in the identical (sampled) option, described in the previous section. Thus, the mass feature m, will be m (s) = m i for the signal events, and m (b) ∼ M for the background events. This is important to do, since: (1) balancing during validation will alter the value of the validation metric(s), as the original distribution of the data will be changed, and (2) regarding the mass feature distribution for the background, validating in a different way would lead to sub-optimal generalization performance of the selected model.

Preprocessing and Regularization
In general, for our models we found out regularization to be beneficial for improved performance, and also (as we will see later) crucial for good interpolation. We utilize two well known regularization techniques: dropout [18], and l2-weight decay. For the affine architecture (section 4.1) we insert a Dropout layer after each affine-conditioning layer, thus zeroing random elements of the conditioned internal representation z: refer to Eq. 1. Instead, for the standard pNN architecture the Dropout layer is inserted after each ReLU activation. In both cases, we use a drop probability of 25%. Lastly we apply l2-regularization on all learnable parameters of the network, but with different coefficients for weights and biases respectively. Classification performances can be usually further boosted by properly normalizing the data input to the network. Recall that our data have the form (x, m, y), in which: x is a multi-dimensional vector of features, m (the mass feature) corresponds to a 1-dimensional vector of values that make the network be "parametric", lastly y is a vector of class-labels (either 1 for signal or 0 for background, in our case). Discarding y, we can normalize (or preprocess, in general) both x and m in the same way, as done by [8] by means of min-max normalization, or differently. In our case, as HEPMASS is already standardized, we only normalize m by just dividing it by 1000.

Properties of PNNs
We believe that parametric networks have many interesting properties beyond interpolation. In this section we attempt a first characterization of them, trying to better understand such kind of models. 8 In our case, with only one background process, the signal-only and fully balanced training (in which half the batch size is reserved for signal, and the other half for background events) options are equivalent; not true, in general. 13

Interpolation
The interpolation capability of a parametric neural network is its ability to generalize towards novel mass points that lie between two known mass hypotheses, resulting in effective interpolation of events between them: we also use the term extrapolation when the missing or novel mass points lie at the two extremes of the mass range, or even beyond them. In this case the generalization capability should be twofold : the network is requested (1) to perform well on novel samples belonging to the known masses, and also (2) to correctly classify new events that belong to the missing hypotheses. This means that a pNN capable of good interpolation should provide more accurate outcomes compared to the ones that would be obtained by interpolating the results of M individual classifiers, instead. We want our pNN to perform well even if some hypotheses are missing. So, how to be sure and ensure that our model has acquired such capability?
Factors. We investigated several potential factors, including per-mass features distribution, mass imbalance, background distribution, network regularization, batch size, and training procedure, that may affect the interpolation capability of pNNs. Here we describe the most impactful: • Per-mass features distribution: Recalling from figure 4, a shift in the feature distribution has been observed. This tells us that some mass points are more difficult to classify than others.
In fact such behavior is directly reflected on single-mass interpolation and extrapolation (i.e. when we train our model on just one mass less). In figure 9 we can observe how extrapolating D 500 (plot d) is way more difficult than the other masses. This also suggests us that evaluating our model on only one mass less does not necessarily imply that our network will correctly interpolate or extrapolate everywhere, in general.
Another way to understand how much the similarity among masses helps (or avoids) our network at interpolating or extrapolating them, is to stress our model at extrapolating: we train a pNN on just one mass hypothesis, requesting it to extrapolate all the remaining ones.
In figures 10a and 10b, we observe how easier is to predict the missing masses, being the model only trained on m X = 750 or m X = 1000. This fact seems to be (at least, partially) independent from the AUC achieved on such mass points: although on D 1500 the highest AUC is obtained (plot 10e), average extrapolation performance are not the highest among the others mass points.
• Background distribution and Regularization: the impact of background distribution (section 4.2) goes beyond classification performance, as it may also affect interpolation. Figure 14c denotes a pNN that hardly interpolates; such network was trained on a uniformly distributed background, without regularization. During training on all the mass points, we noticed that the same network were able to almost classify perfectly both the training and validation sets: clearly overfitting them. As discussed previously, having a uniformly distributed mass feature for the background introduces an additional correlation with the class label, making training "easy". Indeed, by regularizing the model enough and increasing the batch size, generalization as well interpolation can be achieved with success.
Select mass hypotheses. Another practical aspect to consider is how to select the mass hypotheses to drop for a fair measure of interpolation. As described earlier, the goodness of training data can mislead us when quantifying interpolation. So, we suggest to drop almost half of the mass hypotheses, in the following way: let's assume we have hypotheses We can clearly notice how different mass 500 is from all the others, in fact its extrapolation performance is really poor: the (mean) AUC decreases by more than 30%. The other plots present a mean loss in AUC that is at most 8%.
perform well on the only training mass, but not too worse on immediately close hypotheses also maintaining a reasonable accuracy on far masses. This can be an easy way to asses the similarity of features among masses (which is an intrinsic property of the training data): if masses are similar, the network should perform almost the same on each unseen mass.

Learned Mass Representation
From section 1, in our signal-background classification problem, we actually consider M mass hypotheses for the signal. This means that our original task can be broken down into |M| smaller classification problems, each of them considering only a specific mass m i ∈ M. In fact, approaches before the parametric network [4] used to solve each sub-task by training a neural network solely on D mi (i.e. a slice of the original dataset D, that selects events whose mass feature is m i ), thus obtaining a (disjoint) set of M individual networks, which we will call g mi (or g i for short). Somehow each individual network g i , despite being trained solely on one mass, is able to implicitly relate the input features to the signal hypothesis m i they truly belong to (or even to the underlying invariant mass). This fact seems to be confirmed by the visualization in figure 11, in which each intermediate representation h i (of network g mi for all m i ∈ M) has a precise and nicely clustered spatial arrangement, that also relates well to the learned class label.   [19,20] on the HEPMASS dataset. By coloring the points by the mass label (left plot), we notice that representations related to the same mass are clustered together, meaning that each network g i has indirectly acquired knowledge about its underlying signal mass hypothesis m i (although not given as input). Also a structured part of them clearly depicts the learned class label (right plot).
Such kind of visualizations may provide further insights about the relation existent between a parametric network and a set of individual networks. Intuitively, we may want the intermediate representation of our pNN to be disentangled along the mass "axis" (in the underlying manifold), as seen in figure 11 for individually trained neural networks (considered as a whole). The situation for the pNN is similarly structured (figure 12): some mass are well clustered (e.g. at 500 GeV) and for others we can observe a smooth "shading" among them. This means that the pNN has partially recovered the underlying structure about the individual masses, but there is still some confusion about representing datapoints coming from higher values of the signal mass hypotheses.

Results
Since the datasets we have for signal-background classification can be divided into |M| groups (in order to be able to "parametrize" a neural network), also evaluation metrics have to be considered in terms of the available mass hypotheses for the signal. In particular, the models are evaluated on each m i separately. So we consider the signal events generated at a certain m i ∈ M, along with the whole background: i.e. the background events that spans the entire mass range. Indeed, the results provided in this section were all computed only on the test-set of the respective datasets. Moreover, we weight both signal and background samples (only for evaluation) such that the weighted count of signal events is equal to the weighted count of background events, i.e: In particular, for both datasets we set the signal weight to one, w s = 1, and for the background as w b = 1/5; since we have five mass hypotheses for the signal. For such reason, each time we test a particular mass hypothesis m i ∈ M, we select the corresponding signal (i.e. all the signal events that have m i as mass feature) and the whole background: i.e. the mass feature for all background samples b, is set equal to m i ; thus, m (b) = m i . This is to account for the fact that, in HEPMASS, the original background's mass is assigned randomly, being sampled from the set M (i.e. the identical fixed strategy, discussed in section 4.2).

Metrics
We consider standard evaluation metrics for classification tasks, such as the AUC (area under the curve) of the ROC (receiving operating characteristic) and Precision-Recall curves. In particular, the ROC curve can be interpreted for HEP as comparing the signal efficiency (y-axis) against the background efficiency (x-axis): in terms of how much signal is retained when considering a certain fraction of the background. Otherwise, we can consider the background rejection (i.e. 1 − background efficiency): how much signal is retained at a certain discard of background. The Precision-Recall curve instead, compares the signal efficiency (recall ) with what we call the purity (precision): the number of true signal divided by the number of events classified as signal (which also contains misclassification of the background). Along usual classification metrics, we also consider the Approximate Median Significance (AMS) [21,22] but in the following form: in which s t and b t is the (weighted 9 ) number of true signal and true background events, respectively, that passed the classification threshold t. This quantity is useful to determine an optimal classification threshold for our networks, called the best cut t , that is the threshold that maximizes the significance: t = arg max t AMS(t). Since the value of the AMS depends on the number of events from which is calculated, we propose a new metric the significance ratio (σ ratio ) that is normalized in [0, 1], thus being very intuitive to interpret. The significance ratio is defined as the ratio between the best (maximum) AMS by the largest possible significance (only achievable ideally, by means of a perfect classification when s t = s max , i.e. equal to all the true signal, and b t = 0): where s = s t , and b = b t . Such metric can be also used to compare how well the same model classifies different mass hypotheses: this is now possible since the number of events belonging to a certain m i does not affect the scale of the metric (as happens for the regular AMS, instead), anymore.
We can further say that the best cut t , apart from telling us which classification threshold is the best to determine the positive class, is also an useful quantity to monitor because it can provide additional information about the goodness of the classification. In particular, by plotting the best cut versus the mass we may observe failure cases in which t is either 0 or 1, depicting a situation in which the network is unable to correctly separate out the background (t = 0) or to retain a significant amount of signal (t = 1). A special failure case can be observed when s = s max and s = b , i.e. the signal is equal in number to all the true signal, which is also equal to the true classified background (e.g. due to applied weights). In this case we would obtain σ ratio = 1/ √ 2 ≈ 0.707, since: Indeed, measuring σ ratio ≈ 0.707 also implies having an AUC of 0.5, corresponding to nonsense classification. Moreover, if the best cut t is such that s = s max but b > 0, then σ ratio = √ s √ s +b decreases toward zero (in the limit), as b approaches b max (i.e. the weighted count of all true background events).

Baselines
To first assess the advantages brought by parametric neural networks we should compare them to their "non-parametric" counterparts, namely single and individual neural networks. What we call a Figure 13: Comparison among (tuned) baselines on HEPMASS-IMB. As we can see, the best classification performance (in terms of AUC of both ROC and PR curves, and σ ratio as well; for each m i ∈ M) are achieved by the individual NNs, as well as the parametric baseline (with identical-fixed distribution for the background's mass feature).
single-NN is just a neural network that is trained without the mass feature (m) as input but on all M hypotheses at the same time; so, it has only one input: the features x. Instead, the individual networks are, as the name suggests, a set of single networks, g φi , each of them trained to target a specific mass hypothesis m i ∈ M. Since each g φi is trained in isolation on one m i against the whole background, we expect the individual networks to easily beat the single network as it should face a harder learning problem. Moreover, both kind of networks can provide baseline performance for interpolation: the single network is trained to fit all data regardless of the mass (not given as input), whereas each individual network g φi will interpolate by means of the similarity between the hypothesis m i it was trained on, and the m j to interpolate. Finally, to assess the effectiveness of the various decision choices described in section 4, we apply them (whether possible) to the non-parametric baselines, and also (of course) to the parametric baseline: a vanilla pNN, as intended by Baldi et al [8]. Results for classification performance are shown in figure 13.

Evaluation
Hyperparameters. All the neural networks used for comparison are built using the TensorFlow 2.X [23] framework along with the Keras [24] library for Python. To improve reproduce our results we fix the random seed to be 42. All the networks use the same hyperparameters: ReLU activation, [300, 150, 100, 50] units for each hidden layer, sigmoid output, binary-crossentropy loss, Adam optimizer [25], batch size of 1024 (except when told otherwise), and default initialization: glorot uniform [26] for weights, and constant zero initializer for biases. It results in about 70k learnable parameters. The learning rate is never decayed, and set to 3 × 10 −4 . In general, we always use regularization by means of both dropout (with drop rate of 25%), and l2-weight decay. The weight decay is applied differently, with a strength of 10 −4 for weights (or 10 −5 ), and 10 −5 for biases (or 10 −6 ). Also, the same hyperparameters are kept for both datasets, as well as the training budget fixed at 25 epochs. In general, the hyperparameters we use were initially tuned for the vanilla pNN architecture on HEPMASS.
HEPMASS. Results about classification performance (with baseline models), background's mass feature distribution, and model architecture are presented in  We have also included a third baseline, a parametric network with linear activation: as we can see it underperforms even the single-NN, despite leveraging the additional information provided by the input mass feature.
HEPMASS-IMB. Results on interpolation are presented both in table 4 and figure 14c. Furthermore, an exhaustive comparison among baseline models, parametric and affine architectures, background's mass distribution, and training procedure is detailed in tables 5 and 6. Finally, outcomes about classification performance in terms of class separation are detailed in figure 15.

Discussion
In our empirical comparison among network architectures, background distribution, and training procedure, we can conclude that: 1. The affine-conditioning mechanism is able to better exploit the information brought by the mass feature, resulting in improved classification performance.
2. The balanced training procedure, that yields balanced mini-batches, can further improve performance, even without changing the network architecture.
3. The way the mass feature is distributed has a profound impact on how the model classifies and interpolates the missing masses. In general, the uniform distribution tends to easily overfit resulting in lower performance.
4. Finally, the right combination of network architecture, background distribution, and balanced training, allowed us to greatly improve on both imbalanced classification and interpolation, almost recovering the classification performances achieved on the original, full, and not-imbalanced dataset.

Conclusions
In this study we first discussed the concept of "parametrization" which is really a re-brand of the widely used conditioning mechanisms in deep learning. Establishing such connection allows us to brought ideas and methods from such area, to improve pNNs in HEP. Another proposed intuition is about the structure of the data we use for signal-background classification: we know the contribution of the background(s), and at which mass the signal is generated. In fact we leverage the latter information to build a mass feature that parametrizes a neural network, allowing the model to replace a set of individual classifiers, as well as to interpolate beyond events seen during training. By studying the general structure of the data, we can exploit the inductive biases it provides by embedding them in the network design, and training as well. Lastly we demonstrated that pNNs are able to interpolate under real-world assumptions. We hope the ideas proposed here to be inspirational for further work about parametric networks, but also to be useful in other fields beyond HEP that have a similar problem setting and requirements.
Open Questions. Our work is a first step towards a full understanding of parametric networks. We believe more properties and extensions to what is presented here to exist. In particular, we may suggest further research directions: • Real-world datasets are imbalanced, so either self-supervised learning, class-or mass-specific data augmentation, or (parametric) generative models may provide a major improvement in classification performance.
• The signal is generated at few discrete mass hypotheses, what about parametrizing on the whole, continuous mass range?
• The output of a pNN is a single number, why not letting the network output or discover a classification rule that can be easily interpreted by physicists to further increase their knowledge about a certain phenomena?

Data Availability Statement
The data that support the findings of this study are openly available at the following URL/DOI: https://zenodo.org/record/6453048. The code used to produce our experiments is openly available on GitHub: https://github.com/Luca96/affine-parametric-networks.   (a) Weighted class separation (histograms), significance ratio (solid green curve), and best-cut (vertical dashed green line), at mX = 500 GeV on HEPMASS-IMB. The value of the best cut (classification threshold), and the corresponding value of the significance ratio are shown in the legend at the top.
(b) Comparison of ROC and PR curves at mX = 500 GeV. Figure 15: Comparison of classification performance on HEPMASS-IMB, between a pNN (uniform, sampled) and affine model (identical, sampled). We can notice how the two models opt for rather different classification thresholds.