A Neural-Network-defined Gaussian Mixture Model for particle identification applied to the LHCb fixed-target programme

Particle identification in large high-energy physics experiments typically relies on classifiers obtained by combining many experimental observables. Predicting the probability density function (pdf) of such classifiers in the multivariate space covering the relevant experimental features is usually challenging. The detailed simulation of the detector response from first principles cannot provide the reliability needed for the most precise physics measurements. Data-driven modelling is usually preferred, though sometimes limited by the available data size and different coverage of the feature space by the control channels. In this paper, we discuss a novel approach to the modelling of particle identification classifiers using machine-learning techniques. The marginal pdf of the classifiers is described with a Gaussian Mixture Model, whose parameters are predicted by Multi Layer Perceptrons trained on calibration data. As a proof of principle, the method is applied to the data acquired by the LHCb experiment in its fixed-target configuration. The model is trained on a data sample of proton-neon collisions and applied to smaller data samples of proton-helium and proton-argon collisions collected at different centre-of-mass energies. The method is shown to perform better than a detailed simulation-based approach, to be fast and suitable to be applied to a large variety of use cases.


Introduction
In large high-energy physics experiments, particles deposit energy in several detectors, some of which are specialized in charged or neutral particle identification (PID). Usually a combination of techniques among Cherenkov and transition radiations, ionization loss, time-of-flight measurements and calorimetry are simultaneously employed to guarantee redundancy and a wide kinematic coverage. To exploit all the available information, multivariate classification techniques are typically used, combining the response of all involved detectors into variables, called PID classifiers in the following, optimizing the separation among different particle species. Once a PID classifier has been built, predicting its performance is crucial to physics analysis. In general, this implies the knowledge of its probability density function (pdf ) for each particle species in the multivariate space spanning all the features that can affect the detector response. These include the kinematic properties of the particle under scrutiny, but also the underlying event, i.e. the signals due to other particles in the same event. As the number of these features increases, the problem quickly becomes intractable with simple methods. The detailed simulation of the detector response for the physical process of interest provides a way to predict the pdf from first principles. However, the amount of simulated data needed to accurately cover the multivariate feature space is often prohibitive and the imperfection of the simulation needs to be determined through a comparison to real events. Data-driven modelling is thus preferred: calibration channels where the particle species is determined by their kinematics, independently on the response of the detectors used for PID, are used. An example is the Λ → pπ − decay, where a high-purity sample of events can be selected thanks to the isolated decay vertex and the narrow mass peak, by which the proton and the pion can be distinguished by their kinematic properties. For a given decay mode, the distribution of the selected particles in the feature space will differ from that of the calibration channels. Ideally one would like to determine from the control sample, for each particle species, the marginal pdf of the PID classifier(s) as a function of the relevant experimental features, in order to predict it for the physics channel under study. In this paper, we propose a method to approach this problem based on machine-learning techniques. Control channels in data are used to identify the experimental features affecting the PID classifier and to derive a statistical model of its pdf . The method is conceived to be generic, not relying on a specific set of experimental feature variables, to be trainable in a relatively short time, allowing the analyst to easily modify the feature space and tune the hyperparameters for best performance, and to use a state-of-the-art neural network (NN) architecture. After discussing the general idea of this approach, a concrete implementation is discussed as a benchmark case, namely the modelling of charged particle identification in the fixed-target data collected by the LHCb experiment. We conclude by summarizing the performance and general applicability of the proposed method.

The method
The basic idea of the proposed approach is to empirically describe the marginal pdf of a PID classifier x, depending on a set of features θ, through a Gaussian Mixture Model (GMM) as x p ∼ Ng,p j=1 α j,p (θ)G(x, µ j,p (θ), σ j,p (θ)) (1) where the index p refers to the particle species, G is a Gaussian distribution with mean µ and standard deviation σ, and N g is the number of Gaussian functions in the model. The parameters α, µ, σ, whose values can have a non-linear dependence on θ, are estimated from a maximum likelihood fit to a set of n p training events for each particle species. The initialization is performed using a set of Multi Layer Perceptron (MLP) neural networks, where the loss function is the negative logarithm of the likelihood function. This definition is part of the novelty of the presented approach and, in contrast to other generative models used in similar applications such as GANs and VAEs, aims to explicitly model a probability density function. Training events from the calibration channels can be contaminated by unwanted background, which can be subtracted using the sPlot method [1], evaluating a weight w for each event to quantify its probability to belong to the considered signal. Weighting the training events can also be useful to modify the distribution in feature space of the training sample, when there are large differences with respect to the target physics sample. In this way the model will be more accurate in the feature space regions where it is more useful to the application. When weights are used, the loss function becomes [2][3][4] The model can be readily extended, as in the case of our benchmark example, to a set x of (approximately) linearly correlated target variables, by replacing the one-dimensional Gaussian functions with multivariate normal distributions. With N g large enough, the model can reproduce any smooth function with exponential-like tails, which is the typical behaviour of PID classifiers. The main advantage of the chosen functional form is the speed of normalization when performing the fit. This offers the possibility to use a simple NN design with relatively short training time and to tune the hyperparameters of the model in a reasonable time. We make use of state-of-the-art open-source machinelearning libraries to implement the model, namely scikit-learn [5] and Keras [6] with TensorFlow [7] backend. These offer support for the NN backward error propagation and auto-differentiation and can exploit GPU acceleration. Input variables are preprocessed to ease the training phase. A linear transformation is applied to the target variables x, to map the range of their values to the interval [0, 1), using the MinMaxScaler algorithm in scikit-learn. For the other features, whose pdf is not aimed to be reproduced by the model, a transformation is applied converting their pdf to a Gaussian distribution with µ = 0 and σ = 1 implemented by the QuantileTransformer algorithm in scikit-learn. Equalizing the range and, for the features, the functional form of the pdf , allows the NN layer for a standardized initialization and speeds up the numerical estimation of the derivatives in the loss function minimization procedure. The chosen NN architecture, implemented using Keras/TensorFlow, consists in three tanh-activated layers with a constant number of nodes and an output layer matched to the shape of the predicted parameters. Each training requires a sensible choice for the following hyperparameters: • the number of nodes in the NN hidden layers; • the number of epochs, namely the iterations in the training process; • the batch size, namely the number of training events to be considered for each iteration of the minimization procedure; • the learning rate, namely the initial size of the step on the parameter values to be applied on each epoch (which is then linearly decreased with the epoch to ease the convergence).
To speed up the training, it is performed in two steps and a crude estimation of the parameters is firstly obtained by considering them independent on the features θ. This first fit is performed with a reduced number of epochs, and its result is used to initialize the second one. During this second step, the parameters are monitored along the iterations to check that their values converge smoothly. This is indeed a first cross-check against a possible inappropriate value of the learning rate, leading to an oscillating behaviour, or to effects of overtraining, leading to abrupt variations of the parameters. The training dataset is then split in ranges of the features with similar population and the corresponding x histograms are compared to the fitted pdf , as obtained separately for each feature interval. A good agreement between the data and predicted distributions for all the feature values validates the trained model. A concrete implementation of the proposed approach on data collected by the LHCb experiment is presented in the following Section and available on GitLab [8].

Charged PID in the LHCb fixed-target data
The LHCb detector, described in detail in Refs. [9,10], is a single-arm forward spectrometer conceived for heavy-flavour physics in pp collisions at the CERN LHC. Charged particles reconstructed by the spectrometer are dominated by three hadron species: pions, kaons and protons 1 . Charged particle identification is provided by two Ring Imaging Cherenkov (RICH) detectors [11] for momenta ranging from 2 to 100 GeV/c. The RICH1 detector is optimized for the lower momentum region, covering the angular acceptance of [25,300] mrad, while the RICH2 was designed for the forward particles ([15,120] mrad) with a higher momentum. The momentum thresholds for the Cherenkov light emission are 2.5, 9.3 and 17.7 GeV/c in RICH1 and 4.4, 15.6 and 29.6 GeV/c in RICH2 for pions, kaons and protons, respectively. For each track, the yields and positions of the recorded photons, whose emission angle depends on the incident particle velocity and thus on the particle mass for a given momentum, are checked against the three particle species hypotheses. Two PID classifiers are built as logarithms of the likelihood ratio between the proton and pion (DLL p,π ), the proton and kaon (DLL p,K ) or the kaon and pion (DLL K,π ) hypotheses. A finely segmented scintillating detector (SPD) is positioned downstream of the RICH2 to provide a measurement of the detector occupancy already at the earlier stages of the reconstruction procedure. Since 2015, the LHCb experiment also operates in fixed-target mode, recording collisions between the LHC beams and gas targets injected into the LHC beam pipe [12]. Fixed-target runs of limited duration with different beam energy and target gas species have been collected, as summarized in Fig. 1. The first physics result   Figure 2: Two-dimensional template fit to the PID distribution of negatively charged tracks for a particular bin (21.4 < p < 24.4 GeV/c, 1.2 < p T < 1.5 GeV/c). The (DLL p,K , DLL p,π ) distribution, shown in the left plot, is fitted to determine the relative contribution of π − , K − and p particles, using simulation to determine the template distributions and the fraction of fake tracks (which are barely visible). In the right plot, the result of the fit is projected into the variable arg (DLL p,K + i DLL p,π ) [14]. obtained in fixed-target configuration was the measurement of antiproton production from a proton-helium (pHe) sample acquired in 2016 [14]. The antiprotons were distinguished from the negatively charged pions and kaons through a template fit to the two-dimensional space (DLL p,π , DLL p,K ), as illustrated in Fig. 2. The templates were obtained from a detailed simulation of the detector and from the Λ → pπ + , K 0 S → π − π + and φ → K − K + calibration channels in data for antiprotons, pions and kaons, respectively. The data available statistics was limited, while the abundant samples recorded in pp collisions, which are routinely used to determine the PID performance in LHCb results [15], only presented a small overlap in feature space with pHe data. Indeed, the RICH detectors response is strongly affected by the detector occupancy, which is much larger on average in pp than pHe collisions and by the particle kinematics, which also differs between the two samples. Moreover, while the origin of pp collision is defined within a few centimetres, the recorded beam-gas collisions spread over about one metre along the beam line. The limited accuracy of the PID response modelling turned out to be one of the dominant uncertainties in the antiproton production measurement [14]. A larger sample of proton-neon (pNe) collisions with a nucleon-nucleon centre-of-mass energy of √ s N N = 68 GeV was recorded in 2017, providing a source of calibration events acquired in conditions more similar to the other fixed-target samples. We apply here the method proposed in Section 2 to model the PID response using the pNe dataset and to predict the classifier pdf for track selections performed on smaller pHe and pAr data samples, taking into account the different feature distributions due to the different particle selection criteria and experimental conditions such as the gas target and beam energy. The underlying hypothesis is that the response of the RICH detectors is stable across the different data takings.

Calibration channels
Three decays of light hadrons abundantly produced in the recorded collisions are used to model the PID response to particles of known species: Λ → pπ + for antiprotons, K 0 S → π − π + for pions, φ → K − K + for kaons. Antiprotons in Λ decays are identified solely from kinematics, as they always have higher momentum than the pions. In all cases, the control decays are distinguished from the background using a set of requirements on the quality and the kinematics of the reconstructed tracks and requiring that the reconstructed invariant mass of the two-track combination is compatible with the decaying particle mass. The association of the final-state particle with signals in the RICH subdetectors is also imposed. For φ → K − K + decays, where background contamination is more abundant, one of the two candidate final-state particles is required to be positively identified as a kaon, and only the other one is used in the control sample. The invariant mass distributions for the three channels after selection are shown in Fig. 3. The purity for the Λ and K 0 S decay samples is measured to be larger than 99%, while a significant residual background is present in the φ sample. The sPlot method is used to compute a weight for each event representing the probability it belongs to the signal category based on a fit describing the invariant mass distribution as the sum of a signal and a background component, as illustrated on the left plot of Fig. 4. The signal component is parametrized as a convolution of a Breit-Wigner and a Gaussian distribution, while a firstorder polynomial function is used to model the background. The weights obtained with the sPlot technique can be then applied to all variables uncorrelated with the invariant mass, as is the case, to an excellent approximation, of the PID classifiers, in order to remove the background. To verify the procedure, the variable arg (DLL p,π + i DLL p,K ), which, as shown in the right plot of Fig. 4, exhibits three distinct peaks corresponding to the particle species, is plotted before and after the sPlot weighting. The pion contamination from background events is clearly suppressed.

Feature selection
The RICH performance strongly depends on the particle kinematics, with fast variations when crossing the Cherenkov threshold values or the angular acceptance boundaries of the two detectors. The number and topology of the other tracks radiating in the detectors is also critical to the PID performance. The track reconstruction quality is also relevant, as it affects the determination of the centre of its Cherenkov ring. The choice of the features   Figure 4: Application of the sPlot technique to the φ → K − K + decay. The left plot shows the fit of the φ candidates mass distribution; the right one the validation of the evaluated weights by comparing a combination of the PID variables where the three particle species can be easily distinguished before (in blue) and after (in red) the weights application.
to be considered is a key point in the application of the method: enough features should be used to describe all the variance in the detector response, while their numbers should be limited to achieve a reasonable training time. To the purpose of this study, we consider those describing: • the particle momentum p, its longitudinal (p z ) and transverse (p T ) components and the pseudorapidity η; • the track position inside the PID detectors: the three coordinates of the track position closest to the beam and the two slopes with respect to the z axis; • the occupancy in the detector: the number of all reconstructed tracks through the spectrometer (nT racks) and the number of energy deposits (hits) in the two RICH (nRICH1Hits and nRICH2Hits) and in the SPD detectors (nSP DHits);   • the quality achieved in the reconstruction of the track: the fit χ 2 per degree of freedom (track χ 2 /ndf) or the number of hits in the tracking detectors consistent with the track (track ndf ).
As a first assessment of the relevance of these features to the description of the PID response, the distribution of the classifier is studied for pions of the pHe dataset for different ranges of each feature. After applying the requirement arg(DLL K,π + iDLL p,π ) < −1 to reject kaons and protons, for each considered feature the dataset is split in subsets of similar population. The Kolmogorov-Smirnov distance (KS) between the x distributions is computed for each pair of subsets, taking the maximal distance as a measure of the dependence of the x variables on the considered feature. An example is shown in Fig. 5, where the dependence on the pion pseudorapidity (KS = 0.54) is found to be clearly more important than that on the y position of the pHe collision (KS = 0.01). The most relevant features, with KS larger than 0.20, are listed in Tab. 1. Some of these features are known to be redundant or highly correlated with each other. The longitudinal and transverse momentum p z and p T will not be considered, as these features are determined from η and p, which exhibit a larger KS. The SPD occupancy is also dropped, as it is highly correlated with the occupancy in the RICH detectors, which are causally connected with the RICH response. On the other hand, variables with low KS value can still be important for explaining differences between the training and validation samples. As an example, the distribution of the longitudinal position of the tracks origin in the training sample, originating from decays, strongly differs from tracks promptly produced at the collision vertex. For this reason, all coordinates of the track position of closest approach to the beam (denoted with "poca" in the following text and figures) are included, despite a KS value of 0.06, 0.01 and 0.10 for the x, y and z coordinates, respectively.

Model training
The two PID classifiers DLL p,π and DLL p,K in the training samples show a linear correlation to a good extent. Therefore, we model their distribution according to Eq. 1 replacing the Gaussian with a two-dimensional multivariate normal distribution: where the vectors x represent the two PID classifiers (DLL p,π , DLL p,K ) and Σ is their covariance matrix For each component in the GMM, the free parameters are thus the weight α, the central values µ, the standard deviations (σ 1 , σ 2 ) and the correlation coefficient ρ. The model is trained separately for each of the three particle species following the procedure outlined in Section 2 and with the hyperparameters listed in Tab. 2. The choice of the input parameters reflects a higher complexity for the K 0 S → π − π + and the φ → K − K + calibration channels. Indeed, as a consequence of the many threshold effects involved in the process, the distributions of these classifiers may result significantly different from a multivariate normal distribution. For example, in Fig. 6 the distribution of the DLL p,π classifier computed on the calibration pions is shown. Comparing the distributions in four different intervals of the pion momentum, a second minor peak, arising at lower momentum, can be clearly seen. For the φ line, where the weight for the signal hypothesis obtained with the sPlot technique is applied, only a fraction of events gives a large variation of the loss function, resulting in a training more prone to statistical fluctuations. To compensate for this effect, a higher batch size is chosen. To differentiate the components in the GMM, the parameters are firstly randomly initialized from uniform distributions in the ranges: x 2 − x 2 for the mean values µ, being · the average. For the φ calibration channel, the background-subtracted distributions are considered; • [0, 2π] for the correlation angle; x 2 − x 2 for the widths σ 1,2 ;   The training is performed using a mini-batches gradient descent optimized with the RMSProp algorithm. Gaussian weights are left free to fluctuate towards negative values in order to enhance the stability of the training procedure. The evolution of the loss function with the number of epochs is monitored and Fig. 7 shows the resulting curves for the three calibration channels, all presenting a first steep decrease followed by a slow one and finally gentle oscillations near the minimum. The trained models present O(10 5 ) parameters, depending on the chosen complexity for the NN and the GMM. For the K 0 S calibration line and the parameters listed in Tab. 2, this corresponds to a measured training time of approximately 10 hours on a NVIDIA K80 GPU. The dependence of the Gaussian parameters as a function of the features is verified at the end of each training procedure to check for the presence of possible overtraining effects, which would manifest as their rapid oscillations to adapt to statistical fluctuations in the training sample. Fig. 8 shows an example of this for the Λ → pπ + calibration channel, where the expected smooth and monotonic behaviour can be observed. The generalization from the training to the application samples described in Sec. 3.5 definitively excludes overtraining effects.

Validation
To verify that the model properly learns the non-trivial correlations between the PID classifiers and the considered features, its predictions are compared to the actual distributions for control data in two-dimensional intervals of all possible pairs of features θ. The interval boundaries for each feature are set to roughly contain the same number of calibration entries. For each interval, a prediction is obtained by randomly choosing calibration entries in the selected interval and generating values of the target variables according to the fitted model for the corresponding values of θ. Fig. 9 shows an example of this validation for the K 0 S → π − π + calibration channel in intervals of the π − particle momentum and transverse momentum, Figs. 10 and 11 for the proton and kaon ones considering other feature pairs, respectively. The curves illustrate the non-trivial correlations of the target variables with the features and the excellent agreement between the data and the predictions demonstrates that the correlations are correctly learned by the model. For the pion calibration channel, the validation is also repeated considering an interval with low pNe statistics and Fig. 12 shows that, also in this bin, the trained model is able to generate a smooth template based on the parametric interpolation of the available data.

Application
The PID model, trained separately for each calibration channel selected from the pNe data sample, is applied to two independent lower-statistics samples of pHe and proton-Argon for the classifiers pdf is produced with the same method adopted for the validation in Sec. 3.4, but considering the differently-distributed pHe or pAr features as an input. To evaluate the model performance and the possible improvement over the PID model adopted so far in antiproton production studies in fixed-target data [14], the data-driven templates for the DLL p,π and DLL p,K variables are compared with those obtained from the detailed simulation of the detector. For the pAr sample, the application is performed on the DLL p,π and DLL K,π variables. Simulated data samples are generated for fixed-target collisions with EPOS-LHC [16], the interaction of the generated particles with the detector and its response are implemented using the Geant4 toolkit [17] as described in Ref. [18]. The simulation provides an overall good description of the input features and the small observed residual disagreements with respect to data can be corrected with reweighting techniques. For the pHe sample, the same approach of Ref. [14] is followed and simulated events are reweighted in two steps, according to the detector occupancy (using nSP DHits) and the track transverse momentum. To verify that the simulation-based PID model      Figure 11: Comparison for the φ → K − K + calibration channel in the track fit χ 2 /ndf ∈ [1.5, 4) -η ∈ [1.8, 3.7) bin between the bidimensional x distributions in (red) the pNe data and (blue) predicted with the trained model (top plot), its projections onto the two axes (second row) and onto the DLL p,K axis in intervals of the DLL p,π variable (third row).
is not limited by this simple reweighting technique, a more sophisticated approach is adopted for the pAr sample, where a boosted decision tree algorithm [19] is used to determine a single event weight depending on p T and the number of hits in the SPD and RICH1 subdetectors. After the reweighting, simulation-based binned templates for the three particle species are obtained. A fourth particle category for tracks reconstructed from hits belonging to different simulated particles and accounting for a few per cent of the candidate tracks, called ghost in the following, is considered. The corresponding template can only be obtained from simulation and is used also in the data-driven model. Template fits are finally performed on the pHe and pAr data, using the simulation-based or the data-driven templates. Binned maximum likelihood fits to the bidimensional target variable distribution are done in kinematic intervals, leaving the relative abundances of the particle species as free parameters of the fit. Examples of the fit projections to the pHe data are shown in the same kinematic intervals in Fig. 13 for the simulation-based templates and in Fig. 14 for the data-driven templates and directly compared to data in Fig. 15. Similar plots of fit projections in five other kinematic intervals are shown for the pAr sample in Figs. 16-18. In general, the data-based method is found to provide a more accurate prediction of the PID classifiers distribution than the full-simulation one. As expected from the larger size of the training dataset and from the smooth function assumed in the model, it is less affected by statistical fluctuations and also appears to be less biased. To better quantify the comparison between the two sets of templates, the fit quality is measured from the two-dimensional KS distance between the fitted and the actual data distribution, in kinematic bins. The difference between the values obtained with the simulation-based and data-driven templates is shown in Figs. 19 for the pHe (top) and pAr (bottom) data samples, respectively. In both cases, the observation of a positive difference in most of the bins demonstrates that the templates produced with the method presented in this work better describe the data than those based on a detailed simulation.

Systematic uncertainties
Data-driven methods for modelling the detector response such as the one proposed in this paper, while unaffected by biases related to the unavoidable imperfections of the simulation, depend on the selection and peculiarities of the calibration samples employed. We show in this section how the proposed model can be used to investigate these systematic effects. In our benchmark example, the PID response to kaons is the one which is expected to be more difficult to model reliably using the φ → K − K + decay because of the sizeable background contamination and the presence of the tag kaon, which is typically produced within a small angle with respect to the probe track and can thus bias the PID response. Indeed, for pp LHCb data, kaon PID calibration typically relies on the D * + → D 0 (→ K − π + )π + channel [15], whose statistics is too limited to be useful in fixed-target data. To evaluate the possible bias due to the correlation between the tag and probe kaons in the φ channel, a comparison between the PID responses of kaons from the two calibration channels in 2017 pp data is performed in momentum and pseudorapidity bins. A clear difference, notably in the DLL K,π variable, is observed in some kinematic bins. We then train our model for the kaon response using the D * + → D 0 (→ K − π + )π + decay and use the model    to predict the templates for kaons in the φ channel. The result, shown in Fig. 20 for the kinematic bin featuring the largest difference among the two samples, demonstrates that the difference is mostly explained by our model, taking into account the residual difference in track and event topology between the two decays within the kinematic bin. As the model trained on the D * + channels ignores the presence of the companion kaon and is not affected by the background of the φ channel, the systematic effects that have been investigated are found to be minor. This exercise shows that the proposed model can accurately take into account the different distributions of all the relevant experimental features among different data samples, well beyond the simple equalization of kinematic bins. In general, training the model on different calibration channels, also resorting to data recorded at different energy and collision systems, provides a way to evaluate the systematic effects related to the choice of a particular calibration dataset.

Conclusions
In summary, we presented a novel approach based on machine-learning techniques to model particle identification classifiers in high energy physics experiments. The GMM model, fitted to calibration channels using a set of MLP NNs, can be used to identify the relevant experimental features to be considered and to provide a smooth multivariate parametric representation of the PID response, making optimal use of the available training data. State-of-the-art machine-learning software libraries and computing resources provide the possibility to train models with O(10 5 ) parameters in a relatively short time, depending on the chosen complexity for the NN and the GMM. The method was demonstrated on a concrete case, namely the modelling of the particle identification response in the LHCb experiment, where it was shown to be able to predict non-trivial correlations among experimental features and to describe data more accurately than a detailed first-principle simulation of the detector. The method is expected to be employable on a larger variety of use cases dealing with experimental observables, not necessarily related to particle identification, depending on a sizeable number of experimental features.  Figure 19: Difference of the Kolmogorov-Smirnov distance for the modelling of the bidimensional (top) pHe or (bottom) pAr target variables distribution composing templates produced with a full simulation approach or with the machine-learning method employing the pNe calibration data discussed in this document. In all bins presenting a positive value a more precise description of the data is achieved with the latter approach.