Machine-learning-based particle identification with missing data

In this work, we introduce a novel method for Particle Identification (PID) within the scope of the ALICE experiment at the Large Hadron Collider at CERN. Identifying products of ultrarelativisitc collisions delivered by the LHC is one of the crucial objectives of ALICE. Typically employed PID methods rely on hand-crafted selections, which compare experimental data to theoretical simulations. To improve the performance of the baseline methods, novel approaches use machine learning models that learn the proper assignment in a classification task. However, because of the various detection techniques used by different subdetectors, as well as the limited detector efficiency and acceptance, produced particles do not always yield signals in all of the ALICE components. This results in data with missing values. Machine learning techniques cannot be trained with such examples, so a significant part of the data is skipped during training. In this work, we propose the first method for PID that can be trained with all of the available data examples, including incomplete ones. Our approach improves the PID purity and efficiency of the selected sample for all investigated particle species.

1 Introduction ALICE (A Large Ion Collider Experiment) [1] is one of the four major detectors located at the Large Hadron Collider at CERN [2].The main goal of ALICE is to study the properties of quark-gluon plasma (QGP), a hot and dense state of matter, and the strong force that binds quarks together inside hadrons [3].The key requirement for detailed studies of QGP that distinguishes ALICE from the other Large Hadron Collider (LHC) experiments is its capability for very precise particle identification (PID) -i.e. the ability to discriminate between different particle species produced during the collision.This allows for selecting a subset of particles required for specific analysis.
The ALICE experiment is composed of several sub-detectors, some of which measure particle properties that can be used for identification.Figure 1 presents a scheme of the detector in Run 2, the previous LHC data-taking periods.
The three detectors particularly useful for PID are: TPC, TOF, and TRD.The Time Projection Chamber [5] (TPC) is one of the most important ALICE detectors as it records 3D information of the trajectory of charged particles, as well as their specific energy loss due to ionization, which is essential for particle identification.The Time-of-Flight [6] (TOF) detector measures particle travel times from the collision vertex to the detector, from which the particle velocity and mass are calculated.The Transition Radiation Detector [7] (TRD) records transition radiation, that is, the emission of photons by electrons traversing the boundaries of a radiator, which helps in distinguishing electrons from other charged particles.All the detectors mentioned above detect particles carrying a non-zero electric charge.Therefore, this article focuses on the identification of charged particles.
With the signals recorded by the detectors described above, particles are chosen using a set of selection criteria.Traditionally, the particle identification is based on hand-crafted selection criteria, for instance, based on how much the detector response deviates from the expected value.Particles falling outside the selection region are rejected.One of the most common selection methods is described in detail in Section 2.3.
However, when the characteristics of different particle species overlap, combining information from multiple detectors becomes difficult.Choosing selection regions by trial and error is less effective in this case, lowering the purity of the selected sample.
These shortcomings can be addressed with more advanced classification methods such as Bayesian models [8] or neural-network-based (NN) approaches [9][10][11].However, the more sophisticated methods can only be trained or employed with complete data instances; just as a function value cannot be computed without all the inputs, NNs may not be used to process examples with missing values.As detectors used for particle identification work independently, particles may be measured by a subset of detectors while not being recorded by others.This can happen for various reasons; for instance, a detector may malfunction or be switched off, or the particle can have characteristics (e.g.too low transverse momentum) that do not match the detector specification.Collected data may therefore contain incomplete measurements, with information from only some detectors.As a result, standard neural-network-based approaches cannot be used for PID without altering either the incomplete data or the model architecture.The former approach is usually implemented by inputting average or predicted missing values.This introduces artificial information to the data that can change the predictions of a model in undefined ways.Such an approach can often lead to biased outputs, which is especially problematic for real-life applications such as PID, where model predictions may lead to physically impossible outcomes.Therefore, we focus on altering the model architecture, as this approach allows us to avoid making any assumptions about the missing data and ensures that no contained information is modified.
In this work, we introduce the new method of PID with missing data.In particular, we propose an adaptation of an attention-based multi-instance learning approach [12].The resulting model can predict and learn from examples containing any number of missing measurements and allows for using knowledge learned from complete examples when classifying incomplete ones, as well as the other way around.
Our tests show that this additional knowledge improves the performance of neuralnetwork-based models.We evaluate different methods for coping with incomplete data and show that our approach achieves state-of-the-art performance in PID with neural networks.We attribute this performance gain to the proposed method that learns from all available data, including examples with missing data.

Related works 2.1 Missing data sources
The incomplete data problem in statistical and machine learning (ML) models has been widely studied over decades [13,14].In [15], authors highlight three main categories of missing data according to the source of the information gaps.The first case occurs when examples are missing completely at random (MCAR) -where the probability of missing value is independent of all parameters.If the probability of the missing value is dependent only on the value of the observed attributes, we can call it missing at random (MAR).Finally, if this probability depends on the value of the missing parameter itself, we refer to this case as missing not at random (MNAR).
In our case of the data used for particle identification at the ALICE experiment, we can distinguish two main sources of the missing data.For some cases, the lack of particular value -signal from one of the detectors is caused by its temporal malfunction, independent of the measured particle.In this case the signal is missing completely at random (MCAR).On the other hand, for some cases we might lack the measurement of values that fall outside the effective region of particular detector.For example, the efficiency of the TOF detector measurements drop significantly for particles with transverse momentum below 0.5 GeV/c.In such case, the missing data is missing not at random (MNAR).

Machine learning with incomplete data
Several popular machine learning algorithms such as neural networks, cannot be directly trained with missing data.Therefeore there exists different solutions tot that problem that can be roughly divided into two groups: (1) those that modify the training dataset [16,17] and (2) those that adapt ML architectures [18][19][20][21].Data treatment methods transform datasets with missing values into complete ones that can be used with standard machine learning architectures.Adapted architectures are capable of processing incomplete data without any alteration.
The simplest method for data transformation is known as case deletion, where examples without full data availability are simply removed from the training set.This limits the potential training capabilities of the model and can result in bias induced by missing part of the original data distribution.The alternative technique known as imputation fills the missing data with artificial values calculated as a mean, median, or value predicted with a simple model, e.g.linear regression.This idea is further extended in [22], where the imputation model is optimized jointly with the regressor.Similarly to the case deletion approach, imputation methods can also significantly disturb the predictions of the neural model, which might result in unrealistic behavior.
To overcome those shortcomings, methods based on the model's adjustment were proposed.The most straightforward approach suitable in a situation when missing data affects only a few attributes is an ensemble of different classifiers.In such a case, different models are trained on subsets of the training dataset without missing data.In particular, in neural network reduction [21], authors propose to split the dataset into the largest possible complete subsets.
There are several domains where machine learning with incomplete data is already employed.In [23], authors use a recurrent neural network to model healthcare data with missing data.A similar idea was employed in medical applications such as breast cancer prediction [24].Machine learning algorithms with missing data are also used in other domains such as traffic prediction [25,26].The proposed model architecture.Layered blocks are applied separately to each vector in a set.Single blocks are applied to their input as a whole.

Particle identification techniques
As mentioned in the introduction, the recently employed PID techniques depend on the human-defined selection criteria on the response signal from given detectors used in the analysis.In most cases, the so-called "n σ " method is used, whose main parameter is the number of standard deviations from the expected value that a signal for a given particle left in the detector.For example, if both the TPC and TOF detectors are used, a common approach is to define a PID selection as The actual cut-off value λ, however, depends on the specific analysis, and requirements for purity and efficiency of the sample.Typically, this value is set between 2 and 3.This method can be further modified by adding additional conditions, for instance, on rejection of other types of particles.In such a case, if one analyzes, i.e., charged kaons, one may provide the acceptance criterion for the kaon hypothesis (i.e.how far in terms of n σ the signal can be from the expected value for kaons) as well as the rejection criterion for the pion, proton, and electron hypothesis (i.e.what is the minimum n σ that the signal must have compared to the expected values for pions, protons, and electrons).
Outside of the ALICE experiment, several attempts have been made to use neural networks for the PID.In the LHCb experiment [9], shallow neural networks are used to classify calorimeter signals.Similarly, in ATLAS, neural models are used to identify electrons [10], while the CMS Collaboration identifies τ lepton decays with deep neural networks [11].

Attention-based neural network architecture for particle identification
In this work, we focus on the problem of PID in which a type of particle is assigned to a data sample based on its characteristics recorded by a set of detectors.We treat  particle identification as a set of binary classification tasks in a so-called one vs.all approach.Given a set of examples, each labeled with one of k particle types, the task is to find k binary classifiers corresponding to each particle type.Each classifier is tasked with identifying whether or not a particle, represented by a given example, belongs to the specific particle type.Every example consists of a set of real-valued features, based on measurements from a detector, out of which some might be missing.
With this formulation, we introduce a novel method for PID that employs missing data.We base our approach on the attention mechanism, similar to the method introduced for a medical use-case in AMI-Net [12].Our system is composed of several steps.We first encode all data examples into a set of feature-value pairs [27], independently of their characteristics regarding missing data, and transform them into embeddings.These embeddings are further processed by the Transformer's [28] encoder module with multi-head attention.The encoder output is combined into a single feature vector that is an input to the final classifier.The overview of our system is shown in Fig. 2, while in the following subsections, we describe its building blocks in detail.

Feature Value Pairs
We prepare the incomplete data for the model by creating sets of feature-value pairs from the incomplete vectors.Each pair consists of a non-missing value in the vectors and the index of that value.We apply one-hot encoding to these indices to aid the model in their processing.An example is shown in table 1.

Embedding
Embedding is a continuous vector representation of discrete data.Embedded feature indices can use relations between features more effectively by placing similar features close in the embedded space.We create embeddings by applying a neural network with a single hidden layer to the feature-value vectors.

Transformer Encoder
Transformer is an attention-based architecture commonly used for sequence transduction.As the Transformer is used to process sequences of arbitrary lengths, it can naturally be adapted to find/learn relations between available features regardless of the amount of missing values.We apply the Transformer's encoding module to the set of embedding vectors to connect the different features each vector represents.
The encoder consists of a stack of N identical layers.Each layer is made of two sub-layers: a multi-head attention layer and a dense neural network.The multi-head attention is applied to the input set as a whole, while the NN is applied to each vector in a set separately.To ease the training of the encoder, we apply a residual connection around each sub-layer, followed by layer normalization.

Multi-head attention
To connect pairs of vectors in the input set, each input vector is linearly transformed into h query, key, and value vectors, and scaled dot-product attention is applied to each of the h triples of query, key, value sets, where h is the number of heads.This allows specific patterns to be found in pairs of input vectors.For example, a measurement from a specific detector could be used if the momentum is within a particular range.As each of the N layers in the encoder contains a multi-head attention sub-layer, and each sub-layer connects pairs of vectors, the full encoder can theoretically connect subsets of 2 N vectors.

Scaled dot-product attention
Given sets of query, key, and value vectors of dimension d k , scaled dot-product attention calculates a set of weighted averages of the value vectors based on the similarities between the corresponding keys and queries.The similarities are obtained by computing the dot product of the query and key vectors.The dot products are scaled by a factor of 1 √ dk to counteract the increase of the dot product with an increasing dimension of the query, key, and value vectors.Finally, the softmax function is applied to the scaled dot products to obtain the weights used to calculate the averages of value vectors.The whole computation can be described as where Q, K, V ∈ R n×dk are sets of n query, key and value vectors, respectively.

Attention pooling
As the classifier neural network cannot process an unordered, variable-size set of vectors, merging the vectors obtained from the Transformer's encoder into a single vector is necessary.Therefore, we pool the output vectors together by using the self-attention technique that was also used in the Transformer, where each vector is assigned weights based on its own values.These weights are then used to calculate the weighted average of all the vectors in the set.

Self attention
Given a set of vectors {v 1 , v 2 , ..., v n }, where v i ∈ R dmodel , we obtain the pooled vector as follows: Here, E, A ∈ R n×dmodel are respectively the self-attention values and weights, o ∈ R dmodel is the pooled output vector, and N N is a neural network that has both dimensions of input and output equal to d model .

Classifier
We obtain the final prediction score for the PID task by applying a simple neural network with a single-value output to the pooled vector.We normalize the score to the range (0, 1) by applying the logistic function: The obtained score approximates the probability of the example corresponding to a specific particle type.

Evaluation
We evaluate our model on the PID task by comparing it against a standard n σbased selection technique, an ensemble of neural networks [21] as well as two standard techniques for ML with missing data: mean imputation, and linear regression imputation.Additionally, in a scenario with no missing data in the test-set, we compare our approach to the case deletion procedure.

Dataset
The data comes from a Monte Carlo simulation of proton-proton collisions at √ s = 13 TeV with a realistic simulation of the time evolution of the detector conditions in the LHC Run 2 data-taking period.The simulation was performed with PYTHIA 8 [29], the GEANT [30]  The examples are labeled with ten different particle types.The particle species distribution is shown in Table 2.
There are four combinations of missing values, as some examples' measurements from the TOF and TRD detectors are missing.Figure 3

Models
For each compared method, we train identical binary classification models for 6 of the most populous particle types: pions, protons, kaons, and their respective antiparticles.All trained models are identical for imputation methods and the neural network ensemble.They have three hidden layers of sizes 64, 32 and 16, and a single output.Between the layers, we use Rectified Linear Unit (ReLU) activation function f (x) = max(0, x).Dropout regularization with a rate of 0.1 is applied after each activation layer.The input dimension depends on the method used.For imputation techniques, all models process inputs of size 19, as all missing features are imputed.For the neural net ensemble, the four networks have input sizes 19, 17, 17, and 15, depending on the combination of missing values from the TRD and TOF detectors (each detector has two features associated with it).
Our proposed architecture consists of the embedding neural network, the Transformer's encoder, the self-attention neural network, and the classifier network.ReLU activation is also used between neural network layers, and dropout regularization with a rate of 0.1 is applied to the output of the embedding layer and the output of each sub-layer in the encoder.The parameters of all the layers are shown in Table 3.
Model parameters are selected using a hyperparameter sweep for a fair comparison of different network architectures.Given a set of parameter, various combinations of their values are used to train different models, from which the model achieving the best results on the validation dataset is chosen.In this way, each of the compared methods may achieve the best possible results.The hyperparameter sweep is a computationally expensive procedure, hence we perform it with a model trained to detect kaons -the most challenging particles to identify in our dataset.This might bring a small bias of our results towards kaons, so before the full integration of our solution with the ALICE system, we will perform independent sweeps for all particles.

Results
Each model is trained using all complete and incomplete examples and tested in two cases: (1) with all available test examples and ( 2) with the complete ones only.For case deletion, no results are available for testing on the missing data dataset, since it is not possible to directly apply the appropriate models on data with missing values.We evaluate our models on test-set with standard metrics -precision (purity) and recall (efficiency).Since precision and recall are antagonistic, we also include the F 1 metric, which is a combination of the first two according to the formula: Numerical results for the pion, proton, and kaon particles are shown in table 4 and in table 5 for antiparticles.We highlight the best-performing method for each particle in bold.
Table 4: Classification result for the three most common particle species.
(a) Pion identification on complete data only.For the case of incomplete examples, we compare our machine learning solutions with the standard technique described in Section 2.3.For our comparison we used the following selections: |n σ,TPC | < 3 for particles with transverse momenta below 0.5 GeV/c and n 2 σ,TPC + n 2 σ,TOF < 3 for particles with p T ≥ 0.5 GeV/c (in this case TOF signal was required).Tables 4 and 5 show that machine learning approaches, in general, outperform standard n σ -based techniques, providing significantly higher recall for similar or higher precision.Moreover, the proposed architecture is comparable with other tested techniques.We can also observe that training with additional examples with missing data still results in the good quality of the PID on complete examples, as measured by F 1 scores.On the other hand, synthetic imputation of mean or predicted missing values can disturb the learned function and might result in lower performance.The cases of pion and kaon identification on complete cases are the only two where the ensemble achieves slightly better F 1 results than the proposed architecture.

Detailed analysis of the results
To further highlight the benefits of our approach, we provide a detailed analysis of its performance when compared to the baseline approaches.In Figure 4, we present precision-recall curves of different methods applied on all available test data (including incomplete examples) in the most challenging kaon detection task.This plot provides detailed comparison between methods without a need for threshold selection.Additionally, in Figure 5, we analyze the differences in performance for different range of particle momentum p T .We can observe that our approach yields significantly higher area under the curve performance, and has a lower degradation in performance given particles with high momentum values.In Figures 6 and 7, we perform similar analysis for using only complete test data examples.Detailed evaluations for all other particles are included in the supplementary material.

Computational time comparison
In Tables 6 and 7, we present the computational overhead of calculating predictions with our method, when compared to the other baselines.In particular for training time, we report average time needed (in milliseconds) to update model's weights from loading data, through forward and backward propagation up to the final update.For inference time, we report the average forward pass trough the network (in microseconds).Since our model requires calculation of the attention between features coming from different detectors, both it's training and inference times are around 3 times longer than for a standard methods.However, all methods can be easily parallelized (up to the GPU memory size), what can be seen in nearly constant times across different batch sizes.All of the calculations are performed on a small local machine with Nvidia GTX 1660Ti 6GB GPU, Intel core i7-9750H CPU, and 16 GB 2400 MHz (MT/s) of RAM.

Conclusion
This work considers the real-case scenario of classification from incomplete data in the particle identification task, where due to the nature of physical processes, not all of the information is always recorded by all of the detectors.To solve this problem, we propose a novel method based on the attention mechanism.We verify that our approach is able to learn from both complete and incomplete data improving the performance of models trained solely on the complete examples.Moreover, our method provides no worse performance than other techniques while avoiding their drawbacks such as insertion of artificial data (imputation methods) and potentially too complex architecture (neural network ensemble).
Supplementary material: Machine-learning-based particle identification with missing data 1 Detailed evaluations for all particles In the following sections, we show results of the detailed comparison between our and benchmark approaches for all of the particles we consider in our PID task.
Fig.2: The proposed model architecture.Layered blocks are applied separately to each vector in a set.Single blocks are applied to their input as a whole.

(a) 3
data samples with 4 attributes with different amounts of missing values.

F1Fig. 7 :
Fig. 6: Precision recall curve for different ML based approaches without missing data

F1Fig. 1 :
Fig. 1: Performance of different PID methods in kaon selection task with missing data as a function of particle momentum.

Table 1 :
An example for preprocessing of data samples into feature set values.
particle transport model, and general-purpose settings.The dataset consists of 2 751 934 examples with track transverse momentum p T ≥ 0.1 GeV/c.95% of examples fall into the range [0.12, 1.76] GeV/c.Each example contains 19 features: • detector signals: 1 TPC feature, 2 features per TRD and TOF • number of shared TPC clusters • spatial coordinates of a track reconstruction starting point (x, y, z in the local coordinate system, and the rotation angle α between local and global coordinate systems), • track momentum p and its p T , p x , p y , p z components, • track charge and type (propagated/non-propagated Run 3 track, Run 2 track) • the distances of closest approach (DCA) of the track trajectory to the collision primary vertex measured in the xy plane (d xy ) and the z direction (d z )

Table 5 :
Classification result for the three most common antiparticle species.
(a) Antipion identification on complete data only.

Table 6 :
Average training time of our method compared to the baseline[ms]

Table 7 :
Average training time of our method compared to the baseline[µs]