Particle identification with machine learning from incomplete data in the ALICE experiment

The ALICE experiment at the LHC measures properties of the strongly interacting matter formed in ultrarelativistic heavy-ion collisions. Such studies require accurate particle identification (PID). ALICE provides PID information via several detectors for particles with momentum from about 100 MeV/c up to 20 GeV/c. Traditionally, particles are selected with rectangular cuts. A much better performance can be achieved with machine learning (ML) methods. Our solution uses multiple neural networks (NN) serving as binary classifiers. Moreover, we extended our particle classifier with Feature Set Embedding and attention in order to train on data with incomplete samples. We also present the integration of the ML project with the ALICE analysis software, and we discuss domain adaptation, the ML technique needed to transfer the knowledge between simulated and real experimental data.


Introduction
ALICE (A Large Ion Collider Experiment) [1] is one of the four major detectors located at the Large Hadron Collider (LHC) at CERN [2].The main goal of ALICE is to measure the properties of the quark-gluon plasma (QGP), a deconfined state of quarks and gluons, theorized to exist in the early Universe [3].Detailed studies of QGP require very precise particle identification (PID), i.e., the ability to discriminate between different particle species produced during the collision.High PID accuracy distinguishes ALICE from other LHC experiments.It also allows for selecting a subset of particles required for specific analysis.Thanks to several detectors operating concurrently, various types of particles can be separated over a wide range of momentum from just around 100 MeV/c up to around 10 GeV/c. Figure 1 presents a scheme of the ALICE detectors as used during the Run 1 and Run 2 LHC data-taking periods.
In particular, particle identification over the full azimuthal angle uses information from three detectors: Time Projection Chamber [5] (TPC), Time-of-Flight [6] (TOF), Transition Radiation Detector [7] (TRD).The TPC is one of the most important ALICE detectors as it records the 3D trajectory of charged particles.It also measures particle-specific ionization energy loss, which is essential for PID.The TOF takes measurements of particle time of flight from the collision vertex to the detector.Particle velocity and mass can be further computed from the time of flight and the track information.The TRD records transition radiation, the emission of photons by electrons traversing the boundaries of a radiator.It enables to distinguish electrons from other charged particles.Since all the aforementioned detectors detect particles carrying a non-zero electric charge, our work focuses on the identification of charged particles.
The particles are identified based on observables derived from signals recorded by the detectors using selection criteria compared with theoretical calculations such as the Bethe-Bloch formula in case of TPC signals.The traditional approach, the so-called n  method, is to compare the number of standard deviations from the expected value for all detector-specific observables.If a particle has associated PID observables exceeding a certain number of standard deviations, it is rejected.For instance, if one intends to use both TPC and TOF signals, the PID selection would be defined as: √︃ n 2 ,TPC + n 2 ,TOF < Λ, where Λ depends on desired balance between purity and efficiency, and typically is in range of 2 to 3. Such an approach is justified when the separation between various particle species is significantly large.However, in reality, the characteristics of different particle species can overlap, and combining information from multiple detectors can become very complex.

PID with machine learning
A natural response to the difficulties with optimal particle selection is to use machine learning (ML) algorithms, particularly the Bayesian method and neural networks (NN).In this view, particle identification is a standard classification problem.Compared to a human analyzer, ML can utilize more input particle features and learn more complex relationships between the variables.The Bayesian approach [8] is available in the new ALICE software framework O 2 , but its flexibility is limited.For this reason, we focused on neural networks, whose usage for particle identification was explored only for very specific analysis cases in other experiments [9][10][11].

Neural network approach
The simplest model, and also the starting point of our analysis, is a single feed-forward network trained and applied on Monte Carlo simulated data.We implemented a binary classifier, one instance per each (anti)particle species and each combination of detector signals.The network outputs a single value normalized to the range (0, 1) by applying the logistic function  () = 1 1+ −  .The output value corresponds to the probability of the example corresponding to a specific particle type based on a selected detector set of measurements (TPC only, TPC+TOF, TPC+TOF+TRD).It is not possible to process different combinations of detector signals by a single network because the choice of detector set impacts the size of the network input vector.The networks for each (anti)particle species and detector setup are trained independently.Initial results reported in Ref. [12]   are promising: a simple network using combined information from TPC and TOF detectors can improve efficiency compared to the traditional method while not losing purity.
Nevertheless, it must be considered that it is more difficult to estimate the systematic uncertainties of machine learning models.An example solution is to use dropout, a standard method for reducing overfitting in neural networks, to approximate Bayesian uncertainty as shown in [13].

Integration of PID ML with the O 2 framework
The ALICE analysis framework, O 2 [14], is written in C++ and heavily utilizes the ROOT [15] library, while most popular machine learning frameworks are implemented in Python.Therefore, we needed to build a universal interface between Python-based machine learning projects and the C++based analysis framework.Our solution makes use of the ONNX (Open Neural Network Exchange) standard [16], which defines a common file format for storing machine learning models developed in various frameworks such as Tensorflow [17] and PyTorch [18].The ONNXRuntime [19] can then be used for ONNX model inference in C++.
The current PID ML workflow in O 2 is presented in Figure 2. The processing starts from the analysis input data: reconstructed collisions and tracks in AO2D files.For training, simulated Monte Carlo data is used as it contains particle species labels.The PID ML producer task filters the input with a few rough preselections to exclude meaningless tracks and produces skimmed data that contains only track properties used by a neural network.The skimmed data is used for model training.Trained models are stored in the ONNX format in the Condition and Calibration Data Base (CCDB), which is accessible by analyses running on the worldwide LHC computing GRID.In an analysis task, a PID ML model for a single particle and detector settings is represented by an instance of the PidOnnxModel class.In more complex use cases, one can use PidOnnxModelInterface instead of dealing with a collection of PidOnnxModel instances.PidOnnxModelInterface provides handy management of multiple ML models, each with different target particle species, detector setup, and acceptance threshold.
Utilization of ONNXRuntime comes at the cost of creating additional memory copies.The input data is stored on disk in ROOT file format and read as Apache Arrow tables.Unfortunately, no direct conversion exists between Arrow tables and ONNXRuntime tensors.Therefore, the data of interest needs to be copied into C++ STL vectors and then converted to a tensor.This requires hardcoding of inputs for each ML model, which results in substantial code repetition in all analyses using ONNXRuntime and prevents the full development of a universal ONNXRuntime interface in O 2 .The aforementioned problems are visible in the inference example in Listing 1.There exists at least one alternative to ONNXRuntime, ROOT SOPHIE TMVA.It is being developed as part of the ROOT library and enables inference of neural networks saved in ONNX file format.SOPHIE is naturally adapted to the ROOT file format, making it easier to adapt to the ROOT-Arrow O 2 data pipeline.Nevertheless, SOPHIE is still in the development stage.The framework lacks the implementation of many advanced neural network operators as well as the implementation of other ML models such as Boosted Decision Trees used by various groups in ALICE.
Therefore, the main effort is still organized around a more analysis-friendly adaptation of ON-NXRuntime.Once a direct, preferentially copyless Arrow-ONNXRuntime conversion is developed, it will be possible to use the so-called IO Binding to map inputs between data and an ML model.It will also enable encoding model inputs as, for example, strings in a JSON configuration file.This will result in a universal, analysis-independent inference code as depicted in Listing 2.
Presently, the same integration design that PID ML has, with ONNXRuntime and a C++ class to wrap management of a single ONNX model, was applied in other machine learning projects in O 2 .A common interface that further unified various implementations of the ONNX integration was developed based on PidOnnxModel.It is already used for supporting calculations for TPC PID and selection of Λ +  in the pK −  + decay channel.A specialization of the unified interface is being propagated to all heavy-flavor analyses.All of these developments encountered the same difficulties with input hardcoding and code repetitions, and they would benefit from code simplification brought by Arrow-ONNXRuntime direct conversion.

Feature Set Embedding and the attention mechanism
Even though an analyzer can set his own choice of detector set, it does not ensure that all input samples contain all needed detector signals.Particle properties are measured independently by each ALICE detector.Therefore, a particle can be recorded by a subset of detectors while not being measured by others.Possible reasons can be random, like detector malfunction or switching off, and systematic, like particle properties falling outside the detector acceptance region, e.g., too low transverse momentum,  T , to reach outer detectors like TOF and TRD.One can simply remove all incomplete samples from the input.However, this does not allow the identification of samples with missing information, which can form the majority of data.Another method is imputation, which introduces artificial bias to the initial dataset, which can disturb the predictions of the ML algorithm.It is also possible to alter the neural network architecture.For example, one can use a neural network ensemble, a set of classifiers, one per each subset of the training dataset without missing data.The major drawback of this approach is computational complexity, especially with a growing set of attributes with missing values.
To overcome the aforementioned shortcomings, in Ref. [21], we introduce a novel method based on the attention mechanism, similar to the method introduced for a medical use-case in AMI-Net [22].The system overview is shown in Figure 3.
The first module is based on the Feature Set Embedding strategy proposed in Ref. [23].Input data samples are encoded into a set of feature-value pairs.Each pair represents a non-missing value in the input sample and a one-hot encoded index of the feature corresponding to this value.Then, a neural network with a single hidden layer computes embedding for each feature-value pair.The embeddings place similar features close in the embedded space.
The Transformer [24] encoding module connects different features represented by a set of embedding vectors and finds input patterns.For example, a detector signal has meaning only if the momentum is within a particular range.The softmax function is applied to the attention output, a variable-size set of vectors.An additional self-attention layer is used to merge these vectors into one.Finally, the pooled vector is processed by the simple classifier described in Section 2.1.
The attention architecture was tested on data from a Monte Carlo simulation of proton-proton collisions at √  = 13 TeV with a realistic simulation of the time evolution of the detector conditions in the LHC Run 2 data-taking period.The simulation was performed with Pythia8 [25], the Geant 4 [26] particle transport model, and general-purpose settings.The six most abundant particle species were considered for comparison: pions, kaons, protons, and their antiparticles.
The results reported in Ref. [21] clearly show that machine learning algorithms easily outperform the standard method as measured by  1 metrics.The proposed attention architecture achieves very high scores of  1 , precision (purity), and recall (efficiency), comparable with other analyzed ML models.At the same time, our model avoids the flaws of other solutions: artificial bias in imputed and case-deleted data and potentially larger complexity of the neural network ensemble.

Domain Adversarial Neural Networks
Particle identification is used to discriminate particle species in both real experimental data and Monte Carlo simulations.In particular, machine learning techniques presented in this article learn on labeled simulated data but can also be applied to unlabeled experimental data.However, the simulations often result in distributions of particle features shifted as compared to values registered at the experiment.To mitigate this effect, standard PID methods utilize automated processes for data domain alignment.For example, ALICE implemented a tuning method of simulated signals, which shifts back simulated distributions to reproduce, on average, the collected data distributions of selected variables.
Naturally, this is a limited solution, which does not allow for full domain alignment of all particle features.Therefore, we will make use of a known machine learning technique called domain adaptation, which aims to learn the discrepancies between two data domains, the labeled source and the unlabeled target, and translate those to a single hyperspace.The desired classifier is trained and applied to features from the combined latent space.Since the classifier works independently of the initial data domains, it achieves similar performance on both the labeled and unlabeled data (simulated and experimental data in our case).
In the world of neural networks, Domain Adversarial Neural Network (DANN) [27] is the realization of the domain adaptation technique.As depicted in Figure 4, DANN is composed of three neural networks.The feature mapping module maps the original input into domain invariant features, which are provided to the particle classifier that outputs the particle type.At the same time, the domain classifier enforces domain invariance of extracted features through adversarial training.Adversarial training requires DANN training to be split into two steps.First, the domain classifier takes current features from the feature mapping network and assigns them domain labels, whether the data comes from a real or a simulation source.Then, the weights of the domain classifier are frozen, and the particle classifier and the feature mapper learn jointly to predict accurate particle species.The weights of the particle classifier and the domain classifier are updated with respective gradients, while the feature mapper weights are updated with a gradient from the particle classifier and a reversed gradient from the domain classifier.As a result, the trained model maximizes particle identification scores while minimizing domain classifier scores to ensure that the domains are hardly distinguishable for the particle classifier.Overall, the training procedure is more complex than in a simple neural network, but the application performance is similar and depends on the complexity of the particle classifier and the feature mapper.
Domain adaptation is widely used in natural language processing [28,29] and computer vision [30,31].In high-energy physics, the author of Ref. [32] presents how this method can improve the quality of automatic jet tagging on real experimental data.The initial tests with DANN [12] show that this technique improves the classification of particles in experimental data.

Conclusions and outlook
Using machine learning for particle identification can result in higher purity and efficiency than standard methods.The current priority is to test PID ML in a real-world analysis of simulation data from the present Run 3 data-taking period.Afterward, we will extend the attention model with domain adaptation and test it on the new real data.We will check what levels of disagreement between experimental and Monte Carlo data DANN can handle, and we will estimate model uncertainties.Finally, we will be able to achieve regular production of models for Run 3.
In our project, we also solved a technical problem of integration of Python machine learning projects with the C++ ALICE software.Once a convenient data format conversion into ONNXRuntime tensors is developed, O 2 ML projects will get an even more unified, user-friendly interface.

Figure 1 .
Figure 1.Components of the ALICE detector in its Run 2 configuration [4].

Figure 2 .
Figure 2. Scheme of the current ML particle identification workflow in O2 .More information about the workflow topology in the O 2 Analysis Framework can be found in[20].

Figure 3 .
Figure 3.The proposed model architecture.Layered blocks are applied separately to each vector in a set.Single blocks are applied to their input as a whole.

Figure 4 .
Figure 4. Architecture of Domain Adversarial Neural Network.