Machine learning method for single trajectory characterization

In order to study transport in complex environments, it is extremely important to determine the physical mechanism underlying diffusion, and precisely characterize its nature and parameters. Often, this task is strongly impacted by data consisting of trajectories with short length and limited localization precision. In this paper, we propose a machine learning method based on a random forest architecture, which is able to associate even very short trajectories to the underlying diffusion mechanism with a high accuracy. In addition, the method is able to classify the motion according to normal or anomalous diffusion, and determine its anomalous exponent with a small error. The method provides highly accurate outputs even when working with very short trajectories and in the presence of experimental noise. We further demonstrate the application of transfer learning to experimental and simulated data not included in the training/testing dataset. This allows for a full, high-accuracy characterization of experimental trajectories without the need of any prior information.

In order to study transport in complex environments, it is extremely important to determine the physical mechanism underlying diffusion and precisely characterize its nature and parameters. Often, this task is strongly impacted by data consisting of trajectories with short length (either due to brief recordings or previous trajectory segmentation) and limited localization precision. In this paper, we propose a machine learning method based on a random forest architecture, which is able to associate single trajectories to the underlying diffusion mechanism with high accuracy. In addition, the algorithm is able to determine the anomalous exponent with a small error, thus inherently providing a classification of the motion as normal or anomalous (sub-or super-diffusion). The method provides highly accurate outputs even when working with very short trajectories and in the presence of experimental noise. We further demonstrate the application of transfer learning to experimental and simulated data not included in the training/test dataset. This allows for a full, high-accuracy characterization of experimental trajectories without the need of any prior information.
In the last decades, the research in biophysics has conveyed large efforts toward the development of experimental techniques allowing the visualization of biological processes one molecule at a time [1][2][3][4]. These efforts have been mainly driven by the concept that ensemble-averaging hides important features that are relevant for cellular function. Somehow expectedly, experiments performed by means of these techniques have shown a large heterogeneity in the behavior of biological molecules, thus fully justifying the use of these raffinate tools. Besides, experiments performed using single particle tracking [3] have revealed that even chemicallyidentical molecules in biological media can display very different behaviors, as a consequence of the complex environment where diffusion takes place. By way of example, this heterogeneity is reflected in the broad distribution of dynamic parameters of distinct individual trajectories corresponding to the same molecular species, such as the diffusion coefficient, well above stochastic indetermination. Typically, the trajectories are analyzed by quantifying the (time-averaged) mean square displacement (tMSD) as a function of the time lag τ [5]: The calculation of this quantity -expected to scale linearly for a Brownian walker in a homogeneous environment -has provided a ubiquitous evidence of anomalous behaviors in biological systems, characterized by an asymptotic nonlinear scaling of the tMSD curve δ 2 ∼ τ α . More experiments have shown that the anomalous exponent can vary from particle to particle ( Fig. 1(a)) as a consequence of molecular interactions and that these changes can be experienced by the same particle in space/time [6]. Several methods have been proposed to accurately estimate this exponent for single trajectories [7,8] in the presence of experimental limitations, such as optical diffraction and the finite length of the trajectory.
In some cases, this heterogeneity leads to exotic effects, such as the breaking of ergodicity observed in several cellular systems [9][10][11][12]. Nonergodicity implies the nonequivalence of time and ensemble averages of the mean squared displacement (MSD). In the nonergodic case, δ 2 (τ ) remains random even in the long measurement times, i.e., the diffusion coefficients are irreproducible but the distribution of the tMSD is universal [13]. For nonergodic models, the determination of the anomalous exponent at the single trajectory level does not fully characterize the model, since the tMSD can have different scaling when averaged with respect to the time or to the ensemble ( Fig. 1(b)). Therefore, it requires the calculation of the ensemble-averaged MSD (eMSD) over a rather large number of trajectories [5].
The emergence of anomalous behavior has also been widely studied from the theoretical point of view and conceptually-different models have been proposed [5]. However, the fact that models with different physical properties can produce the same tMSD exponent, strongly limits the unambiguous determination of the underlying dynamics, based only on the evaluation of the tMSD ( Fig. 1(a)). In order to solve this ambiguity, a large effort has been made to classify experimental data showing anomalous transport. As an example, the use of alternative estimators [14,15] has been proposed FIG. 1. Examples of single molecule trajectory heterogeneity (a) Time-averaged mean squared displacement (tMSD) calculated from single trajectories. In the upper panel, we show the tMSD of two trajectories corresponding to molecules that display different anomalous exponents α in spite of belonging to the same physical process. In the lower panel, we show the tMSD of two trajectories generated from different diffusion processes but producing a similar α. In this case, the exponent determined from the tMSD cannot be used to discriminate among models. (b) The upper panels display representative trajectories generated from FBM (left) and CTRW (right) models. The two bottom panels show the corresponding tMSD curves (continuous lines) using the same color coding. Since the subdiffusive FBM is ergodic, the tMSD of a single trajectory can be used to extract the anomalous exponent of the model. However, for nonergodic processes such as the CTRW, the tMSD of a single trajectory is a random variable and its anomalous exponent can be different from the one associated to the model, which calculation requires the use of ensemble averaging. In both cases, the RF algorithm is able to determine the correct anomalous exponent associated with the model, even when it does not simply correspond to the exponent of the single trajectory tMSD, as in the nonergodic case. The values of exponent provided by the RF algorithm are schematically represented by the dashed lines, and are in good agreement with the ground truth values, which for the CTRW were also calculated by the eMSD (dotted lines).
to determine whether the pioneering results of Golding and Cox [17] were arising from a continuous-time random walk (CTRW) [18] or fractional Brownian motion (FBM) [19]. This search for a better classification between CTRW and FBM often relied in the determination of the (non)ergodicity of the data [15,[20][21][22], since CTRW is consistent with weak ergodicity breaking [23]. The experimental evidence of nonergodic behavior has boosted the proposal of new theoretical frameworks consistent with these features [24,25].
In this scenario, determining whether a single-molecule trajectory (or a short segment of it) displays normal or anomalous behavior by its tMSD scaling exponent and associating the motion to a specific physical model are elements of paramount importance to gain insight about the biophysical mechanism underlying anomalous diffusion, thus providing a detailed picture of a variety of phenomena. Recent works in this direction have focused on classification schemes based on optimization procedures [8], Power Spectral Density [26], or Bayesian approaches [27,28].
Surprisingly, in spite of the fast rise of Machine Learn-ing methods, little efforts have been made to characterize single trajectories. A few attempts in this sense have been mainly directed to qualitatively discriminate among confined, anomalous, normal or directed motion [29,30], without extracting quantitative information of classifying with respect to the underlying physical model. This paper presents a Machine Learning algorithm based on the Random Forest (RF) architecture that efficiently and robustly characterizes single trajectories at different levels: first, obtaining the discrimination among several diffusion models; then, providing the estimation of the exponent that characterizes the anomalous diffusion, thus inherently classifying between normal and anomalous (sub-and super-) diffusion. The algorithm allows to accurately tackle these challenging problems even when dealing with short and noisy trajectories.
In comparison to previously reported methods, the algorithm we present in this paper allows for the characterization of a trajectory without prior knowledge about the process from which the single trajectory has been extracted. Most of the works existent in the literature focus their studies on anomalous ergodic trajectories (usually FBM related processes), whereas our method is robust against the appearance of nonergodicity. While showing similar accuracy as others on ergodic trajectories [8,28], to the best of our knowledge, our method represents the first attempt to extract the anomalous exponent for nonergodic processes through single-trajectory characterization.

I. MACHINE LEARNING METHOD
In this section, we outline the main parts of the trajectory characterization algorithm, consisting of: the Machine Learning algorithm that takes the form of a RF architecture; the simulated dataset; and the preprocessing applied to the dataset before being analyzed by the RF. Figure 2 shows a schematic representation of the pipeline.
Random Forest RF is an architecture based on Decision Trees. A decision tree is an efficient non-parametric method widely used for classification and regression problems [31]. The basic idea consists in producing recursive binary splits of the input space, so that the samples with the same label are grouped together. The criterion to produce the splits is based on a homogeneity measure (usually, the information entropy) of the target variable within each of the obtained groups. In regression problems, a commonly used criterion is to select the split that minimizes the Mean Squared Error (MSE); this recursive process continues until some stopping rule is satisfied, e.g., a common one is to consider that a tree node can be split if it contains more than a given number of samples; therefore, the minimum number of samples required to split a tree node should be adjusted in order to control the size of the tree, thus preventing overfitting. Once a decision tree is obtained, the output for unseen samples is computed just passing them through the nodes of the tree, where a decision is made with respect to which direction to take. Finally, a terminal tree node is reached, where the output is obtained.
A RF is a tree-based ensemble method, which builds several decision tree models independently and then computes a final prediction by combining the outputs of the different individual trees [32]. In particular, the ensemble is produced with single trees built from samples drawn randomly with replacement (bootstrap) from the training set. An additional randomness is added when splitting a tree node because the split is chosen among a random subset of the input variables, selected in this case without replacement, instead of the greedy approach of considering all the input variables. Due to this randomization, the bias of the ensemble is slightly higher than that of a single tree, but the variance is decreased and the model is more robust to variations in the dataset.
RF is a very powerful, state-of-the-art technique for both regression and classification problems, usually outperforming not only single decision trees but also sophisticated models, as shown in a thorough comparison study [33].
Training and test datasets The training dataset is built out of numerical simulations of trajectories from various kinds of theoretical models. As a natural choice, we included three of the best-known and used models that can give rise to anomalous diffusion: CTRW [18], FBM [19] and Lévy walks (LW) [34]. In addition, we included trajectories from the annealed transient time model (ATTM) [24]. In the ATTM, a diffuser performs a random walk but stochastically changes the diffusion coefficient at random times. Both the diffusion coefficient and the time at which the diffusivity changes are drawn from distributions with a power law behavior [24].
Its time-averaged MSD shows a linear scaling, but the model has a regime in which it displays weak ergodicity breaking. The ATTM has been shown to reproduce the features observed for the diffusion of a cell membrane receptor [12], one of the experimental datasets analyzed.
Preprocessing Our aim is to design a method that can be used to accurately characterize heterogeneous trajectories without having to calculate other parameters or using a priori knowledge. In order to be able to analyze data coming from any possible spatiotemporal scale, we designed a preprocessing procedure that properly rescale the data. We implemented the following procedure, chosen among other plausible normalization techniques as it gives rise to the best results in terms of accuracy: 1. We use one of the models above to simulate the trajectory of a particle during t max time steps. The result is a vector of positions, X = (x 1 , x 2 , ..., x tmax ).
2. This vector is transformed into a vector of distances traveled in an interval of time T lag , i.e., W = (∆x 1 , ∆x 2 , ..., ∆x J−1 ),, where J = t max /T lag . We define ∆x i as 3. To normalize the data, we divide W by its standard deviation (STD) to get a new vectorŴ .
4. Then, we do a cumulative sum ofŴ to construct a normalized trajectoryX.
Summarizing, the previous procedure generates a new trajectory which is constructed via the normalized displacements of the original trajectory. This makes that the magnitudes of the resulting trajectories are comparable, no matter what were their original values. While the RF could be trained usingŴ , our results show that training withX gives indeed much better results. The same preprocessing is applied to both the simulated and experimental trajectories used in Sections II and III.

II. TRAJECTORY CHARACTERIZATION AS A MACHINE LEARNING PROBLEM
We will use our method to characterize single trajectories according to two different schemes: (A) discrimination among diffusion models; (B) prediction of the anomalous exponent α, that inherently implies classification as normal or anomalous diffusion. For each of these problems, we created a dataset of 1, 2 · 10 5 trajectories with t max = 10 3 , divided into a training and test set with ratio 0.8/0.2, respectively. The results presented in all the figures and the values of the accuracy discussed in the text correspond to the ones measured in the test set, ensuring that the RF does not present overfitting in any of the problems considered. The different classes considered in each problem have an equal number of trajectories, hence allowing us to use the accuracy as a measure of the goodness of the RF. For technical details and an practical example of the implementation, we refer the reader to the Supplementary Material and to the code repository Ref. [35].
A. Discrimination among diffusion models In order to predict the diffusion model underlying a certain trajectory, we construct a RF whose input is the normalized trajectoryX, and the output is a number between 0 and N − 1 corresponding to the different models, with N the total number of models used in the training. Figure 3(a) shows the accuracy of the RF. Each line corresponds to a training dataset made up of different models. In the absence of data preprocessing (point marked as 'Raw' in the x-axis), the RF shows large accuracy. The accuracy drops significantly as T lag increases, likely as a consequence of the removal of microscopical properties of the model, such as short-time correlations, hence preventing the RF from learning important features of them. This might lead to the conclusion that the filtering introduced by the preprocessing steps only limits the time resolution. This is obviously true for simulated data, obtained at the same scale, for which preprocessing is unnecessary. However, when dealing with experimental data of unknown spatiotemporal scale, such a preprocessing is of fundamental importance to be able to apply the same architecture and training dataset, in spite of the little loss of performance.
In addition, the accuracy heavily depends on similarities among the models to be discriminated. For example, the accuracy obtained with a RF trained only with trajectories reproducing conceptually different models such as FBM and CTRW (triangular markers in Fig. 3(a)) is higher than the one obtained when including in the training models with similar characteristics, such as CTRW and ATTM, independently of T lag (red circles and yellow squares in Fig. 3(a)).
B. Anomalous exponent estimation A first approximation toward the characterization of the anomalous exponent can be based on a regression problem, in which the output of the RF is the value of the anomalous exponent α. The nature of the regression algorithm makes that the output of the RF is the continuous value which better satisfies the constraints learn during training.
To characterize the performance of the method, we calculate the prediction error ε of a trajectory as the absolute distance between the predicted exponent and the ground truth value. The percentage of trajecto-riesN (ε) with a given error ε is represented in the bar plots of Fig. 3(c) for three different cases and a subdiffu- sive dataset including trajectories obtained from FBM, CTRW and ATTM. The case (i) considers trajectories with t max = 10 3 without noise, while the case (ii) and (iii) show results for shorter and noisy trajectories (see discussion below). For case (i) the calculated mean absolute error (MAE) of the prediction of the anomalous exponent gives a value of 0.11. Moreover, the histogram showed in Fig. 3(c)(i) shows that for ∼ 80% of the trajectories, the output exponent lies within 0.1 from the true value.
C. Experimental scenario: short and noisy trajectories A remarkable feature of the method is the possibility to correctly characterize very short trajectories. In Fig 3(b) and (d), we show the ability of the RF to characterize short trajectories. In Fig 3(b), we plot the accuracy in model discrimination as a function of the length of the trajectories, t max . In Fig 3(d), a similar study is done, now tracking the MAE of the RF trained to predict α. Although we observe an expected decrease of performance for short trajectories, both plots show that the RF is able to characterize trajectories as short as only 10 points. Quantitatively, when comparing trajectories of 10 points with larger ones, of 1000 points, the model discrimination accuracy only decreases by a factor of 8.2%, while the MAE decreases by a factor of 18%. Panel (ii) in Fig 3(e) shows the error distribution when predicting α for t max = 100.
Importantly, one has also to take into account that the experimental trajectories have a limited localization precision, that results into Gaussian noise. Therefore, it is necessary to test the robustness to noise of the RF. To this end, we trained the RF with trajectories sim-ulated as described before and then we try to predict the anomalous exponent of trajectories belonging to the same dataset, but whose positions X were perturbed with noise to obtain the dataset X (n) where µ i (µ, σ n ) is a random number retrieved from a Normal distribution with mean µ = 0 and variance σ n . The results obtained for training with FBM, CTRW and ATTMs are presented in Fig. 3(d). The RF shows a great robustness against noise. For σ n < 1, the MAE appears almost unaffected. When increasing σ n , we see that the MAE increases, as expected, but even for large σ n the MAE is still reasonable.

III. TRANSFER LEARNING IN SIMULATED AND EXPERIMENTAL DATA
To further show the advantages of our Machine Learning algorithm, we applied it to three sets of trajectories different from those included in the training/test dataset. This is often referred as transfer learning, as certain architecture is trained in one setting and then applied to a different one. For this, we will consider three datasets: (i) Simulated data coming from a recently presented model [36], describing the movement of a diffuser in a network of compartments of random size and random permeability, both drawn from universal distributions. This model shares the same subordination as the quenched trap model, i.e. a CTRW and (iii) refer to the datasets discussed in Section III. For dataset (i), we plot the percentage of trajectoriesN where the predicted value of α has an absolute error ε. As it is a simulated dataset, exponents from 0.2 to 1 are considered. The input to the RF were the raw trajectories, with no preprocessing. For datasets (ii) and (iii), we present the percentage of trajectories predicted to have an anomalous exponent α. The trajectories were preprocessed with T lag = 1. For dataset (ii), we present results two training datasets: dark blue for a mixed dataset and light blue for a FBM dataset.
with power-law distributed trapping times and recapitulates the complexity and heterogeneity found in some biological environments. This choice allows to test the algorithm over a conceptually different model with respect to the training dataset, while having the advantage of tuning the value of anomalous exponents.
(ii) Experiment 1, reporting the motion of individual mRNA molecules inside live bacterial cells [17]. The tMSD shows anomalous diffusion with α ∼ 0.7; this behavior has been associated to FBM [14,37].
(iii) Experiment 2, corresponding to a set of trajectories obtained for the diffusion of a membrane receptor in living cells [12]. Although the time-averaged MSD shows a nearly linear behavior, the data present features of ergodicity breaking due to changes of diffusivity [38] and have been associated to the ATTM model.
Following the scheme presented in Fig. 2, first we train the RF with simulated trajectories obtained with different theoretical frameworks. It should be noted that for this section, since we deal with trajectories that do not show superdiffusive behavior, we do not include the Lévy walks process in the training dataset.
Results Following the same structure of the previous section, we start by discriminating the diffusion model that can be associated to datasets (i)-(iii). The results are reported in Table I, showing a high rate of correct classification for the dataset (i). For the experimental data in datasets (ii) and (iii), we do not dispose of ground truth values, thus we compare our results with those of previous analysis, performed with alternative methods. For the trajectories of Experiment 1, we found that the algorithm largely assign them to the FBM, in strong agreement with previously reported results based on the concept of variation [14]. The data of Experiment 2 are mainly assigned to the ATTM model. This model was shown to reproduce features observed in these data, such as subdiffusion and weak ergodicity breaking [12]. Moreover, a little fraction of trajectories are classified as CTRW. As previously mentioned, CTRW and ATTM share similar features (such as time subordination), increasing the difficulty in discriminating between them. This appears to be the main source of error in the results. To obtain further insights on the study of the diffusion, we used the RF to extract the anomalous exponents. For the first dataset (i), based on simulations, we generated trajectories having a broad range of subdiffusive trajectories, namely α ∈ [0.2, 1]. Then, we used the trained RF to predict the value of the anomalous exponent and evaluated the error as the absolute value of the difference between the actual and predicted α. The results are reported in the histogram of Fig. 4 (i) and display a distribution similar to the one obtained for the training/testing data. Thus, we run the same procedure on the experimental data. For the two datasets, in Fig. 4(ii)-(iii) we report the values obtained for the anomalous exponent α. The histogram of the α obtained for the trajectories of Experiment 1 (dark blue) shows mainly subdiffusive values, peaked in the range 0.6 − 0.8. This is in good agreement with the original paper [17], where α was estimated by means of two different approaches as 0.7 and 0.77. However, the method also classifies a percentage of the trajectories as having α = 1. Importantly, the performance of the method can be further improved by taking advantage of the results of the model discrimination discussed above and shown in Table I. In fact, when the latter classification indicates that most of the trajectories follow a specific diffusive model, one can train the algorithm with a dataset composed only of trajectories simulated with that model. This kind of training produces exponent values in the same range, but largely reduce the fraction of those associated to α = 1, as shown in Fig. 4(ii) (light blue).
Last, in Fig. 4(iii) we plot the distribution of exponents obtained for the Experiment 2. The subdiffusive values show a large number of occurrences in the 0.8−0.9 range, compatible with the exponent 0.84 calculated in previous studies [12]. Noteworthy, due to the nonergodic nature of the data, in the original paper α could only be calculated from the ensemble-averaged MSD, whereas the RF is able to determine this exponent from single trajectories.

IV. CONCLUSIONS
We have presented a machine learning method, based on a Random Forest architecture, which is capable to analyze a single trajectory and to determine the theoretical model that describes it at best. Moreover the same method is used for predicting its anomalous exponent with high accuracy, and thus classify the motion as normal or anomalous. The method does not need any prior information over the nature of the system from which the trajectory is obtained. It acts as a blackbox, which we train with a dataset of simulated trajectories, and then it is used to characterize the trajectory of interest. In particular, its spatial scale is not of any relevance, as we devised a preprocessing strategy which rescales trajectories to obtain comparable estimators from very different systems. The method requires a minimal amount of information. First, because it performs extremely well even with surprisingly short trajectories. Second, because it is robust with respect to the presence of a large amount of thermal noise, and can thus be applied even with low localization precision. We showcase the suitability of our method by applying it to two experimental datasets by means of transfer learning. Overall, this method can be of large application to characterize experiments from several research areas. In contrast to other methods, it can determine the type of diffusion and the anomalous exponent also for nonergodic models, without the need of performing ensemble averages. We note that recent works show how other Machine Learning architectures, such as convolutional neural networks and longshort term memory neural networks, are also capable of doing single trajectory characterization [30,39]. The development of these methods and of other deep learning architectures may help to avoid the preprocessing procedure and could lead to increase the accuracy on the problems described in this work.

SUPPLEMENTARY MATERIALS
Hyperparameters. We summarize here the different parameters used to train the RF models whose results are presented in this work. Common to all are the number of ensembled trees (100), the train/test dataset ratio (0.8/0.2) and the size of training set (1, 2 · 10 5 trajectories). For the details of each figure, see Table II.

Models in dataset
Range of α's tmax Figure 3 TABLE II. Hyperparameters for the training of the different models exposed in this work.