Leveraging large-deviation statistics to decipher the stochastic properties of measured trajectories

Extensive time-series encoding the position of particles such as viruses, vesicles, or individual proteins are routinely garnered in single-particle tracking experiments or supercomputing studies. They contain vital clues on how viruses spread or drugs may be delivered in biological cells. Similar time-series are being recorded of stock values in financial markets and of climate data. Such time-series are most typically evaluated in terms of time-averaged mean-squared displacements (TAMSDs), which remain random variables for finite measurement times. Their statistical properties are different for different physical stochastic processes, thus allowing us to extract valuable information on the stochastic process itself. To exploit the full potential of the statistical information encoded in measured time-series we here propose an easy-to-implement and computationally inexpensive new methodology, based on deviations of the TAMSD from its ensemble average counterpart. Specifically, we use the upper bound of these deviations for Brownian motion (BM) to check the applicability of this approach to simulated and real data sets. By comparing the probability of deviations for different data sets, we demonstrate how the theoretical bound for BM reveals additional information about observed stochastic processes. We apply the large-deviation method to data sets of tracer beads tracked in aqueous solution, tracer beads measured in mucin hydrogels, and of geographic surface temperature anomalies. Our analysis shows how the large-deviation properties can be efficiently used as a simple yet effective routine test to reject the BM hypothesis and unveil relevant information on statistical properties such as ergodicity breaking and short-time correlations.


I. INTRODUCTION
Brownian Motion (BM) is characterized by the linear scaling with time of the mean squared displacement (MSD), r 2 (t) = 2dDt in d dimensions, where D is the diffusion coefficient and angular brackets denote the ensemble average over a large number of particles.In many biological and soft-matter systems, this linear scaling has been reported to be violated [1][2][3][4].Instead, anomalous diffusion with the power-law scaling r 2 (t) ≃ t α of the MSD is observed.The anomalous diffusion exponent α characterizes subdiffusion when 0 < α < 1 and superdiffusion when α > 1 [5][6][7].
Passive and actively-driven diffusive motion are key to the spreading of viruses, vesicles, or proteins in living biological cells [8][9][10].Pinpointing the precise details of their dynamics will ultimately pave the way for improved strategies in drug delivery, or lead to better understanding of molecular signaling used in gene silencing techniques.Similarly, improved analyses of the stochastic dynamics of financial or climate time series will allow us to find better comprehension of economic markets or climate impact.
The most-used observable in the analysis of time-series r(t) garnered for the position of viruses or vesicles by modern single-particle tracking setups in biological cells or for the key quantities in financial or climate dynamics, such as price or temperature, is the time-averaged MSD (TAMSD) [5,7] (1) expressed as function of the lag time ∆.For BM at sufficiently long T , the TAMSD (1) converges to the MSD, formally lim T →∞ δ 2 (∆) = r 2 (∆) = 2dD∆, reflecting the ergodicity of this process in the Boltzmann-Khinchin sense [11].Anomalous diffusion processes may be MSDergodic, with a TAMSD of the form δ 2 (∆) ≃ ∆ β with β = α, e.g., fractional Brownian motion (FBM), or they may be "weakly non-ergodic", e.g., β = 1 for continuous time random walks (CTRWs) with scale-free waiting times [5,7,11].
Due to the random nature of the process, the TAMSD is inherently irreproducible from one trajectory to another, even for BM.The emerging amplitude spread is quantified in terms of the dimensionless variable ξ = δ 2 (∆)/ δ 2 (∆) , where δ 2 (∆) is the average of the TAMSD over many trajectories [7,11].The variance of ξ is the ergodicity breaking parameter EB(∆) = ξ 2 − 1. Together with the full distribution φ(ξ), EB provides valuable information on the underlying stochastic process [7].For BM, in the limit of large T , each realization leads to the same result, φ(ξ) = δ(ξ −1) and EB = 0.For scale-free CTRWs, even in the limit T → ∞ EB retains a finite value and the TAMSD remains a random variable, albeit with a known distribution φ(ξ) [5,7,11].
The MSD and TAMSD or, alternatively, the power spectrum and its single trajectory analog [12,13], are insufficient to fully characterize a measured stochastic process.A TAMSD of the form δ 2 (∆) ≃ ∆, e.g., may represent BM or weakly non-ergodic anomalous diffusion.Similarly, the linearity of the MSD, r 2 (t) ≃ t is the same for BM and for random-diffusivity models with non-Gaussian distribution (see below).For the identification of a random process from data, additional observables need to be considered which may then be used to build a decision tree [14].Recent work targeted at objective ranking of the most likely process behind the data is based on Bayesian-maximum likelihood approaches or on machine learning applications [15][16][17][18].The disadvantage of these methods is that they are often technically involved and thus require particular skills, plus computationally expensive.Here we provide an easyto-implement reliable method based on large-deviation properties encoded in the TAMSD.As we will see, this method is very delicate and able to identify important properties of the physical process underlying the measured data.Moreover, it detects correlations in the data and has significantly sharper bounds than the well known Chebyshev inequality [19,20] widely used in different applications [21][22][23].In the following we report analytical results for the large-deviation statistic of the TAMSD and demonstrate the efficacy of this approach for various data sets ranging from microscopic tracer motion to climate statistics.

II. LARGE DEVIATIONS OF THE TAMSD
Large-deviation theory is concerned with the asymptotic behavior of large fluctuations of random variables [24][25][26][27].It finds applications in a wide range of fields such as information theory [28], risk management [29], or the development of sampling algorithms for rare events [30].In thermodynamics and statistical mechanics, largedeviation theory finds prominent applications as described in [31].More recently large-deviations for a variety of random variables have been analyzed for different stochastic processes [32][33][34][35][36][37].In fact large-deviation the-ory is closely related to extreme value statistics [38][39][40] (see also Appendix E).
An intuitive definition of the large-deviation principle can be given as follows.Let A N be a random variable indexed by the integer N , and let P (A N ∈ B) be the probability that A N takes a value from the set B. We say that A N satisfies a large-deviation principle with rate function [31].The exact definition operates with supremum and infimum of the above probability and the rate function [27].However, sometimes it is difficult or even impossible to find explicit formulas for the rate function or the large-deviation principle.Still, in such cases one may be able to find an upper bound for the probability P (A N ∈ B), i.e., the function I B (N ) N ) .This is exactly the case we consider here.
When the TAMSD is a random variable and we have expressions for I B corresponding to specific processes, we arrive at upper bounds on the probability, P ((ξ − 1) > ε) that a given realization of the TAMSD deviates from the expected mean by a preset amount ε: P ((ξ − 1) > ε) ≤ e −I(ε,∆,N ) .Here, I is a function of the deviation ε, the lag time ∆, and the number N of points in the trajectory.

Chebyshev's inequality
Before we come to large-deviations, we recall the (onesided) Chebyshev inequality for any random variable X with mean µ and finite variance.For BM, Chebyshev's inequality for the TAMSD reads (see Appendix B for details) While this inequality is useful for a first analysis and will serve as a reference below, we will show that the largedeviation result presented here has significantly sharper bounds.
For the special choices ∆ = 1 and ∆ = 2 the eigenvalues of Σ(∆) can be calculated explicitly.This is relevant because for such low values of ∆, the conclusions drawn from the TAMSD analysis of sufficiently long T are statistically significant.For ∆ = 1, the eigenvalues λ j (∆ = 1) = 2D and therefore λ = 4D.Using this in (3) we get For ∆ = 2 the eigenvalues are given as (see Appendix C) This expression can then be used to obtain λ and thus P ((ξ − 1) > ε) for ∆ = 2.For other values of ∆, the eigenvalues are obtained numerically [42].
Simulated data serve as benchmarks for the experimental data below.We simulate 100 trajectories each for different processes (Fig. 1 A-D).This number of trajectories is of the same order as in the experimental data sets.A larger set of 10,000 analyzed trajectories is presented in Fig. 3.In addition to BM, we simulate FBM, scaled Brownian motion (SBM), CTRW, superstatistical process, and diffusing-diffusivity (DD) process, see Appendix F for their exact definition.FBM [43] is governed by the Langevin equation, driven by power-law correlated fractional Gaussian noise (FGN) η H (t) with Hurst index H (0 < H < 1), related to the anomalous diffusion exponent by α = 2H.SBM is characterized by the standard Langevin equation but with time-dependent diffusivity D(t) ∝ t α−1 [7,44].CTRW is a renewal process with Gaussian jump lengths and long-tailed distribution ψ(τ ) ≃ τ (−1−α) (0 < α < 1) of sojourn times between jumps [45,46].For the simulated superstatistical process [47,48] the diffusivity for each trajectory is drawn from a Rayleigh distribution.Finally, the DD process is governed by the Langevin equation with white Gaussian noise, but with a time-dependent, stochastic diffusivity, evolving as the square of an Ornstein-Uhlenbeck process with correlation time τ c [49].

B. Beads tracked in aqueous solution
This data set (labeled "BM, x-dim" and "BM, y-dim" for the two directions) consists of 150 two-dimensional trajectories from single particle tracking of 1.2 µm-sized polystyrene beads in aqueous solution [12].The time resolution of the data is 0.01 sec.

D. Climate data
We also use daily temperature records over a 100 year period, after removing the annual cycle (these "anomalies" represent deviations from the corresponding mean daily temperature) [51].This data set consists of uninterrupted daily temperature recordings starting 1 January 1893 and are validated by the German Weather Service [Deutscher Wetterdienst (DWD), 2016].The records were taken at the meteorological station at Potsdam Telegraphenberg (52.3813 latitude, 13.0622 longitude, 81 m above sea level).with a large H exponent, subdiffusive SBM with a small α, CTRW, and random-diffusivity models (superstatistical and DD) clearly exceed the bound (3).Thus, non-Gaussianity (as realized for CTRW and the randomdiffusivity processes) is not a unique criterion for the violation of the large-deviation bound.But according to these results the large-deviation method, for a given value of the scaling exponent α, allows to distinguish FBM and SBM that both have a Gaussian PDF.At longer ∆, in general, the theoretical bound (3) increases and thus P ((ξ − 1) > ε) is below this bound for a larger range of ε.The large deviation method is surprisingly robust with respect to the number of analyzed trajectories, as can be seen from the marginal improvement of the results based on 10,000 trajectories in Fig. 3. Figs. 4 and 5 further analyze FBM and demonstrate the validity of the theoretical bound (C3) derived for FBM.For further analysis of SBM see Fig. 6.

A. Large deviations in simulated data sets
Chebyshev's inequality (2) essentially provides the same bound as the one from large-deviation theory for short ∆.However, at longer ∆ it provides a much higher estimate than large-deviation theory, and it is unable to distinguish subdiffusive SBM with α = 0.3 from BM, as both lie below this bound.Moreover, for long ∆ the probability P ((ξ − 1) > ε) for all simulated processes lie either below or very close to the bound of Chebyshev's inequality, rendering it ineffective in discerning different processes.Chebyshev's inequality (2) lies above the large deviation bound (3), except for the cases ∆ = 1 and 2 with small ε, when it is slightly below but still quite close to the bound set by (3).

Beads tracked in aqueous solution
Polysterene beads tracked in aqueous solution were analyzed in [12] using single-trajectory power spectral analysis, concluding that the data are consistent with BM.From Fig. 1 E-H it can be seen that the estimated probability P ((ξ − 1) > ε) somewhat exceeds the theoretical bound (3) for BM.To understand this non-BM-like behavior shown in the large-deviation analysis we closely examined the motion of individual beads.Indeed, the displacement distributions of some beads showed non-Gaussian behavior, that we could attribute to beadbead collisions as well as to imprecise localization of the bead center when the recorded tracks suffered from non-localized brightness.We removed the non-Gaussian trajectories using the JB test component-wise (see Appendix G).From the filtered data set (M = 129 in xdirection and M = 125 in y-direction) we see that the large-deviation analysis within the error bars is now consistent with BM (especially for ∆ = 1, see Fig. 11).The large-deviation analysis is thus more sensitive to non-BM-like behavior than other methods [12].We also note that the analysis based on Chebyshev's inequality could not distinguish these features.

Beads tracked in mucin hydrogels
The data sets (M = 131 at pH=2 and M = 50 at pH=7) consisting of beads tracked in mucin hydrogels show different trends of P ((ξ − 1) > ε) depending on the pH values, as seen for N = 300 in Fig. 1 E-H.Notably, for the beads tracked at pH=2 P ((ξ − 1) > ε) remains significantly above the bound set by (3), particularly at short ∆.This implies that the spread of the TAMSD is inconsistent with BM and hence the dynamics cannot be explained solely by BM.The data sets at pH=7 show significantly different behavior.We observe a clear distinction in the trend of P ((ξ − 1) > ε) along the two directions of motion.Along the direction (labeled "xdim") P ((ξ − 1) > ε) remains slightly above the theoret-ical bound for BM from large-deviation theory for most of the range of ε at ∆ = 1 and ∆ = 2, while it remains below the theoretical bound for the motion along the y-direction.As for the beads in aqueous solution, Chebyshev's inequality provides a looser bound.
The mucin data sets were analyzed extensively in terms of Bayesian and other standard data analysis methods in [52].The MSD and TAMSD exponents for the data at pH = 2 and 7 correspond to α = 0.46 and 0.36 and β = 1.09 and 0.94, respectively.The angular bracket for β denotes that these exponents were determined from the ensemble-averaged TAMSD.The discrepancy between the α and β values suggest ergodicity breaking and hence a contribution from a model such as CTRW.For CTRW, the ensemble-averaged TAMSD scales with the total measurement time T as a power-law [7].However, as shown in [52], the ensemble-averaged TAMSD for the data sets at pH = 7 showed no dependence on T , while the data sets at pH = 2 showed a very weak dependence, ruling out CTRW as a model of diffusion.Moreover in the Bayesian analysis carried out in [52] BM, FBM and DD models were compared and relative probabilities were assigned to each of them, based on the likelihood for each trajectory to be consistent with a given process.It was observed that for both pH=2 and pH=7, and for most of the trajectories, both BM and FBM had high probabilities.On comparing the estimated Hurst index H for the FBM, it was seen that for pH=7, H ≈ 0.5 with a very small spread from trajectory to trajectory.In this sense, the pH=7 data seemed to be very close to BM.This was also confirmed independently by looking at β extracted from the TAMSD.In contrast, the estimated H for the pH=2 data showed a large spread in the range 0.3 ≤ H ≤ 0.7.These observations are now clearly supported by the results for P ((ξ − 1) > ε), demonstrating that the data sets at pH = 7 are close to BM while the data sets at pH = 2 cannot be explained (solely) by BM.Thus, for this data set the large-deviation analysis again demonstrates its effectiveness in unveiling the physical origin of the stochastic time series.

Climate data
The climate data were successfully modeled by an autoregressive fractionally integrated moving average model, more specifically, ARFIMA(1,d,0) with d ≈ 0.15 [51,53].ARFIMA(0,d,0) corresponds to FGN with H = d + 0.5.It was found that the data, in addition to long-range correlations characteristic of FGN, exhibited short range correlations due to which ARFIMA(1,d,0) fitted the data better than ARFIMA(0, d,0).These short-range correlations could be explained by the average atmospheric circulation period of 4-5 days [51].For our tests of deviations of the TAMSD from the ensemble-averaged TAMSD, we construct FBM trajectories (M = 100) of length N = 300 by taking a cumulative sum of FGN.If the temperature anomalies could be described by ARFIMA(0,d,0), or, equivalently, by FGN, the cumulative sum would be FBM and hence should show a similar trend of P ((ξ − 1) > ε), as seen for the simulated FBM processes in Fig. 1 A-D.That means that it remains below the theoretical upper bounds (2) and (3) for FBM, as long as the scaling exponent does not become too large.Alternatively, deviations of P ((ξ − 1) > ε) from the trend exhibited by simulated FBM, particularly at α = 1.3 corresponding to d = 0.15 reported in [51], would support the result in [51] that ARFIMA(0,d,0) does not completely explain the data of surface temperature anomalies.This indeed turns out to be the case for N = 300 in Fig. 1 E-H where we observe that P ((ξ − 1) > ε) remains above the theoretical upper bound for BM from large-deviation theory, especially at short ∆.Moreover, comparing with Fig. 1 A-D we clearly observe that P ((ξ − 1) > ε) remains well above the theoretical upper bound (3) for BM for the climate data at sufficiently large values of ε, while it always remains below the bound for simulated FBM with α = 1.3 for all lag times.This corroborates the finding in [51] that ARFIMA(0,d,0) (or equivalently FBM for the data constructed by taking the cumulative sum) cannot completely explain the climate data.In comparison, Chebyshev's inequality (2) provides the same information for short lag times but fails to distinguish the climate data from corresponding simulated FBM for long ∆, as this bound lies above the empirical probability P ((ξ − 1) > ε) for both corresponding simulated FBM and climate data.In order to check whether the short-term correlations are indeed relevant, we create an artificially correlated process in the form of an integrated Ornstein-Uhlenbeck (OU) process, the results of which are shown in Fig. 2. With a correlation length of five steps the result of this OU process indeed leads to an ε dependence of P ((ξ − 1) > ε) that is very similar to the climate data's.Conversely, as soon as we remove the correlations in the climate data by random reshuffling of the temperature anomalies, the large-deviation behavior becomes BM-like.

V. DISCUSSION AND CONCLUSION
It is the purpose of time series analysis to detect the underlying physical process encoded in a measured trajectory, and thus to unveil the mechanisms governing the spreading of, e.g., viruses, vesicles, or signaling proteins in living cells or tissues.Recently considerable work has been directed to the characterization of stochastic trajectories using Bayesian analysis [15-17, 54, 55] and machine learning [18,56,57].Most of these methods are technically involved and expensive computationally.Moreover, the associated algorithms often heavily rely on data pre-processing [18].To avoid overly expensive computations, it is highly advantageous to first go through a decision tree, to narrow down the possible families of physical stochastic mechanisms.For instance, one can eliminate ergodic versus non-ergodic or Gaussian versus non-Gaussian processes, etc.Here we analyze a new method based on large-deviation theory, concluding that it is a highly efficient and easy-to-use tool for such a characterization.We show how we can straightforwardly infer relevant information on the underlying physical process based on the theoretical bounds of the deviations of the TAMSD-routinely measured in single-particletracking experiments and supercomputing studies and easy to construct for any time-series such as daily temperature data-from the corresponding trajectory-average.Specifically, we demonstrate that this tool is able to detect the short-time correlations which effect non-FBM behavior in daily temperature anomalies, as well as the crossover from BM-like behavior at pH=7 to non-ergodic, non-BM-like at pH=2 for the mucin data, and the delicate sensitivity to non-Gaussian trajectories for beads in aqueous solution.We conclude from our analyses here that the large-deviation method would be an excellent basis for a first efficient screening of measured trajectories, before, if necessary, more refined methods are applied.
There are two seeming limitations to the largedeviation tool.First, it is easy to formulate this tool for one-dimensional trajectories, while the generalization to higher dimensions is not straightforward.However, as we demonstrated it can be used component-wise and, remarkably, can be used to probe the degree of isotropy of the data.In fact, from Fig. 1 E and F we concluded that the tracer bead motion in mucin at pH=7 was nonisotropic.In this sense, the one-dimensional definition of the large deviation tool is in fact an advantage.Second, it is not trivial to derive similar expressions as (3) for other stochastic processes.Here, numerical evaluations can be used instead.Moreover, in this case we can also use Chebyshev's inequality, with the caveat that it works best at short lag times ∆.Generally, the bound provided by the large-deviation theory is considerably more stringent than Chebyshev's inequality, as demonstrated here.
We demonstrated that superdiffusive FBM with large H values is outside the large-deviation bound.Superdiffusive FBM applied in mathematical finance are indeed in this range of H values [58][59][60], and our large-deviation tool is therefore well suited for the analysis of such processes.We also showed that the large-deviation tool is able to uncover subtle correlations in the data, similarly to ARFIMA analyses applied mainly in mathematical finance and time series analysis.This similarity between the two methods strengthens the connections to physical models recently worked out between random coefficient autoregressive models and random-diffusivity models [61].
The large-deviation test investigated here is a highly useful tool serving as an easy-to-implement and to-apply initial test in the decision tree for the classification of the physical mechanisms underlying measured time series from single particle trajectories.
Superstatistical process: By a superstatistical process [47,48] we mean a process which is defined by Eq. (F1) where the diffusion coefficient is a random variable, that is, there exists a distribution of diffusivities over the tracers in a single particle tracking experiment.The convolution of such distributions of diffusivities with a Gaussian distribution can give rise to non-Gaussian displacement distributions routinely observed in many experiments [50,[91][92][93][94][95][96][97][98].As in many of these experiments, the diffusivity has a Rayleigh-like distribution, for our simulated superstatistical process we applied the Rayleigh distribution for the diffusivity, where D 0 is the scale parameter of the Rayleigh distribution and is related to the mean D = D 0 (π/2).
Diffusivity (DD): The minimal DD model can be expressed as the set of stochastic differential equations [49] dX where the time dependent diffusion coefficient is defined as the square of the Ornstein-Uhlenbeck process Y (t) and τ c is the relaxation time to the stationary limit [99].η 1 (t) and η 2 (t) are independent white Gaussian noise with zero mean and unit variance.In the long time, stationary limit the diffusion coefficients are distributed roughly exponentially [49], F6) where D ⋆ = σ 2 τ c .The TAMSD for this DD model grows linearly with lag time but the PDF of the process is non-Gaussian (Laplacian) for times less than the relaxation time τ c , and it crosses over to a Gaussian PDF for t ≫ τ c .This behavior was seen in a number of experiments [91,93].

Fig. 1 AFIG. 2 :
Fig. 1 A-D shows the comparison of the simulated data with the theoretical upper bounds (2) and (3) for BM, as function of the deviation ε ∈ [0.1, 1].Here each of the M = 100 simulated trajectories was of length N = 300.We see that, particularly at short ∆, the theoretical bound (3) from large-deviation theory clearly distinguishes model classes and/or diffusive regimes.BM, subdiffusive FBM, and superdiffusive SBM clearly lie below the bound (3).In contrast, superdiffusive FBM