Building general Langevin models from discrete data sets

Living systems, from cell colonies to bird flocks and insect swarms, display second order emergent dynamics. Disentangling the contributions of inertial terms, dissipation and noise to the dynamical behavior is crucial to describe their behaviour, cohesion and motion. Reliable methods are needed to extract that information from experimental data. We develop a systematic inference approach to retrieve second order effective models from discrete finite length trajectories. We first show, using the analytically tractable stochastic harmonic oscillator, that a naive inference approach based on a measurement of velocities through finite differences does not retrieve the effective parameters of the underlying dynamics, regardless of the time resolution and the length of the sampled trajectories. We derive a higher order discretization framework that provides excellent results even with moderately long trajectories. We apply our method to second order models of collective motion and show that our results also hold in the presence of interactions.


I. INTRODUCTION
Recent experimental findings on a variety of living systems, from cell migration [1], bacterial propulsion [2], worm dynamics [3], to the larger scale of animal groups on the move [4][5][6][7], indicate that the observed behavior cannot be explained with a first order dynamical process, but requires a higher order description. For bird flocks and insect swarms, the case which interests us most, data shows that propagating directional information during collective turns in flocks requires rotational inertia, i.e. a reversible dynamical term, to account for the measured dispersion law [6]. The shape of the velocity-velocity correlation function in swarms, which flattens at short times, also points to a second order dynamics for these systems, as suggested by the value of the dynamical critical exponent [7]. Overall, data indicate that considering second order dynamics is required to explain how animal groups behave on their natural size and time scaleseven though overdamping might theoretically occur for very large systems and on very large time scales.
Broadly speaking, the emergent dynamics of all the above examples share three fundamental ingredients: an effective inertia, dissipation, and a stochastic contribution. Disentangling such contributions is often crucial to understand the processes at stake and reliable methods are required to extract that information from available data. From the theoretical perspective, the problem to be addressed is the following: imagine that we are given finite length experimental trajectories of a given unknown process, sampled at discrete time intervals. How can we optimally extract from the experimental data the parameters of a reliable microscopic model? The general context of such a question is the one of statistical inference, and much work has been done in this field in the last years [8][9][10]. Most analyses, however, applies to processes that are -or can effectively be modelled as -first order processes in time [11][12][13][14][15][16][17]. No systematic work exists on second order inference (see however [1,3,18,19]).
The example of animal groups, which motivates the present work, is also helpful to discuss the theoretical objectives and experimental constraints of the inference procedure. Ideally, we would like to build the simplest continuous second order model consistent with experimental findings. We seek a continuous time model for two reasons: i) it allows computations to be performed; ii) it is a reasonable assumption for systems where microscopic update times are much smaller than observational scales (cognitive processes occur on tenths of milliseconds, whereas behavioral changes on scales of seconds). Experimental data, on the other hand, come in the form of discrete time series, where the discretization interval is set by the time resolution of the experimental apparatus. For animal groups, the data consists of the 3D reconstructed trajectories of the individuals in the group, sampled at the rate of a 3D stereo-tracking video acquisition system (1/170 fps for state of the art experiments [6]). These trajectories are of finite length, typically not too long, as groups might leave the field of view of the cameras (in the field) or reach the walls of the confinement tanks (in the lab). The same kind of data, and analogous constraints, are common to many experimental situations -even though the time resolution and the quality of data might of course change from case to case.
In the presence of noise, the nature of the data poses two major problems. First of all, if the dynamics are of second order, all signals (including initial condition and noise) are propagated in time with a memory kernel, making the relation between the coarse grained data that we observe and the underlying process far more complex than in the first order case. The memory kernel arises from the contraction of the dynamical description of the second order stochastic process from the full phase space to a lower dimensional subspace -usually that of mea-surable degrees of freedom [20][21][22]. For example, were we able to experimentally measure with the same accuracy a pair of conjugate variables, e.g. positions and velocities of moving individuals, we could seek a model for their joint evolution. But in common experiments that is not the case, as one typically measures one degree of freedom (e.g., positions) and must derive the other. To confront the data, we therefore need to work in a reduced space. Secondly, the goal of the inference procedure is to retrieve a continuous stochastic model (occurring on scales dt → 0) from a collection of discrete sample paths occurring on observational time scales ∆t dt. In absence of an explicit solution of the stochastic process, the most reasonable thing to do is to transform the stochastic differential equation (SDE) into an approximated difference equation. Such discretization must be performed very carefully, since the resulting equation should correctly represent the underlying stochastic process both at the scales of the sampled data (at which inference works), and in the microscopic limit of vanishing increments.
These two problems are quite general and do not depend on the presence of interactions in the system, but rather on the nature of the dynamics. For this reason, while we are interested in applying our method to strongly interacting systems, it is convenient to investigate first a simple well-known non-interacting case, where nevertheless all the above problems appear. In this spirit, the paper is organized in the following way: In Sec. II we formalize the problem and discuss in detail how to build an appropriate dynamical inference strategy for the stochastic harmonic oscillator. Leaving out irrelevant details, the simplicity of the model will allow us to elucidate the main conceptual issues and understand how to solve them. We will show that in order to get accurate results (inferred parameters equal to true ones) a sufficiently high order of approximation is required in the adopted discretization scheme. In particular, the simplest Eulerlike scheme, which works perfectly well with first order processes, fails when dealing with second or higher order ones, so that one needs to go to the next order of approximation. In Sec. II F we compare theoretical predictions with numerical data to consolidate our results. In Sec. III we address the case of a strongly interacting system: we apply the inference procedure to synthetic data of the Inertial Spin Model, a model of self-propelled particles that correctly describes the phenomenology of natural flocks of birds [6]. We show how, even in this case, dynamical inference gives excellent results. Finally, in Sec. IV we summarize all our results, discuss their conceptual relevance, and outline their potential for applications to real data.

A. Problem definition
Let us assume that the available experimental data are sequences of points (x 0 , x 1 , . . . x L ), uniformly separated in time by ∆t, and that the underlying dynamics is described by the complete Langevin equation of a damped one-dimensional stochastic harmonic oscillator: where µ is the damping coefficient, m the mass, T the effective temperature, having the dimension of a standard temperature rescaled by a mass, so that σ 2 = 2T µ/m, and ξ is a standard white noise, ξ = 0, ξ(t)ξ(t ) = δ(t − t ). Since the noise is additive, it is unnecessary to distinguish between Itô and Stratonovich integration. Let us call λ the irreducible set of parameters that enter in Eq. (1), namely the effective damping coefficient η = µ/m, the pulsation ω 2 0 = k/m and the effective temperature T . The aim of dynamical statistical inference is to provide an estimate of their values. Following a Bayesian approach, the posterior distribution of parameters given the data reads: P (λ|{(x 0 , . . . , x L ) α }) ∝ P ({(x 0 , . . . , x L ) α }|λ)ρ(λ), (2) where each Greek index labels a different experimental sample. By choosing a uniform prior ρ(λ), the maximum of Eq. (2) corresponds to the maximum likelihood estimator. The conceptual and technical difficulty of the whole inference problem is then only about finding a tractable expression for the dynamical likelihood.
The theory of stochastic processes provides us with an explicit but formal expression for the probability density functional P [x(t)|x 0 ,ẋ 0 ], involving, in general, integrodifferential operators. A closed form solution for the stochastic process may be generally unknown or complicated [43], especially for many body or off-equilibrium systems, but finely time-resolved data may be available. What we look for is a (possibly approximated) expression for the probability of the discrete trajectory, for which a practical connection with the data can be established.
A straightforward general strategy is the following: 1. As a preliminary step, Eq. (1) can be conveniently rewritten as a set of two first order equations: This setup is the starting point for the development of the standard canonical description of any dynamical system.
2. Since the dynamics is Markovian when parametrized by the vector variable q = (x, v), the probability of a discrete trajectory in this space, given the initial condition q 0 = (x 0 , v 0 ), can be split into a product of propagators: 3. Following [23], one can exploit any update rule based on a Taylor expansion to approximate, within a certain order of accuracy, the propagator over a small time interval ∆t: Eq. (5) can be replaced into Eq. (4) to get an approximated expression for the probability density of the sequence of points in phase space: (6) 4. Marginalizing over the velocity-like degrees of freedom one gets a probability distribution depending on the x's only. This projection operation on the subspace of x variables is where the original Markovian property of Eq. (4) is generally lost. This procedure does not simply consist of removing the intermediate variables v 1 , . . . , v N , but, also of eliminating the initial condition v 0 , which is a difficult step in the context of stochastic dynamics (see App. D for a broader discussion).
Alternatively, one could start from Eq. (1), discretize it over intervals of length ∆t, and compute P ({(x 0 , . . . , x L ) α }) from the noise distribution directly. The two procedures are of course equivalent, but they are both helpful to understand how the non-Markovian character of the x dynamics mingles with the discretization. We shall therefore refer to both of them in the following.
The first thing we need is then a discrete integration scheme for Eq. (1) or Eq. (3) (for a recall on the discretization procedure we refer to App. A). Although the naive intuition is that any convergent -even if slowly -discretization scheme should work for small ∆t, in fact the order of approximation of the temporal discretization is able to affect the mathematical properties of the discrete path integral measure and, consequently, of the inference procedure. Nonetheless, it is worth starting with the easiest possible construction, i.e. the Euler-Maruyama scheme, in order to show in which respect its accuracy is not enough to build a good likelihood, and why higher order schemes are needed. For the sake of clarity, we classify in this paper the adopted discretization schemes according to what is known in the literature as their strong convergence order [24], even if this is not what truly matters in our analysis: as we shall see, what matters is the order at which fluctuations stop being correctly reproduced.
In the case of the harmonic oscillator, the well-known Euler-Maruyama (EM) scheme [24] reads: with r n i.i.d. random variables of normal distribution N (0, 1), for n = 0, . . . , L − 1. This scheme implements the simplest possible discretization procedureapproximating derivatives of both x and v with finite differences -and can be obtained from the (formal) solution of Eq. (3) by expanding to first order in ∆t (see App. A). Since we know the distribution of r n , a simple change of variables from r n to v n+1 immediately gives the discrete propagator in (x, v) space: The first neglected terms in Eq. (8) and Eq. (9) are O(∆t 3/2 ).
The scheme provides a deterministic update for the x variables, which manifests itself through δ-functions. In this case, we can explicitly marginalize over the v degrees of freedom -including over the initial condition v 0 -which leads to: Regarding the dependence on v 0 , it is clear that, to this order of approximation, information on v 0 is fully equivalent to information on x 1 (see App. D), so that: The same conclusion is obtained by rewriting the system of update equations Eq. (7) in terms of x variables only: by means of a change of variables from the random sequence {σ∆t 3/2 r n } n=0...L−1 to the sequence of positions {x n+1 } n=0...L−1 . This transformation enables to approximate the process as a second order master equation with transition probabilities: where: giving back Eq. (11) for the full sequence probability.
A factorization of P (x L , . . . x 2 |x 1 , x 0 ) into a product of transition probabilities of this kind is possible because the random variables r n appearing in Eq. (12) are independent. This is a crucial feature occurring at this level of approximation only: more accurate discretization procedures indeed produce an effective noise for the x variables which is correlated in time [44]. As we shall discuss, this is the main reason why the discretization scheme above leads to a wrong inference of the parameters. To see this, we compute the dynamical likelihood, as defined in Eq. (2).
Using Eqs. (13)-(15), a simple expression for the likelihood as product of transition probabilities for a second order master equation is obtained: This corresponds to the discrete path probability one would obtain adopting a maximum caliber approach [8] when equal-time correlations, one-time-step correlations and two-time-step correlations are taken as fixed observables. Indeed, rearranging the S n 's, the reduced minuslog-likelihood can be written as (see App. B 1): where we introduced the following notation for the experimental temporal correlation functions, evaluated at a time distance of 0, ∆t and 2∆t: Minimization of the quantity in Eq. (17) with respect to η, T and ω 2 0 yields the inference formulas for the parameters of the harmonic oscillator. We report here only the formula for the damping coefficient η, while the remaining expressions can be found in App. B 1: We applied Eq. (18) to synthetic trajectories obtained from numerical simulations of the stochastic harmonic oscillator in several damping conditions. Data were generated with a high precision algorithm [25] and with a numerical update time τ sim = 0.005 (to be compared to the time scales of the process η −1 = ω −1 0 = 1). We applied inference formulas to discrete data sets {x n } sampled from synthetic trajectories at time intervals ∆t ≥ τ sim . This choice mimics real experiments, where the time resolution is fixed by the acquisition apparatus, while the true microscopic time-scale of the dynamics is unknown.
Results are reported in Fig. 1a for a sample value of the simulated parameter η sim (gray line), as a function of the sampling time ∆t. The main feature that stands out is that the inferred value η * (blue squares) is systematically different from the true value of the damping coefficient. Further analysis of the numerical results hints at the emergence of a constant rescaling factor close to 2/3 for η * , as compared to η sim (Fig. 1b). The remaining inferred parameters are in agreement with the parameter values used in the simulations, as shown in Figs. 1c -1d. It is worth remarking that this rescaling is independent of ∆t, so increasing the resolution of the acquisition instruments is of no help in improving the estimation of the damping coefficient. In App. B 1 we show that the same problem also occurs when using other O(∆t 1/2 ) variants of the EM scheme, as also displayed in Figs. 1a -1b. We conclude that there is some serious faultiness in the order of approximation adopted and that a correct inference procedure requires a higher discretization accuracy.
A simple argument can help us to understand what is missing in the O(∆t 1/2 ) scheme, and why the parameter η is the one affected by the approximation. Assuming that the experimental correlations perfectly reproduce the true ones, we can replace into Eq. (18) the known expression for the temporal correlation function of the harmonic oscillator C(t) in the stationary regime: where γ = η/2. At this point, we perform a Taylor expansion around t = 0 of the obtained expression of η * as a combination of C(0), C(∆t) and C(2∆t), since the underlying assumption of the whole procedure is that the time lag ∆t between subsequent points is small, compared to the typical time scales of the dynamics. As a result, Eq. (18) is transformed into a function depending only on the derivatives of C(t) at t = 0: (20) Knowing Eq. (19), one can compute the desired derivatives in the simple case of the harmonic oscillator. In particular, we have: ...
(21) It is evident that proper combinations of these quantities allow us to extrapolate all the parameters of the model. The importance of the first derivative as a quantity to discriminate between first and second order dynamics in oscillator-like models has already been stressed by Cavagna et al. [26,27], with explicit reference to interacting systems. Our point is that we can go beyond the binary answer provided byĊ(0)/C(0), proportional -through a time scale factor -to 1 or to 0 for first or second order dynamics respectively, and give a quantitative estimation of the damping regime in which a system operates, employing all the derivatives at t = 0 up to the third one.
We find then, at the leading order, a rescaling factor of 2/3, as observed in numerical tests. No rescaling factors appear for the other inferred parameters: performing the same replacement and expansion of the analytical correlation functions, the effective temperature T and pulsation ω 0 are correctly retrieved from proper combinations of C(0) andC(0). This result gives us a clue to understanding the origin of the ∆t-independent rescaling factor for η. Looking back at Eq. (12), one realizes that terms of order O(∆t 3/2 ) and O(∆t 2 ) appear, even if the starting point is an update scheme coming from an O(∆t) Taylor expansion. This means that Eq. (12) has been inconsistently derived from Eq. (7) retaining only some of the O(∆t 3/2 ) and higher order terms. In turn this results in missing O(∆t 3 ) contributions for the x-noise correlator and the reconstructed fluctuations at the lowest order. Since η is the only parameter requiring a third order Taylor expansion, while a second order expansion is sufficient for T and ω 0 , this explains why Eq. (18) is incorrect and shows the need of higher order discretization schemes for stochastic second order dynamics. These schemes are discussed in Sec. II D.
We want to stress here that adopting a Euler-Maruyama scheme is the same as making the simplest and most common deterministic extrapolation of the velocity-like degrees of freedom, when only position-like degrees of freedom are experimentally accessible. This approximated estimation of the velocity works if one observes the system in the overdamped regime, i.e. when η∆t 1 and ω 0 /η < ∞, and the effective dynamics can be described by a first order equation. In this case, EMbased inference schemes provide in effect excellent results [13,14]. However, we have just shown it is destined to fail in a more general context.
Finally, we comment on the generality of this 2/3 rescaling factor: it is not a specific feature of the stochastic harmonic oscillator, but a recurrent trait in stochastic models embodied by second order equations that contain, in addition to white noise, a viscous friction term and a general conservative force associated to any velocityindependent potential. The discussion led in the next sections validates this claim, which can be explicitly inspected at the analytical level, following the procedure above, only in simple models, like the stochastic harmonic oscillator (linear force) or the free Brownian particle (no force). Despite the large use of the latter, as the simplest stochastic model for motility with persistence, careful attention to the distortions introduced on test statistics by time-lapse recording was paid only quite recently by Pedersen et al. [18]. As far as we know, this paper is the first one that addressed, even if adopting other tools than statistical inference, the problem of reconstructing quantitative relations between observables of the system, as they are predicted by the continuoustime model, from time-lapse recordings of the stochastic trajectory. They observed a 2/3 rescaling factor as an effect of discretization errors in two different equations: first, in the relation between the secant velocity and the conditional average of the secant acceleration [45], and, second, in the variance-covariance matrix of secant acceleration fluctuations (at equal time) [18]. The high technical accuracy of this analysis is not complete with an explanation or interpretation of the results, which is what we attempt to provide in this work.
The lowest order of convergence required to develop any reasonable dynamical maximum likelihood scheme is O(∆t 3/2 ) -corresponding to a second order Taylor expansion in Eq. (3). Independently of the details of the discretization (e.g. the point around which the Taylor expansion is performed), step 3 of the procedure outlined in Sec. II A reduces to a sequence of intertwined gaussian integrals for the marginalization of v 1 . . . v L , which is hard to compute for an arbitrary length of the trajectory. Therefore, it is convenient to work again with update equations in x space. There are two main ways to obtain such equations, depending on the order in which the two crucial steps (variable elimination and temporal discretization) are performed. One can either first eliminate the variable v(t), obtaining a generalized Langevin equation at continuous times [20], and then perform a discrete time approximation, or discretize first both the equations of motion Eq. (3) and then retrieve a discrete update equation in x space, through the elimination of the discrete set of velocity variables v n . Even though these approaches are well known in the literature, we describe them in more details in App. A for self-consistency.
In general, as explicitly derived in App. A, any O(∆t 3/2 ) scheme in x space for the stochastic harmonic oscillator takes the form of a linear difference equation of the form [46]: where the inhomogeneous terms ζ n are still Gaussian random variables of null mean, but they are no longer independent. This is the crucial difference with the Euler-Maruyama scheme. Their statistical properties are fully described by the (non-diagonal) covariance matrix C nm = ζ n ζ m . Eq. (23) defines an affine map: where M ij = δ i,j + αδ i,j−1 + βδ i,j−2 and x 0 = (x 0 , x 1 , 0, . . . , 0) , which can be generalized to a nonlinear transformation when anharmonic forces are present. This map can be exploited, when the covariance matrix C and its inverse are known, to write the dynamical likelihood: where Z is the normalization constant and α and β, as well as the entries of the covariance matrix, are known combinations of the parameters of the model, whose details depend on the adopted discretization scheme. The normalization constant reads: where λ k is the k-th eigenvalue of the covariance matrix C, and J ζ(x) being the Jacobian of the change of variable transformation (equal to the identity if we take an explicit scheme like that expressed by Eq. (23)). For second order processes, we show that C has a 'nearest-neighbour' structure (see App.A): Hence the covariance matrix has the form of a symmetric tridiagonal Toeplitz matrix of order L − 1. To order O(∆t 3 ), the entries of this matrix are: These mathematical features carry a deep physical meaning: first of all, the presence of non-vanishing off-diagonal elements is the signature of a colored structure. Secondly, the fact that the matrix is banded means that the correlation of the noise variables is short-ranged, i.e. that the associated memory kernel, in a continuous-time description, decays fast [21]. Finally, the Toeplitz structure is synonymous with shift invariance. A more careful derivation of the update equations in x space would require shift invariance not to hold and the first entry of the covariance matrix C 11 to be different from the other elements of the main diagonal. Eq. (23) is in fact not valid for the first integration step, where the initial conditions intervene. In this respect the structure of the data also poses the problem of the elimination of the initial condition v 0 in favour of x 0 and x 1 . Even if not able to perform it explicitly, we can argue (see App. A) that it has the effect of modifying the covariance matrix Eq. (27) in the following way: where the shift invariance expressed by the Toeplitz structure of the previously considered covariance matrix is then broken at the beginning of the time series (a →ã). Despite that, the error we make by replacing a with a in the quasi-Toeplitz matrix Eq. (29) is negligible in the limit of long trajectories, as discussed in Appendix D and numerically checked in Sec. II F (see, in particular, Fig. 3). Intuitively, since the breaking of the shift invariance occurs only at the first step, the longer the trajectory, the more similar this is to a truly shift invariant situation. Notice that what matters is not the total length (L + 1)∆t of the trajectory in units of the physical time scales of the process, but just the number of points L + 1 of which the trajectory is made up [47]. Apart from the difficulty in determining correctlyã, the advantage of replacing the true covariance matrix Eq. (29) with a Toeplitz matrix is that the inverse of the Toeplitz matrix is explicitly known, as well as the eigenvalues [28,29]: Let us highlight that the inverse of the covariance matrix does not preserve a banded structure. This means that, even if noise correlations are local in time, products between every pair of points of the trajectory enter into the minus-log-likelihood. Hence Eq. (25) cannot be factorized. Factorization corresponds indeed to a block structure for C −1 , which implies a block structure for C. This is incompatible with the tridiagonal Toeplitz or quasi-Toeplitz nature of the covariance matrix, where offdiagonal elements are of the same order as the diagonal ones. Nonetheless, having built an explicit discrete path integral measure, a maximum likelihood approach is practicable, and it reduces to minimizing the quantity L = − ln P (x L , . . . , x 2 |x 1 , x 0 ) with respect to the parameters of the model. Thanks to the regularities of the likelihood Eq. (25), the minimization of L can be performed analytically in the case of the harmonic oscillator and, in general, of simple single-particle systems, for which a generalization of Eq. (23) can be obtained. The generalized update equation reads: with F an arbitrary function of x n and x n−1 , and a polynomial function of the effective parameters µ 1 . . . µ M . The optimization procedure can be performed semianalytically also for many-particle systems representing a microscopic version of field theories. In this case an additional parameter is typically the interaction range of effective pair-wise potentials, which may depend on a different (measurable) variables than the field-like observable x. In general, once an expression for L is given, a large number of optimization algorithms are available to minimize it with respect to all the extra parameters that do not allow for a full analytical approach.
Complete inference formulas for the simple onedimensional harmonic oscillator and for a system of many coupled harmonic oscillators with parameter-dependent connectivity matrices are reported in Apps. B 2-B 3. In both cases, optimal parameter values are given by combinations of all the two-time functions up to the length of the trajectory, and not only those computed at a temporal distance of 0, 1 and 2 time steps.

E. Alternative non-Bayesian approach
An alternative, non-Bayesian approach to the maximum likelihood method is also viable. It consists of deriving from the update equations in position space some relations between experimental correlation functions and model parameters. The starting point is a difference equation in x space of the kind of Eq. (23). To fix the ideas, let us consider the continuation rule of the Langevin Impulse integrator (LI) [30], which is the ex-plicit scheme we used for our numerical analysis: The gaussian random variables ζ n are characterized by the same covariance matrix as before (Eq. (27)). Multiplying both sides of Eq. (33) by x m , for m ∈ {n − 1, n, n + 1}, and self-consistently averaging over the noise distribution, yields a set of three independent equations, from which all the parameters of the dynamical model Eq.
(1) can be extracted (explicit formulas are derived in Appendix B 4). Notice that, in contrast to the other O(∆t 3/2 ) inference method, the obtained relations can involve only three types of temporal correlation functions: equal-time, onetime-step and two-time-step correlations. Even if we are not using all the exploitable information carried by an N -point trajectory (the operation outlined above could in principle be performed for all x m ), this is the optimal minimal choice. Indeed, the shape of the temporal correlation function at small times contains substantial dynamical information. Moreover, due to the finite length of the trajectories, two-time quantities, like correlation functions, are typically better estimated at small time differences than at large ones.
Remarkably, the inference formulas obtained in this way do not require any rescaling factor, for any length of the trajectory. Unfortunately, however, the approach does not apply to interacting systems, where an interaction range is needed to parametrize the potential. As these formulas do not come from any optimization procedure in a Bayesian or information theory framework, there is no numerical strategy to find the best parameters of the interaction potential, in absence of an analytical approach.
Finally, it is important to specify the probability density function with respect to which we are taking the averages in Eq. (33). Since, in order to compute x n ξ n and x n+1 ξ n , we self-consistently used the same update rule and the same shift-invariant noise statistics, we argue that we implicitly introduced a stationarity assumption. As a result, averages are performed over the stationary joint probability distribution of three subsequent points: P s (x n+1 , x n , x n−1 ). The quantity we are dealing with in the Bayesian method is a conditional probability, which we approximate, for any finite L, by making use of Eq. (30). Hence we refer to different ensembles of stochastic trajectories for the two cases: stochastic trajectories with a fixed initial condition for the Bayesian Toeplitz-based method (Fig. 2a) and stochastic trajectories with random initial conditions sampled from the stationary distribution for the non-Bayesian approach (Fig. 2b).

F. Numerical results
At this point we have two valid O(∆t 3/2 ) inference schemes for the stochastic harmonic oscillator, which can be tested against synthetic time series. We generated the trajectories of a stochastic harmonic oscillator as explained in App. E 1 and subsampled them by progres- . The corresponding ensemble is referred to as the non-Bayesian inference approach. In both cases: T = 2, ω0 = 1, η = 1.5, ∆t = 0.005. sively increasing the time separation ∆t between subsequently observed points.
Filtering the synthetic trajectories in time is a good blind inspection tool to check the robustness of the continuous description given by the inferred parameters, without prior knowledge about the time scales of the process. Of course, we expect all these approximated dynamical inference schemes to work in a limited time window, since, in discretizing the equations of motion, the underlying assumption is that ∆t must be much smaller than the typical time scales of the process. Tests on numerical simulations also tell us how big this window is and when the method starts to fail.
We studied the performance of both O(∆t 1/2 ) and O(∆t 3/2 ) schemes in several damping regimes: results are shown in Fig. 1. We observed that methods of the same order give almost undistinguishable results, for long enough trajectories and for an experimental time lag ∆t sufficiently smaller than the two characteristic time scales of the system: η −1 and ω −1 0 . Indeed η∆t and ω 2 0 ∆t 2 are the dimensionless parameters controlling the Taylor expansion upon which the discretization schemes are based. In the vicinity of the region where the Taylor series in η∆t and ω 2 0 ∆t 2 becomes non-asymptotic, diversified behaviors between the methods start to appear and inference results start to part from their expected values. Of course, O(∆t 3/2 ) methods remain accurate and stable in a larger ∆t window than O(∆t 1/2 ) methods. One can use this numerical analysis also to identify heuristic bounds of their applicability.
Results from all the developed schemes are shown in Fig. 1a for a particular value of the damping coefficient, η sim = 3. It is evident that, in contrast to the lower order ones, O(∆t 3/2 ) inference schemes do not require any rescaling factor, and the inferred value η * coincides with the true one. This is reasserted by Fig. 1b: there is a clear separation of O(∆t 1/2 ) vs O(∆t 3/2 ) methods, which respectively fall on the two lines of slope 2/3 and 1 in the (η sim , η * ) plane, where η * is the inferred value of the damping coefficient and η sim is the parameter value used in the simulation. The inferred ω 2 0 and T parameters reproduce the values used in the simulations (Figs. 1c-1d), both within the O(∆t 1/2 ) and O(∆t 3/2 ) methods.
As a final comment, let us focus only on the O(∆t 3/2 ) class and recall that the non-Bayesian method involves only two-time quantities up to 2∆t, whereas the Toeplitzbased one involves two-time correlation functions that extend over the whole trajectory. This tells us that the former (non-Bayesian) method applies even to disconnected triplets of points (or, in general, to disconnected small sequences), if a fragmented observation of the system is the only one achievable. On the contrary, the latter (Toeplitz-based) method is exact only in the infinite trajectory limit, so the smaller the number of subsequent points, the less accurate the inference scheme becomes, even if multiple acquisitions separated by a small time lag are made. In other words, what matters in this case is not only the total number of points for statistical reasons -which is the only thing to worry about in all the other developed schemes -but also their succession in time. We checked this in numerical simulations, keeping constant the total number of points used in the inference procedure, (L + 1)n S , and adapting the number of samples n S as the length L + 1 of the sample trajectories is varied. Large deviations of the inferred value from the simulated one are visible for small values of L (Fig. 3). For small L it is also possible to make an approximate analytical estimate of the error introduced in replacing the quasi-Toeplitz covariance matrix with a Toeplitz one. Numerical results and analytical estimates agree. These results confirm that the role of the initial condition on the marginalized velocity variable is crucial and cannot be replaced by a combination of conditions on the position variable. This is the origin of the incorrect estimate of the damping coefficient that one obtains from short trajectories. The exact result is only retrieved in the infinite L limit, yet Fig. 3 shows that the convergence is quite fast: 10 point trajectories already confine the error to 5% and 15 points to 3%; these values are orders of magnitude below the typical number of recorded frames in common motility observation experiments.

III. INTERACTING CASE
Following our original objective to develop an inference strategy for natural flocks of birds, we generalized the inference equations of Sec. II and performed numerical simulations of the topological inertial spin model (ISM) on a non-evolving random lattice at low temperature. The model, introduced to account for experimentally observed features that could not be explained within the framework of first order processes [6,31], represents a second-order generalization of the well-known Vicsek model. The stochastic equations of motion in three dimensions read: where the indexes i, j = 1, . . . , N label different individuals, v 0 is the constant modulus of each velocity vector v i , and ξ i⊥ is the orthogonal projection to v i of a three-dimensional white noise of parameters T and η: Motivated by the findings of [32], we choose to parametrize the coupling constant as J ij = J n ij , where n ij = 1 if bird j is among the first n c nearest neighbours of bird i, whereas it takes a null value otherwise.
In the ordered phase, the spin-wave expansion of the equations of motion of the inertial spin model linearizes the force terms, and Eq. (34) takes the form of a set of SDEs for N coupled harmonic oscillators [13]: Here π i are the birds' normalized velocity fluctuations, lying on the orthogonal plane to the direction of collective motion, Λ ij = n c δ ij − n ij is the discrete Laplacian of the birds' network, andξ i⊥ is now a two-dimensional white noise that lives on the same plane as π i . To leading order, it is described by the parameters T and η appearing in Eq. (34). For a full derivation of the equations of motion in the spin-wave approximation see App. C. Let us just notice here that, thanks to the linearity of Eq. (35), the inference strategy we developed for the harmonic interacting case applies also to the inertial spin model in the highly polarized phase.
For the sake of simplicity, in our simulations we discarded the first equation of Eq. (34) and kept the birds' reciprocal positions fixed. The dynamical maximum likelihood approach, however, should work even when reshuf-fling the birds' positions and static approaches fail, since at each time step it is possible to reconstruct the neighborhood of each individual and compute the associated time-dependent observables [13,14,33].
We applied our inference strategies to the synthetic trajectories. Results are in qualitative agreement to those of the harmonic oscillator. In particular, the expected rescaling factor of 2/3 for the damping coefficient is retrieved using any O(∆t 1/2 ) scheme, as shown in Fig 4a. This fact corroborates that the emergence of this 2/3 factor is a universal feature of second order stochastic processes, coming from the interplay between the terms containing second and first order time derivatives, rather than the kind of conservative forces which are applied to the system. Again, O(∆t 3/2 ) schemes do not require any rescaling -at least for sufficiently long trajectories.
As already mentioned, however, there are some relevant differences with respect to the simple noninteracting case. First of all, the additional difficulty we must face in the case of N -body dynamics is that of estimating the interaction range. Since an explicit analytical minimization of the minus-log-likelihood is not operable, for both O(∆t 1/2 ) and O(∆t 3/2 ) likelihood-based methods, a numerical approach is needed. The problem is however algorithmically tractable, since it consists of a one-dimensional optimization problem. Moreover, if the parametrization of the n ij matrix discussed above is adopted, n c is a discrete parameter, so the exact minimum value can always be found (see Fig. 4b). Wrong estimations of the topological interaction range can be due to a distorted reconstruction of the likelihood from the data. As the number of birds N or the number of trajectory points L is increased, the improved statistics smoothens the rugged reconstructed likelihood and the real minimum becomes easier to detect. To this end, another parameter playing a relevant role is the time lapse ∆t, since when the separation between subsequent data points is very small compared to the time scales of the system, increments are also very small. Smaller increments correspond to smaller quantities to minimize, which are then subject to bigger relative errors, as clearly observable in Fig. 4b.
Once the optimal value of n c is recovered, it is then used to compute the spatially structured correlation functions which enter into the formulas of the remaining parameters (Apps. B 3-B 4). Non-Bayesian methods are not based on any likelihood definition, and, as a result, do not allow us to infer n c . Despite that, we show in App. B 4 that an approximated estimation of the effective temperature T /χ and of the damping coefficient η/χ is possible, adopting, for instance, the discretization scheme of the LI integrator. On the contrary, the parameters associated to the interaction potential, n c and J, are not evaluated by this approach. Results for η/χ (Fig. 4a) and T /χ (Fig. 4c) are in good agreement with those obtained by applying the Toeplitz method, even for relatively short trajectories. Taking, for instance, trajectories of length L = 200 (far below the length required for the harmonic oscillator schemes to work well) for systems of N = 1000 particles already enables us to achieve good accuracy, with undistinguishable features between the two methods (see Fig. 4).
The need for very long trajectories in the likelihoodbased O(∆t 3/2 ) inference scheme, for both single particle and many particle problems, stems from two different facts. Firstly, the shift-invariance approximation introduced by enforcing a Toeplitz structure for the noise covariance matrix results in better performance for longer trajectories. Secondly, the empirical reconstruction of two-time correlations, which are the quantities that enter into the definition of the likelihood, improves when achieved from longer trajectories as compared to shorter ones. The advantage of moving from the single oscillator to the many-body interacting case is that a restricted number of "local" quantities turn out to dominate and self-average in sufficiently large systems. In other words, the statistical issue for the computation of these quantities can be at least partially mitigated by averaging over the sample size, rather than relying only on temporal averages as we are compelled to do in the case of the harmonic oscillator.

IV. CONCLUSIONS
We proposed several strategies to tackle the problem of learning the best continuous inertial stochastic dynamical model from time lapse recordings of moving objects. The problems arising in this context are general, as they stem from the combination of the following three ingredients: the second (or higher) order nature of the process, when described in terms of the directly measurable degrees of freedom, the stochastic description, and the use of discrete sequences of data points. Contrary to first order processes, connecting the continuous-time description to a discrete-time one is not a straightforward task in the case of second order dynamics. Careful attention must be paid to the mathematical peculiarities of the white noise, and to the fact that the lowest order of approximation in ∆t which is required to consistently take into account the stochastic nature of the evolution varies with the order of the original continuous-time SDE. This last issue is somehow known in numerical analysis, where a plethora of integration schemes has been introduced and their stability and convergence have been extensively studied. From a different point of view, the problem has been investigated also by Drozdov and Morillo [23], who highlighted how the non-invertibility of the diffusion matrix requires a Taylor expansion of the dynamical equations that is at least accurate up to O(∆t 3/2 ) for a correct discretization of the path integral. Here, we combined this knowledge with that of the classical problem of variable elimination and lack of Markovian property in stochastic processes, and exported it to the field of statistical inference.
As a first result, this allowed us to detect and explain the emergence of a 2/3 rescaling factor for the in-ferred damping coefficient, when O(∆t 1/2 ) discretization schemes are used. Secondly we managed to build consistent inference schemes based on the Bayesian maximum likelihood principle that work fairly well for sufficiently long trajectories. This is obtained through a higher order discretization, where the color and the loss of Markovian character due to the variable elimination is properly included. The only inaccuracy we were forced to introduce, in order to have an explicit expression for the likelihood that could be easily manipulated, is the Toeplitz approximation of the covariance matrix of noise. Physically, this approximation corresponds to neglecting the breaking of shift invariance associated to the initial condition on the conjugate non-measurable variable. However, the error we make with this approximation rapidly decreases as the number of points of the trajectory increases, and typically it is fairly negligible for trajectories of a few tens of points.
Although we focused only on the simple case of linear systems, the procedure we followed is applicable to any kind of second order SDEs driven by white noise, and the obtained formulas could be adapted to different kinds of conservative forces. In this framework it is also frequently possible to implement an analytical optimization with respect to the parameters of the model. The parameters that are too hard to extract analytically can be obtained by numerical maximization of the defined likelihood. We also developed, outside the Bayesian context, an alternative strategy, which exploits relations between two-time correlation functions and parameters of the model that can be found from the discrete update equations. In this approach there is no inconsistency in assuming shift invariance. Unfortunately, however, it cannot be applied to interacting models having a complicated, non algebraic, dependence on the interaction range parameter, since, in absence of a likelihood definition, no numerical optimization can be performed.
All of the proposed methods provide us with reliable controlled inference procedures (possibly combinable) for second order stochastic processes. In particular, we want to highlight that the simplest possible way is to adopt, even if incorrectly, one of the naïve methods and introduce an a posteriori correction of the wrongly estimated parameter, to take into account the effect of the lowest order discretization. This approach is conceptually the same as trying to give to a non-Markovian process an effective description with uncorrelated noise and rescaled effective parameters, for which Einstein's relation still holds. If not too cumbersome, it is however preferable to adopt a higher order inference method, especially when a numerical optimization must be performed, because the higher order of accuracy makes the method stable over a larger ∆t interval, and the factorized structure of the likelihood that one retrieves in O(∆t 1/2 ) schemes is intrinsically incorrect.
We want to stress that, independently of the order of approximation, one of the advantages of this general framework is that it does not require having explicit an-alytical solution of the process, but it simply exploits the local dynamical information carried by the differential equation. This is the pivotal aspect that makes the whole procedure so strikingly robust and widely applicable.
Following the same paradigm, one can try to generalize the approach to higher order processes, if motivated by some experimental evidence. The caveats given by the analysis we carried out can be summarized as follows: given an n-th order SDE in the x variable driven by white noise, the elimination of each auxiliary variable among the first n − 1 time derivatives of the measurable degree of freedom x implies a progressive modification in the spectrum of the effective noise, as well as nested convolutions with memory kernels in the deterministic part of the equation. At the discrete level, this will correspond to having a k-diagonal quasi-Toeplitz structure for the covariance matrix of the random terms appearing in the update equations, where k = 1 + 2(n − 1). At this stage, one can exploit the Toeplitz approximation, in the limit of long trajectories, to write the dynamical likelihood, the essential ingredient of dynamical statistical inference.
Another important remark is that we didn't include measurement errors but assumed that the data are given with infinite experimental precision. An extension of the developed method would be to model the measurement error and to include it in the inference procedure, in order to avoid distortions in the reconstruction of the real dynamical parameters.
The natural use of the developed framework is application to real data. Technical specifications of acquisition systems have remarkably improved in the last years, and it is now possible to collect well resolved trajectories for long enough time windows. This is also true for animal groups on the move, where experiments are performed in the field and strong limitations are usually set on the acquisition length due to global motion. For example, we know from previous work that the emergent dynamics of flocks of birds is dominated by an effective rotational inertia [6]. This inertia allows information to propagate linearly and in an almost undamped way allowing flocks to turn coherently. Retrieving the effective damping coefficient in this case will allow us to predict the scales where damping becomes relevant, setting a size limit for groups able to collectively change direction. In the context of swarm dynamics, recent theoretical findings [34,35] suggest that the value of the damping coefficient setsagain -a size crossover for groups displaying different critical behavior on the large scale. Understanding the interplay between size, information propagation and response is a key issue in collective behavior and a reliable inference approach is crucial to provide well grounded answers to these questions.

FF thanks Marco Baldovin for helpful discussions and suggestions. IG and FF also thank Andrea Cavagna, Angelo Vulpiani and Massimiliano Viale.
Appendix A: Discretization procedure Let us briefly summarize two possible systematic strategies to obtain a discretized equation in the space of the x variables up to arbitrary order. Following [20], we can consider Eq. (3) and formally solve the second equation: Plugging this expression back into the equation for x, we get a closed equation in x space: where now K(t) = ω 2 0 e −ηt and the effective noise is given by ζ(t) = t 0 dse −η(t−s) ξ(s). This formalism shows very clearly that projecting from the full phase space into the x space the dynamics acquires a memory, described by the kernel K(t), and color of the noise. ζ(t)ζ(t ) ∝ K(t − t ) extends over the whole trajectory. Discrete update equations on the scale ∆t can now be obtained by integrating Eq. (A2) between t n and t n+1 . These equations share the characteristics of Eq. (A2): both the noise and the initial conditions are propagated in time. Luckily, however, the memory kernel decays exponentially, and by an appropriate reweighing we can get rid of both effects. Indeed the combination x n+1 − x n − e −η∆t (x n − x n−1 ) does not contain v 0 and has a short correlated effective noise: We can check that ζ n ζ m has the nearest neighbor structure of Eq. (27): ζ n ζ m = C nm = aδ n,m + bδ n,m±1 .
So far, these equations are exact. Some approximation is needed at this stage to evaluate the integral of the force. Various methods have been investigated in the literature; among the simplest is the Langevin Impulse method [30], which approximates the integral with the function at the midpoint, leading to the coefficients a and b of the covariance matrix assume the expression reported in Eq. (28).
An alternative approach, followed in [25] (see also [36]), consists of considering the full system of equations in the (x, v) space in integral form: where f (x) = −ω 2 0 x for the harmonic oscillator. Again, approximations have to be made to compute the integrals. This can be done through an iterative Taylor expansion. To first order one gets back the EM scheme. Following [25], to second order we get: where D n is defined as follows: (A8) and ξ n and θ n are i.i.d. gaussian variables sampled from N (0, 1). Eliminating the velocity variables v n and v n−1 , we find a difference equation of the form of Eq. (23): with α and β coinciding, up to O(∆t 2 ), to the Taylor expansion of the coefficients in Eq. (A5). The noise variable ζ n is defined from Eq. (A7) as a linear combination of ξ n , ξ n−1 , θ n , θ n−1 . As a result, correlations between subsequent noise extractions emerge due to overlapping Wiener processes, which are still described by Eq. (27). This second derivation is helpful in justifying the quasi-Toeplitz structure of the covariance matrix discussed in the main text. Indeed, fixing x 1 implies fixing the first random increment which is responsible for position update in the integration scheme Eq. (A7), when the known initial conditions are (x 0 , v 0 ). Since this stochastic increment enters into the definition of ζ 1 but not in that of ζ 2 , the true covariance matrix must have a different entry C 11 than the other elements on the main diagonal, as in Eq. (29). Several O(∆t 1/2 ) schemes for the numerical integration of second order stochastic differential equations can be defined. From each of them, inconsistently retaining only the diagonal O(∆t 3/2 ) stochastic terms when we write the update equations in x space, we can extract a factorized expression for the dynamical likelihood, such as Eq. (16).
Let us focus on three particular examples: the standard explicit Euler-Maruyama (called Euler-forward or EM-fwd) scheme, its implicit variant (called Eulerbackward or EM-bkd), and the symmetric BBK scheme [37]. The three of them may be obtained from the second order SDE Eq. (1) by approximating first and second time derivatives as: • Euler-forward: • BBK: To find the three schemes, we expand around different points the integrals in Eq. (A7). The resulting update equations in the three cases read: with σ = √ 2T η and {r n } a sequence of L − 1 i.i.d. gaussian random variables of null mean and unit variance. Notice that Eq. (B8) is not exactly what we obtain by applying the prescription given by Eqs. (B3) and (B4). We chose to modify the point where the force is computed because this greatly simplifies the normalization of the propagator in the interacting case.
Thanks to the independence of the random variables appearing in Eqs.(B7)-(B9), the discrete propagator takes an approximate factorized form, which we can gen-erally write as: (B10) The reduced minus-log likelihood, defined as corresponds in the factorized case to the temporal average of the quantity (S n + ln Z n ). This quantity is defined in a slightly different way in the three cases above; consequently, in each of these cases the reduced minus-loglikelihood will read: We recall the notation used in the main text to indicate the equal-time, one-step and two-step experimental correlation functions: Minimization of Eqs. (B12)-(B14) with respect to the parameters of the model yields the following optimal values, according to the adopted scheme: • Euler-forward: • Euler-backward: • BBK: (B22) All of the schemes above are derived from O(∆t 1/2 ) integrators and consequently give a 2/3 rescaling factor for the inferred damping coefficient η * . This can be checked using the procedure outlined to derive Eq. (20), which consists of replacing the experimental two-time quantities with the known correlation functions for the harmonic oscillator and performing a Taylor expansion around the zero temporal distance. In the same way, the exactness of the inference formulas for T * and ω 2 0 * can be checked for the three methods.

Shift-invariant O(∆t 3/2 ) Bayesian approach
We argued that the joint probability of sequences of points in real space is not factorized into a chain of con-ditional probabilities. This happens because the dynamics of the harmonic oscillator, when projected into the x space, is governed by evolution equations containing a colored noise, as explained in the main text and in App. A. The right scheme to adopt is then of the kind of Eq. (23): as discussed in the main text, this exhibits strong convergence of order O(∆t 3/2 ) and requires correlations between subsequently extracted random variables to be taken into account, resulting in a covariance matrix with a symmetric tridiagonal (quasi-)Toeplitz structure (cfr. Eqs. (29) and (27)). We pursue a maximum likelihood approach taking Eq. (25) as the function of the parameters of the model to maximize: The partition function is specified by Eq. (26) and Eq. (31), whereas the relation between the parameters α and β of the difference equation and parameters of the dynamical model depends on the details of the discretization scheme which is adopted.
Thanks to the peculiar structure of this likelihood, one can go pretty far with simple algebra in the optimization problem. First of all, it is convenient to reformulate the issue as a minimization problem for the minus loglikelihood: (B26) As usual, the temperature just appears as a prefactor for the effective action, without affecting its actual dynamical structure. The optimal value is given by: Replacing it into Eq. (B25) and getting rid of additional constants and common factors, we obtain a reduced minuslog-likelihood: (B28) One can now split all the terms appearing in the sum and derive with respect to α and β, which play the role of independent parameters of the model. By adopting the Langevin Impulse integrator (see App. A), they correspond to: By adopting a second order Taylor expansion around the prepoint, they correspond to: As required for them to be consistent, the two variants are equivalent up to O(∆t 3 ). The numerical results shown in this paper are obtained using Eq. (B29).
Imposing that the derivatives of L w.r.t. α and β are zero leads to: This procedure can be applied when dealing with a one-dimensional stochastic harmonic oscillator, and can be generalized to any one-dimensional system described by a Kramers process with velocity-independent forces: f = f (x). The only difference is given by the replacement of linear update equations with non-linear ones, which results into a new set of observables appearing in the effective minus log-likelihood. All of these observables are still experimentally available, provided that the dataset consists of sample trajectories (x 0 , x 1 , . . . , x L ).
As one moves from single to many particle systems, extra parameters are needed: position and velocity variables are conveniently represented as N -component vectors, N being the number of constituents of the group, and model parameters become matrices. For the equations of motion of the three-dimensional ISM on a fixed lattice in the spin-wave approximation Eq. (35), the update rule becomes: with where Λ ij is the discrete Laplacian, and sums over the j index are implicit. The definitions of α 0 , α 1 and β depend on the details of the discretization. Using, for instance, the LI integrator for Langevin equations, Using instead a second order Taylor expansion, we get: The three parameters α 0 , α 1 and β are in any case not independent, since the extra independent parameters of the interacting problem are hidden in the adjacency matrix. In both of the cases considered above (second order Taylor expansion around the prepoint, Eq. (B39), and Langevin Impulse integrator, Eq. (B38)), α 0 and β are linked by the same relation: α 0 = −β − 1. With this replacement, and renaming α 1 = α, the minus-loglikelihood reads: Again, one can proceed with an analytic minimization with respect to T , α and β, giving: with K 0 . . . K 5 the generalization to the many-particle case of the combinations of experimental observables T 1 , . . . , T 5 defined above: 4. Non-Bayesian approach: inference formulas without defining the likelihood We build in this section an alternative approach to the Bayesian one, as outlined in Section II E of the main text. To be explicit, we need to choose a discrete update equation in x space: let us choose again the one corresponding to the usual continuation rule of the LI: and multiply its r.h.s. and l.h.s. by x n , x n+1 and x n−1 and take the average over the noise distribution. The resulting equations are: Using again Eq. (B49) -combined with the covariance matrix of the gaussian variables -to compute ζ n x n and ζ n x n+1 , the relations we find are: In order to find Eqs. (B53)-(B55), we identified the actual correlation functions with the empirical ones, denoted with C, G and F symbols, and we hypothesized a stationarity assumption to hold to explicitly compute them. More precisely: After proper manipulation, one can extract "inference relations" for b, e −η∆t and ω 2 0 ∆t, and derive from them the physical parameters of the model. In order, e −η∆t is given as the solution of the second-degree polynomial equation: then b and ω 2 0 ∆t are computed as follows: Notice that these inference equations are not unique. Combining the starting equations in a different way would result into slightly different inference formulas, which, however, should provide the same result if the experimental correlation functions faithfully reproduce ensemble averages at the steady state.
This strategy cannot be adapted to interacting problems, outside of the mean field approximation. The obstacle comes from the parametrization of the interaction matrix, which is the discrete counterpart of the introducing an interaction range in the corresponding field theory. Without a priori parametrization, the issue of sufficient statistics arises: one can think about repeating the same procedure in the multi-particle case for each particle pair and look for independent inference formulas for any matrix element JΛ ij . Bypassing the technical difficulties related to solving the resulting system of N 2 +2 second degree equations for the unknowns b, e −η∆t and {JΛ ij } i,j=1...N , we have a much greater number of parameters to infer than of points in each frame. This problem becomes totally untractable if one also allows Λ ij to evolve in time, as in active animal groups [14,38].
Assumptions about the structure of the matrix Λ ij dramatically diminish the number of parameters and help us deal with the worry of insufficient statistics, but require an alternative strategy to estimate the interaction range, since this physically motivated parametrization does not allow us to find closed-form equations.
It is possible yet to approximately estimate the damping coefficient and the effective temperature of the system of interacting particles, assuming that they are all immersed in the same uniform thermal bath. Under this assumption, Eqs. (B50)-(B52) can be adapted to the interacting case and properly manipulated to find the following relations: where we have used the third independent equation to eliminate J/χ and exploited the fact that a = 4b, with b = 1 6 2 T η χ 2 ∆t 3 . Let us define the empirical spatiotemporal correlation functions involved in these inference formulas: • Equal-time correlations: • One-step correlations: • Two-step correlations: The observables appearing in Eqs. (B59)-(B60) are defined from (B61)-(B66) as in the following. We can distinguish the contribution of self-correlations, encoded by: and that of correlations between directly interacting birds, encoded by the quantities: where Λ ij = n c δ ij − n ij . Notice that all of them are by definition self-averaging quantities, which obviously tend to be more and more stable as the size of the system increases.
As already stressed, in absence of a proper likelihood, an unattainable task is that of dealing with functions denoted with an int subscript; however, the manipulation we carried out to derive Eqs. (B59)-(B60) confines them into sub-leading terms. This can be checked by looking at the combinations: the one obtained replacing G int with G int , and Under the working hypothesis that ∆t is sufficiently small, we can neglect these terms and find usable relations to extract the effective parameters of the thermal bath (η/χ, T /χ) from the experimental self-correlations only. Precisely, η/χ is found as a solution of the equation: whereas the effective temperature is extracted from b, being: Notice that this formula is exactly equivalent to Eq. (B57), since we defined the effective damping coefficient of the harmonic oscillator as η = µ/m, whereas the corresponding quantity, having the dimension of an inverse time scale, is η/χ for the ISM. These formulas have been applied to find the results shown in Fig. 4.
Appendix C: Equations of motion of the ISM in the spin wave approximation (SWA) We derive in this appendix the equations of motion of the inertial spin model (ISM) in the so-called spin wave approximation (SWA). The name comes from the analogy with ideal Heisenberg ferromagnets which, at very low temperatures, can be studied using an approximate theory, whose basic idea is that the lowest energy excitations in a ferromagnet are those produced by a single reversed spin over a large number of otherwise aligned spins in a crystal lattice. Dyson showed that an interaction between spin-wave states arises from this approximation and it should be taken into account to consistently work out the spin wave expansion [39]. In a similar way, since natural flocks of starling are in a deeply ordered phase, we can perform an expansion around the perfectly ordered state of the flock, where all of the birds' velocities are aligned along the same direction.
Let us denote by n the collective direction of motion of the flock. Each vector v i can be decomposed into its longitudinal and transverse components with respect to n: In the case of bird flocks, the spin-wave approximation reduces to approximating the longitudinal components as follows:  having v i a unit length. The equations of motion of the ISM (with fixed interaction network) can be written in the form of a set of second order SDEs for the velocity variables: where the ⊥ symbol indicates the projection onto the orthogonal plane to the direction of motion of the i-th bird, v i . This projection operator and the last term of Eq. (C3) are the required ingredients to ensure individual speed conservation: |v i (t)|= v 0 = 1 ∀i, t. Thanks to this property, Eq. (C3) further simplifies: Using Eqs. (C1) -(C2), and exploiting the fact that, for any vector a, one can evaluate all the terms appearing in Eq. (C4), at the desired order of approximation. Let us focus firstly on time derivatives: we notice that, in principle, they also produce terms containing dn dt and d 2 n dt 2 . In the following we will assume that the direction of collective motion n is constant. This is legitimate in the limit N → ∞, when the wandering of the order parameter is suppressed, or at least when it is very slow compared to the relaxational dynamics of the degrees of freedom. If, on the contrary, one wants to take this effect into account, apparent forces emerge because the chosen reference frame is non-inertial. Neglecting apparent forces enables to segregate onplane (i.e. perpendicular to n) and off-plane (i.e. parallel to n) contributions, and completely disentangle the corresponding equations. One can then consider the equations in the π-plane only: where Λ ij = n ij − n c δ ij andP is the projection operator onto the plane perpendicular to the collective velocity The velocity fluctuations π i play in this case the same role as spin excitations in Dyson's SWA, both becoming the new degrees of freedom and displaying a linear interaction.
At this stage, what remains to explicitly evaluate is onlyP ξ i⊥ . We know that ξ i⊥ lives in the plane perpendicular to v i , so that the perpendicular component to the plane spanned by V and v i is left unchanged by this projection operator, while the other one is contracted with a factor cos θ i , with θ i the angle between v i and n. As a result: The second moment of each noise term is then rescaled, with respect to the original one, by a factor: In order to let the fluctuation-dissipation theorem hold, this rescaling can be re-adsorbed by the temperature parameter T /χ, which is in principle different for each bird. At an averaged level, we can define a new spin wave temperature that differs form the original temperature of the inertial spin model by a factor 1 In the low temperature case, where |π| 1, Φ = 1 + O(|π| 2 ); the first correction to the temperature parameter is then of a lower order with respect to the terms which have been neglected in the deterministic part of Eq. (C6) and shall correctly be included through this simple effective rescaling.
As long as the experimental or statistical errors are wide enough and the system pretty ordered, this SWArelated correction is negligible. Thanks to the large statistics and high accuracy we managed to have with our simulations and inference machinery, we are able to detect it in Fig. 4c, where points are systematically placed below the line of slope 1, especially for higher values of the temperature, which in turn correspond to lower polarization values. A comparison between the two panels of Fig. 7 confirms that this is truly the origin of the observed trend and not an intrinsic defect of the inference procedure.

Appendix D: The role of the initial condition
What emerges from the whole discussion led in the main text is that the ultimate problem in the development of a Bayesian inference procedure for second order stochastic dynamical systems is the elimination of the initial condition on the first derivative of the measured degrees of freedom. Suppose that the alleged model for some dynamical phenomenon of interest is described by a given set of stochastic differential equations. If one wants to adopt a Bayesian inference strategy based on the maximum likelihood principle to learn the parameters of the model from experimental data, the first unavoidable task is to find a usable expression for the probability to observe the observed sequence of points, knowing the parameters of the model λ. While in first order stochastic processes, i.e. when all the degrees of freedom allowing for a Markovian description of the dynamics are experimentally accessible, it is clear how this likelihood should be computed (see, for example, [13]). For second order stochastic processes such that only position variables can be measured, the inference problem may turn out to be ill-defined.
Indeed, having a simple first order model, likė with initial condition y(0) = y 0 , what one usually refers to is the propagator P (y L , . . . , y 1 |λ; y 0 ), where y 0 represents, as well as the structure of model, some 'shared information' appearing as a hidden conditional in both the posterior, the likelihood and the prior. Conventionally, we will put after the semicolon any pieces of this shared information, which does not explicitly intervene in Bayes' formula.
In the case of a second order stochastic process like the initial condition is given by the pair x(0) = x 0 ,ẋ(0) = v 0 . We can then define a propagator P (x L , . . . , x 1 |λ, x 0 , v 0 ). However, contrarily to x 0 , the initial condition on the velocity is not empirically known, so we cannot identify this propagator with a likelihood of the form of P (x L , . . . , x 1 |λ; x 0 , v 0 ). Two alternatives are formally possible: • Starting from P (x L , . . . , x 1 |λ; x 0 , v 0 ), one can try to replace v 0 with a second condition on the x variable (since this is directly measurable), formally obtaining: P (x L , . . . , x 2 |λ; x 0 , x 1 ) = dv 0 P (x L , . . . , x 2 |λ; x 0 , v 0 ) · P (v 0 |x 0 , x 1 ).
Replacing the original pair of initial conditions (x 0 , v 0 ) on the degree of freedom and its derivative with a pair of conditions (x 0 , x 1 ) on the nonderived variable is a standard faultless procedure in deterministic systems, under proper existence and uniqueness conditions for the solution. In contrast, the same cannot be done easily when dealing with stochastic systems, because stochasticity implies that the information on the initial velocity is not equivalent to that on a subsequent position.
• The second option is to include v 0 as a parameter of the model. In general, when one takes the dynamical description of the process in x space, as in Eq. (A2), the initial condition on the velocity acts as an external source. Hence, from the viewpoint of statistical inference, v 0 becomes a parameter to be estimated. Exploiting the Bayes' theorem: P (λ, v 0 |x L , . . . , x 0 ) ∝ P (x L , . . . , x 1 |λ, v 0 ; x 0 )ρ(λ, v 0 ), (D4) with ρ(λ, v 0 ) the prior on all the parameters of the model. In this scenario, the conjecture of uniform prior leading to the maximum likelihood principle is not as easily justified as before, and more sophisticated approaches might be preferable.
Although naturally suggested by the theoretical picture of the problem, both of these alternatives are too involved to be carried out explicitly, unless in the naïve O(∆t 1/2 ) approach, where P (v 0 |x 0 , x 1 ) corresponds, within the considered order of approximation, to δ (v 0 − (x 1 − x 0 )/∆t) and a fully factorized closed form for the discrete path integral is found from Eq. (D3).
In the approach followed in Sec. II D (details in App. B 2) we decided to deal with the initial condition problem in the following way. Firstly, the choice of the LI discretization scheme confines the initial condition problem only to the first step, independently of the total number of points of which the trajectory is made up and of how the decay time of the memory kernel relates to ∆t. This can be clearly understood from the derivation in App. A. We start from update equations in x space containing v 0 as a parameter at any time step. A linear combination of properly weighted update equations at different steps enables one to eliminate v 0 , but this cannot be done at the initial step, where the required shift invariance is broken. The second thing we did is then to neglect the breaking of shift invariance, which gives to the covariance matrix of noise a quasi-Toeplitz structure, and to introduce a Toeplitz approximation for it. This approximation works well for long trajectories (i.e. made up of a large number of points), whereas it fails for very short ones. The convergence is however quite fast (see Fig. 3 and the related discussion in the main text). Numerical results on the ISM confirm that, for the typical length of available data-sets on starling flocks [40], the error we make because of this approximation is by far negligible.
To obtain the analytical prediction for the rescaling coefficients shown in Fig. 3, we followed the same line of reasoning that led to the prediction of the 2/3 rescaling factor for the inferred value of the damping parameter in the O(∆t 1/2 ) schemes. We took the explicit inference formulas for β * in Eq. (B31), replaced the two-time products x n x m with the self-correlation function C(|n−m|∆t) and performed a Taylor expansion up to the third order for small ∆t. What we obtained, for any L, reads as: β * = e −η * ∆t 1 + (L + 1) ...
Appendix E: Numerical methods

Stochastic harmonic oscillator
The numerical simulations of the single stochastic harmonic oscillator was performed using the simplest algorithm proposed in [25]. The update equations we used correspond to those reported in Eq. (A7). We chose a constant integration time step equal to τ sim = 0.005 for all the sampled values of η, T and ω 2 0 that we used to produce Figs. 1 and 5; the maximum value of η∆t we worked with is 2.5 · 10 −2 .
No extra filtering of the produced synthetic trajectories was performed before applying the inference procedure: the minimum value of the time lag ∆t we took corresponds to τ sim itself. Possible different behaviour at small ∆t across various schemes of the same order may be due to similarities of the discrete and explicit or implicit nature of the integration algorithm. Nonetheless, since the scheme is strongly convergent [24,41] at order O(∆t 3/2 ), it is robust enough to reproduce, for our purposes, an approximately correct probability distribution of the stochastic trajectories of the harmonic oscillator, even at the scale of τ sim . Since the observables of interest depend on fluctuations rather than on averages, the choice of a good-for-the-purpose numerical integrator is crucial, for the single one-dimensional harmonic oscillator as well as for the N -particle system simulated in the interacting case. Sophisticated algorithms largely used for MD simulations do not necessarily match our requirements.

ISM simulations
We implemented a numerical integrator for the ISM in d = 3 that combines the leapfrog method with Boris's trick to ensure speed conservation [42]. We performed simulations on fixed Poisson random lattices (i.e. sites are randomly chosen points with uniform distribution), discarding the update of particle positions and consequent reshuffling effects. As a result, the adjacency matrix of the graph associated to the interacting particle system is time-independent and the constant speed v 0 of each bird does not play any role. Thus the numerical integrator we used consists of the following set of update equations: with t n = − 1 2χ ∆ts n+1/2 and u n = 2t n /(1 + |t n | 2 ). Ξ n i is a three-dimensional isotropic Gaussian variable of zero mean and of variance: The adjacency matrix explicitly reads: with r ij the rank of bird j as a neighbour of bird i (excluding the bird itself, to which we conventionally associate rank r ii = 0). In all of our simulations we worked with periodic boundary conditions. We tried to ensure that the system was sampled in a stationary regime by starting from microscopic configurations corresponding to polarization values close to the equilibrium ones. The polarization is the macroscopic order parameter of the system and it is defined, in perfect analogy to the magnetization in a 3-dimensional Heisenberg model, as Flocks of N = 1000 birds are simulated to obtain the results shown in this paper, with topological range of interaction n c = 6 (except for the data in Fig. 4b), alignment strength J/χ = 5 and effective temperature T /χ in the range [0.2, 1.2]. When not explictly indicated, we took T /χ = 0.4, approximately corresponding to a polarization of 0.97 (for n c = 6). We chose an integration time step of τ sim = 0.0005/(Jn c ) for all the simulations. Different damping regimes have been explored, and the performance of the inference method was tested in each of them, and for various choices of the time lag ∆t. In order to disentagle the effects of the discrete nature of the simulation from proper malfunctioning of the inference schemes, the minimum inference time step ∆t displayed in Figs. 4b and 6 is 5τ sim .