Statistical inference across time scales

We investigate statistical inference across time scales. We take as toy model the estimation of the intensity of a discretely observed compound Poisson process with symmetric Bernoulli jumps. We have data at different time scales: microscopic, intermediate and macroscopic. We quantify the smooth statistical transition from a microscopic Poissonian regime to a macroscopic Gaussian regime. The classical quadratic variation estimator is efficient in both microscopic and macroscopic scales but surprisingly shows a substantial loss of information in the intermediate scale that can be explicitly related to the sampling rate. We discuss the implications of these findings beyond this idealised framework.


2004
Statistical inference across time scales 2005 and independent of the standard homogeneous Poisson process (N t ) with intensity ϑ ∈ Θ = (0, ∞). Suppose we have discrete data over [0, T ] at times i∆. This means that we observe and we obtain a statistical experiment by taking P ϑ as the law of X defined by (2) when (X t ) is governed by (1). This toy model is central to several application fields, e.g. financial econometrics or traffic networks (see the discussion in Section 3 and the references therein). Moreover, it already contains several interesting properties that enlight a tentative concept of statistical inference across scales. This is the topic of the paper.
On the one hand, if we observe (X t ) microscopically, that is if ∆ = ∆ T → 0 as T → ∞, then asymptotically, we can -essentially -locate the jumps of (N t ) that convey all the relevant information about the parameter ϑ. In that case, X is "close" to the continuous path (X t , t ∈ [0, T ]). On the other hand, if we observe (X t ) macroscopically, that is if ∆ T → ∞ under the constraint 1 T /∆ T → ∞, we have a completely different picture: the diffusive approximation becomes valid, where (W t ) is a standard Wiener process. We elaborate in Appendix the aproximation (3). Inference on ϑ essentially transfers into a Gaussian variance estimation problem; in that case, the state space rather becomes R ⌊T ∆ −1 ⌋+1 . Finally if we observe (X t ) in the intermediate scale 0 < lim inf ∆ T ≤ lim sup ∆ T < ∞, we observe a process presenting too many jumps to be located accurately from the data, and too few to verify the Gaussian approximation (3). Therefore, depending on the scale parameter ∆ T , the state space may vary, and it has an impact on the underlying random scenarios P ϑ , although the interpretation of the parameter of interest ϑ remains the same at all scales. What we have is rather a family of experiments where P T,∆ ϑ denotes the law of X given by (2) and these experiments E T,∆ may exhibit different behaviours at different scales ∆. Heuristically, we would like to state that in the microscopic scale ∆ T → 0, the measure P T,∆T ϑ conveys the same information about ϑ as the law of that is if the jump times of (X t ) were observed. On the other side, in the macroscopic scale ∆ T → ∞ with T /∆ T → ∞, the measure P T,∆T ϑ shall convey the same information about ϑ as the law of 0, √ ϑW ∆T , . . . , that is if the data were drawn as a Brownian diffusion with variance ϑ.

C. Duval and M. Hoffmann
The following questions naturally arise: i) How does the model formulated in (4) interpolate -from a statistical inference perspective -from microscopic (when ∆ = ∆ T → 0) to macroscopic scales (when ∆ = ∆ T → ∞)? In particular, how do intrinsic statistical information indices (such as the Fisher information) evolve as ∆ = ∆ T varies? ii) Is there any nontrivial phenomenon that occurs in the intermediate regime iii) Given i) and ii), if a statistical procedure is optimal on a given scale ∆, how does it perform on another scale? Is it possible to construct a single procedure that automatically adapts to each scale ∆, in the sense that it is efficient simultaneously over different time scales?

Main results
In this paper, we systematically explore questions i), ii) and iii) in the simplified context of the experiments E T,∆ built upon the continuous time random walks model (1) for transparency. Some extensions to non-homogeneous compound Poisson processes are given, and the generalisation to a more general compound law is also discussed. As for i), we prove in Theorems 1, 2 and 3 that the LAN condition (Locally Asymptotic Normality 2 ) holds for all scales ∆. This means that P T,∆ ϑ can be approximated -in appropriate sense -by the law of a Gaussian shift. We derive in particular the Fisher information of E T,∆ and observe that it smoothly depends on the scale ∆. We shall see that the answer to ii) is positive. More precisely, we first prove in Theorem 4 that the normalised quadratic variation estimator is asymptotically efficient -it is asymptotically normal and its asymptotic variance is equivalent to the inverse of the Fisher information -in both microscopic and macroscopic regimes. In the microscopic regime, it stems from the fact that the approximation becomes valid, as the jumps are ±1, and the efficiency is then a consequence of N T /T being the maximum likelihood estimator in the approximation experiment (5). In the macroscopic regime, thanks to the diffusive approximation (3) we have which is precisely the maximum likelihood estimator in the macroscopic approximation experiment (6). Surprisingly, ϑ QV T fails to be efficient when More precisely, we show in Theorem 5 that, although rate optimal, ϑ QV T misses the optimal variance by a non-negligible factor, depending on ∆ ∞ , that can reach up to 23%. This phenomenon is due to the fact that in the intermediate regime (7), the process (X t ) is sampled at a rate which has the same order as the intensity of its jumps. On the one hand, (X i∆T − X (i−1)∆T ) 2 gives no accurate information whereas a jump has occured or not during the period [(i − 1)∆ T , i∆ T ], contrary to the case ∆ T → 0. On the other hand, there are not enough jumps to validate the approximation of X i∆T − X (i−1)∆T by a Gaussian random variable, contrary to the case ∆ T → ∞. Finally, we construct in Theorem 6 a one-step correction of ϑ QV T that provides an estimator efficient in all scales, giving a positive answer to iii). This paper is organised as follows. We first propose in Section 2.1 a canonical framework for different time scales by considering the family of experiments E T,∆T T >0 . The way the scale parameter depends on T defines the terms microscopic, intermediate and macroscopic scales rigorously. Specialising to model (1) for transparency, the results about the structure of the corresponding E T,∆T T >0 are stated in Section 2.2. We show in Theorems 1, 2 and 3 that the LAN (Local Asymptotic Normality) property holds simultaneously over all scales and provides an explicit expression for the Fisher information. The proof follows the classical route of [9] and boils down to obtaining accurate approximations of the distribution in the limit ∆ T → 0 or ∞. Note that f ∆T (ϑ, k) does not depend on i since (X t ) has stationary increments. However explicit, the intricate form of f ∆T (ϑ, k) requires asymptotic expansions of modified Bessel functions of the first kind. In the macroscopic regime however, we were not able to obtain such expansions. We take another route instead, proving directly the asymptotic equivalence in the Le Cam sense, a stronger result at the expense of requiring a rate of convergence of ∆ T to ∞, presumably superfluous. We show in Theorems 4 and 5 of Section 2.3 that the quadratic variation estimator ϑ QV T is rate optimal and efficient in both microscopic and macroscopic regimes, but not in the intermediate scales (7). This negative result is however appended with the construction of an estimator based on a one-step correction of ϑ QV T that is efficient over all scales (Theorem 6). Moreover this estimator has the advantage of being computationally implementable, contrary to the theoretical optimal maximum likelihood estimator. Section 3 gives some extensions in the case of a non-homogeneous compound Poisson process (Theorem 7) and addresses the generalisation to more general compound laws. The comparison to related works on estimating Lévy processes from discrete data is also discussed. Section 4 is devoted to the proofs.

Building up statistical experiments across time scales
Let T > 0 and ∆ > 0 be such that ∆ ≤ T . On a rich enough probability space (Ω, F , P), we observe the process (X t ) defined in (1) at frequency ∆ −1 over the period [0, T ]. Thus we observe X defined in (2), and with no loss of generality 3 , we take X 0 = 0. We obtain a family of statistical experiments where P T,∆ ϑ denotes the law of X when (X t ) has the form (1), and Θ ⊆ (0, ∞) is a parameter set with non empty interior. The experiment E T,∆ is dominated by the counting measure µ T on Z ⌊∆ −1 T ⌋ . Abusing notation slightly, we may 4 (and will) identify X with the canonical observation in E T,∆ . Since (X t ) has stationary and independent increments under P T,∆ ϑ , we obtain the following expression for the likelihood where we have set, for k ∈ Z, We shall repeatedly use the terms microscopic, intermediate and macroscopic scale (or regime). In order to define these terms precisely, we let ∆ = ∆ T depend on T with 0 < ∆ T ≤ T , and we adopt the following terminology.

The regularity of (E T ,∆T ) T >0 across time scales
Let us recall 5 that the family of experiments (E T,∆T ) T >0 satisfies the Local Asymptotic Normality (LAN) property at point ϑ ∈ Θ with normalisation where and If (9), (10) and (11) hold, we informally say that (E T,∆T ) T >0 is regular with information I T,∆T (ϑ). This means that locally around ϑ, the law of X can be approximated by the law of a Gaussian shift experiment, where one observes a single random variable with ξ T being approximately distributed as a standard Gaussian random variable under P T,∆T ϑ as T → ∞. In particular, I T,∆T (ϑ) is the Fisher information of the Gaussian shift experiment: the optimal rate of convergence for recovering ϑ up to constants from X is the same as the one obtained from Y and is given by I T,∆T (ϑ) −1/2 provided I T,∆T (ϑ) → ∞ as T → ∞. Note also that if the convergence of the remainder term r T = r T (v) in (11) holds locally uniformly in v, then I T,∆T (ϑ) can be replaced by any function J T,∆T (ϑ) such that J T,∆T (ϑ) ∼ I T,∆T (ϑ) as T → ∞ without affecting the LAN property. Hereafter, the symbol ∼ means asymptotic equivalence up to constants. Our first result states the LAN property for the experiment E T,∆ T >0 on every scale ∆ ∈ (0, ∞).
Then the family (E T,∆T ) T >0 is regular and we have where, for x ∈ R and ν ∈ N, denotes the modified Bessel function of the first kind.
Remark 1. By taking ∆ T = ∆ ∞ ∈ (0, ∞) constant, we include the case of a fixed ∆, therefore the same regularity result holds for E T,∆ T >0 . An inspection of the proof of Theorem 1 reveals that the mapping ∆ Our next result shows that formally we can let ∆ ∞ → 0 in the expression of I T,∆∞ (ϑ) given by Theorem 1 in the microscopic case. Moreover we obtain a simplified expression for the information rate.
Theorem 2 (The microscopic case). Assume ∆ T → 0 as T → ∞. Then the family (E T,∆T ) T >0 is regular and we have The macroscopic case is a bit more involved. In that case, we cannot formally let ∆ ∞ → ∞ in the expression of I T,∆∞ (ϑ) given by Theorem 1. However we have the following simplification.
Then the family (E T,∆T ) T >0 is regular and we have The condition T /∆ is technical but quite stringent; it is satisfied for example if ∆ T = T β with 4 5 < β < 1 and stems from our method of proof, see Section 4.3. It is presumably superfluous, but we do not know how to relax it.

The distortion of information across time scales
On each scale ∆ > 0, let us introduce the empirical quadratic variation estimator that mimics the behaviour of the maximum likelihood estimator in both macroscopic and microscopic regimes (see Section 1). More precisely, we have the following asymptotic normality result.
where ξ T → N (0, 1) in distribution under P T,∆ ϑ , and I T,0 (ϑ) and I T,∞ (ϑ) are the information of the microscopic and macroscopic experiments given in Theorems 2 and 3 respectively.
On a microscopic scale ∆ T → 0, we have On a macroscopic scale ∆ T → ∞ with T /∆ T → ∞, we have on the contrary As a consequence, we readily see that ϑ QV T,∆T is asymptotically normal and that its asymptotic variance is equivalent to I T,0 (ϑ) −1 on a microscopic scale and to I T,∞ (ϑ) −1 on a macroscopic scale. At a heuristical level, this phenomenon can be explained directly by the form of the empirical quadratic variation estimator, as we already did in Section 1. At intermediate scales however, this is no longer true.
Theorem 5 (Loss of efficiency in the intermediate regime). Assume that Then lim inf where I T,∆T (ϑ) is defined in Theorem 1.

Remark 2.
For technical reasons, we are unable to prove that Theorem 5 remains valid beyond the restriction lim sup ∆ T ≤ 1/(4ϑ). Numerical simulations suggest however that Theorem 5 is valid whenever lim sup ∆ T < ∞, see Figure 1.
Let us denote by the squared error loss of the quadratic variation estimator. By Theorems 1, 2 and 3, the family E T,∆T T >0 is regular in all regimes and we may apply the classical minimax lower bound of Hajek, see for instance Theorem 12.1 in [9]: we have, for any ϑ 0 ∈ Θ and δ > 0 such that On the one hand, Theorem 4 suggests 6 that the lower bound (13) can be achieved in microscopic and macroscopic regimes. On the other hand, Theorem 5 shows that inequality (13) is strict in the intermediate case, whenever the restriction (12) is satisfied, thus revealing a loss of efficiency in this sense. Define where h ∆ (ϑ, k) is defined in Theorem 1. An inspection of the proof of Theorem 5 shows that ϕ(ϑ, ∆) = ψ(ϑ∆), for some univariate function ψ, and that The maximal loss of information is obtained for as T → ∞. Numerical simulations show that the maximum loss of efficiency is close to 23%. Since E T,∆ T >0 is regular for every ∆ > 0, an asymptotically normal estimator with asymptotic variance equivalent to I T,∆ (ϑ) −1 is given by the maximum likelihood estimator. However due to the absence of a closed-form for the likelihood ratio that involves the intricate function f ∆ (ϑ, k) defined in (8) (see also Section 4.1.1), it seems easier to start from ϑ QV T,∆ which is already rate-optimal by 6 This is actually true as the uniform integrability of ϑ QV T,∆ T under P T,∆ ϑ , locally uniformly in ϑ, can easily be obtained. We leave the details to the reader. Theorem 4 and correct it by a classical one-step iteration based on the Newton-Rhapson method, see for instance the textbook [15] pp. 71-75. To that end, define Theorem 6. In all three regimes (microscopic, intermediate and macroscopic), we have Proof. In essence the regularity of f ∆ ϑ QV T,∆ , X i∆ − X (i−1)∆ enables to apply Theorem 5.45 of Van der Vaart [15].
Theorem 6 expresses the fact that ϑ OS T,∆T automatically adapts to I T,∆T and is therefore optimal across scales.

Discussion
The compound Poisson process (X t ) with Bernoulli symmetric jumps defined in (1) is the simplest model of a continuous time symmetric random walk on a lattice that diffuses to a Brownian motion on a macroscopic scale. The intensity ϑ of the Poisson arrivals on a microscopic scale is transferred into the variance ϑ of the Brownian motion on a macroscopic scale: in distribution as T → ∞, where (W t ) is a standard Brownian motion (see Appendix). The statistical inference program we have developed across time scales on the toy model given by (X t ) can be useful in several applied fields. For instance, in financial econometrics, (X t ) may be viewed as a toy model for a price process (last traded price, mid-price or bets bid/ask price) observed at the level of the order book, see e.g. [2] or [10]. The parameter ϑ can be interpreted as a trading intensity on microscopic scales that transfers into a macroscopic volatility in the diffusion regimes. Our results convey the message that if a practitioner samples (X t ) at high frequency at the same rate as price changes, which is customary in practice, then the realised volatility estimator ϑ QV T,∆T is not efficient, and a modified estimator like ϑ OS T,∆T should be used instead. However, this framework is a bit too simple and needs to be generalised in order to be more realistic in practice. Two directions can be explored in a relatively straightforward manner: i) The extension to a non-homogeneous intensity Poisson process.
ii) The extension to an arbitrary compound law on a discrete lattice.
Extension to the non-homogeneous case Theorems 1, 2 and 3 extend to the non-homogeneous case, when one allows the intensity of the jumps to depend on time. In this setting, the counting process (N t ) defined in (1) is defined on [0, T ] and has intensity is the nonvanishing (integrable) intensity function, so that the process is a martingale. The homogenous case is recovered by setting λ(ϑ, t) = ϑ for every t ∈ [0, 1]. In this context, the macroscopic approximation (15) becomes in distribution as T → ∞. We state -without proof -an extension of Theorems 1, 2 and 3 for the associated family of experiments E T,∆T T >0 across scales.
Theorem 7. We have Theorems 1, 2 and 3 with the following generalisation 1. In the microscopic case ∆ T → 0,

In the intermediate regime
3. In the macroscopic case ∆ T → ∞ with T /∆ T → ∞ and T /∆ The proof of Theorem 7 relies on the approximation where r T → 0 as T → ∞ in all three regimes. Assumption 1 ensures that the convergence of the remainder is uniform in i and ϑ. This reduction enables us to transfer the problem of proving Theorems 1, 2 and 3 when substituting independent identically distributed random variables by independent non-equally distributed ones. This is not essentially more difficult, and the regularity of λ enables us to piece together the local information given by each increment X i∆T − X (i−1)∆T in order to obtain the formulae of Theorem 7.
An analogous program as in Section 2.3 for the distortion of information could presumably be carried over, with appropriate modifications. For instance, one can show that -probability, in all three regimes. Then, in order to estimate ϑ efficiently, one should rather consider a contrast estimator that maximises for a suitable function g T,∆T , and make further assumptions on existence of a unique maximum for the limit -whenever it exists -of U T,∆T under P T,∆T ϑ as T → ∞. We do not pursue this here.

Extension to more general compound laws
The situation is a bit more delicate when one tries to generalise Theorems 1, 2 and 3 to an arbitrary compound law (ζ(ϑ, k), k ∈ Z), for every ϑ ∈ Θ, with 0 ≤ ζ(ϑ, k) ≤ 1, for k ∈ Z and k∈Z ζ(ϑ, k) = 1, (and ζ(ϑ, 0) = 0 for obvious identifiability conditions). We then observe a process (X t ) of the form (1), except that the jumps (ε i ) are now distributed according to In order to keep up with the preceding case, we normalise the compound law, imposing k∈Z k ζ(ϑ, k) = 0 and k∈Z k 2 ζ(k, ϑ) = 1.
First, in the microscopic case, we approximately observe over the period [0, T ] a random number of jumps, namely N T which is of order ϑT . Second, conditionally on N T , the size of the jumps form a sequence of independent and identically distributed random variables with law ζ(ϑ, k). On the other side, in the macroscopic limit, the effect of the size of the jumps is only tracked through their second moment, which is normalised to 1 by (16). Therefore it gives no additional information about ϑ. The situation is rather different from the case of symmetric Bernoulli jumps: here, the extraneous information about ϑ lies in the effect of the jumps, which are recovered in the microscopic regime and lost in the macroscopic one. There is however one way to reconcile with our initial setting, assuming that the compound law ζ(k) does not depend on ϑ and is known for simplicity. Then, for k ∈ Z, we have where ζ ⋆m (k) is the probability that a random walk with law ζ(k) started at 0 reaches k in m steps exactly. Therefore In the symmetric Bernoulli case, we have G k (x) = I |k| (x), where I ν (x) is the modified Bessel function of the first kind. Anticipating the proof of Theorems 1, 2 and 3, analogous results could presumably be obtained for an arbitrary compound law ζ(k) satisfying (16), provided accurate asymptotic expansions of G k (x) are available in the viscinity of 0 and ∞. The same subsequent results about the distortion of information that are developed in Section 2.3 would presumably follow, with the same estimators ϑ QV T,∆T and ϑ OS T,∆T , and the appropriate changes for f ∆ (ϑ, k) in (14).

Relation to other works
Concerning the estimation of the law of the jumps, say ζ, we have an inverse problem. One tries to recover ζ from the observations of a compound Poisson process, the link between ζ and the law of the process being given by (17). In the setting of positive compound laws, Buchmann and Grübel [5,6] succeed to invert that relation and give an estimator of ζ in the discrete and continuous case. That method which consists in inverting (17) is called decompounding. It was generalised by Bøgsted and Pitts [4] to renewal reward processes when the law of the holding times is known, inrestriction to the case of having positive jumps only.
The compound Poisson process is a pure jump Lévy process that can be studied accordingly. Using the Lévy-Khintchine formula, it is possible to estimate nonparametrically its Lévy measure which is given by the product ϑ × ζ in that case. This strategy is exploited by van Es et al. [16] for a known intensity. This estimation procedure does not restrict to compound Poisson processes and it includes the case of pure jump Lévy processes in general. Nonparametric estimation of the Lévy measure from high frequency data (that corresponds to our microscopic case ∆ T → 0) is thoroughly studied in Comte and Genon-Catalot [7] as well as in the intermediate regime (with ∆ T = ∆ ∞ fixed) in [8]. In that latter case, we also have the results of Neumann and Reiß [13].

Some estimates for f ∆ (ϑ, k)
We have, for k ∈ Z: where φ m (k) is the probability that a symmetric random walk in Z started from 0 has value k after m steps exactly: otherwise.

C. Duval and M. Hoffmann
Let us introduce the modified Bessel function of the first kind 7 for every x, ν ∈ R, and where denotes the Gamma function. Straightforward computations show that See for instance [14], p. 21 Example 4.7. We gather some technically useful properties of the function I ν (x) that we will repeatedly use in the sequel.
1. For every x ∈ R \ {0} and ν ∈ R, we have 2. For every µ > ν > − 1 2 and x > 0, we have Proof. Property 1 can be found in the textbook of Watson [17] and readily follows from the fact that x I ν (x) is analytical with an infinite radius of convergence. Property 2 is less obvious and follows from Nasell [12].

The Fisher information of E T,∆
For i = 1, . . . , ⌊T ∆ −1 ⌋, let E T,∆ i denote the experiment generated by the observation of the incerement X i∆ − X (i−1)∆ . Since (X t ) has independent stationary increments, we have, for k ∈ Z Using that X 0 = 0, it follows that as a product of independent observations given by the increments X i∆ −X (i−1)∆ , each experiment E T,∆ i being dominated by the counting measure on Z with 7 The function x Iν (x) can also be defined as the solution to the differential equation Statistical inference across time scales 2019 density f ∆ (ϑ, k) given by (18) that does not depend on i. Moreover, E T,∆ i has (possibly infinite) Fisher information given by which does not depend on i. We study the regularity of E T,∆ in the classical sense of Ibragimov and Hasminskii (see [9] p. 65). i) The mapping ϑ f ∆ (ϑ, k) is continuous on Θ for every k ∈ Z. ii) The Fisher information is finite: I ∆ (ϑ) < +∞ for every ϑ ∈ Θ.

Lemma 2.
The experiments E T,∆ i are regular.
Proof. For every k ∈ Z, f ∆ (ϑ, k) = exp(−ϑ∆)I |k| (ϑ∆), therefore i) is readily satisfied since ϑ ∈ Θ ⊂ (0, ∞) and ∆ > 0. We also have f ∆ (ϑ, k) > 0 for every k ∈ Z, then I ∆ (ϑ) is well defined, but possibly infinite. In order to prove ii), we write where we have set, for every k ∈ Z, h ∆ (ϑ, k) = I |k|+1 (ϑ∆) I |k| (ϑ∆) and used Property (19). It follows that Moreover the function |X ∆ | I |X∆| (ϑ∆) is decreasing (see (20)), thus and since X ∆ has all moments under P T,∆ ϑ , we obtain ii). We proceed similarly for iii). First, for any ϑ ∈ Θ and ε such that ϑ + ε ∈ Θ, we have therefore the experiments E T,∆T and E T,∆T are equivalent. Moreover, E T,∆T and Q T,∆T live on the same state space R ⌊T ∆ −1 T ⌋ and have smooth densities with respect to the Lebesgue measure. The proof of Theorem 3 is therefore implied by the following bound locally uniformly in ϑ and where · T V denotes the variational norm. This bound is implied in turn by the bound locally uniformly in ϑ, since each experiment is the ⌊(T /∆ T ) −1 ⌋-fold product independent and identically distributed random variables 10  Set η = η T = κ ∆ T log(T /∆ T ). We claim that for κ 2 > 2ϑ, the terms I, II and III are o (T /∆ −1 T ) hence (36) and the result, for an appropriate choice of κ so that the convergence can hold locally uniformly in ϑ. Since q ϑ,∆T (x) is the density of the Gaussian law N (0, ϑ∆ T ), we readily obtain using κ 2 > 2ϑ. For the term II, we observe that since |U 1 | ≤ 1/2, we have 10 For instance, by using the bound where the ε i ∈ {−1, 1} are independent and symmetric. By Hoeffding inequality, this term is further bounded by for every κ ′ > 0. If κ ′ < κ 2 /2, one readily checks that Moreover, if κ ′ > ϑ, we have, by Chernov inequality, and this term is also o (T /∆ T ) −1 . Thus II and III have the right order and it remains to bound the main term I. By Plancherel equality we obtain the following explicit expression: for any ρ ≥ 0. By a first order expansion, we have that IV is less than for some bounded function ξ α(ξ). Set α ⋆ = sup x |α(x)|. We thus obtain that IV is less than a constant times