Semi-parametric estimation of shifts

We observe a large number of functions differing from each other only by a translation parameter. While the main pattern is unknown, we propose to estimate the shift parameters using $M$-estimators. Fourier transform enables to transform this statistical problem into a semi-parametric framework. We study the convergence of the estimator and provide its asymptotic behavior. Moreover, we use the method in the applied case of velocity curve forecasting.


Introduction
A main issue in data mining is the feature extraction of a large set of curves. Indeed, classification methods enable to split the data into different homogeneous groups, each representing a specific mass behavior. But, within one group, the observations differ slightly the one from another. Such variations take into account the variability of the individuals inside one group. More precisely, there is a mean pattern such that, each observation curve is warped from this archetype by a warping function, see for examples [18].
In this work, we focus on the particular case where the individuals usually experience similar events, which are explained by a common pattern, but the starting time of the event occurs sooner or later. Classification methods, like repeated measures ANOVA or Principal Components Analysis of curves, see for instance [17], ignore this type of variability. Hence, computing a representative curve, for each group, severely distorts the analysis of the data. Indeed, the average curve (usually the mean or the median) oversmooths the studied phenomenon, and is not a good description of reality.
In our work, we restrict ourselves to the case where all the curves can be deduced the one from another by a shift parameter. Hence, we consider the following model: for j = 1, . . . , J and i = 1, . . . , n j , we observe where, J stands for the number of curves C j , while n j is the number of observations for the j-th individual. Values t ij are observation times, which are assumed to be known. The unobserved warping effects θ * j , j = 1, . . . , J, are shift parameters which translate the unknown function f . We also choose t ij = t i and n j = n, which means that all curves are observed at the same time with the same occurrence. The errors ε ij for (i, j) ∈ {1, . . . , n} × {1, . . . , J} are i.i.d. with distribution N (0, 1). Moreover, without loss of generality, we assume in the following that σ = 1 (see Remark 3.2). We aim at estimating the shift parameters θ * j , j = 1, . . . , J, in order to find a good representative of the feature f . A more general problem has been tackled in the literature and some work has been done to find a representative of a large sample of close enough functions f j , j = 1, . . . , J, see for examples [18]. Indeed, in a general case, we observe realizations y ij , j = 1, . . . , J, i = 1, . . . , n j , from model where, ε ij , j = 1, . . . , J, i = 1, . . . , n j , are i.i.d. random variables, representing the observation noise. Hence, such functions f j , j = 1, . . . , J, are close from each other in the sense that there exists an unknown archetype f and unknown warping functions h j , j = 1, . . . , J, such that, for all j = 1, . . . , J, Examples of such data might be growth curves, longitudinal data in medicine, speech signals, traffic data or expenditure curves for some goods in the econometric domain. Our main motivation in this paper is the analysis of the vehicle speed evolution on a motorway. The data are curves, describing the evolution, on observation cells, of the daily vehicle speed. After performing classification procedures (see for instance [14] for a complete study), we obtain clusters of functions, each one representing a typical common behavior. Indeed, all the curves can be deduced one from another by a shift parameter. This kind of issue led several statisticians to apply transformations to functions in order to get rid of the shifts and to align the curves. If a parametric model would be available a priori, the analysis would be made easier. But, if the data are numerous, there is not generally enough knowledge to build such a model. Thus, they turn into a non parametric framework. When the pattern is known, the problem turns to align a noisy observation with a fixed feature. Piccioni, Scarlatti and Trouvé in [15], Kneip, Li, MacGibbon and Ramsay in [11], or Ramsay and Li in [16] proposed curve registration methods. Their main idea is to align each curve on a target curve f 0 , which means finding, for all j ∈ {1, . . . , J}, the warping function h j minimizing where h j belongs to a particular smooth monotone family defined by the solution of the differential equation D 2 h j = w j Dh j . Hence, w j is simply D 2 h j /Dh j , the relative curvature of h j . Thus, penalizing w j yields both smoothness and monotonicity of h j (see [16] for more details). The main drawback of such methods is that they assume that the archetype f 0 is known, which is a reasonable assumption in pattern recognition, but which is unrealistic when the observed phenomenon is not well known as in our study. Alternatively, in a non parametric point of view, the pattern is replaced by its estimate. In this case, the issue is a matter of synchronizing sample curves. Wang and Gasser in [21] use kernel estimators. In another work, Gasser and Kneip, in [5], align the curves by aligning the local extrema of the functions, which are estimated as zeroes of the non parametric estimate of the derivative. In all cases, the issue of estimating the shifts is blurred by the estimation of the curves, which leads to non parametric rates of convergence. Hence, it seems natural to study our regression problem (1) in a semi-parametric framework: the shifts are the parameters of interest to be estimated, while the pattern stands for an unknown nuisance functional parameter. A very general semi-parametric regression model called Self-modelling regression (SEMOR) has been considered in [10]. The model is f j (·) = f (·, θ * j ), j ∈ {1, . . . , J}, and a general backfitting algorithm is studied. Roughly speaking, after initializing an estimate of f by a first guess (using for example a kernel method), this algorithm is based on two recursive steps. In the first step, the estimation of θ * j , j = 1, . . . , J, is performed. In the second step, the estimate of f is updated. In both steps, estimations are performed using a least squares criterion. In [10] a complete study, including the asymptotic normality of the estimates, is performed for the Shape-invariant model (SIM) introduced in [12]. See also [13], [8] and [9] for related works. Actually, the model studied in our paper (regression model (1)) falls in the SIM frame, so that, the methods studied in [10] may be applied. Nevertheless, the estimation procedure developed here is new, structurally simpler and computationaly easier to implement than the complicated backfitting algorithms.
The difficulty of the work is that the estimation must not rely on the pattern, even if the quantities are deeply linked. That is the reason why we will use an M -estimator built on the Fourier series of the data. Under identifiability assumptions, we provide a consistent method (Theorem 2.1) to estimate at the parametric rate of convergence the shifts θ * j , j = 1, . . . , J, when f is unknown, and we show that fluctuations of the estimates are asymptotically Gaussian (Theorem 3.1). Further, our estimation method leads to a fast algorithm to align shifted curves without any prior assumption on the feature, due to semiparametric techniques. We point out that this study can be linked first with the study of Golubev in [7], dealing with the semi-parametric efficiency in the estimation of shifts in a continuous observation scheme, and also with the study of Gassiat and Lévy-Leduc in [6], dealing with the estimation of the periodicity of a signal. Further, the mixed effects model (1) with random shifts is studied in [1] (see also [2]). We outline the fact that the method we propose handles a large variety of curves with minimal smoothness properties, namely we only require L 1 conditions. The present paper falls into six parts. Section 2 is devoted to the definition of the model and to the description of the estimation method. In Section 3, we provide asymptotic properties of the estimators. As a matter of fact, we show that the estimators are convergent and asymptotically Gaussian. The estimating method is effectively performed in Section 4, on some simulated data, and then used to analyze road traffic data. We compare our results to another existing method. The technical lemmas and the proofs are gathered in Section 5 and Section 6.

Model
For the j-th curve (j = 1, . . . , J), we get n observations y ij , i = 1, . . . , n, measured at equispaced times t i = i−1 n T ∈ [0, T [, with T ∈ R * + . We model these observations in the following way: where, f : R → R is an unknown T -periodic function, θ * = (θ * 1 , . . . , θ * J ) ∈ R J is an unknown shift parameter, θ * j is the shift of the j-th curve, and, ε ij , i = 1, . . . , n, is a Gaussian white noise, with variance 1. For sake of simplicity, we consider an unitary variance, but all our results are still valid for a general variance.
Our aim is to estimate the translation factors (θ * j ) without the knowledge of the pattern f . Due to the special structure of the model, Fourier analysis is well suited to conduct such a study, since the Fourier basis diagonalizes any translation. Then, using a Discrete Fourier Transform we may transform the model (3) into the following one (supposing n is odd): where, c l (f ) = 1 n n m=1 f (t m )e −i2π ml n , l = −(n − 1)/2, . . . , (n − 1)/2, are the discrete Fourier coefficients and α * j = 2π T θ * j ∈ R, j = 1, . . . , J, are the phase factors, and, for all j ∈ {1, . . . , J}, w jl , l = −(n − 1)/2, . . . , (n − 1)/2, is a complex Gaussian white noise, with complex variance 1/n, and with independent real and imaginary parts. As previously, our goal is to estimate the phase factors α * j , j = 1, . . . , J, without the knowledge of the Fourier coefficients of function f . Stricto sensu, the discrete Fourier coefficients are not the Fourier coefficients of the functions, but the bias induced is similar to the bias induced by any discretization in regression, which vanishes under some regularity assumptions, as shown in [4]. Hence, from now, we will consider the model (4) T dt. Observe that in this last equation we only have to assume that f is integrable. Hence, if we only consider the discretized version given in Model 2.2, only a minimal smoothness conditions (f ∈ L 1 (R)) is necessary, contrary to other methods in statistics, which require stronger conditions.
We point out that we are facing a semi-parametric model. As a matter of fact, we aim at estimating the parameter α * = (α * 1 , · · · , α * J ) which depends on an unknown nuisance functional parameter (c l (f )) l∈Z , the Fourier coefficients of the unknown function f .

Identifiability
We notice that the model (4) is not identifiable for all translation parameters. Indeed, replacing α * by and replacing f (·) by f (· − c), let invariant the equation (4). So, in order to ensure identifiability of the model, we restrict the parameter space A: In this paper, we will mainly consider, in the Theorem 3.1, the parameter set (5) the constant c must be equal to 0.
Hence, we get J different solutions in [−π, π[ J ⊂ R J , and a unique solution with the additional condition α 1 ∈ 0, 2π J .

Estimation
Since we want to estimate the shifts without prior knowledge of the function f , we will consider a semi-parametric method, relying on an M -estimation procedure. Hence, the functional parameter is a nuisance parameter that does not play a role in the rate of converge of the estimates of the parameters, regardless of smoothness conditions for f . For this, define, for any α = (α 1 , . . . , α J ) ∈ A, the rephased coefficients c jl (α) = e ilαj d jl , j = 1, . . . , J, l = −(n − 1)/2, . . . , (n − 1)/2, and the mean of these rephased Fourier coefficientŝ Hence, |c jl (α) −ĉ l (α)| 2 should be small when α is close to α * . Now, consider a bounded measure µ on [0, T ] and set Obviously, the sequence (δ l ) is bounded. Without loss of generality we will assume that δ 0 = 0. Assume further that l |δ l | 2 |c l (f )| 2 < +∞. So that f * µ is a well defined square integrable function: Consider the following empirical contrast function: In the sequel, we will always assume that: The set {l : δ l c l = 0} contains at least two integers which are coprime. (8) The random function M n is non negative. Furthermore, its minimum value should be reached close to the true parameter α * .Then, the following theorem provides the consistency of the M -estimator, defined bŷ α n = arg min α∈A M n (α).
Theorem 2.1 Under the following assumptions on f and on the weight sequence (δ l ) l∈Z : we have thatα n We point out that we only assume that f * µ and f * µ * µ are square integrable and yet are able to build estimates of the shifts in the model (4). The computation of the estimator is quick since only a Fast Fourier Transform algorithm and a minimization algorithm of a quadratic functional are needed.
Proof 2.2 (Proof of Theorem 2.1) The proof of this theorem follows the classical guidelines of the convergence of M -estimators (see for example [19]). Indeed, the contrast is split into two parts, a determinist and a random one. Then, it suffices to show that the following conditions hold for the criterion function to ensure consistency ofα n .
i) Convergence to a contrast function: where K(·) has a unique minimum at α * . ii) Set the modulus of uniformly continuity W , defined by There exists two sequences (η k ) k∈N and (ǫ k ) k∈N , decreasing to zero, such that for a large enough k, we have These two conditions are fulfilled, as it is proved in Section 6. Notice that we chose to privilege the uniform convergence of the modulus of continuity of the contrast and not the uniform convergence of the criterion itself. Nevertheless, the two proofs use the same kind of arguments, i.e proving the uniform convergence of empirical processes of Gaussian variables.

Asymptotic normality
In this section, we prove that the estimator built in the previous section is asymptotically Gaussian, and we give its asymptotic covariance matrix. In general, the asymptotic covariance matrix hardly depends on the geometric structure of the parameter space A. So, for sake of simplicity, we study the asymptotic normality for the parameter space A 1 . Hence, the parameter space has dimension J − 1, and we rewrite this set asÃ 1 = [−π, π[ J−1 and, any element inÃ 1 as α = (α 2 , . . . , α J ). Also, for sake of simplicity, in this section and in the proofs of Theorem 3.1, we will write M n (α) instead of M n (0, α 2 , . . . , α J ). So, we consider any estimator defined byα n = arg min α∈Ã1 M n (α).
Theorem 3.1 Under the following assumptions (δ l ) l∈Z : we get that with , where, I J−1 is the identity matrix of dimension J − 1, and U J−1 is the square matrix of dimension J − 1 whose all entries are equal to one. (3) has a variance equal to σ 2 , then the limit distribution in the previous theorem has a covariance matrix equal to σ 2 Γ. where ∇ is the gradient operator. A second order decomposition leads to: there existsᾱ n in a neighborhood of α * such that
Observe that the extra terms (δ l ) l∈Z used in the definition (7) smooth the criterion function M n (α). Indeed, without this term, i.e under the choice δ l = 1, the random part of the derivative of the criterion function does not converge towards a determinist function but to a random process, which prevents the study of the asymptotic distribution. The weights enable to get rid of this part, smoothing the contrast to zero.
Moreover, the convergence of the criterion is speeded up by using these smoothing weights. We illustrate this purpose on Section 4 by comparing a weighted criterion with a non weighted one (see Figure 3). Moreover, this result will be highlighted in the proof of Theorem 3.1.

Practical choice of the δ l 's
The problem of choosing the weights (δ l ) l∈Z in the definition of the criterion function is important. If we work with L 2 functions, the assumption (12) is satisfied for example as soon as |δ l | = O |l| −2−ν , for some ν > 0. In the simulations our functions are much more regular. Hence, we have taken δ l = 1/|l| 1.3 . This choice guarantees consistency and good numerical results. Moreover, in order to illustrate the importance of the weight sequence, we have also taken δ l ≡ 1 and δ l = 1/|l| 2 in Figure 3. Indeed, when looking at the asymptotic variance, we can see that there is a trade-off which leads to a lower bound for the smoothing sequence, the smaller the weights, the larger the variance. Since the function f is unknown and so the sequence does not depend on the Fourier coefficients, hence the optimal choice for (δ l ) l∈Z should be given by semi-parametric efficiency. Using Cauchy-Schwarz's inequality, we get that This case, corresponding to the least favorable case in the semi-parametric efficiency framework, is obtained for the optimal choice of coefficients δ l = 1. If an asymptotic fluctuation results would hold for this sequence, we would obtain: Nevertheless, for the choice of the weight sequence δ l = 1 the asymptotic normality result does not hold. Non optimality as regards asymptotic efficiency is the price to pay both to deal with a discretized version of the regression model and to handle simultaneous estimation for all the unknown functions.
Maybe, a different way of estimation could get rid of this drawback. Yet, another choice could have been done to smooth the contrast by restricting the number of Fourier coefficients, as it is done in [7] for example. Some links could also be established between the estimator we consider and a Bayesian penalized maximum likelihood estimator, where the weights (δ l ) l∈Z stand for a particular choice of a prior over the unknown function f . This Bayesian point of view is tackled in [3]. However, the optimal choice of the smoothing parameter to obtain efficiency is a very interesting issue in the semi-parametric framework (i.e when the weights are not allowed to depend on the Fourier coefficients of the functions). Quite posterior to the first submission of this work, this problem has been solved in a [20]. (3) is Gaussian. Nevertheless, we could get rid of this assumption with moment conditions on the errors.

Applications and simulations
In this section, we present some numerical applications of the method. The first one gives results on simulated data. The second one is based on an experiment on human fingers force. The last one is carried out with traffic data. The optimization algorithm used in any resolution is based on a Krylov method (the conjugate gradient method). Indeed, minimizing an L 2 criterion function with a conjugate gradient algorithm yields a reduced step number, and hence, a small complexity.

Simulated data
Simulated data are carried out as follows:  Figure 1. The target function f is considered as a 2π-periodic function (T = 2π), hence α * = θ * . The function f is plotted by a solid line in Figure 1 (d). Figure 1 (a) shows simulated data y ij , j = 1, . . . , J, i = 1, . . . , n. The cross-sectional mean curve of these data is given on Figure 1 (d) by the dotted line. We can see that this mean function is representativeness of data. Indeed, the amplitude of higher optimum is reduced, and smallest ones disappeared. Figure 1 (c) shows curves unshifted by the estimated parameters. The mean function of these unshifted curves is given on Figure 1 (d) by dashed line. Figure 1 (b) plots θ * j on abscissa axis againstθ j on ordinate axis, j = 1, . . . , J. Estimations are very close to true parameters. Comparison between mean curves, before and after the shift estimation, is straightforward. We now compare our estimations with those obtained with an existing method: curve registration by landmarks. This method aims at aligning curves by, first, estimating landmarks of curves (here, the maximum) and by, secondly, aligning these landmarks. For more details on this procedure, see [5]. In Figure 2, we show the results on our simulated data. These results are not as good as those we obtain with our method. It can be explained by the fact that we need first to estimate each curve maximum by a non parametric method (a kernel estimation), which leads to estimation errors. Moreover, our method uses all information given by the data, not only that given by landmarks.
In order to illustrate the importance of the weight sequence, we compare now the criterion function for various values of (δ l ) l∈Z . For this purpose, simulated data sets are carried out, with J = 2, θ * 1 = 0 and θ * 2 = π/3. Figure 3 shows the obtained results. The first column of this figure presents these simulated data sets, with respectively, σ = 1 in Figure 3 (a,1), σ = 3 in Figure 3 (a,2), σ = 5 in Figure 3 (a,3) and σ = 7 in Figure 3 (a,4). The second column presents the unweighted criterion functions, i.e with δ l ≡ 1, associated respectively with (a,1), (a,2), (a,3) and (a,4). The third and fourth columns present the associated weighted criterion functions M n (·) with, respectively, δ l = 1/|l| 1.3 and δ l = 1/|l| 2 . In all these figures, the vertical dashed line represents the value π/3 where the minimum value of our criterion function is achieved. It clearly appears that without the weight sequence, i.e with δ l ≡ 1, the criterion function converges to a random process. Moreover, the variance of this random process is proportional to the noise variance σ 2 . We also see that even with an important noise variance, as in Figure 3 (a,4), our weighted criterion functions, (c,4) and (d,4), are smooth, with a unique minimum around π/3. This shows that our procedure is quite robust to the SNR. Moreover, it appears that the impact of the exponent β > 1.25 of the weight sequence δ l = 1/|l| β is only on the amplitude of the Mfunction. These numerical results emphasize the fact that the weight sequence (δ l ) l∈Z is important, but that its value can be easily chosen.

Pinch force data
Data presented here are extracted from an experiment described in [16] with a Curve Registration methodology. Data represent the force exerted by the thumb and forefinger on a force meter during 20 brief pinches. These 20 force measurements having arbitrary beginning, Ramsay and Li in [16] begin their study by a landmark alignment of curve maxima (with single shifts). These aligned data are shown in Figure 4 (a).
Our purpose is to study these data with our shift estimation methodology. Shift estimations and unshifted curves are respectively shown in Figure 4 Figure 4 (b), we only show a boxplot of the estimated parameters because, obviously, we do not know the real parameters. We note that shift parameters are almost all close to zero, between −10 −3 and 3 × 10 −3 . In this case, landmark alignment unshift quite well the data. Nevertheless, comparing Figure 4 (a) and Figure 4 (c), we can see that curves are slightly better aligned after our shift estimation methodology. In Figure 4 (d), the cross-sectional mean curves of unshifted curves (solid line) and of primary curves (dotted line) are almost the same ones.

Application to road traffic forecasting
Most of the Parisian road traffic network is equipped with a traffic road measurement infrastructure. The main elements of this infrastructure are counting stations. These sensors are situated approximately every 500 meters on main trunk roads (motorways and speedways principally). Every counting station measures, daily, the average speed of vehicle flow on 6 minutes periods. We consider measurements from 5 AM to 11 PM, hence, the length of the daily measurement is 180. We note y ij the speed measurement of day j ∈ {1, . . . , J} and of period i ∈ {1, . . . , n}, with n = 180. Our purpose is to improve, with our shift estimation methodology, an existing forecasting methodology. This forecasting methodology is described in [14]. This  procedure is based on a classification method. We dispose of a sample of J speed curves and we want to summarize it by a small number N of standard profiles, representatives of each cluster. Consider several clusters of J speed curves. Indeed, we note frequently that many subgroups are composed by curves describing the same behavior. For example, we observe a speed curve subgroup with a same traffic jam or speed reduction, but with different starting times for each curve. Thus, Figure 5 (a) represent a particular cluster on a particular counting station. Figure 5 (b) is a boxplot of the estimated shiftsθ j , j = 1, . . . , J. Unshifted curves are plotted on Figure 5 (c). So, in this homogeneous cluster, where only a shift phenomenon appear, the mean curves in Figure 5 (d) of unshifted curves (solid line) and of primary curves (dotted line) aren't the same. The shift estimated mean is clearly more representative of individual pattern.

Technical Lemmas
The two following propositions, Proposition 5.1 and Proposition 5.2, are used in the proof of asymptotic normality (Theorem 3.1). Their proofs are postponed to the appendix.

Appendix
Let z be a complex number andz its conjugate. We write Re(z) = 1 2 (z +z) (the real part of z) and Im(z) = 1 2i (z −z) (the imaginary part of z). Proof 6.1 (Proof of Theorem 2.1) In the sequel we assume without loss of generality that all the α * j are equal to 0. Consider the following notation: for all j = 1, . . . , J, for all l = −(n−1)/2, . . . , (n−1)/2, w jl = 1 √ n ξ jl = 1 √ n ξ x jl + iξ y jl . Here, ξ x jl and ξ y jl are independent Gaussian sequences, with law N n (0, I n ). Also, set Using the following decompositions lead to the following expression of the criterion function M n (α) We have split the criterion function into four different terms: (22), (23), (24), and (25). We aim at proving the convergence of these terms to a determinist contrast function and the uniform convergence of their increments.
• The term (22) is a deterministic one. Using Parseval theorem, we have that • The term (23) is a pure noise term composed of terms of the type |δ l | 2 cos(l(α j − α k ))ξ x jl ξ y kl , k = j.
For |h| ≤ 1/n 2 the inequality | cos lh − 1| ≤ 1/n leads to |U n (α j − α k + h) − U n (α j − α k )| ≤ Hence the first term in (24) goes to 0 in probability. • The remaining term in (24) is similar to the term (25), which has the same asymptotic behavior as Here we only give the proof of the uniform convergence of (25) which holds under slight modifications of the second term of (24). As the noise is Gaussian we get |V n (α j − α k )| ≥ 16 log n +∞ l=−∞ |δ l | 2 |c l (f )| 2 n   ≤ Kn 2 exp(−4 log n).
Using again previous bound for h ≤ 1/n 2 , we obtain In conclusion, we have that sup α∈[−π,π] J |M n (α) − K(α) − | This ensures that (11) is fulfilled and that the convergence property in (10) is ensured. It remains to be seen that the asymptotic contrast enables to identify the α j 's, which concludes the proof of Condition (10). Cauchy-Schwartz inequality yields that hence, K(·) ≥ 0, and the minimum value is reached for In a matrix way, we get the equation (