Weighted least squares methods for prediction in the functional data linear model

The problem of prediction in functional linear regression is conventionally addressed by reducing dimension via the standard principal component basis. In this paper we show that an alternative basis chosen through weighted least-squares, or weighted least-squares itself, can be more effective when the experimental errors are heteroscedastic. We give a concise theoretical result which demonstrates the effectiveness of this approach, even when the model for the variance is inaccurate, and we explore the numerical properties of the method. We show too that the advantages of the suggested adaptive techniques are not found only in low-dimensional aspects of the problem; rather, they accrue almost equally among all dimensions.


Introduction
The functional linear model has the appearance of being rather conventional. It involves representing a scalar response, Y , as where X denotes the function-valued explanatory variable, α is a scalar, β is the function-valued slope parameter, and I is a known compact interval. However, estimation of β is generally a nonparametric problem, and the level of complexity implicit in that property can carry over to the problem of prediction, in which we wish to estimate α + I β x for a given function x. Sometimes α + I β x can be estimated root-n consistently, where n denotes sample size, but more commonly, estimators converge at strictly slower rates. Cai and Hall (2006) discuss these issues, and Faraway (1997), Ferraty and Vieu (2000), Cuevas et al. (2002), Ramsay and Silverman (2005, Chapter 12),  and  address functional linear regression in more general terms.
A standard approach to estimating α and β is to first estimate the principal component basis from a sample of observations of (X, Y ), and then construct an estimator of µ(x) = α + I β x in terms of that basis, using least squares. However, in practice the distribution of the error in (1) can be a functional of the distribution of X, and the optimal choice of basis can depend significantly on x. To address these challenges we could construct the basis so that it gave greater emphasis to observations of X that were relatively close to x. For example, we could restrict attention to X for which X − x ≤ δ, where · was a suitable distance measure and δ played the role of bandwidth, although δ would not necessarily be chosen to converge to zero as n increased. More subtly, the basis could be constructed by applying kernel weights to each observation. See Mas (2008) for theoretical results addressing problems of this type.
Although this approach is attractive, practical difficulties can arise from the implicit reduction in sample size that is involved. An alternative method is to estimate the variance, σ(X) 2 say, of the distribution of the error in (1) conditional on X, and adapt prediction to the level of variability there. We suggest solving this problem by modelling σ(x) 2 as a function of α + I β x, and using its inverse, with x replaced by a data value X, as a weight in the basic least-squares problem. We then show that calculations can be simplified by computing a new principal component basis, adapted to heteroscedasticity. While our approach has some similarities with the weighted least squares method used for finite dimensional data, it differs significantly due to the intrinsic nonparametric, and infinite dimensional, characters of functional linear regression; we quantify these issues in theoretical terms.
In summary, this paper makes three main contributions. First, we show in sec- The main theoretical result in section 3 gives a concise account of the way adaptive methods can improve the performance of estimators in functional linear regression. In particular, we show that the advantages accrue almost equally among all dimensions; they are not principally to be found in low-dimensional aspects of the problem.
Previous developments of principal components analysis for functional data play a central role in our work. Early contributions include those of Besse and Ramsay (1986), Ramsay and Dalzell (1991) and Rice and Silverman (1991). From that point a very substantial literature has developed, including but by no means limited to the work of Silverman (1995Silverman ( , 1996, Brumback and Rice (1998), Cardot et al. (1999Cardot et al. ( , 2000Cardot et al. ( , 2003, Cardot (2000), Girard (2000), James et al. (2000), Boente and Fraiman (2002), He, Müller and Wang (2003), Ramsay and Silverman (2005 where α is a scalar, β and X are functions defined on the compact interval I, and E(ǫ | X) = 0. Square-bracketed subscripts here distinguish the ith observation of X, X [i] , from the ith principal component score, which is conventionally represented (2), where x denotes a particular value of X and µ is a scalar functional.
A standard approach to estimating µ(x) is to introduce an orthonormal basis, say ψ 1 , ψ 2 , . . ., and argue that β and x admit convergent expansions with respect to this sequence, i.e.
where b j = I β ψ j and x j = I x ψ j . Estimatorsα of α andb j of b j , for j ≥ 1, are then constructed from the data by minimising where X ij = X [i] ψ j and r denotes the frequency cut-off, a smoothing parameter.
These definitions ofα andb 1 , . . . ,b r reflect the definitions of α and β at (2) and, for appropriate choice of r, ensure consistency. The resulting estimator of µ is A thresholding method could also be used instead of "cut-off smoothing," but the difficulty of estimating the variance ofb j makes that approach unattractive.

Principal component basis
It is common to take ψ 1 , ψ 2 , . . . to be the principal component basis, ordered so that the corresponding eigenvalues form a decreasing sequence. Specifically, define K(s, t) = cov{X(s), X(t)} to be the covariance function of X, and construct the spectral decomposition of K, where θ 1 ≥ θ 2 ≥ . . . ≥ 0 and (θ j , ψ j ) are the (eigenvalue, eigenfunction) pairs of the transformation that takes ψ to Kψ, defined by (Kψ)(t) = I K(s, t) ψ(s) ds.
Then the orthonormal functions ψ j make up the principal component basis. The In practice the principal component basis is unknown, and needs to be estimated from data. To this end we define is an estimator of K(s, t), (θ j , ψ j ) are (eigenvalue, eigenfunction) pairs for the transformation represented by K, and the order of the indices j is chosen to ensure thatθ 1 ≥θ 2 ≥ . . . Thenθ j andψ j are our estimators of θ j and ψ j , respectively, and we would replace (4) bŷ where X ij = I X [i]ψj , giving the obvious estimator µ(x) of µ(x). Equivalently, sinceα =Ȳ − IβX then, writing X j = n −1 i X ij , we can minimisê over b 1 , . . . , b r , obtaining the same numerical valuesb 1 , . . . ,b r as we do when minimising (7). Then, definingx j = I xψ j , we take to be our estimator of µ(x). In a slight abuse of notation, when discussing practical implementation we shall writeα andb j for the quantities that minimise (7) rather than (4).

Adapting to the variance of Y
The estimator µ at (9) is conventional, but does not take into account the fact that the errors at (2) are often heteroscedastic. When a significant amount of variability is explained by that aspect of the problem, we should replaceŜ r (α, b 1 , . . . , b r ) at (7) by its form where a weight, equal to an approximation to the inverse of the variance , is incorporated into the series at (7).
In conventional parametric regression, the conditional variance of the regression errors is often modelled as a power of the assumed parametric form of E(Y |X). See for example Carroll and Ruppert (1988). In the functional data context we propose with g a univariate function and α, b 1 , . . . , b r and r as in (7) and where X j is the jth principal component score. An adaptive form of g that is often suitable is the "power of the mean" model where c 1 and c 2 are constants, or the version of it which includes an intercept term.
In principle, a nonparametric estimator of g could be used. However, in applications to real functional data, where sample sizes are generally only moderate in size, we found that the increase in variance resulting from the nonparametric fit significantly outweighed any improvements in performance that might be expected using that approach. Bear in mind that it is not necessary to have consistent estimators of the variances in order to enjoy improved statistical performance, even in the asymptotic limit. We shall take this point up again in section 3; see point (ii) below the Theorem there.
To estimate σ(X) 2 we interpret the unweighted estimatorsα andβ = j≤rb jψj as pilot estimators of α and β = j b j ψ j , respectively, and use them to calculate Since these quantities are already centred, we and chooseĉ 1 andĉ 2 to minimise T (c 1 , c 2 ). In this notation our estimator of var(Y − Next we incorporate these weights into the objective function at (8), obtaining: is given by the following analogue of (5), based on the new coefficient estimators: A computational advantage of defining estimators by minimisingŜ (13), is that the "ex transpose ex" matrix in the former case is simple to invert. Indeed, by definition of X ij in terms of the orthogonal The fact that this does not hold in the case of the objective function U t (b 1 , . . . , b t ) reflects the fact that the orthonormal basis functionsψ j are not necessarily, in this case, the natural ones. Instead we could replace U t (b 1 , . . . , b t ) by )X ij , and where the orthonormal functionsφ 1 ,φ 2 , . . ., with corresponding eigenvalues ω 1 ≥ ω 2 ≥ . . ., are defined by the following spectral decomposition Takingb w1 , . . . ,b ws to minimise V s (b 1 , . . . , b s ), a competitor with µ w (x) at (14) is given byμ The numerical differences between µ w andμ w are generally very small.

Practical choice of smoothing parameters
The methodology outlined in sections 2.2 and 2.3 involves two smoothing parameters: r, in the equivalent objective functionsŜ r andŜ equiv r at (7) and (8), and t, in U t at (13), or s, in V s at (15). We propose selecting these parameters by crossvalidation, as follows. Omit the data pair (X [i] , Y i ) from the sample, and, using the remaining n − 1 pairs, construct the predictorμ w (x) at (16) for a general r and s; The same approach is used to select r and t for the predictor µ w (x) at (14).

Theoretical properties
We shall suppose that we model the variance var(ǫ | X) = σ(X) 2 as τ (X) 2 , where the function τ is known but may not equal σ. That is, our model may not actually be correct. We shall make three simplifying assumptions: (a) The principal components I X ψ j of X are independent, rather than merely uncorrelated; (b) the principal component basis ψ 1 , ψ 2 , . . . is known; and (c): where δ is stochastically independent of X, E(δ) = 0, E(δ 2 ) = 1, the functional σ is bounded, τ is bounded above zero, and σ(X) and τ (X) depend on only a finite number of the principal component scores X j = I X ψ j ; that is, for some t ≥ 1 and positive integers j 1 , . . . , j t we can write σ(X) 2 = var(ǫ | X) = h(X j 1 , . . . , X jt ), where h is a positive, t-variate function which is bounded away from zero and infinity, and τ (X) 2 can be represented in the same way.
Assumption (a) can be relaxed to a mixing condition, and (b) can be removed by using the estimated ψ j s rather than their true forms. The latter approach requires a regularity condition on the spacings of the eigenvalues θ j , and in that setting a complex technical argument, similar to those given by Hall and Hosseini-Nasab (2008), is needed. Condition (c) can be relaxed by noting that if σ(·) is sufficiently regular, and if the scores X j are independent, then σ(X) can be approximated by a sequence of functions σ t (X 1 , . . . , X t ), for t ≥ 1, where σ(X) − σ t (X 1 , . . . , X t ) converges to zero as t → ∞, with a similar constraint imposed on τ (X).
Recall that the pair (X, Y ) is generated by the model at (2), where the error, ǫ, has zero mean, and we wish to estimate µ(x) = α+ I β x for a particular function x.
Our estimator, which is equivalent to that given at (14) Since we centre the principal component scores X ij at their respective meansX j,w then we may, and do, assume without loss of generality that E{X τ (X) −2 } = 0.
The eigenfunctions ψ j and eigenvalues θ j are defined by (6), and, in addition to (17), we assume that: the principal components I X ψ j are independent , Write AMSE µ w (x) − µ(x)} for the asymptotic mean squared error of the estimator. The following theorem, which is derived in section 5, describes asymptotic properties of this quantity.
(ii) If the model is essentially incorrect, i.e. if τ does not equal a constant multiple of σ, then the estimator remains consistent and enjoys the same convergence rate as before, but with an inflated constant multiplier. More generally, if the variance functional σ 2 is not constant, and if the model is wrong but approximately correct (in particular, if τ (X) is sufficiently close to σ(X) for sufficiently many values of X), then ρ 2 is reduced relative to the value it would have if we were to simply take τ ≡ 1.
These properties point to the fact that it is not essential to use a nonparametric estimator of variance in order to obtain an improvement in performance over standard least-squares. As noted in section 2.3, for the small to moderate sample sizes commonly encountered with functional data, weighted least-squares based on nonparametric estimators of variance generally perform poorly because the estimated weights introduce too much extra variability.
(iii) The factor ρ 2 , defined in (i) above, is applied to each and every term in the series j≤r θ −1 j x 2 j in (23); it does not reduce in size as j increases. Therefore the advantages of correcting for heteroscedasticity are valid with equal force for arbitrarily large dimension; they do not relate just to low-dimensional aspects of the problem.
(iv) As is to be expected, the effect of weighting has an impact only on the variance contribution to asymptotic mean squared error, not on the bias component. However, even if the problem is finite-dimensional the impact of the variance component persists even in the asymptotic limit, and so there is always something to be gained, in asymptotic terms, by adapting the estimator appropriately to heteroscedasticity.
(v) The first part of (20) determines that the estimator µ w (x) has nonparametric, rather than parametric, convergence rates. It holds if we treat x as a realisation of X, and average over all such realisations. In particular, if x is distributed like X then E(x 2 j ) = θ j , and so E( j≥1 θ −1 j x 2 j ) = j≥1 1 = ∞, implying that the first part of (20) holds "on average." 4 Numerical illustrations

Real data example
We applied our method to Australian rainfall data available at http://www.world weather.org. The data consist of rainfall measurements at each of 206 Australian weather stations, averaged over the 40 to 100 year periods for which the weather stations had been in use. For any given station we took Y to equal the average yearly rainfall divided by the average number of days on which it had rained, which we refer to as intensity. Our predictor X(t) equalled the rainfall at time t, where t represented the fraction of the year that had passed at the time of measurement, and rainfall at that time was averaged over the years for which the station had been operating and was obtained by passing a local polynomial smoother through discrete observations.
The majority of weather stations fall into one of two classes, which respectively comprise most stations in southern parts of the continent, which tend to follow a "European" rainfall pattern where the majority of rain comes in cooler months and summer is relatively dry; and most stations in northern regions, which exhibit a "tropical" pattern where most rain falls in mid to late summer and the cooler months are generally dry. Only a small number of weather stations have more complex rainfall patterns that are not of one of these two types, although some northern stations reflect rainfall southern patterns, and vice versa. These features suggest that most of the data might reasonably be assumed to come from a mixture of two populations. Those populations might produce different error variances in the linear model, leading to heteroscedasticity.
We removed two weather stations, Ipswich and Katherine, which appeared to be outliers, in the sense that the functional linear model explained their rainfall relatively poorly. Then we applied our method to the n = 204 remaining stations.
To test the method we generated B = 500 samples, each of size n = 102, by randomly removing half of the 204 stations. For each of the B samples we then applied our method to predict the value of Y for each of the 102 removed stations. Note that, since these were real data, we did not know the true value of the target µ, and so we compared the predicted value with the true value of Y . For each of the B samples we calculated the mean squared errors for the 102 predicted stations, that is, for b = 1, . . . , B, we calculated For each of the B samples, we also calculated the proportion of the 102 predicted stations that were better predicted by the weighted methods, that is, for In Figure 1 we present graphs of the resulting B = 500 ordered values of log( MSE w,b /MSE b ), of log(M SE w,b /MSE b ), ofp w,b and ofp w,b . We see that both weighted methods gave very similar results, and that both strongly bettered the unweighted predictorμ: for most of the 500 samples, the MSE of the weighted methods was inferior to that of the unweighted method. Moreover, in more than 90% of the cases, the proportionsp w,b andp w,b were higher than 0.5, meaning that for most of the B = 500 samples, more than half of the 102 predicted values were closer to the true Y i when using the weighted method rather than the unweighted method.

Simulations
Using the same 204 functions X(t) as in section 3.1, we also tested the weighted methods on some generated Y data. The advantage of simulated data is that we know the target µ(x) = E(Y |X = x), and thus it is easier to assess the quality of the predictors. We generated 204 Y -values according to the model at (2), where, we took U ∼ U[−3/4, 3/4]. We tried two models for f (X) 2 : (i) f (X) 2 = 0.1·{ β(t)X(t) dt} 2 and (ii) f (X) 2 = 0.1 · { β(t)X(t) dt} 2 + 0.2 · | β(t)X(t) dt| 1/2 . Note that the function β as defined above is a smoothed and simplified version of the estimated β of the real data of section 3.1.
We proceeded as in section 3.1 and, by randomly splitting the data (X [i] , Y i ) into two parts, constructing B = 500 samples of size n = 102, and each time applying the method to the 102 remaining data points. In both cases (i) and (ii) we took the function g at (10) equal to g(u) = |c 1 u| c 2 . Even though this form is only an approximation to the real g for case (ii), we shall see below that the weighted methods worked well there too.
In this case since we knew the target µ, we replaced Y i by µ(X [i] ) in the definitions of MSE w,b , MSE b ,M SE w,b ,p w,b andp w,b . Figures 2 and 3 show, respectively, the results for cases (i) and (ii). We show only the predictorμ w ; the results forμ w were similar. The figures illustrate the improvement that can be gained by using a weighted version of the predictor.

uniformly in
Combining these results we deduce that the right hand side of (28) equals O p c r r 2 n k r n −1 = O p c r r 2k+1 n −(k+1) .
Next we outline the argument that extends this result to the heteroscedastic setting. First we discuss a version of the theorem in an artificial problem where the error variance is a function of Z, say, which is independent of (X, Y ) but is observed along with that pair. That is, the model (2) now has the form Y = α + I β X + σ(Z) δ, where the perturbation δ is independent of X and Z and has zero mean and unit variance. The appropriately weighted criterion function is that at (18) but with τ (X [i] ) replaced by τ (Z i ). In this case the proof above is easily re-worked, in particular with the factor τ (Z i ) −2 included in both series at (26) and in subsequent series, to show that the asymptotic mean squared error ofμ w (x) continues to be given by (23)  This result continues to hold when σ and τ are functions of X rather than Z, and depend on only a finite number of principal component scores. The proof proceeds by noting first that if var(ǫ | X) = h(X j 1 , . . . , X jt ), as in (17), and if we assume that the components with indices j 1 , . . . , j t are known and therefore do not need to be estimated, then we are in exactly the position addressed in the previous paragraph; we can take Z to be (X j 1 , . . . , X jt ) and replace X by the expansion ′ j X j ψ j where the summation ′ j is over only those indices j not included among j 1 , . . . , j t . Moreover, the asymptotic mean squared error formula is unaffected if we eliminate the components corresponding to j = j 1 , . . . , j t , or if we take those components to be known.