A powerful test based on tapering for use in functional data analysis

A test based on tapering is proposed for use in testing a global linear hypothesis under a functional linear model. The test statistic is constructed as a weighted sum of squared linear combinations of Fourier coefficients, a tapered quadratic form, in which higher Fourier frequencies are down-weighted so as to emphasize the smooth attributes of the model. A formula is $Q_n^{OPT}=n\sum_{j=1}^{p_n}j^{-1/2}\|\boldsymbol{Y}_{n,j}\|^2$. Down-weighting by $j^{-1/2}$ is selected to achieve adaptive optimality among tests based on tapering with respect to its ``rates of testing,'' an asymptotic framework for measuring a test's retention of power in high dimensions under smoothness constraints. Existing tests based on truncation or thresholding are known to have superior asymptotic power in comparison with any test based on tapering; however, it is shown here that high-order effects can be substantial, and that a test based on $Q_n^{OPT}$ exhibits better (non-asymptotic) power against the sort of alternatives that would typically be of concern in functional data analysis applications. The proposed test is developed for use in practice, and demonstrated in an example application.


Introduction
The subject of this article is the functional linear model and functional linear hypothesis, both cornerstones functional data analysis (FDA) methodologies. This model is described as a sample of independent random functions (sometimes called curves or profiles), which here will be taken to have a common one-dimensional domain, (a, b], and real-valued response. To facilitate study of asymptotic properties, the present investigation will adopt a replicated functional linear model, in which each of N response points of dimension P is replicated n times. In matrix form, the i'th replication is (1.1) T is a functional vector of responses on t ∈ (a, b], and X = [x 1 , . . . , x N ] T is a N × P "essence" regressor matrix, assumed of full column-rank, which stores the values of explanatory variables at which the response functions are measured. The vector β(t) = [β 1 (t), . . . , β P (t)] T is a P × 1 functional vector of regression coefficients, and the ǫ i are independent and identically distributed functional error vectors; each ǫ i (t) = [ǫ i,1 (t), . . . , ǫ i,N (t)] T is a vector of independent and identically distributed error functions, for which E[ǫ i,k (t)] = 0 and V [ǫ i,k (t)] = 1 for each (i, k) and t, and which are stationary on t ∈ (a, b]. The functional linear hypothesis is H 0 : L T β(t) = 0 for all t ∈ (a, b] against a general alternative, where L is a P × ν hypothesis matrix of full column-rank. (Use of the symbol ν is to reflect the degrees of freedom in a standard test of such a hypothesis in the analogous univariate situation.) After some initial preprocessing, functional data may typically be represented by a discrete, high-dimensional model. Such representations are made here using Fourier decomposition, whose advantage in FDA is not only to discretize the model but also to decorrelate the error structure and offer meaningful descriptive summarization. This is demonstrated on an existing data set in Section 2. (See also Fan and Lin, 1998, Spitzner, Marron, and Essick, 1998, Spitzner and Woodall, 2003 for further demonstrations in FDA.) Also laid out in that section is how, taking into consideration the specific L, data collected under the model (1.1) may be translated to that of the model Y n,j = θ j + n −1/2 e n,j , (1.2) for j = 1, . . . , p n , where p n represents some (high) maximum number of dimensions to be accounted for at a given n. The statistics Y n,j = [Y n,j1 , . . . , Y n,jν ] T are ν-dimensional discrete data vectors, each entry of which a linear combination Fourier coefficients (defined by a distinct column of L) that partly summarizes the information relevant to the specific null hypothesis. The θ j = [θ j1 , . . . , θ jν ] T are mean vectors, and the e n,j = [e n,j1 , . . . , e n,jν ] T are zeromean, unit-covariance error-vectors such that the e n,jk are independent across k. (Across j, however, small correlations among the e n,jk are possible.) The functional linear hypothesis translates to H 0 : θ j = 0 for j = 1, . . . , p n versus H 1 : not H 0 , (1.3) in which the matrix L has been absorbed into the transformation from (1.1) to (1.2). A typical assumption made in FDA is that the functional parameter β(t) in (1.1) is somehow "smooth." From an intuitive standpoint, this means that the β 1 (t), . . . , β P (t) are each taken to be a conglomeration of mainly large-scale, sweeping shapes, which are represented by low-frequency Fourier components. For rigorous analysis, smoothness is expressed more technically in Section 3 by restricting β(t) to a Sobolev class. At any rate, a key issue in testing is how to exploit the smoothness assumption, so as not to waste statistical power attempting to distinguish the "rougher" aspects of the model (i.e., the smallscale wiggly shapes). This is especially important for testing in high-dimensions, where the vastness of the parameter space requires a careful management of power.
The transformation to the discrete model (1.2) typically assigns smaller j to model-components associated with smoother functional attributes. Noting this, one would want a test that focuses power primarily on the model's lower-indexed components. One class of tests that do this is defined by test statistics of the form Q n = n pn j=1 w n,j Y n,j 2 , (1.4) where each 0 < w n,j ≤ 1 and w n,j → 0 as j → ∞ to emphasize the Y n,j with smaller j. The particular test of interest in this article is defined by the weight setting w n,j = j −1/2 , which rewrites (1.4) as Q OP T n = n pn j=1 j −1/2 Y n,j 2 . This technique of managing power by direct down-weighting, as in (1.4), shall be referred to as "tapering." Detailed asymptotic power properties of tests based on tapering are deduced in Spitzner (2008A), some of whose results are reproduced in Section 3. There, it is shown that Q OP T n manages asymptotic power in an optimal way among test statistics of the form (1.4).
The asymptotic performance criteria used in Spitzner (2008A) and adopted here are taken from "rates of testing" theory, a framework articulating the rate at which power is retained in high-dimensional testing problems under geometric smoothness constraints. Its basic components are laid out in Ingster (1993), Spokoiny (1996), Lepski and Spokoiny (1999), Horowitz and Spokoiny (2001), and Gayraud and Pouet (2005), among others. Relevant criteria are described fully in Section 3. Within this context, the problem of selecting the w n,j in (1.4) for good asymptotic power would be aptly described as constrained rateoptimization among tests based on tapering, for which Q OP T n is a solution.
There are, however, two types of asymptotic optimality to consider, and it is known that any test based on tapering is suboptimal with respect to the type that is more relevant for consideration in practice. Specifically, Ingster (1993) and Spokoiny (1996) respectively deduce "minimax" and "adaptive-minimax" rates, which bound the performance of any test with respect to an adopted smoothness geometry. The former rate defines a stronger form of asymptotic optimality, but the latter "adaptive" type is the one more practically relevant. It is well established that tests based on tapering can achieve Ingster's minimax rate of testing (e.g., Ingster (1993) and Fan, Zhang, and Zhang (2001) provide two distinct examples), but Spitzner (2008A) deduces the suboptimality of any test based on tapering, including that based on Q OP T n , with respect to Spokoiny's adaptive-minimax criterion.
Existing tests that are known to achieve adaptive minimaxity are based on the alternative test-construction techniques of "truncation" or "thresholding." Spokoiny (1996) provides an adaptive-minimax test based on thresholding, and Fan, Zhang, and Zhang (2001) establish the adaptive minimaxity of Fan's (1996) "adaptive Neyman test," a test based on truncation. Details of these tests are provided in Section 4. Both have been developed for use in FDA in Fan and Lin (1998) and Abramovich et al. (2002).
Despite its asymptotic suboptimality, the test based on Q OP T n is demonstrated in Section 4 to retain good and even superior power over the adaptiveminimax tests above in non-asymptotic, practically realistic FDA settings, illustrating that the improvements offered by thresholding or truncation, though guaranteed, may arise quite slowly asymptotically. That is, the evidence of this article suggests that, among tapering, truncation, and thresholding, the most powerful tests in typical FDA applications are constructed by tapering! A novel aspect of this article is that it highlights a distinction between the asymptotic setup of FDA and that of other high-dimensional problems involving smoothness constraints, such as goodness-of-fit testing, within which many of the existing high-dimensional tests were first developed. The particular distinction has to do with the dependence of the dimensionality parameter, p n , on the sample-size parameter n. In non-FDA scenarios, there is typically an explicit connection between these two parameters (often p n = n), whereas in FDA the connection is largely hypothetical, and one typically has little or no control over the rate at which p n increases. Accordingly, an important concern in FDA is the sensitivity of test performance on the rate of p n → ∞. It shall be seen that the optimality property of the test based on Q OP T n is robust in this regard.

Organization
This paper is organized as follows. Section 2 presents an applied data example, which demonstrates the transformation from (1.1) to (1.2) and the test based on Q OP T n . This will provide grounding and intuition for subsequent discussion. The setup and relevance of asymptotic analysis in FDA is discussed in that section as well. Section 3 defines rates-of-testing criteria and presents the paper's main theoretical results. Section 4 gives details of several existing high-dimensional tests, and reports on empirical comparisons carried out by simulation. Concluding discussion appears in Section 5, in which the importance of studying tests based on tapering in FDA is further elaborated.

Functional data analysis and its asymptotic framework
Let us begin discussion with an analysis of the Canadian temperature data of Ramsay and Silverman (2005, ch. 13) under a functional linear model. (The data are available in a supplemental website to the book.) Subsequently, a general asymptotic setup for functional data analysis will be laid out.

An example data set
The Canadian temperature data consist of daily mean-temperature profiles across the year at 31 weather stations in three regions of Canada: there are M 1 = 14 "Atlantic" stations, M 2 = 5 "Pacific" stations and M 3 = 12 "Continental" stations. The raw measurements are displayed in the top panels of Figure  1. (The original data set analyzed by Ramsay and Silverman also includes three stations in an additional "Arctic" region, and one additional Atlantic station, at "Schefferville." For the present analysis it makes sense to have set aside the Arctic stations since there are so few of them, and the Schefferville station since its location is of unusually high latitude relative to the other Atlantic stations.) Ramsay and Silverman remark that the region-effects seen in these data are "more complex than the constant or even sinusoidal effects that one might expect," but note specifically that the Pacific stations tend to have warmer winter temperatures and Continental stations tend to have colder winter temperatures, while all regions' summer temperatures tend close to the average. The latter observations refer to large-scale attributes of the temperature profiles, and it will be presumed that these and other large-scale attributes are of primary interest. Nevertheless, smaller-scale, "more complex" attributes are not to be ignored; we want a systematic way to explore essentially all aspects of the data. Fourier decomposition provides just such a technique, one that is furthermore most appropriate for data such as yearly temperature profiles whose attributes tend to be periodic.

Fourier decomposition
The notation used in this analysis will parallel that of Section 1, but, for simplicity, without the subscript n, thereby ignoring any replication concepts until further ideas are laid down. Set M = M 1 + M 2 + M 3 = 31 and denote by Y k (t) the measurement of the k'th station at time t within a typical year: the measurement times are t 1 , . . . , t r , where t 1 is January 1, r = 365, and t l = l, so that t l+1 − t l is one day. Next define the Fourier basis functions on (0, 365] according to ψ 1 (t) = 1, ψ 2j (t) = sin(jπ(2t/r − 1)), and ψ 2j+1 (t) = cos(jπ(2t/r − 1)) for j = 1, 2, . . . The j'th Fourier coefficient of the k'th station is then Y * jk = r −1 r l=1 Y k (t l )ψ j (t l ), which gives the j'th coefficient of a multiple regression of Y k (t l ) across t 1 , . . . , t r onto any finite set of regressors ψ j (t l ) across t 1 , . . . , t r . Coefficients associated with the first few ψ j are shown in the bottom panels of Figure 1, centered and scaled for each j so that the mean and sum of squares of the displayed Y * jk match common values. Consider, for instance, that the shapes of ψ 1 (t) = 1 and ψ 3 (t) = cos(π(2t/r − 1)) convey interpretations whereby Y * 1k measures the k'th station's yearly average temperature and Y * 3k measures its differential between the winter and summer temperatures. With this in mind, observe from the bottom panels of Figure 1 that the Y * 1k tend to be larger for the Pacific region and smaller for the Continental region, while the Y * 3k tend to be smaller for the Pacific region and larger for the Continental region. This reflects the observations made by Ramsay and Silverman: between these two regions, the yearly average temperature of the Pacific region is warmer (as reflected in the Y * 1k ) and there is a smaller differential between the winter and summer temperatures (as reflected in the Y * 3k ). Each of the remaining sets of coefficients describe a distinct attribute of these data. For instance, the Y * 2k describe asymmetries between the spring and fall transition periods, and the Y * jk with larger j summarize finer periodic attributes of the stations' yearly profiles.
Fourier decomposition of these data is not discussed in Ramsay and Silverman (2005) itself, but it is in the book's supplemental internet materials, within which there appears the remark: "it was decided that 65 basis functions captured enough of the detail in the temperature data . . . ," referring to Fourier basis functions. The present analysis will follow this guideline and consider just the Fourier coefficients Y * jk with j = 1, . . . , 65.  Ramsay and Silverman (2005) treat the temperature profiles by an elementary multi-group functional model, Y k(g,l) (t) = µ g (t) + ǫ k(g,j) (t), where k = k(g, l) indexes the l'th station of region g, µ g (t) is the mean temperature profile for the g'th region, and ǫ k(g,j) (t) are random errors. Their analysis concludes that there are indeed vast differences in temperature profiles among the regions. Consider, however, that one would expect latitude to explain some substantial portion of the variation in temperature from station to station, and moreover that it is surely possible for the dependency of temperature on latitude to change from region to region. Accordingly, the model considered here is an extension of the multi-group model that incorporates latitude as a covariate in such a way as to allow for differences in both intercept and slope across regions. Denoting by x k the latitude of the k'th station, and writingx

Linear models and testing
, a functional linear model with P = 6 regression parameters, µ g (t) and β g (t) for g = 1, 2, 3. From this, an analogous model is implied for each set of Fourier coefficients, which for the j'th set is Component-specific estimates and tests may be carried out under the model (2.1) using standard linear-regression methodology. To illustrate, Figure 2 displays scatterplots of Fourier coefficients corresponding to j = 1 in the left panel and j = 41 in the right panel, with estimated region-specific fitted regression lines drawn in. The former panel depicts patterns one would more-or-less expect to see in these data. There, the Y * 1k are seen to generally decrease as latitude increases, as would be expected of measurements of yearly average-temperature. Observe also that the Y * 1k of the Atlantic and Continental stations follow a common trend fairly consistently, whereas those of the Pacific stations fall above Table 1 P-values for component-specific and global tests comparing the relationships between temperature and latitude across regions. The headings "falls on common trend" and "same slope" refers to the respective null hypotheses H 0 : is taken across all pairs of distinct regions consistent with the row label. The headings Y * 1k and Y * 41k indicate component-specific tests at j = 1 and j = 41, respectively, each with ν numerator degrees of freedom. "Global" p-values are simulated from the null distribution of F global . that trend. The fitted slopes of all three regions are roughly the same. A very different pattern is seen in the scatterplot of the Y * 41k . Among those coefficients it is seen that the Pacific and Continental stations follow roughly the same trends, though shifted slightly, and that those trends are very different from the trend followed by the Atlantic stations: the Y * 41k tend to increase with latitude for the former stations but decrease for the latter.

Falls on common trend
The pattern associated with the Y * 41k within a yearly temperature profile is symmetric with a period about two-and-a-half weeks. Its magnitude is quite small, accounting for relative temperature differences of less than one half of one degree. Yet our analysis suggests its relationship with latitude may distinguish the Atlantic region from the others.
Formal tests of the patterns noted above may be formulated in terms of the parameters of the model (2.1). Corresponding p-values are listed in Table 1. To describe the relevant hypotheses, it will be convenient to define an "overall" regression line at each j, Referring to the headings of Table 1, the null hypothesis labeled "falls on common trend" is H 0 : , where the index-set G consists of all pairs of distinct indices among regions indicated in the table's corresponding row-label. The null hypothesis labeled "same slope" is H 0 : β g1j = β g2j for (g 1 , g 2 ) ∈ G.
Each null hypothesis above may be written as a linear hypothesis H 0 : and L is a matrix of full column-rank that is determined by the specific hypothesis. The associated test statistic is a ratio of independent "mean-square" statistics F j = MS j (L)/MSE j , which follows an F distribution whose non-centrality parameter is zero only under H 0 , assuming the errors in (2.1) are independent, homoscedastic, and Gaussian. (Explicit formulas are provided in Section 2.4; see also, Seber, 2003.) Associated degrees of freedom are ν = rank L, whose values are indicated in Table 1, and M − P = 25.
The p-values in Table 1 reflect the observations made above on Figure 2. Comparing against the standard 0.05 level, the p-values in the column labeled Y * 1k under "falls on common trend" separate out the Pacific region as falling off of a common trend that is followed by the other regions. Those in the Y * 1k column under "same slope" indicate no evidence for differences in slope. Similarly, the column labeled Y * 41k under "same slope" separates out the Atlantic region as having a different relationship with latitude than the others.
There remains the question of how to carry out these tests "globally." That is, can the same hypotheses be tested on the whole of the temperature profiles, inasmuch as they are represented by 65 sets of Fourier coefficients? This type of question is the central concern of this article. A starting point is to consider the plots in Figure 3, which charts the p-values from component-specific tests of "same slope" across all sets of Fourier coefficients, j = 1, . . . , 65, each panel corresponding to a separate pair of regions. The most significant differences in slope are reflected in the very small p-values displayed in the two leftmost panels, at j = 41. Other p-values are "small" as well, in the sense of falling below the 0.05 level, even in the rightmost panel, but nowhere near as small as these two at j = 41. However, in light of there being 65 test results to examine per panel, it is no surprise to find at least a handful of small p-values. Of interest, then, is to deduce a single assessment for each panel which combines the p-values across all j = 1, . . . , 65. Well suited to this task is a test statistic constructed by tapering, in a manner similar to (1.4). For the present situation, let us define this statistic as F global = 65 j=1 w j F j , and set the weights to w j = j −1/2 , paralleling the construction of Q OP T n . By combining the individual test statistics F j this way, F global is globally sensitive, but it down-weights the influence of the Y * jk with larger j, as is desired to reflect primary interest in the larger-scale shapes.
Independence shall be assumed among the Y * jk across j (which is justified in in Section 2.4), so that the null distribution of F global is fully defined. Simulated p-values (using one million iterations) for global versions of the null hypotheses discussed above are listed in the columns of Table 1 labeled "global." Comparing against 0.05, those in the "same slope" portion of the table again separate out the Atlantic region as having a different relationship with latitude than the others, but this conclusion now accounts for essentially all aspects of the temperature profiles.

An asymptotic framework for functional data analysis
The sort of analysis carried out on the Canadian temperature data is now cast in the general context described in Section 1. The transformation from the continuous model (1.1) to the discrete model (1.2) will be laid out in detail, and its properties discussed. Hereafter, the notation will revert back to that of the replicated model, in which there is an explicit reference to a sample-size parameter, n.
Regarding the setup for asymptotic analysis, there is some imprecision in defining replication strictly according to the model (1.1), for the presence of a continuous covariate would make it unrealistic to expect exact duplication of the regressor matrix, X, except in carefully designed experiments. Absent a continuous covariate, however, many multi-group experiments do allow the parameterization M g = N g n, where M g are group-specific sample sizes, and the "sample size" n is some common divisor among them. In such cases the matrix X would consist entirely of zeros and ones, and N = g N g . The notation adopted in the Canadian temperature example is intended to suggest such a multi-group experiment; yet, the corresponding replication concept is unrealistic, for if any additional weather stations were sampled, one would not expect the latitudes of those stations to match any of those in the current data set.
Such complications notwithstanding, the purpose of introducing the samplesize parameter, n, is to manifest the notion that an increase in the amount of data collected is coupled with a decrease in error variability. This is apparent in expression (1.2), which shows the magnitude of error in the discrete model to shrink at the rate n −1/2 . Some readers might prefer to reinterpret the asymptotic formulation used in this article to one in terms of shrinking errors, in which case the results presented here would directly translate. Otherwise, a "pure" interpretation of replication in the model (1.1) might take n = 1 in an analysis of current data, and treat any future replication as entirely hypothetical; or, one might employ various conceptual devices, such as sampling covariates from a distribution, to modify the model (1.1) and its translation to (1.2) so as to make replication more realistic. Despite these possibilities, the perspective taken here is that the model (1.1) is entirely adequate for illustrating the key ideas of present interest, and any modification would only add technical complications that are tangential to them.
Regarding data collection and Fourier decomposition, the functional measurements are assumed to have been taken along a dense, finite grid that is common to all Y i (t), as in the Canadian temperature data. For t in the domain (a, b], the points of the grid are taken to be t l = a + (b − a)l/r for l = 1, . . . , r and some fixed, large r (which may not be p n ). The data associated with the curve Y i,k (t) are Y i,k (t 1 ), . . . , Y i,k (t r ). (In more general situations the grid may change from curve to curve, but to avoid additional complication it will be assumed a good approximation to the present setup is available, e.g., by interpolating measurements onto a fixed grid.) Set ψ 1 To make the final step to the discrete model (1.2), a linear transformation involving L is used to tailor the statistics (2.2) to the linear hypothesis. Set H = {L T (X T X) −1 L} −1/2 and define the Y n,j in (1.2) according to The remaining objects defining (1.2) are An often-appropriate assumption has each ǫ i a stationary process such that is, across integer m, a mean-zero independent and identically distributed sequence with finite fourth moment, and ∞ m=−∞ |γ m | < ∞. When this assumption is valid, Theorem 10.3.2.i of Brockwell and Davis (1991) implies that each Cov(Y * 2 i,jl , Y * 2 i,kl ) → 0 as r → ∞, for j = k. (See also Corollary 3.1.1.i in Section 3, below.) Thus, Fourier decomposition provides a means to decorrelate the functional linear model, while the statistics (2.2) or (2.3) capture its core structure.
In the typical case where the σ 2 j are unknown, the test statistic Q n in (1.4) would be replaced byQ n = n pn j=1 w j Ŷ n,j 2 , whereŶ n,j is defined as in the right side of (2.3) but with an estimateσ 2 n,j substituting for σ 2 j . For instance, theσ 2 n,j may be the usual unbiased estimateŝ This modification led to the statistic F global in the Canadian temperature example, for which F j = MS j (L)/MSE j has so that F global = pn j=1 w j F j =Q n /ν. Fan and Lin (1998, sec. 3.4) discuss other estimates of σ 2 j based on smoothing across j, which are specialized for the Fourier decomposition technique.
Regarding the relationship between p n and n, the reader should notice that p n never appears in the transformation from (1.1) to (1.2), and so these two parameters are never actually connected. The parameter p n is constrained by the resolution of the grid t 1 , . . . , t r , which to avoid numerical error requires p n ≤ r. Moreover, it is possible that p n would be set according to subjective modeling assumptions such as an analyst's determination of the number of basis functions needed to describe the data, as was made in the Canadian temperature example. In Spitzner, Marron, and Essick (1998), p n is set subjectively to avoid observed defects in the ability of the ψ j to decorrelate the model at larger j. At any rate, the justification for taking n → ∞ and p n → ∞ is that it forms an appropriate abstract conceptualization for repeated measurement of functional data in accordance with a global point of view. In particular, p n → ∞ represents a situation where the grid t 1 , . . . , t r is to become increasingly dense, and if n → ∞ as well then potentially all available information about the curve model will be captured in the limit.

Rates of testing for the tapering mechanism
The discussion now turns to rates-of-testing theory and the asymptotic optimality of Q OP T n among tests based on tapering. The more technical aspects of this discussion have been omitted, but can be found in Spitzner (2008A).
Smoothness constraints are formally defined within rates-of-testing theory as a restriction of the functional parameter β(t) in (1.1) to a smooth-function class. In the most general settings, this would be a Besov class, but here it is taken to be a Sobolev class, a special case, which is appropriate when working with Fourier decompositions. Such constraints may be expressed as a restriction of the mean vectors of the discrete model (1.2) to the geometry a Sobolev ellipsoid of radius M in infinite-dimensional discrete space, where M > 0 and s > 1/2 are fixed constants. The notations = 4s + 1 > 3 shall also be used. The bound on the norm in (3.1) models smoothness by restricting expression of the higher-indexed θ j , with larger s making the restriction stronger. Moreover, Parseval's identity implies that (θ 1 , θ 2 , . . .) ∈ B s,M is equivalent to the assumption that the corresponding β(t) in (1.1), assuming the θ j arise through (2.4), is an element of a Sobolev ellipsoid in continuous space, (t)dt ≤ M c } for some s c = 1, 2, . . . and M c > 0, which are easily determined. (For details and further discussion, see Adams and Fournier, 2003).
The "rates" in rates-of-testing theory, which characterize test performance, are described as follows. Fix s > 1/2, M > 0, and for each n let φ n = φ n (Y n,1 , . . . , Y n,pn ) be a test of (1.3) such that lim n P 0 [φ n = 1] ≤ α, for a fixed level α ∈ (0, 1). The notation P θ is here used to denote probabilities under the model (1.2) for a specific θ = (θ 1 , θ 2 , . . .) and fixed n. Rates-of-testing criteria are formulated from sequences δ n → 0 satisfying inf θ∈H1(δn/δ * n ;s,M) The criterion (3.2) describes the rate at which a gap may shrink between the null hypothesis and a class of "distinguishable" alternatives, those the test would be able to detect with high power, asymptotically. The better-performing tests allow this gap to shrink faster: if for some δ n → 0 the criterion (3.2) is satisfied for one test, but not another, the former test is preferred. Ingster's (1993) minimax performance bound states that for no test does any δ n = o(n −2s/s ) satisfy (3.2), but there is a test (based on tapering) for which δ n = n −2s/s satisfies (3.2). This identifies the rateδ M n (s) = n −2s/s as minimax for the geometry B s,M at a specific s. Suppose now that fixed bounds s * < s * are given, and for each s * < s < s * one is to consider a separate sequence (δ n (s)), and setδ AM n (s) = {n 2 (log log n) −1 } −s/s . Spokoiny (1996) establishes that for no test is (3.2) satisfied across s * < s < s * if δ n (s) = o(δ AM n (s)) for some such s. It is also shown there is a test (based on thresholding) for which δ n (s) =δ AM n (s) does satisfy (3.2) across s * < s < s * . This identifies the rateŝ δ AM n (s) as adaptive-minimax for B s,M across s * < s < s * . (The optimal tests alluded here are the same mentioned in Section 1 and are described later in Section 4.) The main technical result for evaluating tests based on tapering rewrites the criterion (3.2) in terms of the parameters of the test statistic (1.4).
Theorem 3.1. Assume the model (1.2) and suppose (Q n ) is a sequence of test statistics with each Q n as in (1.4) for associated sequences (w n,j ) and (p n ) such that each 0 < w n,j ≤ 1 and p n → ∞ as n → ∞. Set S n (p) = w 2 n,1 + · · · + w 2 n,p , W n (p) = min{w 2 n,j : j ≤ p}, U n (p, q) = qW n (q)/S n (p), and U n (p) = U n (p, p). Suppose at each n the e n,jk are independent across k, and Suppose further that each e n,jk is such that E[e n,jk ] = 0, V [e n,jk ] = 1, E[e 4 n,jk ] ≍ 1 uniformly across j, k, and n, and P [e n,jk ≤ −t] > 0 for each t > 0. Let (δ n ) be some positive sequence for which δ n → 0. Fix α and, for each n, let φ Q n denote the size-α test which rejects the null hypothesis in (1.3) when Q n exceeds some critical value. For fixed s > 1/2, M > 0, and φ n = φ Q n the criterion (3.2) holds if, and only if, both where q n = {δ n /M } −1/s . The same conclusion holds if the Y n,jk in Q n are replaced with Y n,jk (1 + o p (1)), provided Cov(Y 2 n,jk (1 + o p (1)), Y 2 n,jl (1 + o p (1))) → Cov(Y 2 n,jk , Y 2 n,jl ) for each j, k, and l. Proof. This is Theorem 1 of Spitzner (2008A).
An important corollary of this theorem establishes its validity under assumptions on the covariance structure of the continuous model (1.1) that would typically be made in practice. The case of unknown variances is also treated.

Corollary 3.1.1. Suppose the model (1.2) derives from the functional linear model (1.1) via the transformation (2.3), and assume the notation of Section 2.4. The conclusions of Theorem 3.1 hold true under the following statement (i) and remain true under statement (ii). (i) Each
, for which the coefficients γ m satisfy ∞ m=−∞ |γ m ||m| 1/2 < ∞, the η i,k (a+ (b − a)m/r) are independent and identically distributed across integer m with E[{η i,k (a + (b − a)m/r)} 4 ] < ∞, and the weight sequence (w n,j ) does not increase in j for each n.
In Spitzner (2008A), Theorem 3.1 is applied to characterize and bound the rates-of-testing performance of tests based on tapering. The key results most relevant to present purposes are summarized as follows.
Corollary 3.1.2. Assume the notation and conditions of Theorem 3.1.
(i) Suppose (3.5.i) holds andδ n → 0 is such that n 2 U n (p n ,q n )q −s n ≍ 1, whereq n = {δ n /M } −1/s . Then, the sequence (δ n ) defines a "boundary rate" of the test φ Q n in the sense that for any sequence (δ n ) the criterion (3.2) holds ifδ n = O(δ n ) but not if δ n = o(δ n ).
(ii) Setδ Q n (s) = {n 2 (log n) −1 } −s/s and suppose 1/2 < s * < s * . For no test φ Q n is (3.2) satisfied for δ n = o(δ Q n (s)) across s * < s < s * . Moreover, if w n,j = j −1/2 , as in Q OP T n , and p n is such that {n 2 (log n) −1 } 1/3 /p n is bounded and log p n ≍ log n, then φ Q n satisfies (3.2) for δ n =δ Q n (s) across s * < s < s * . Thus,δ Q n (s) is an "optimal adaptive rate of testing for the tapering mechanism." Proof. Statement (i) and the first assertion of (ii) follow immediately from Theorems 2 and 3, respectively, of Spitzner (2008A). To prove the second assertion of (ii), first observe that the specified settings imply {n 2 (log n) −1 } 1/s /p n is bounded, sinces > 3, and S n (p n ) ≍ log p n . It follows that the sequence in (3.5.i) has which is bounded. Next set δ n =δ Q n (s), so that q n = M 1/s {n 2 (log n) −1 )} 1/s , and observe n 2 U n (p n , q n )q −s n ≍ n 2 q −s n / log p n = M −s/s (log n)/(log p n ).
Thus the conditions of statement (i) hold, andδ Q n (s) is a boundary rate for this test.
The first statement of Corollary 3.1.2 is useful to deduce the δ n → 0 such that (3.2) holds for a specific test, as is demonstrated in the proof of the second statement. The second statement is particularly important in that it establishes the concept of adaptive optimality among tests based on tapering, and characterizes the associated performance bound via the sequenceδ Q n (s). Observe that the optimal adaptive rate identified in that statement is slower than Spokoiny's adaptive-minimax rate,δ AM n (s) = o(δ Q n (s)). This of course means there are tests that would asymptotically outperform any test based on tapering in the adaptive context, such as Spokoiny's (1996) test based on thresholding or Fan's (1996) adaptive Neyman test. Nevertheless, within the class of tests based on tapering, the test based on Q OP T n , with p n as in Corollary 3.1.2.ii, is adaptively optimal. Moreover, although the condition log p n ≍ log n does not allow p n → ∞ at an arbitrarily fast rate, it nevertheless gives the dimensionality parameter fairly wide leeway. This property makes such a test particularly suited for use in FDA.
Let us also remark that adaptive-optimality can be established in the same manner as in the proof of Corollary 3.1.2.ii for the class of test statistics (1.4) with w n,j = {j(log j) γ } −1/2 such that γ < 1. The setting γ = 0, the special case that defines Q OP T n , is preferred for simplicity.

Comparisons among high-dimensional tests
In this section, the various high-dimensional tests alluded to in this article are described in detail and evaluated by simulation, with the goal of comparing the power-properties of tapering, truncation, and thresholding mechanisms.
The tests are defined here in the context of the discrete model (1.2), taking ν = 1, and are compared assuming independent, Gaussian errors with known variances. Indeed, most of the available theoretical results are derived in this context, including those of Ingster (1993) and Spokoiny (1996). (Yet Fan andLin, 1998, andFan andHuang, 2001, establish robustness of the good power-properties of Fan's, 1996, adaptive Neyman test under weaker assumptions.) To reflect the setting ν = 1, this section will revise notation to rewrite Y n,j = Y n,j , θ n,j = θ n,j , and e n,j = e n,j in (1.2). Each test is defined below by stating a test statistic; it should be assumed the corresponding test itself rejects the null hypothesis when the test statistic exceeds a fixed cutoff.

Tests based on tapering
A test based on tapering that is known to achieve Ingster's minimax rate is defined by the test statistic F ZZ n = n pn j=1 w n,j Y 2 n,j , with weights given by w n,j = 1 − j 4s ξ 2 n /(1 + j 2s ξ n ) 2 and ξ n = n −4s/s . Optimal performance of this test was first deduced in Fan, Zhang, and Zhang (2001) for a slight variation in which the test statistic is expressed as an infinite quadratic form, F ZZ ∞ n = n ∞ j=1 w n,j Y 2 n,j . However, Spitzner (2008A, ex. 2) clarifies that such performance is retained for the finite-sum version, provided n 2/s = O(p n ).
A minimax test based on a more simple quadratic form is studied in Ingster (1993). It is defined by the unweighted statistic U W Q n = n pn j=1 Y 2 n,j , and haŝ δ M n (s) = n −2s/s as a boundary rate, provided p n ≍ n 2/s . A pitfall of this test is that it is extremely sensitive to the rate at which p n increases, which puts it at a serious disadvantage in the FDA context. For instance, if p n ≍ n 2/γ for γ =s it is possible to find sequences δ n = n −(2−t)s/s with t > 0 that fail to satisfy the rates-of-testing criterion (3.2). This is a drastic degradation in performance, well beyond that incurred to achieve adaptive-minimaxity or optimal adaptive performance among tests based on tapering. The power of the test based on U W Q n is invariant on the contours of pn j=1 θ 2 n,j , a property that is sometimes touted as a practical advantage; for present purposes this property will be exploited to calibrate the simulation design.
Note that tests based on F ZZ n , F ZZ ∞ n , and U W Q n each require a precise specification of the parameter s to achieve minimax performance. Consequently, since s can rarely be specified exactly, each is difficult to implement in practice. In the present exercise, these particular tests should be regarded not as candidates for practical usage, but as conceptual references against which to compare the properties of other tests.
Two other tests based on tapering will also be evaluated. One is, of course, that based on Q OP T n , an adaptively optimal test of this class. The other is defined by the test statistic CV M n = n pn j=1 j −2 Y 2 n,j , which is motivated by the classical Cramér-von Mises statistic of goodness-of-fit testing. The Cramérvon Mises statistic is traditionally expressed as in integrated squared-difference between empirical and hypothesized distribution functions, but it is studied in Eubank and LaRiccia (1992) via the representation n pn/2 j=1 j −2 (Y 2 n,2j−1 + Y 2 n,2j ), giving rise to the statistic CV M n studied here. This test is included as an example of a widely-used high-dimensional test that exhibits rather poor performance in an FDA context. For instance, it can be shown that, due to its strong down-weighting of higher-indexed components, its best possible boundary rates areδ n = n −(2−t)s/s for t = 2{1 −s/(s + 3)} > 0.

Adaptive-minimax tests based on truncation and thresholding
Two tests that achieve Spokoiny's adaptive-minimax rate shall be considered, one based on truncation and the other on thresholding. The test based on truncation is Fan's (1996) adaptive Neyman test, whose test statistic is AN n = max k=1,...,pn (N n,k − k)/ √ k, for which N n,k = n k j=1 Y 2 n,j . Sometimes this test is interpreted via the scheme known as "Neyman's truncation," which describes the test statistic by N n,kn , viewingk n as a data-driven diagnostic; to yield AN n ,k n is that k which maximizes the standardized sums (N n,k −k)/ √ k. Other choices ofk n are discussed in Raynor and Best (1989), Eubank and Hart (1992), Eubank and LaRiccia (1992), Inglot and Ledwina (1996), Eubank (2000), Aerts, Claeskens, and Hart (2000), Claeskens and Hjort (2004), and elsewhere. Among them, the test based on AN n has the most well established rates-of-testing properties and compares the most favorably in the empirical-power investigation of Spitzner (2006), which is similar to the investigation carried out in what follows. It is shown in Fan, Zhang, and Zhang (2001) to achieve Spokoiny's adaptiveminimax rate, at least for the case where p n = n.
The test studied here based on thresholding is introduced in Spokoiny (1996). This test is readily applicable under the discrete model (1.2), but its formulation takes the continuous model (1.1) to have been translated to (1.2) using wavelet decomposition rather than Fourier decomposition. For present purposes it is unnecessary to describe the complete details of that translation, except to note that its organization of component-subscripts uses "wavelet indexing:" j = j(k, l) denotes the l'th component at the k'th level of resolution: k = 0, 1, . . . and l = 1, . . . , 2 k ; j(0, 1) = 1, j(k+1, l) = j(k, l)+2 k , and j(k, l+1) = j(k, l)+1.
The test statistic is constructed in two stages. In the first stage, a class of tests indexed by s > 1/2 is constructed as follows. Define for parameters k * n (s), k * n , and a "hard-thresholding" parameter ξ n,k (s), where µ HT (ξ) = E[η 2 I{|η| > ξ}] with η following a standard-normal distribution. The settings applied here take k * n = ⌈log 2 n⌉, k * n (s) = ⌈(2s + 1/2) −1 log 2 n⌉, and ξ n,k (s) = (k − k * n (s) + 8) log 2, where ⌈x⌉ denotes the smallest integer still to exceed x. In the second stage, a range s * < s < s * is assumed to have been specified, over which adaptively-optimal performance is to be achieved. The HT n (s) corresponding to s in that range are combined into the test statistic is HT n = max s * <s<s * HT n (s)/c n,α (s), where c n,α (s) is the cutoff for a size-α test based on HT n (s). (Note there is only a finite number of distinct HT n (s) across s * < s < s * .) The settings for k * n (s) and ξ n,k (s) given above are not those originally specified in Spokoiny (1996), but are modified versions suitable for the nonasymptotic context considered here. The original settings are k * n (s) = ⌈(2s + 1/2) −1 log 2 (M n)⌉, where M is as in B s,M , and the hard-thresholding parameter is ξ n,k (s) = 4 (k − k * n (s) + 8) log 2. With these, Spokoiny (1996) establishes that each test based on HT n (s) achieves Ingster's minimax rate, and the test based on HT n achieves Spokoiny's adaptive-minimax rate. However, Abramovich et al. (2002) remark that in the original setting for ξ n,k (s) the leading constant of four is "unreasonably high" for "finite sample situations," presumably referring to data configurations one would tend to work with in practice, and proposed it be replaced with one, as is done here. The original setting for k * n (s) is problematic in that it depends on M . Abramovich et al. (2002) handle this with application-specific, ad hoc adjustments; the setting here matches the ratio k * n /k * n (s) to the limit of what it would be under the original setting.

An empirical comparison
The power of the tests described in Sections 4.1 and 4.2 are now compared in a simulation exercise involving two classes of high-dimensional alternatives. The first class consists of "spiked" alternatives, for which θ j is taken to have θ j = √ λ for j = j 0 and θ j = 0 otherwise, where j 0 is a subscript that indexes the class, and λ is set to calibrate each alternative so that the power of the test based on U W Q n is 0.4. The second class consists of "smooth" alternatives, which are parameterized according to θ spiked class of alternatives. This formula is derived from the inverse-CDF of the beta distributions, and is such that the partial sum J j=1 θ 2 j is approximately 80% of its value at J = p n when J is approximately 100b% of p n . The class is indexed by the parameter b. Examples of each type of alternative are displayed as upper-right insets in the panels of Figure 4.
Each spiked-alternative satisfies the technical property of maximally compressing non-zero components into higher indices among alternatives with constant ∞ j=1 j 2s θ 2 j . Mathematical proofs of minimaxity single out these alternatives as yielding the minimum power, which is to be maximized (cf. th. 4 of Fan, Zhang, andZhang, 2001, andth. 1 of Spitzner, 2008A). In other words, the class of spiked alternatives represent those alternatives that are the hardest to distinguish. These may be of primary interest in some specialized FDA applications, such as those involving PET-fMRI images (cf. Abramovich et al., 2002). On the other hand, alternatives in the smooth class are idealized representations of those of primary concern in more typical FDA testing problems. For instance, in the Canadian temperature example, interest centers on such large-scale attributes as the average yearly temperature and the differential between winter and summer temperatures, attributes that are expressed in the lower-indexed θ j . The smooth alternatives reflect this and similar situations by expressing the departure from the null hypothesis mainly through these lowindexed components.
The simulation design is such that the models examined each have dimensionality p n = 127, which was selected to accommodate seven levels of resolution in the wavelet-indexing scheme, at k = 0, 1, . . . , 6. (This value is not atypical in FDA applications; e.g., p n = 100 in Fan and Lin, 1998, and p n = 124 in Spitzner, Marron, and Essick, 1998.) The sample-size parameter is set to n = 64, which was chosen since the statistic HT n then has k * n = 6, so that the test is sensitive to all available levels of resolution.
Simulated power is calculated at twenty alternatives from each of the spikedand smooth-alternative classes. The specific alternatives evaluated are such that their corresponding class indices are roughly evenly-spaced across the ranges j 0 = 1, . . . , p n for the spiked class and 0.01 ≤ b ≤ 0.80 for the smooth class, including the stated endpoints.
Some intricacy is required to deduce reasonable specifications for the parameters of the Sobolev geometry, B s,M , which are needed to construct the statistics F ZZ n and HT n . For spiked alternatives, the bound in B s,M is set to M = λ(p n +1), so that the Sobolev norm of the alternative indexed at j 0 = p n +1 would have ∞ j=1 j 2s θ 2 j = M at s = 1/2, if such a setting for s were allowed. Then fixing M at this value, the parameter s is set at different values for different alternatives: for j 0 = 2, . . . , p n , the parameter is s = log(p n + 1)/{2 log j 0 }, which solves ∞ j=1 j 2s θ 2 j = M ; for j 0 = 1, s is set to its value at j 0 = 2. For smooth alternatives, M is first set to the value pn j=1 j 2s θ 2 j obtained with s = 1/2 and the θ j defined by indexing at b = 0.81. The parameter s is then set numerically at each alternative evaluated, for which b ≤ 0.80, to the value that solves ∞ j=1 j 2s θ 2 j = M . The bounds s * and s * , which are required to construct HT n , are taken as the lower and upper values of s calculated in the scheme above, treating the two classes of alternatives separately. The selected spiked alternatives yield the bounds s * = 0.5008 (at j 0 = 127) and s * = 1.1667 (at j 0 = 1); the selected smooth alternatives yield s * = 0.5017 (at b = 0.80) and s * = 2.4680 (at b = 0.01). Under these settings, the parameter k * n (s) varies within a very narrow range: k * n (s) = 4, 5 for spiked alternatives, and k * n (s) = 3, 4, 5 for smooth alternatives. Consequently, the simulated power of the test based on HT n is nearly identical to that of each test based on HT n (s).
Simulated power curves are displayed in the panels of Figure 4 for tests based on F ZZ n , U W Q n , Q OP T n , CV M n , AN n , and HT n . Test-statistic cutoffs and the quantities µ HT (ξ n,k (s)) of the statistic HT n were calculated by simulation, and all tests were carried out at the α = 0.05 level. Every simulation used a minimum of 250,000 iterations. The left panel of Figure 4 displays results for the spiked alternatives, and the right panel those of the smoothed alternatives. One should find that the test based on U W Q n serves well as a basis of comparison for the other tests, observing that its simulated power curve is near-constant at the value 0.4.
Examining first the results for spiked alternatives displayed in the left panel, the benefit of the test based on HT n against spiked alternatives is glaring. Whereas the simulated power curve of every other test follows the same pattern of starting out high at low values of j 0 then dropping sharply and later evening out well below 0.4, that of HT n is nearly flat at 0.6, representing a consistent 50% increase in power above that of the test based on U W Q n . Among the remaining tests, the simulated power curves of the tests based on F ZZ n , Q OP T n , and AN n appear quite similar, and reflect performance that is not altogether poor, especially at the lower values of j 0 . The test based on CV M n , however, exhibits exceptionally poor performance.
The picture changes drastically for the smooth alternatives. In the right panel, the considerable advantage of the test based on HT n against spiked alternatives disappears; its simulated power, in fact, appears generally low relative to the other tests, and the same can be said of the test based on AN n , although to a lesser degree. To put it another way, it is the tests based on tapering that exhibit superior performance in this context, specifically those based on F ZZ n and Q OP T n . The test based on F ZZ n outperforms that based on Q OP T n , and this is not altogether surprising given that the former satisfies a stronger asymptotic property. Simulated power of the test based on CV M n is still quite poor, although not as dismal as it is for spiked alternatives.
Though the tests based on truncation and thresholding are known to have superior asymptotic performance, these empirical results suggest quite clearly that the higher-order asymptotic factors wash out slowly, and may very well have a substantial influence in data configurations one is likely to encounter in practice. All of the tests are shown to be sensitive to the shape of the alternative, and the test based on thresholding appears especially sensitive in this regard. These observations substantiate a recommendation to the analyst that when choosing among high-dimensional test in FDA it is prudent to consider the types of departures from the null hypothesis one is most interested in detecting. The test based on Q OP T n may best suit the goals of the analysis, despite its asymptotic inferiority to truncation and thresholding, and it likely will in typical FDA applications where the primary interest is in the large-scale attributes of the model. Let us briefly return to an observation made in the Canadian temperature example. As has been suggested, the focus on large-scale attributes puts the goals of the analysis in line with detecting smooth-shaped alternatives, and, in light of the current simulation results, the test based on F global is best suited to that purpose. However, recall that the test rejected H 0 on the example data, and exploratory analysis indicated the presence of a spiked-shaped alternative. Thus we have a practical illustration of a test based on tapering evidently detecting an alternative among those hardest to distinguish, thereby further demonstrating the global power of the test.

Concluding discussion
Through an application example and both theoretical and empirical power investigations, this article has demonstrated the benefits of the tapering mechanism in FDA, and the test based on Q OP T n in particular. It has been shown how tests based on tapering may be constructed on a functional linear model, using Fourier decomposition to first translate to a high-dimensional discrete model. Intuition for the discrete model and tests based on tapering have been discussed through an example analysis of the Canadian temperature data. Applying criteria from rates-of-testing theory, it has been shown that the weight setting w n,j = j −1/2 in (1.4), which defines Q OP T n , represents an adaptively optimal configuration among tests based on tapering. Moreover, the test based on Q OP T n retains optimality over a wide range of rates at which p n → ∞, a property that is particularly advantageous in the FDA context. The discussion reiterates that any test based on tapering can be outperformed asymptotically by tests based on truncation or thresholding. Nevertheless, an empirical investigation in a non-asymptotic context has demonstrated that high-order effects may be non-negligible in practice, and that the test based on Q OP T n has superior power against a class of alternatives that would be of natural interest in many FDA applications.
While this article argues in support of the tapering mechanism on the basis of its power properties, there are a number of reasons completely separate from power why an analyst would, at the outset, want to restrict himself or herself to tests based on tapering. The first is that the test statistic (1.4) may arise through a formal Bayesian construction, as a monotone transformation of a posterior null probability. This is shown in Spitzner (2008B), where the rates-of-testing framework is developed entirely within a Bayesian context. It should be noted that there are existing Bayesian constructions of the thresholding mechanism for use in estimation, e.g., in Abramovich et al. (2007), which suggest the statistic HT n might be viewed as a summary metric applied a Bayesian estimator. This falls short of producing a formal Bayesian test based on HT n , however, since it fails to represent HT n as a monotone transformation of a posterior null probability.
Another reason that the tapering mechanism is attractive is because of its straightforward intuition. The non-expert, once appreciating the goal to exploit smoothness but still retain global power, would more quickly embrace the intuition underlying the use of tapering before truncation or thresholding. Tapering is easy to understand as it directly incorporates all components of the model, while explicitly down-weighting the high-indexed ones. The latter mechanisms are more complicated, harder to connect to the goal, and easy to apply incorrectly. For these reasons, tapering may be the preferred recommendation in consulting or interdisciplinary situations, and may offer the best insurance for continued correct implementation once the statistician's involvement in a project wanes.
Finally, the test statistic (1.4) is convenient in that it may be treated through well-known analytical approximations (see, e.g., Mathai and Provost, 1992, sec. 4.6), rather than simulation. This may be a trivial advantage in straightforward testing problems, given modern computing power, but it can help enormously when developing more complicated high-dimensional statistical procedures. For instance, Spitzner and Marshall (2009) make critical use of distributional approximations of quadratic forms to develop a high-dimensional sequential monitoring procedure based on tapering.
The results of this article encourage the use of the test based on Q OP T n in FDA, and establish a role for the tapering mechanism in rates-of-testing theory. It is hoped that the ideas presented here are also found helpful in clarifying the role of smoothness constraints in high-dimensional problems, and in providing an accessible methodology with which to exploit them.