A partial overview of the theory of statistics with functional data
Introduction
The problems of statistical inference can be primarily classified according to the nature of the sample space (where the available data live) and that of the parameter space , where the target “parameter” is supposed to belong. To a certain extent, the progress of the mathematical statistics can be described in terms of the conquest of new broader more sophisticated structures for and , in particular those corresponding to infinite-dimensional spaces. The denomination Abstract Inference was used by Grenander (1981) to provide a particularly insightful view of this progress towards generality in the statistical theory.
From this perspective, the theory of statistics with functional data, often denoted Functional Data Analysis (FDA), corresponds to a last-generation statistics where (and, in many cases, also ) is an infinite-dimensional function space. So, according to the above mentioned classification, FDA could be placed in the general development of statistical theory as indicated in the following informal sketch (where n denotes the sample size):
Hence, in simple words, we might say FDA refers usually to those statistical problems where the available data consist on a sample of n functions defined on a compact interval of the real line, say . Additional (real or functional) variables are often incorporated, for example in the regression models.
Other more sophisticated models are possible, where [0, 1] is replaced by a d-dimensional interval or the sample functions are vector-valued. Also, FDA bears some affinity with those statistical problems, often referred to as “inference in stochastic processes” where the sample information is given by a partial trajectory x(t), of a stochastic process . In this case, the length T of the observation interval plays the role of the sample size n.
Of course the FDA theory has incorporated many standard tools of the classical parametric or multivariate statistics. For example, the dimension reduction tools. However, the infinite-dimensional nature of the sample space poses especial problems which allow us to classify FDA as a genuinely new branch of the statistical theory.
The book by Ramsay and Silverman (2005), whose first edition was published in 1997, must be cited as a major landmark in the history of FDA. This book has a practical orientation, targeted to a wide scientific audience. It has developed a crucial role in the popularization of FDA. The associated software, freely provided by the authors, became soon an effective toolbox for an increasing number of researchers, flooded by a new abundance of experimental data coming from on-line monitoring of different experiments. Another book by the same authors, Ramsay and Silverman (2002), is focussed on the illustration of the main FDA techniques through the study of specific case studies with real data.
The book by Ferraty and Vieu (2006) represented a second-generation view of the subject. It incorporates further mathematical insights, including a more detailed treatment on the non-trivial asymptotic issues involved in FDA, together with some discussion of several relevant issues as the use of semi-metrics and the so-called “small ball probabilities” phenomenon, which is in the basis of many theoretical difficulties in FDA. However, again, the practical aspects played a major role among the aims of this book.
The FDA French school include many other references deserving mention: some researchers are grouped in the STAPH team (www.math.univtoulouse.fr/staph/) whose earliest contribution to the topic is perhaps the paper by Dauxois et al. (1982), a pioneering contribution to the study of principal components for functional data.
The paper by Bosq (1991) is another path-breaking reference in the topic (not considered here) of functional-valued time series. The corresponding general theory of auto-regressive functional processes is given in Bosq (2000). The monograph by Bosq and Blanke (2007) deals mainly with the use of nonparametric approaches in statistical functional problems. Whereas the orientation of these two books is mostly theoretical, they are both, in a way, extremely practical as they jointly provide a fascinating account of the main mathematical tools involved in FDA.
The book by Horváth and Kokoszka (2012) is a fresh addition to the current general literature on FDA. It offers a well-balanced mixture of theoretical aspects (e.g., the useful Chapter 2 on Hilbert space theory) and applications (in particular, detailed discussions of real data examples and up-to-date information on software). About 40% of the book length (from Chapters 13 to 18) is devoted to the analysis of dependent functional data, including functional time series, change point detection and spatial statistics with functional data.
Special issues devoted to FDA topics have been published by different journals, including Statistica Sinica, issue 14, 3 (2004). Computational Statistics, 22, 3 (2007), Computational Statistics & Data Analysis, 51, 10 (2007), Journal of Multivariate Analysis, 101, 2 (2010).
Among the survey and overview papers, let us mention, e.g., Rice (2004), Müller (2005), González-Manteiga and Vieu (2011) (which includes an extensive bibliography), Delsol et al. (2011) (especially oriented to practical issues and real-data applications), etc.
The recent collective book Ferraty and Romain (2011) consist of 16 chapters, from different authors, with up-to-date surveys of the main topics of FDA.
The aim is to provide a personal perspective of the current theory and practice of FDA. Such a view is necessarily limited by several obvious constrains, including the space limitations and the author's awareness of the different subjects. This accounts for the use of the term “partial” in the title. In particular, no claim of bibliographical completeness is made (I apologize for any omissions). In fact, since FDA is a vast topic with many different facets, I realize that there would probably be room for another paper, almost disjoint with this one, written under a similar title by another author with different interests and experience.
This paper is a sort of mixture of a tutorial and an up-to-date survey of FDA: it is somewhat of a tutorial since it is not targeted to a readership of specialists in FDA but rather to a broader audience of statisticians, not necessarily familiar with the subject. This places readability and a certain degree of self-containedness as major priorities of this paper. In the survey aspect, an effort has been made to provide a reasonably wide range of topics and, as a rule, the most recent references have been preferred over the older ones.
The organization of the paper tries to evoke the familiar structure of so many classical books of statistics: thus, some “descriptive” aspects (not related to inference notions), concerning the structure and representation of the data are discussed in Section 2, the probabilistic basis of FDA theory is briefly outlined in Section 3; the location, centrality parameters (mean, median and mode) are discussed in Section 4; regression, classification and dimension reduction techniques are analyzed in 5 Functional regression, 6 Supervised and unsupervised functional classification, 7 On dimension reduction techniques in FDA respectively. The functional resampling methodologies are reviewed in Section 8. Some remarks on the available software are made in Section 9.
Section snippets
The data
Unlike the classical statistics (where there is usually little discussion on the nature of the data) in FDA one might reasonably ask: Do really exist such a thing as a functional data? The question has some relevance as, in practice, when a process is monitored, one usually record the values in a discrete grid . So, at the end one always has a, possibly high-dimensional, vector observation . There are at least two reasons that could lead us to consider this as a
Some probability background
Many classical books of mathematical statistics used to include some introductory chapters devoted to the probability foundations of the subject. Indeed there are some good reasons for this, since statistical inference can be seen as a second step after proposing a probability model for a random experiment. In the case of FDA, since the data are functions, the observed “random variables” are in fact stochastic processes, i.e., random elements taking values in a function space. Therefore some
The definitions
Regarding the mean, similar to the real-valued case, the integral-based definition of the “functional” expectation (see Section 3) allows for the usual motivation in terms of projections: If X is a random element taking values in a Hilbert space and , then the mean of X, , fulfills ; see, e.g., Bosq (2000, pp. 38–41), for a broader discussion of this, including the extension to conditional expectations. This property is behind the interpretation of m in terms
Functional regression
According to the standard setup, a typical (random design) functional regression model has the form where X is the explanatory (functional) variable, Y is the output (response) variable, g is a (usually unknown) function and is the error which is often assumed to fulfill .
The aim is to estimate g from a random sample , i=1,…,n.
Supervised and unsupervised functional classification
In statistics, the word classification has also the same usual double meaning as in the ordinary language, where this term stands for both “to assign (an element) to a particular class or category” and for “arrange (a group of elements) in classes according to shared characteristics”. The first meaning would correspond to the statistical methodology called supervised classification or discriminant analysis. The second one would better suit to the clustering methodology which roughly corresponds
On dimension reduction techniques in FDA
A natural idea in high-dimensional or functional data is to transform the sample data into elements of small dimensional spaces, thus allowing for a simpler statistical treatment. Of course the use of projections via linear functionals, either systematic or random (see 3.6 The random projections methodology, 4.3 Depth-based notions of medians and quantiles) is a possible, sometimes very useful, option. Let us now consider here other more standard versions of this idea (still relying on linear
The bootstrap in the functional setup
Today the bootstrap is perhaps the most popular among the resampling methodologies. The basic ideas behind bootstrap are now a commonplace for the statistical community. However, in order to present here a survey of results on functional bootstrap we need to briefly recall some fundamentals.
It is well-known that, essentially, the bootstrap is a tool aimed to approximate the sampling distribution of a (usually re-scaled) statistic. The goal of such approximations is often to construct confidence
Software for FDA: the R package fda.usc
The practical use of FDA methodologies, with almost no exception, relies heavily on the availability of a friendly, reasonably comprehensive, software. The R package fda.usc, prepared by Manuel Febrero and Manuel Oviedo (University of Santiago de Compostela, Spain) is a recent valuable contribution in this regard. This software incorporates and extends the previous R-package fda (see Ramsay et al., 2009) and the R-functions provided by Ferraty and Vieu (2006) as a complement for their book. The
Acknowledgments
This work was partially supported by Spanish Grant MTM2010-17366.
The corrections, insights and additional references provided by an anonymous referee led to a much improved manuscript.
I am deeply indebted to my co-workers in FDA subjects: A. Baíllo, J.R. Berrendero, J. Cuesta-Albertos, M. Febrero-Bande and R. Fraiman. This paper is dedicated to them.
References (137)
- et al.
Using principal components for estimating logistic regression with high dimensional multicollinear data
Computational Statistics & Data Analysis
(2006) - et al.
Estimation and inference in functional mixed-effects models
Computational Statistics & Data Analysis
(2007) - et al.
Kernel-based functional principal components
Statistics & Probability Letters
(2000) - et al.
The random projection method in goodness of fit for functional data
Computational Statistics & Data Analysis
(2007) - et al.
Impartial trimmed k-means for functional data
Computational Statistics & Data Analysis
(2007) - et al.
The random Tukey depth
Computational Statistics & Data Analysis
(2008) - et al.
An ANOVA test for functional data
Computational Statistics & Data Analysis
(2004) - et al.
On the use of the bootstrap for estimating functions with functional data
Computational Statistics & Data Analysis
(2006) - et al.
On depth measures and dual statistics. A methodology for dealing with general data
Journal of Multivariate Analysis
(2009) - et al.
On the using of modal curves for radar waveforms classification
Computational Statistics & Data Analysis
(2007)
Asymptotic theory for the principal component analysis of a vector random functionsome applications to statistical inference
Journal of Multivariate Analysis
Functional PLS logit regression model
Computational Statistics & Data Analysis
Measures of influence for the functional linear model with scalar response
Journal of Multivariate Analysis
Regression when both response and predictor are functions
Journal of Multivariate Analysis
Quantiles for finite and infinite dimensional data
Journal of Multivariate Analysis
Bootstrap in functional linear regression
Journal of Statistical Planning and Inference
Best approximations to random variables based on trimming procedures
Journal of Approximation Theory
Testing the stability of the functional autoregressive process
Journal of Multivariate Analysis
Detecting changes in functional linear models
Journal of Multivariate Analysis
On the kernel rule for function classification
Annals of the Institute of Statistical Mathematics
Optimal testing in a fixed-effects functional analysis of variance model
International Journal of Wavelets, Multiresolution and Information Processing
Testing in mixed-effects FANOVA Models
Journal of Statistical Planning and Inference
Topics in Stochastic Processes
Supervised classification for a family of Gaussian functional models
Scandinavian Journal of Statistics
Classification methods for functional data
Partial least squares for discrimination
Journal of Chemometrics
Detecting changes in the mean of functional observations
Journal of the Royal Statistical Society B
Rates of convergence of the functional k-nearest neighbor estimate
IEEE Transactions on Information Theory
On the performance of clustering in Hilbert spaces
IEEE Transactions on Information Theory
Some asymptotic theory for the bootstrap
Annals of Statistics
Simultaneous analysis of Lasso and Dantzig selector
Annals of Statistics
Modelization, nonparametric estimation and prediction for continuous time processes
Inference and Prediction in Large Dimensions
Theory of classificationa survey of some recent advances
ESAIMProbability and Statistics
Adaptive inference for the mean of a Gaussian process in functional data
Journal of the Royal Statistical Society B
Convergent estimators for the L1 median of a Banach valued random variable
Statistics
Prediction in functional linear regression
Annals of Statistics
Optimal estimation of the mean function based on discretely sampled functional dataphase transition
Annals of Statistics
The Dantzig selectorstatistical estimation when p is much larger than n (with discussion)
Annals of Statistics
Functional linear regression
Nearest neighbor classification in infinite dimension
ESAIMProbability and Statistics
On a geometric notion of quantiles for multivariate data
Journal of the American Statistical Association
Nonlinear manifold representations for functional data
Annals of Statistics
Functional response models
Statistica Sinica
Smoothing splines estimators for functional linear regression
Annals of Statistics
Multiway ANOVA for functional data
Test
A sharp form of the Cramér–Wold theorem
Journal of Theoretical Probability
Impartial trimmed means for functional data
Cited by (326)
Tests for equality of several covariance matrix functions for multivariate functional data
2024, Journal of Multivariate AnalysisSwimming in an ocean of curves: A functional approach to understanding elephant seal habitat use in the Argentine Basin
2023, Progress in OceanographyFlu vaccination coverage in Italy in the COVID-19 era: A fuzzy functional k-means (FFKM) approach
2023, Journal of Infection and Public HealthEstimation in nonparametric functional-on-functional models with surrogate responses
2023, Journal of Multivariate AnalysisHeterogeneous beliefs and the Phillips curve
2023, Journal of Monetary EconomicsEnergy Saving Analysis of refrigeration room Group Control Based on Kernel Ridge Regression Algorithm
2023, International Journal of Refrigeration