Utilities for Statistical Computing in Functional Data Analysis : The R Package fda . usc

This paper is devoted the R package fda.usc which includes some utilities for functional data analysis. This package carries out exploratory and descriptive analysis of functional data analyzing its most important features such as depth measurements or functional outliers detection, among others. The R package fda.usc also includes functions to compute functional regression models, with a scalar response and a functional explanatory data via nonparametric functional regression, basis representation or functional principal components analysis. There are natural extensions such as functional linear models and semi-functional partial linear models, which allow non functional covariates and factors and make predictions. The functions of this package complement and incorporate the two main references of functional data analysis: The R package fda and the functions implemented by [Ferraty and Vieu (2006)].


Introduction
The technological progress has led to the development of new, quick and accurate measurement procedures.As a consequence, further possibilities for obtaining experimental data are now available and some classical paradigms must be revised.For example, it is now possible (and even frequent) to find problems where the number of data points is greater than the number of variables.In many areas it is common to work with large databases, which often increase these observations of a random variable taken over a continuous interval (or in increasingly larger discretizations of a continuous interval).
For example, in fields such as spectroscopy, the measurement result is a curve that, at least, has been evaluated in 100 points.This type of data, which we call functional data arise naturally in many disciplines.In economics, one could consider curves intra-day stock quotes.
In environmental studies, one could find continuous measurements of atmospheric monitoring networks.The importance of functional data in image recognition or spatio-temporal information is also well-known.
Undoubtedly, package fda (Ramsay, Wickham, Graves, and Hooker 2012) is a basic reference to work in R programming environment (R Development Core Team 2012) with functional data.The books by Ramsay and Silverman (2005) and Ramsay and Silverman (2002) are well-known references in this field.In both cases all the techniques included are restricted to the space of L 2 functions (the Hilbert space of all square integrable functions over a certain interval).The book by Ferraty and Vieu (2006) is another important reference incorporating non-parametric approaches as well as the use of other theoretical tools such as semi-norms and small ball probabilities that allow us to deal with normed or metric spaces.These authors are part of the French group STAPH maintaining the page http://www.lsp.ups-tlse.fr/staphwhere R software can be downloaded.
Other R software provides implementation of functional data analysis (FDA) in different areas.The package rainbow (Shang and Hyndman 2012) for functional data representation, the package ftsa (Hyndman and Shang 2012) for functional time series analysis, the package refund (Crainiceanu, Reiss, Goldsmith, Huang, Huo, and Scheipl 2012) for functional penalized regression, the package fpca (Peng and Paul 2011) implements the restricted maximum likelihood estimation for functional principal components analysis (FPCA), the package MFDF (Dou 2009) for modeling functional data in finance via generalized linear models (GLM), the package fdaMixed (Markussen 2011), for FDA in a mixed model framework and the package geofd (Giraldo, Delicado, and Mateu 2010) for spatial prediction for function value data.In MATLAB PACE package (see http://anson.ucdavis.edu/~ntyang/PACE/)provides implementation of various methods using FPCA as generalized functional linear models (GFLM) for sparsely sampled functional data, this package is also in R (Liu 2011).Finally, from a Bayesian perspective, Crainiceanu and Goldsmith (2010) have developed tools for GFLM using WinBUGS.
The aim of the package fda.usc is to provide a broader, flexible tool for the analysis of functional data.Therefore, we propose an integration of the work on non-parametric functional methods implemented by Ferraty and Vieu (2006) with those from the group of the University of Santiago de Compostela (USC), thus complementing and extending some of the functions of package fda.The fda.usc package is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=fda.uscand it has the greedy objective of being an integrate framework for all the functional data methods available.
In Section 2.1 we introduce a new R class for functional data: "fdata".Section 2.2 describes the functional representation of an object of class "fdata" by basis representation or kernel smoothing.Section 2.3 is focused in normed and semi-normed functional spaces and in how the metrics and semimetrics can be used as measures of proximity between functional data.Section 2.4 is concerned with the concept of statistical depth for functional data see Cuevas, Febrero-Bande, and Fraiman (2007).These depth measures are useful to define location and dispersion measures, for classification and for outlier detection (Febrero-Bande, Galeano, and González-Manteiga 2008) (Section 2.6).Section 3 is devoted to functional regression models with scalar response from different perspectives: Parametric, using basis representation (Section 3.1, 3.2 and 3.3), non-parametric, using kernel smoothing (Section 3.4) and semi-parametric which combines both (Section 3.5).The Section 3.6 summarizes the functional regression models using a classical example to show the capabilities of the package fda.usc.Finally, Section 4 discusses the most important features of the package.

Functional data: Definition and descriptive analysis
After a brief introduction of the state of the art in functional data, we describe the most important features, procedures and utilities of the new package.This article is not intended to enter into many methodological details because most of the functions of the package are adaptations of original functions studied in previous publications.

Functional data definition: The new R class "fdata"
A functional variable X is just a random variable taking values in a functional space E. Thus, a functional data set is just a sample {X 1 , . . ., X n } (also denoted X 1 (t), . . ., X n (t) when convenient) drawn from a functional variable X .Usually, E is assumed to be a normed or seminormed metric space.If E is a Hilbert space, functional data can be represented using a basis.This is the approach of the fda package for its methodological developments and it is also used by refund, geofd or MFDF packages, among others.In practice, this representation projects an infinite dimensional space into a finite one and has the drawback of the approximation of the original data.However, the functional space E may not be Hilbertian and in this case, a more flexible representation of functional data is needed.For this reason, this package introduces the new class "fdata".This new class uses only the values evaluated on a grid of discretization points {t 1 , . . ., t m } that can be observed in a non-equispaced way.The drawback now is that every calculation (distance, norms,...) must be made through numerical approximations and the design or the density of this grid could affect the accuracy of the calculations.If functional data is sparsely sampled or there are many missing data points, probably the representation in a basis is mandatory and some other alternatives, as for example PACE package, should be considered.
The fda.usc package defines an object called fdata as a list of the following components: data: Typically a matrix of (n, m) dimension which contains a set of n curves discretized in m points or argvals.argvals: Locations of the discretization points, by default: {t 1 = 1, . . ., t m = m}.rangeval: Range of discretization points, by default: range(argvals).names: Optional list with three components: main, an overall title, xlab, a title for the x axis and ylab, a title for the y axis.
All the methods in the package can work with the new class "fdata".Some basic operations to handle this new class are introduced.For example, generic methods for the math group abs, sqrt, round, exp, log, cumsum; for the OPS group "+", "-", "*", "/", "==", "<", '>", "<=", '>="; the for summary group all, any, sum, min, max; and other operations is.fdata(), c(), dim(), ncol(), nrow(), plot(), inprod.fdata()and norm.fdata().To calculate the derivative of functional data the fdata.deriv()function has been implemented, which computes the derivative by different methods, some purely numerical and others based on a The function fdata() converts a data object of class: "fd", "fds", "fts", "sfts" or other formats ("vector", "matrix", "data.frame") to an object of class "fdata".We have included in the "fdata" format some popular datasets of functional data: tecator, phoneme, poblenou and aemet.As an illustration we applied some of above methods to the tecator dataset (data("tecator")) that has two components.The first component, tecator$absorp.fdata,includes the curves of absorbance for the analysis of pieces of meat stored in the format of class "fdata".Each curve consists of a m=100-channel absorbance spectrum measured along the discretized points: argvals = {t 1 = 850, . . ., t m = 1050}.The second component, tecator$y, is a dataframe including Fat, Water and Protein contents of each spectrometric curve obtained by an analytical chemical processing.The tecator dataset presents interesting features and it is a well known example in functional data analysis (available at http:// lib.stat.cmu.edu/datasets/tecator).The following code displays the absorbance curves colored by the fat content, see Figure 1.
R> library("fda.usc")R> data("tecator") R> names(tecator) [1] "absorp.fdata" "y" R> absorp <-tecator$absorp.fdataR> Fat20 <-ifelse(tecator$y$Fat < 20, 0, 1) * 2 + 2 R> plot(tecator$absorp.fdata, col = Fat20) R> absorp.d1<-fdata.deriv(absorp,nderiv = 1) R> plot(absorp.d1,col = Fat20) The representation of a functional dataset should be consistent with the physical interpretation of the phenomenon being described and so it is important to choose the space where the data must be considered.For example, in left panel of Figure 1, the spectrometric curves are plotted showing with different colors two levels of fat content.This representation which implicitly assumes a L 2 space, is not related with the information of fat content.In other words, the vertical shift of these curves has no special relation with the fat content.So, if we are interested in the relationship between the spectrometric curve and the fat content, probably another choice of functional space would be more advisable as shown, for example, in the right panel of Figure 1 with the first derivative of the spectrometric curves.

Smoothing
If we assume that our functional data Y (t) is observed through the model: where the residuals (t) are independent from X(t), we can get back the original signal X(t) using a linear smoother, x = n i=1 s ij y i or x = Sy where s ij is the weight that the point t j gives to the point t i and y i = Y (t i ).In the package two procedures have been implemented for this purpose.The first one is the representation in a L 2 basis (or penalized basis) and the second one is based on the smoothing kernel methods.

Basis representation
A curve can be represented by a basis when the data is assumed to belong to L 2 space.A basis is a set of known functions {φ k } k∈N that any function could be arbitrarily approximated by a linear combination of a sufficiently large number k n of these functions, see Ramsay and Silverman (2005, p. 43 and 44).
The procedure recovers the function X(t) by using a fixed truncated basis expansion in terms of k n known basis elements, The projection (or smoothing) matrix is given by: S = Φ(Φ Φ) −1 Φ , with degrees of freedom of the fit df = tr(S ν ) = k n .
If smoothing penalization is required, a parameter λ will also be provided and in this case the projection matrix S is: S = Φ(Φ Φ + λR) −1 Φ , where R is the penalization matrix.Typically R penalizes the integral of the square of the derivative of order 2.
There are several different types of basis (Fourier, B-spline, Exponential, power, polynomial, . . . ) that can be created using the function create.fdata.basis()or directly through the create.basis-name.basisfrom the fda package.This package also includes the basis functions based on the data: create.pc.basis() and create.pls.basis() that provides a functional principal component and partial least square basis, respectively.The choice of the basis should be based on the objective of the analysis and on the data.For example, it is common to use the Fourier basis for periodic data, and B-spline basis for non-recurrent data.Also, the function fdata2fd() converts an object of class "fdata" into an object of class "fd" using the basis representation shown in (1).

Kernel smoothing
The non-parametric methodology and in particular, the kernel smoothing method, can also be used to represent functional data.Now, the non-parametric smoothing of functional data is given by the smoothing matrix S = (s ij ).For example, for the Nadaraya-Watson estimator (function S.NW()): where K(•) is the kernel function.Other possibilities for S are the k nearest neighbors estimator (function S.KNN()) or the local linear regression estimator (function S.LLR()), see Wasserman (2006), Ferraty and Vieu (2006) for futher details.Many different types of kernels are considered in the package: Gaussian, Epanechnikov, triweight, uniform, cosine or any user-defined.

Validation criterion
The choice of the smoothing parameter is crucial and, in principle, there is no universal rule that would enable an optimal choice.Among the different selection criterion to select the parameter ν, we have implemented two: Cross-validation (CV) and generalized crossvalidation (GCV).

Cross-validation : CV
where ŷν i(−i) indicates the estimator based on leaving out the i pair (t i , y i ) and w i is the weight at point t i .This criterion is implemented by the function CV.S().
The purpose of the function min.basis() is to represent the functional data in terms of a (truncated) expansion with respect to a given basis of functions in the corresponding space.Such expansions depend on a parameter ν = (k n , λ), where k n is the number of basis elements used in the truncated expansion and λ is the penalization parameter.In the following example (see below R code), phoneme$learn is smoothed using min.basis()function.

Measuring distances
Sometimes, it is difficult to find the best plot to summarize a particular functional dataset because the shape of the graphics depends strongly on the chosen proximity measure.As shown in Figure 1, the plot of X(t) against t is not necessarily the most informative and perhaps, considering another distance between curves we can obtain better results.This package collects several metric and semi-metric functions which allow us to extract as much information possible from the functional variable.
The most general spaces for functional data are the complete metric spaces where only the notion of distance between elements of the space is given.If the metric d(•, •) is associated with a norm (so that d(X(t), Y (t)) = X(t) − Y (t) ) we have a normed space (or a Banach space).In some important cases the norm . is associated with an inner product ., . in the sense that X = X, X 1/2 .A complete normed space (Banach space) whose norm derives from an inner product is called a Hilbert space.The best known example is the space L 2 [a, b] of real square-integrable functions defined on [a, b] with f, g = b a f g.Utilities for computing distances, norms and inner products are included in the package.If we focused on L p spaces (the set of functions whose absolute value raised to the p-th power has finite integral), the function metric.lp()uses Simpson's rule to compute distances between elements: where w are the weight and the observed points on each curve t are equally spaced (by default) or not.norm.fdata()computes the norm and, specifically for L 2 , inprod.fdata()calculates the inner product between elements of the space.
In the next example, the distances between some training sample curves of phoneme data (phoneme$learn) are calculated by the metric.lp()function.Figure 4 shows the dendrogram for a selection of 11 curves of class (3) (corresponding to the index from 110 to 120 curves) and 11 curves of class (5) (from 220 to 230).This example can be understood as a classification problem in which the goal is to classify the curves in 2 classes.Other semi-metric functions are included in the fda.usc package to compute distances between "fdata" objects under the assumption that these objects belongs to a metric o semi-metric space.semimetric.basis()function calculates the semi-metric L 2 of the derivatives of order q (by default q = 0),

R> mdist <-metric.lp(learn)
) 2 dt for functional data of two "fdata" (or "fd") class objects.An alternative, without using the basis function of fda package, is to use the semi-metrics proposed by Ferraty and Vieu (2006); semimetric.deriv() and semimetric.fourier()functions calculate distances between the derivative of order nderiv using B-spline approximation and Fourier approximation, respectively.Other semi-metric functions have been included in the package, see help(semimetric.NPFDA).The functions semimetric.pca() and semimetric.mplsr()compute proximities between curves based on the functional principal components analysis (FPCA) method and functional partial leastsquared (FPLS) method, respectively.The FPCA and FPLS semi-metric can be used only if the curves are observed at the same discretized points and in a fine enough grid.Finally, semimetric.hshift()computes proximities between curves taking into account a horizontal shift effect of two functional objects.
The procedures of the package fda.usc which include the argument metric allow us the use of metric or semi-metrics functions implemented or other user defined metric with the only restriction that the first two arguments belong to the class "fdata".

Exploring functional data
Exploratory functional data analysis can help the user to analyze datasets by summarizing their main characteristics (identify features), detect possible errors or unexpected variability and resume the data.
In fda.usc, the usual tools for summarizing functional data are included: func.mean(),func.var(),fdata2pc() and fdata2pls() for computing the mean, the marginal variance, the principal eigenfunctions and the partial least squared, respectively.The output of these tools is always an object of class "fdata".Different depth notions have been proposed in the literature, with the aim of measuring how deep a data point is in the sample.In univariate data, the median would typically be the deepest point of a cloud of points.In functional data, although there are more depth measures, this package includes those that are contained in the work of Cuevas et al. (2007) as depth.FM() that is based on the median (proposed by Fraiman and Muniz 2001), depth.RP() that is calculated through random projections (RP) based on the Tukey depth or depth.RPD() that is calculated through random projections of the curves and their derivatives and depth.mode()that is based on how surrounded the curves are with respect to a metric or a semi-metric distance, selecting the trajectory most densely surrounded by other trajectories of the process.The population h-depth of a datum z is given by the function: where X is the random element describing the population, d(•, •) is a suitable distance and K h (t) is a re-scaled kernel with tuning parameter h.Given a random sample X 1 , . . ., X n of X, the empirical h-depth is defined as: All depth functions return: The deepest curve median, the index of the deepest curve lmed, the mean of (1 − α)% deepest curves mtrim, the index of (1 − α)% deepest curves ltrim and the depth of each curve dep.
An interesting application of the proposed depth measures is their use as measures of location and dispersion.In the following example we use data("poblenou") where each curve represents the evolution of NO x levels in 1 day had measured every hour {t i } 23 i=0 by a control station in Poblenou in Barcelona (Spain).We split the whole sample into working and non-working days.

Bootstrap replications as dispersion measures
The dispersion of a location statistic for functional data can be estimated by smoothed Let X 1 (t), . . ., X n (t) the original data and T = T (X 1 (t), . . ., X n (t)) the sample statistic.
Calculate the nb bootstrap resamples {X * 1 (t), . . ., X * n (t)}, using the following scheme X * i (t) = X i (t) + Z(t) where Z(t) is normally distributed with mean 0 and covariance matrix γΣ x , where Σ x is the covariance matrix of {X 1 (t), . . ., X n (t)} and γ is the smoothing parameter.
Compute d(T, T * j ), j = 1, . . ., nb.Define the bootstrap confidence ball of level 1 − α as CB (α) = X ∈ E such that d(T, X) ≤ d α being d α the quantile (1 − α) of the distances between the bootstrap resamples and the sample estimate.
The fdata.bootstrap() function allows us to define a statistic calculated on the nb resamples, control the degree of smoothing by smo argument and represent the confidence ball with level 1−α as those resamples that fulfill the condition of belonging to CB (α).The statistic used by default is the mean (func.mean())but also other depth-based functions can be used (see help(Descriptive)).As an example, in next two code lines the confidence ball for location statistic 25%-trimmed mean is computed and plotted in Figure 6 using the random project depth.

Functional outlier detection
In order to identify outliers in functional datasets, Febrero-Bande et al. (2008) make use of the fact that depth and outlyingness are inverse notions, so that if an outlier is in the dataset, the corresponding curve will have a significantly low depth.Therefore, a way to detect the presence of functional outliers is to look for curves with lower depths.Two procedures for detecting outliers are implemented: The first one is based on weighting outliers.depth.pond()and the second one is based on trimming outliers.depth.trim().Both methods are based on bootstrap and therefore the CPU time can be high.
As an illustration of the outliers detection procedures we replicate the example used by Febrero-Bande et al. (2008).Figure 7 shows the result of applying the outliers detection method based on trimming with FM depth for the case of working days and non-working days.The curves in red correspond to those identified as outliers in the original paper.

Functional regression models
A regression model is said to be "functional" when at least one of the involved variables (either a predictor variable or response variable) is functional.This section is devoted to all the functional regression models where the response variable is scalar and at least, there is one functional covariate.For illustration, we will use the tecator dataset to predict the fat contents from the absorbance as a functional covariate X(t) = X and the water contents as a non-functional covariate (Z = Water).We will use the first 165 curves to fit the model.The last 50 records will be used to check the predictions.The explanatory variables to introduce in the models are: The curves of absorbance X as functional data or one of its two first derivatives (X.d1,X.d2)and water content (Water) as non-functional variable.
R> X.d1 <-fdata.deriv(X,nbasis = 19, nderiv = 1) R> X.d2 <-fdata.deriv(X,nbasis = 19, nderiv = 2) In the following sections, regression methods implemented in the package are presented and illustrated with examples for estimating the Fat content of the tecator dataset.Finally, we show in Section 3.6 a summary focused in prediction of the presented methods.

Functional linear model with basis representation: fregre.basis()
In this section we will assume a functional linear model of type: where •, • denotes the inner product on L 2 and are random errors with mean zero and finite variance σ 2 .Ramsay and Silverman (2005) model the relationship between the scalar response and the functional covariate by basis representation of the observed functional data X(t) and the unknown functional parameter β(t) = b k φ k (t).The functional linear model in Equation 4 is estimated by the expression: where X(t) = C ψ(t)φ (t), and b = ( X X) −1 X y and so, ŷ = Xb = X( X X) −1 X y = Hy where H is the hat matrix with degrees of freedom: df = tr(H).
If we want to incorporate a roughness penalty λ > 0 then the above expression is now: ŷ = Xb = X( X X + λR 0 ) −1 X y = H λ y where R 0 is the penalty matrix.
fregre.basis() function computes functional regression between functional explanatory variable and scalar response using basis representation.This function allows covariates of class "fdata", "matrix", "data.frame"or directly covariates of class "fd".The function also gives default values to arguments basis.xand basis.bfor representation on the basis of functional data X(t) and the functional parameter β(t), respectively.In addition, the function fregre.basis.cv()uses validation criterion defined in Section 2.2 by argument type.CV to estimate the number of basis elements and/or the penalized parameter (λ) that best predicts the response.
Going on with the example, the next code illustrates how to estimate the fat contents (y = Fat) using a training sample of 1st derivative absorbance curves X.d1.

Functional linear model with functional PC basis: fregre.pc()
Similarly, Cardot, Ferraty, and Sarda (1999) used a {ν k } ∞ k=1 orthonormal basis of functional principal components to represent the functional data as X i (t) = ∞ k=1 γ ik ν k and the functional parameter as β(t) = ∞ k=1 β k ν k , where γ ik = X i (t), ν k and β k = β, ν k .Now, the estimation of β(t) can be made by a few principal components (PCs) under the assumption that β k = 0, for k > k n and the integral can be approximated by: ŷ = X, β ≈ We have implemented the functional principal regression in the function fregre.pc().The call for tecator example is shown below: R> res.pc1 <-fregre.pc(X.d1,y, l = 1:6) For the fitted object of fregre.pc()function, the function summary.fregre.fd()also shows: Variability of explicative variables explained by principal components.
Variability for each principal component.
How to select k n : fregre.pc.cv() The novelty of the procedure used in fregre.pc.cv() is that the algorithm selects the principal components that best estimated the response.The selection is done by cross-validation (CV) or model selection criteria (MSC).
Predictive cross-validation: where criteria is an argument of the function fregre.pc.cv() that controls the type of validation used in the selection of the smoothing parameter k n .
In the following example the regression model is fitted using the best selection of first 7 functional principal components (FPCs).The minimum MSC (MSC = 2.128167) is achieved with the FPC 1, 2, 7, 3 and 6 using the SIC criterion.
R> res.pc2 <-fregre.pc.cv(X.d2,y, kmax = 7) R> res.pc2$pc.opt 3) PC( 4) PC( 5) PC( 6) PC(7) rn=0 2.852951 2.258292 2.144443 2.13227 2.128167 2.159107 2.190052 Functional linear model with functional PLS basis: fregre.pls() In the previous section, the dimension reduction by FPCs is a good solution for the estimation of the functional linear model.Another good alternative is the use of the criterion that maximizes the covariance between X(t) and the scalar response Y via the partial least squares (PLS) components.Let {ν k } ∞ k=1 the functional PLS components (Preda and Saporta 2005) and before X i (t) = ∞ k=1 γik νk and β(t) = ∞ k=1 βk νk .The functional linear model is estimated by: We have implemented the FPLS regression in the function fregre.pls()where the PLS factors have been calculated using an alternative formulation of the NIPALS algorithm proposed by Krämer and Sugiyama (2011).The call for tecator example and the print of the fitted object are shown below: R> fregre.pls(X.d1,y, l = 1:5) -Call: fregre.pls(fdataobj= X.d1, y = y, l = 1:5) Functional influence measures: influence.fdata() This section focuses on how to identify influential observations in the FLM discussed in the previous sections.Febrero-Bande, Galeano, and González-Manteiga (2010) studied three statistics for measuring the influence: Cook prediction distance (CP i ), Cook estimation distance (CE i ) and Peña's distance (P i ), respectively.1.The functional Cook's measure for prediction (CP i ) detects observations whose deletion may entail important changes in the prediction of the rest of the data.It is defined as where ŷ−i is the prediction of the response y excluding the i -th observation (X i , y i ).
2. The functional Cook's measure for estimation (CE i ) detects observations whose deletion may entail important changes in the estimation.
where β−i is the estimator of the parameter β excluding the i -th observation (X i , y i ) in the process.
3. The functional Peña's measure for prediction (P i ) detects observations whose prediction is most affected by the deletion of other data.
S 2 R H ii where ŷ(−h,i) is the i -th component of the prediction vector ŷ(−h) for h = 1, ..., n.
Once estimated the functional regression model with scalar response (by fregre.basis(),fregre.pc()or fregre.pls()), the influence.fdata()function is used to obtain the influence measures, see Figure 9. Febrero-Bande et al. (2010) propose to approximate quantiles of the above statistics by means of a procedure which uses smoothed bootstrap samples of the set of observations of the above statistics.This package includes the above procedure in influence.quan()function.When the goal is to make inferences about the functional parameter and not on influence measures, one advantage derived from this procedure is that the calculation takes into account the mue.boot curves of the functional parameter β(t) in the functional Cook's measure for estimation.However, this procedure has a very high computational cost.In order to reduce substantially the computational time we have created the fregre.bootstrap()function which is explained below.

Bootstrap for functional linear model
We can study the variability of functional beta parameter β, which is estimated by the previously functional regression models (fregre.basis(),fregre.pc()and fregre.pls()),using the confidence ball (CB) defined as follows, where D α such that P r β ∈ CB ( βν ) = 1 − α.To do this, we have followed the smoothed bootstrap procedure proposed by Febrero-Bande et al. (2010).Given n scalar responses Y and n functional data X.
1. Fit the functional linear model in Equation 4 and compute the residuals e. 3. Smooth the two previous samples as follows: i is a Gaussian process with zero mean and the covariance operator γ X Γ X , where γ X is a smoothed bootstrap parameter.In the function fregre.bootstrap(), the smoothed bootstrap parameters γ X and γ e are the arguments smoX and smo, respectively.In step 4, the argument kmax.fixcan be selected by an appropriated cutoff ν in each b iteration or fixed its value in the step 1.
In the example, we have used a small resample nb = 1000 and we have fixed the parameter kmax.fix(5 Fourier basis, 6 FPC and 5 FPLS) to have a reasonable computing time.Still, the calculation procedure is computationally high demanding.
The functional linear model ( 9) is estimated by the expression: The arguments are as follows: formula: A symbolic description of the model to be fitted.
data: List containing the variables in the model.The first item in the data list is a "data.frame"called df with the response and non-functional explanatory covariates.Functional covariates ("fdata" or "fd" class) are introduced in the following items in the data list.
basis.x:List with a basis object for every functional covariate.
basis.b:List with a basis object for estimating the functional parameter β(t).
For the tecator data example, the content of Fat is estimated from the second derivative of absorbances curves X.d2 and the content of Water by fregre.lm()function.
This procedure is implemented in the fregre.np()function and some of its arguments are: Ker: Type of asymmetric kernel function, by default asymmetric normal kernel.
Again, the function fregre.np.cv() is used to choose the smoothing parameter h by the validation criterion described in Section 2.2.
The code for the tecator example is: R> fregre.np(X.d1,y) -Call: fregre.np(fdataobj= X.d1, y = y) -Bandwidth (h): 0.006498587 -R squared: 0.9873025 -Residual variance: 3.036685 3.5.Semi-functional partially linear model: fregre.plm() An extension of the non-parametric functional regression models is the semi-functional partial linear model proposed in Aneiros-Pérez and Vieu (2006).This model uses a non-parametric kernel procedure as that described in Section 3.4.The output y is scalar.A functional covariate X and a multivariate non-functional covariate Z are considered.
The unknown smooth real function r is estimated by means of rh where W h is the weight function: w n,h (X, X i ) = K(d(X,X i )/h) n j=1 K(d(X,X j )/h) with smoothing parameter h, an asymmetric kernel K(•) and a metric or semi-metric d(•, •).In fregre.plm()by default W h is a functional version of the Nadaraya-Watson-type weights (type.S = S.NW) with asymmetric normal kernel (Ker = AKer.norm) in L 2 (metric = metric.lpwith p = 2).The unknown parameters β j for the multivariate non-functional covariates are estimated by means of βj = ( Z h Zh ) −1 Z h Zh where Zh = (I − W h )Z with the smoothing parameter h.The errors are independent, with zero mean, finite variance σ 2 and E[ |Z 1 , . . ., Z p , X(t)] = 0. Coming back to the example of Section 3.3, the fitted model for the case of a real variable Z = Water and the second derivative of the absorbance curves (X.d2) as functional covariate can be obtained by: R> res.plm2 <-fregre.plm(Fat ~Water + X.d2, ldata) -Call: fregre.plm(formula= Fat ~Water + X.d2, data = ldata) R> pred.lm1<-predict.fregre.lm(res.lm1,newldata) R> res.plm1 <-fregre.plm(f1,ldata) R> pred.plm1<-predict.fregre.plm(res.plm1,newldata) We make predictions for the rest of models fitted in this paper (code not shown).Following the ideas by Aneiros-Pérez and Vieu (2006), we calculated the mean square error of prediction (MEP ): MEP = n i=1 (y i − ŷi ) 2 /n / (Var (y)), which is used for comparing the predictions of different fitted models.Table 1 resumes the statistics of the fitted models and their predictions.

Conclusion
The package fda.uscpresented in this paper is the result of the integration of our codes with other procedures from different authors, such as the package fda or the functions from the STAPH group.One major advantage of this software is that it avoids the need of a basis representation of functional data.Using the new class "fdata", the proposed methods can represent the functional data using discretized versions in a given grid of points.
The fda.usc package includes most of the methods recently developed for exploratory functional data analysis and for functional regression with scalar response.This package also incorporates other utilities for statistical computing within the field of functional data analysis (FDA).Some useful additions are: Generalized functional linear models (FGLM): fregre.glm().
Other utilities and auxiliary functions, as the cond.F() function that calculates the conditional distribution function of a scalar response with functional data.
The fda.usc package is an attempt to obtain an integrated framework for FDA.It is under continuous development therefore updates will be available in CRAN.We are working to incorporate new depth measures and methods for functional regression and classification.Further information, as script files and a beta version of package, is also available at the project website whose URL http://eio.usc.es/pub/gi1914/, also given in the DESCRIPTION file.
) and (3) shown above.As an illustration, in the right panel of Figure2, the GCV criterion is drawn in function of the bandwidth h using the function min.np()and Figure3shows smooth representation of 11th curve of phoneme$learn by Nadaraya-Watson (S.NW()), local linear regression (S.LLR()) method and K nearest neighbors (S.KNN()) method.

Figure 2 :
Figure 2: GCV criterion as a function of the number of B-spline basis elements and the penalizing parameter λ (left panel).GCV criterion as a function of bandwidth parameter (right panel): Normal kernel and local linear smoothing matrix (blue line); Normal kernel and Nadaraya-Watson smoothing matrix (green line) and uniform kernel and K nearest neighbors smoothing matrix (red line).

Figure 3 :
Figure 3: Eleventh phoneme curve: Observed (grey dashed line), smoothed by 27 B-spline basis elements and λ = 81.28(black line), smoothed by normal kernel and Nadaraya-Watson smoothing matrix with bandwidth h = 3.28 (green line), smoothed by normal kernel and local linear smoothing matrix with bandwidth h = 7.85 (blue line) and smoothed by uniform kernel and K nearest neighbors smoothing matrix with bandwidth h = 7 (red line).

Figure 5 :
Figure 5: Descriptive statistics for poblenou dataset based on depth: trend measures for working days (top left) and non-working days (top right), dispersion measures for working days (bottom left) and for non-working days (bottom right).

Figure 6 :
Figure 6: Bootstrap replications of poblenou curves: Using the 25%-trimmed mean statistic for working days curves (left) and for non-working days curves (right).

Figure 7 :
Figure 7: NO x levels measured by a control station split into two groups (gray lines): Working days (left) and non-working days (right).The red lines correspond to days detected as outliers.
kn) is the (n,k n ) matrix with k n principal components estimation of β scores and λ i the eigenvalues of the PC.The model of Equation 6 is expressed as: ŷ = Hy where H = γ .1 γ .1 y nλ 1 , . . ., γ .knγ .kny nλ kn with degrees of freedom: df = tr(H) = k n .

Figure 9 :
Figure 9: Influence measures for tecator dataset calculated from fitted model (res.basis0):Matrix of scatter plots for Cook prediction distance (CP i ), Cook estimation distance (CE i ) and Peña's distance (P i ).

2
. Obtain B standard bootstrap samples of size n of the residuals: e b 1 , . . ., e b n , for b = 1, . . ., B. Similarly, obtain B standard bootstrap samples of the functional data: X b 1 , . . ., X b n .

Table 1 :
Prediction methods for functional regression model fitsOnce the model is estimated we can obtain predictions by means of: predict.fregre.fd()corresponds to the model fitted from fregre.basis(),fregre.pc(),fregre.pls()orfregre.np()functions.predict.fregre.lm()corresponds to the model fitted from fregre.lm()function.predict.fregre.plm()corresponds to the model fitted from fregre.plm()function.A sample test of the last 50 curves of absorbance (or one of the two first derivatives) and Water content can be used to predict the Fat content.We provide below the complete code for the prediction process.Results for functional regression models.df degrees of freedom, R 2 , S 2 R residual variance, and mean square error of prediction (MEP).