Adapted Variational Bayes for Functional Data Registration, Smoothing, and Prediction

We propose a model for functional data registration that compares favorably to the best methods of functional data registration currently available. It also extends current inferential capabilities for unregistered data by providing a flexible probabilistic framework that 1) allows for functional prediction in the context of registration and 2) can be adapted to include smoothing and registration in one model. The proposed inferential framework is a Bayesian hierarchical model where the registered functions are modeled as Gaussian processes. To address the computational demands of inference in high-dimensional Bayesian models, we propose an adapted form of the variational Bayes algorithm for approximate inference that performs similarly to MCMC sampling methods for well-defined problems. The efficiency of the adapted variational Bayes (AVB) algorithm allows variability in a predicted registered, warping, and unregistered function to be depicted separately via bootstrapping. Temperature data related to the el-ni\~no phenomenon is used to demonstrate the unique inferential capabilities for prediction provided by this model.


INTRODUCTION
This paper introduces a novel approach to functional data registration within a Bayesian hierarchical model.Our model for registration compares favorably with the best registration procedures currently available and also extends inferential procedures to include both 1) functional prediction in the context of registration and 2) registration and smoothing in one model.The primary advantage to our proposed registration model in comparison to current registration methods is that it provides a probabilistic framework in which new observations can be considered.Assuming a new unregistered function has been partially recorded, with this framework, we can obtain estimates of the registered partial function and also the corresponding partial warping function.Using these estimates, the complete registered function, the complete warping function, and the complete unregistered function can be predicted.Details of the prediction model can be found in Section 5. To the authors' knowledge this is the first time functional prediction is considered in the context of registration.
Additionally, our model can be extended to allow for noisy observations.Most current registration methods consider functional regularization as a pre-processing step with the exception of Rakêt et.al. [20].However, in the paper by Rakêt et.al. [20], the authors' maximum likelihood approach to registering latent functions does not provide an easy way to quantify variability in the estimates of the registered functions.In Section 6.3, we provide an illustration of how smoothing functions in a pre-processing step can significantly underestimate the variability in the estimates of the registered functions.
The simplest definition of functional data registration is any algorithm that aligns functions in a way that eliminates all phase variability between functions.Without registration, basic summary statistics such as the sample mean and covariance are less interpretable as time variation between significant features in the functions tends to dampen the amplitude variation in these features.Furthermore, the average timing of significant features may also be of interest and is difficult to obtain under traditional methods of analyzing functional data.There has been much recent interest in proper ways to define and measure registration as well as in developing registration methods with desirable statistical properties.
The evolution of registration dates back to Sakoe and Chiba [28] where the authors use a dynamic programming algorithm for landmark registration.Landmark registration was again considered by Kneip and Gasser [12] and Kneip and Gasser [13].In 1997, a new cost function for functional registration was introduced by Wang and Gasser [35].A significant advancement in registration literature can be traced to Silverman [30] and Ramsay and Li [24] where the authors introduce global registration procedures, and Ramsey considers the use of a flexible family of monotone warping functions.Parametric and B-spline base warping functions are considered by Brumback and Lindstrom [4] and Gervini and Gasser [8], respectively.Nonparametric maximum likelihood approaches to registration are considered by both Ronn [27] and Gervini and Gasser [9].A moments based approach to registration is introduced by James [11].Tang and Müller [32] propose pairwise curve synchronization.The first Bayesian approach to registration can be found in Telesca and Inoue [33].Registration to principal components is considered by Kneip and Ramsay [14].Finally, with regard to improvements in registration, the recent work by Srivistava, et.al.[31] offers the most comprehensive framework for registration to date.
Much of the focus in combining registration with other types of inference in one model has been in the area of functional data clustering and registration.Current work in this area can be found in Liu and Yang [18], Sangalli et.al. [29], and also a Bayesian approach in Zhang and Telescar [37].Recent work by Rakêt et.al. [20] includes a model for functional smoothing and registration.While these extensions to registration procedures offer additional tools for functional data analysis, they tend to focus less on high-quality registration.
In this paper, we develop Bayesian hierarchical models that address both areas of development in registration procedures.First, a model is proposed to register functional data that gives estimates that compare favorably with those from the best current registration methods available, notably,Srivistava, et.al.[31].Then, we demonstrate how this model can be extended to incorporate other inferential procedures.The two examples provided in this paper are extensions for both a functional prediction model and a model for simultaneous registration and smoothing.This paper also addresses the computational issues associated with high-dimensional Bayesian hierarchical models.To this end, we propose an alternative algorithm to variational Bayes approximation that can be used for models in which the full conditional distributions of a subset of the parameters are not from a known parametric family.We call this the Adapted Variational Bayes (AVB) algorithm.This paper is organized as follows.Section 2 presents our basic registration model.The Adapted Variational Bayes algorithm is discussed in detail in Section 3. A comparison of results from our model to current methods can be found in Section 4.1.Additionally, a comparison of results obtained using AVB and those given by MCMC can be found in 4.2.The prediction model is presented in Section 5.In Section 5.4, the prediction model is used to forecast the future trajectory of sea-surface temperatures that are associated with the El-niño phenomenon.
An adaptation to our model that allows for noisy data is found in Section 6.Finally, a discussion is found in Section 7.

GAUSSIAN PROCESS MODELS FOR REGISTRA-TION
The functional registration models proposed in this paper are foremost designed to extend and improve on the minimum eigenvalue registration criterion for continuous registration first introduced by Ramsay and Li [24] .Accordingly, we will consider two functions perfectly registered if the variation between the two functions can be described entirely in terms of one functional direction -the target function.Our method of registration improves on Ramsay and Li's Procrustes method by implicitly accounting for vertical shifts between registered functions and by allowing the target curve to evolve throughout the registration procedure.In Section 4, we will demonstrate how using the minimum eigenvalue criterion under these conditions provides a more complete curve registration.Our results are comparable to those of Srivistava, et.al.[31].
The theoretical basis for modeling functional data as Gaussian processes in a hierarchical Bayesian environment is established in Earls and Hooker [5].In this model, each registered function, X i (h i (t)), i = 1, . . ., N , is the composition of an observed unregistered function, X i (t), with an unknown warping function, h i (t) over some fixed time domain T .We model each registered function, X i (h i (t)), i = 1 . . .N , as a Gaussian process such that Here, f (t) is the target function.The target function serves as the primary functional direction in which the registered functions vary.Accordingly, the above covariance function, γ −1 R Σ(s, t), is defined to penalize all variation in the registered functions beyond a scaling and vertical shifting of the target function (the function-specific mean).In these models we will define γ R as a registration parameter that determines the severity of this penalty.
Given a sample of unregistered functions, X i (t), i = 1 . . .N , defined over the interval T = [t 1 , t p ], we are interested in estimating the warping functions, h i (t), the shifting and scaling parameters, z 0i and z 1i , the target curve, f (t), and the registered functions.
For now, we will assume the functions are recorded without noise.If the functions are recorded with noise, it is common practice in the current literature to first perform a pre-processing smoothing step.An undesirable result of this pre-processing step is that the subsequent inference procedure is unable to capture the extra variability associated with the smoothing process.In Section 6, we will show how our model can be modified to both smooth and register functions.
Inference is accomplished through a Bayesian hierarchical model.The distributional assumptions and prior specifications for this model are, Note, for this model, the distribution on z 1 = (z 11 . . .z 1N ) can be replaced by a Dirichlet distribution on z 1 /N .The result is a slightly more complicated model that has the nice effect of scaling the target function to the empirical mean of the estimated registered functions.Priors (2) and (3) would then be omitted.
Given the above model, in practice we will proceed by using finite approximations to each functional distribution.In Earls and Hooker [5] we establish some theoretical properties of these types of approximations.The following finite-dimensional distributions are used in the final model in lieu of their infinite dimensional counterparts above.
In the above priors, the sum of the matrices P 1 and P 2 form a basis for R p .The matrix, , where L is a second finite difference matrix of rank p − 2. The eigenvectors of L approximately span the space of all functions that are not constant or linear.The matrix where B is an approximated squared constraint operator so that if B is the constraint operator for any function, ].By this definition, B is an approximate basis for constant and linear functions.B and L together span all of R p and the eigenvectors of B are orthogonal to the eigenvectors of L. The matrices P 1 and P 2 serve two purposes in this registration model.When considered together, they define the matrix Σ = P 1 + P 2 that penalizes any variation from a given mean function.Considered separately, as in Σ f , P 2 provides a penalty on roughness with associated smoothing parameter λ f .P 1 is only necessary in Σ f to assure this covariance matrix is non-singular and consequently η f is only large enough to provide stability in this matrix.Here, η f and λ f are estimated within the model.For a more extensive discussion on selecting smoothing parameters in this way, see Earls and Hooker [5].Finally, the matrix P w is only present to provide extra smoothness in the warping functions if necessary.It can be denoted as either a squared first or second derivative penalty on the base functions and is sometimes excluded altogether.
In this paper, we will refer to the functions, w i (t), t ∈ T , from which the warping functions, h i (t), t ∈ T , are derived, as the base functions.The base functions are non-parametrically specified for optimal registration.We, however, impose the following restrictions on the warping functions: Restrictions (1) and ( 3) are built into the definition of h i (t).Restriction ( 2) is imposed through the characteristic function in the expression for the prior defined for each base function, w i (t).
Note that w i (t) = 0 corresponds to the identity warping, h i (t) = t.An important feature of our definition of the warping functions is that it defines an identifiable relationship between h i (t j ) and w i (t) which is necessary for predicting future outcomes of curves only partially observed.In Section 5 is a more thorough discussion of the prediction model.Shifting these by an additional registration so that the warped times average to the original time does not affect our prediction model, but it will then allow an explicit comparison of h i (t j ) to t j to tell us whether the process is running ahead or behind "standard" time.Srivistava, et.al.[31] use a similar "correction" to determine their target function.

Registration and Warping Parameter Selection
Registration in this model is controlled by three parameters, γ R , γ w , and λ w .γ R determines the extent the registered functions will be penalized for varying from a shifting and scaling of the target function.This penalty for lack of registration is tempered by penalties for roughness in the warping functions.γ w determines how far the warping functions can deviate from the identity warping while λ w controls the smoothness of the warping functions.For a given statistical analysis, γ R , γ w , and λ w are chosen to allow for a desirable amount of warping.This model can also be adapted to allow for function specific warping penalties.In Section 5.4, we will give an example where function specific penalties for the base functions have been utilized to preserve significant covariance relationships in the estimated registered functions.
While warping procedures perform best when all functions are a scaling of and vertical shift from a target function, it is common in practice for unregistered datasets to include functions that vary significantly in other directions.In this model, a large registration penalty, γ R , in comparison to the penalty on warping, γ w , will result in registered functions that no longer retain significant features in the data.Alternatively, a registration parameter that is too small will not properly align features.Desirable values of these parameters can be determined using short runs of the adapted variational Bayes algorithm described in Section 3.1.In practice, we have found these penalties should be adjusted by powers of ten to see a significant change in estimates of the registered functions.Once determined, γ R , γ w , and λ w are fixed and can be used with the adapted variational Bayes estimates to initialize an MCMC sampler.

VARIATIONAL APPROXIMATION FOR BAYESIAN REGISTRATION 3.1 Adapted Variational Bayes
For this hierarchical Bayesian model, it is appropriate to use Markov Chain Monte Carlo (MCMC) methods to sample from the joint posterior distribution of all unknown parameters.The advantage of using a MCMC model for registration is that it is clear upon inspection whether the chain has been run long enough to provide good estimates.However, for most applications, the dimensionality of the parameter space will require exceptionally long chains that are impractical and expensive to obtain.Here, we suggest a variational Bayes alternative to MCMC sampling to at the very least obtain good starting values for a MCMC sampler.Alternatively, we will show in Section 4.2 that differences in the estimated parameters obtained through adapted variational Bayes and MCMC sampling tend to be small, and estimation via adapted variational Bayes alone is likely sufficient for many inferential procedures.
The variational Bayes procedure described here is based on the variational methods proposed by Omerod and Wand [19] and Bishop [2].Their proposed method optimizes a lower bound of the marginal likelihood which results in finding an approximate joint posterior density that has the smallest Kullback-Leibler (KL) distance from the true joint posterior density.In Goldsmith et.al. [10] the authors utilize variational Bayes for a functional regression model and provide a clear explanation of the procedure and how convergence of the parameter estimates is monitored.
Suppose q(θ) is the approximated posterior joint distribution.The variational Bayes algorithm assumes for some partition of θ = {θ 1 , . . ., θ d }, q(θ) = d k=1 q k (θ k ), where each distribution q k is of a known parametric form.Traditionally this requirement is satisfied through the use of conditionally conjugate priors.
In our model, the Gaussian process priors for the base functions, w i (t), i = 1, . . ., N , are not conditionally conjugate to the likelihood function.Therefore, the traditional variational Bayes optimization method does not apply directly since q k (w i ), i = 1, . . ., N are not known parametric distributions.
Under these conditions, a natural extra step in the traditional variational Bayes algorithm is to first maximize each q k (w i ) in w i for i = 1, . . ., N where all other unknown parameters are held constant at their values in the current iteration of the algorithm.Then assuming the w i , i = 1, . . ., N , are known, update all other parameters using the traditional variational Bayes procedure.This results in the following adapted variational Bayes (AVB) algorithm: 1. Initialize θ 2. For each iteration, m, and each k, k = 1, . . ., N , update the estimate for 3. For each iteration, m, and each k, k = (N +1), . . ., d, update q k so that q where the expectation is taken with respect to the distributions q (m−1) j (θ j ), j = 1, . . ., d, j = k 4. Repeat steps (2) and (3) until the desired convergence criterion is met Here the notation, E (θ −k ) , denotes the expected value over all parameters except θ k .In the next section, we will drop the subscript k, and E (θ −θ k ) will represent the expectation over all parameters except for θ k (e.g.E (θ −η f ) will represent the expectation taken over all parameters except for η f ).
Convergence of the AVB algorithm is guaranteed.However, convergence to a global maximum is not guaranteed, and in practice it is sometimes necessary to adjust the registration and warping penalties as the functions become registered.An unregistered function that requires a substantial amount of warping can cause convergence to a local maximum due to the small penalty on warping.
The flexibility in warping allowed by this small penalty can cause the function to deform rather than register.This can be remedied in two ways.The first option might be to perform a simple initial warping for this function that prevents the optimization from falling into a local mode.The second option is to adjust the registration and warping parameters over time.Initially a stronger warping penalty is employed to prevent function deformation.Then, as the functions register, the warping penalty can be reduced to allow for a more complete registration.When initializing an MCMC sampler, the final penalties on warping and registration from the adapted variational Bayes algorithm should be used.Appendix B includes a proof of the convergence properties of the AVB algorithm in addition to the details required to monitor convergence.

Comparison to Other Registration Procedures
Our registration criterion minimizes all variation in the warped functions that is not in the direction of the target function (allowing for vertical shifts).In this respect, the underlying registration principle driving our model is similar to that proposed by Ramsay and Li [24].Here we will compare our registration results to those using Ramsey's method as well as the registration procedure proposed by Srivistava, et.al.[31].Srivistava et.al. propose a geometric framework for functional data registration using the Fisher-Rao Riemannian metric, Rao [26].In this paper we will refer to Ramsey and Li's registration procedure as ME (minimum eigenvalue) and Srivistava's procedure as F-R (Fisher-Rao), and the model proposed here as GP (Gaussian Processes).In the paper by Srivistava, et.al., several comparisons of registration under the F-R framework to the registration methods proposed by Gervini and Gasser [8], James [11], Liu and Müller [17], and Tang and

Registered Domain
Registered Functions sls = .02Lower Right Boys velocity functions registered by the GP model.We have chosen to use the Sobolev Least Squares (sls) criterion to compare the three registration methods for each dataset as advocated by Srivistava, et.al.[31].The Sobolev Least Squares criterion compares the total cross-sectional variance of the first derivatives of the registered functions to that of the original functions.Explicitly, Lower values of sls correspond to better function alignment.Berkeley Boys Growth Velocity Data Figure 2 contains 39 velocity of growth functions for boys from the Berkeley Growth Study, Tuddenham and Snyder [34].For this analysis, the original data are slightly changed to eliminate some erratic behavior at the beginning of each function.Here, GP and F-R yield similar registration results.However, the GP algorithm results in the lowest sls.ME registers the most significant peak in growth velocity but does not align lesser features as well as GP.Note: The ME registered functions are based on 2 complete runs of the ME algorithm where in each run the previous runs results were used as the 'unregistered' functions.

Simulated Data Set
Running this algorithm more than twice resulted in function distortion due to over-warping and a larger sls.
While the GP and F-R methods result in a similar alignment of functions, these results are achieved in very different environments that are specialized to satisfy specific inferential prefer-ences.The F-R registration method is convenient (using R package 'fdasrvf') and provides fast high-quality estimates.On the other hand, while providing comparable registration results, our method expands inferential capability by providing 1) variability estimates for all unknown parameters and 2) a probability framework in which future partially observed unregistered functions are considered.In contrast to traditional functional prediction methods, our model not only provides an estimate of the complete unregistered function, but also estimates the complete warping function and the complete registered function.Details of the prediction model are found in Section 5.

Comparison to MCMC Results
To establish the utility of the adapted variational Bayes algorithm, here we compare the estimates of registered functions using adapted variational Bayes versus those obtained through MCMC sampling.For this exposition, the simulated data set and the Boys Growth Velocity data set described in Section 4.1 are used to look at the discrepancies between the estimated registered functions from MCMC sampling versus those determined by the AVB algorithm.
The squared L 2 norm of the difference between the AVB and MCMC estimate of a registered function is used to quantify the differences between these estimates.Figure 3 illustrates for the simulated data set how closely the AVB estimates follow the MCMC estimates.Even the largest squared L 2 norms of the differences between these two estimates correspond to minor changes in the estimates.These simulated data represent rather ideal conditions for registration where there is almost no variation in the registered functions beyond a scaling and vertical shift of the target function.Consequently, as we might expect, the MCMC and AVB estimation procedures are primarily in agreement.Figure 4 is a more realistic look at the differences between the MCMC and AVB registration results for data that has significant variation in the registered functions beyond a scaling and vertical shift of the target function.However, even here we see the AVB algorithm performs well.Of the 39 observations, in only 2 or 3 are there notable discrepancies between the AVB and MCMC estimated registered functions.

VARIATIONAL APPROXIMATION FOR FUNCTIONAL PREDICTION 5.1 Functional Prediction Algorithm
The probabilistic framework of our registration model provides a natural structure in which we can consider new observations.Functional prediction has previously been considered by Ferraty and Vieu [6].Here we extend current methods by taking into account the phase variability of a partially observed function.
We will make the following assumptions: 1. We have a sample of approximated unregistered functions, 2. X i = (X i (t 1 ), . . ., X i (t p )) , i = 1, . . ., N are registered using the registration method outlined in Section 2 via a MCMC sampler or adapted variational Bayes.Lower Plots of the next three observations with the highest squared L 2 norms of the difference between the MCMC and AVB estimates.The squared L 2 norm associated with the lower right plot is about .64.As can be seen in this illustration, at this level there are only small differences between the MCMC and AVB estimates.
4. A new function, X N +1 (t) has been observed at the time points (t 1 , . . ., t r ) , r < p.

(X(h(t
), the distribution of the registered functions can be approximated by a multivariate normal distribution using the sample mean, μX(h) , and sample covariance matrix, Σ X(h) , of the estimated registered functions obtained in (2).
6. (w(t 1 ), . . .w(t p−1 )) ≈ N p−1 ( μw , Σ w ), the distribution of the base functions can be approximated by a multivariate normal distribution using the sample mean, μw , and sample covariance matrix, Σ w , of the estimated base functions obtained in (2).
Under these assumptions, we will proceed as follows: 1. Register the partially observed function, X P N +1 = (X N +1 (t 1 ), . . ., X N +1 (t r )) to the estimated target function, f (t), truncated to an appropriate registration time, t f , f ∈ {1, . . ., p}, 2. Using the distributions from assumptions ( 5) and ( 6) above, the estimate of the partial registered function, X P N +1 ( ĥN+1 ) = (X P N +1 ( ĥN+1 (t 1 )), . . ., X P N +1 ( ĥN+1 (t f ))) and the estimate of the partial base function, w P N +1 = ( w P N +1 (t 1 ), . . .w P N +1 (t f −1 )) , estimate the registered and base functions to time t p and t p−1 respectively using the conditional expectation of the multivariate normal distribution.Accordingly, denoting future registered observations and future warping function values , X F N +1 (h N +1 ) and w F N +1 , respectively, the estimates of these future values are 3. Estimate the complete unregistered function, X N +1 (t), using the inverse of the estimated warping function and the estimated registered function.

Determining the Last Registered Time
An additional random element in the prediction model is the last registered time of the truncated target function, t f , used to register the partial observation.To obtain the best possible registration of the partial observation, a range of final registration times are considered over a finer domain.
The efficiency of the adapted variational Bayes algorithm makes it possible to consider several possible partial registrations as follows: 1.For each of the time points t j , j ∈ {m, . . ., (m + k − 1)}, t m+k−1 < t p , the partially observed function, X P N +1 (t), is registered to the estimated target function, f (t), truncated to time, t j , so that ĥN+1(j) (t j ) = t r , where ĥN+1(j) (t) is the estimated warping function determined by registering X P N +1 (t) to the proposed final registration time t j .Note, the first and last times considered in this interval are chosen by plotting the partial unregistered function and the target function together and determining a generous interval that contains the appropriate final registration time.This interval is subsequently made finer to allow this time to fall between two of the original time points.

Calculate d
This algorithm determines the final registered time, t f , that results in the minimum L 2 norm between the partially recorded unregistered function and the target function evaluated at the inverse of the warping function estimated using that final time.Note, for all j, f U (j) shares the same domain as the partially recorded unregistered function, X P N +1 .

Credible Intervals for Predicted Functions
The  5. Draw a sample of size S from the distribution of X F (h X(h) .Combine each of these samples of future values of the registered function with the estimated partially registered function determined in (4) to get a sample of S estimated registered functions.
6. Draw a sample of size S from the distribution of w .Combine each of these samples of future values of the base function with the estimated partial base function determined in (4) to get a sample of S estimated base functions.
7. Determine the S warping functions that result from step (6).
8. Determine the S unregistered functions that result from combining each of the S registered functions drawn in (5) with the corresponding inverse warping function from (7).This process results in M ×S bootstrapped samples of the registered function, warping function and unregistered function.From these samples, point wise bootstrapped confidence intervals can be determined for each function.
Here we have approximated the distributions of the base and registered functions by fitting a multivariate normal distribution.However, given the small sample size, approximating these distributions by a multivariate t distribution as suggested in Lange [16] may provide confidence intervals with better coverage properties.Beran [1] also provides a method for constructing robust confidence intervals in the context of univariate prediction problems.

Functional Prediction -El-Niño Data
The El-Niño data consist of weekly readings of sea surface temperature with the first observation in June of 1950.Complete data can be found at NOAA's Climate Prediction Center website (http://www.cpc.ncep.noaa.gov/data/indices/).The data that we are using for this analysis are found through Professor Frederic Ferraty's (Mathematics; University of Toulouse, France) website (http://www.math.univ-toulouse.fr/ferraty/SOFTWARES/NPFDA/npfda-datasets.html).These data are a subset of the original data with monthly sea surface temperature records from June of 1950 to May of 2004.For this analysis, the bi-monthly observations are added to the data to prevent significant changes to the shape of a given function due to interpolation error.Also, light smoothing is applied to all functions.
The goal of our study is to predict how high temperatures will stay in the remaining part of the year in conjunction with how long temperatures will drop before they rise again based on the first seven months of temperature recordings from the lowest temperature recording in the previous year.
For this purpose, the data are restructured to define a"year" as the period of time between the lowest temperatures in consecutive calendar years.For example, the first year in our data set ranged from September 1950 to September 1951.Note, these "years" will not all be 12 months in length, and our final data had "years" that ranged from 11 to 14 months.For our analysis, we will concentrate on a subset of this group of restructured functions where the previous year's lowest temperature for each selected temperature profile is between 19.5 • C and 21 • C. Twenty-nine of the original temperature profiles fit this criterion.We will use the first 28 functions to predict the remaining portion of the 29th function based on the first 7 months of sea surface temperature observed in that year.
For the purpose of registration, all functions need to be recorded over the same interval of time.
As mentioned above, in this particular case our data is recorded over a time periods that range from 11 to 14 months.An easy remedy to this situation is to perform a simple initial warping to each function that rescales every observation to an 11 month time frame.In our final analysis, this initial warping is accounted for when determining the final base functions used for the prediction algorithm.
The original unregistered functions and the functions registered using the GP model described in Section 2 are plotted in Figure 5.For this data set, to register significant features in the sample while retaining function variation beyond a scaling and vertical shift of the target function, individual warping parameters, γ w i , i = 1, . . ., 28 were utilized instead of γ w in (5).Significant differences in the amplitude variation in the original functions that is unassociated with temporal variation prevented the use of a global parameter.However, only 3 unique warping parameters in total were necessary.
Using the empirical mean of the 28 original registered functions as the target function, the first 7 months of sea surface temperature records from observation 29 are registered to a piece of the target function where the final registered time is allowed to vary from 6.5 to 7.5 months.Between these months, a finer time interval corresponding to weekly records is used to allow for additional flexibility in determining the final registered time.The partially recorded function is plotted with the target function in the lower right panel of Figure 5.The grey shaded area includes the time points considered for the final time of the partial registration.After the optimal registration of the partially recorded observation is determined, estimates of the entire registered function, warping function, and unregistered function are determined using the model outlined in Section 5.1.
One-hundred bootstrapped samples were used to estimate the variability in the predictions of all three estimated functions.Figure 6 plots the initial estimates with the 95% bootstrapped confidence intervals.In addition, the plot of the estimated unregistered function also includes the true value of this function.
The primary advantage of registering the partially recorded observation before estimating future values is that we can capture variation in amplitude and timing separately.In Figure 6, the first plot captures the variability in the future level of sea surface temperature (amplitude variation), and answers the question,"How high can we expect sea surface temperatures to stay?".
The second plot captures the variability in the timing of future observations (temporal variation) which addresses the question of, "When can we expect sea level temperatures to begin rising again?".The confidence interval for the unregistered function seen in the last frame of Figure 6, combines both amplitude and temporal variation to estimate the future trajectory of sea surface temperature for this year.In this illustration it can be seen that the main difference in the estimated and actual temperature profiles lies in the timing of the lowest observation.However, for this observation, the sea surface temperature at 12 months is not much different than the sea surface temperature at 11.5 months.The predicted timing of the lowest temperature was 11.1 months.
One of the most notable features of this analysis is that there is little uncertainty in the   registration of the first 7 months of sea surface temperatures.The most prominent feature in the data is the peak temperature that occurs anywhere from 4 to 8 months in the original data.In our partially recorded observation, as seen in Figure 5, the peak of the target function and the partially recorded observation are already closely aligned.Additionally, this observation happens to be similar in shape to the target function.The combination of these features resulted in only a minimal amount of variation in the estimated registered and warping functions in the first 7 months.However, we note here, this phenomenon is an artifact of these particular data, and in other analyses more variation in the registered timing of the partially recorded observation would be expected.
The El-niño data set provides a challenging registration problem.The registered functions vary significantly in directions beyond the target function.Choosing curve specific registration parameters enabled features common to all functions to be registered while retaining prominent features in each individual curve.This is just one example of the difficulties that can arise in registering functional data and in turn how these challenges can be addressed to analyze data that does not fit the "ideal" registration problem.

FUNCTIONAL DATA REGULARIZATION AND REG-ISTRATION 6.1 Combining Registration and Smoothing
If instead of the function itself, noisy observations of each unregistered function, X i (t) are observed over a finite number of time points, t = (t 1 , . . ., t p ) , we will additionally assume that the observations, Y i (t j ), j = 1, . . ., p are iid, N (X i (t j ), σ 2 Y ).Incorporating registration and smoothing into a single model has also been considered recently by Rakêt et.al. [20].In their paper, each registered noisy function, Y i (h i (t)) at time point t j , j = 1, . . ., p is composed as follows: where f (t) is similar to our target function, r i (t) is a function-specific random effect that accounts for variation in individual noiseless functions beyond the target function, and i (t j ) are iid Gaussian noise.
The advantage of our model is that incorporating individual random effects is unnecessary.
Noting that the observations are noisy, not the registered functions; smoothing in our model is applied to the observations, not to the functions after registration.Under these conditions, variability in the estimated unregistered, smoothed functions, X i (t), can be looked at separately from variability in the estimated registered functions, X i (h i (t)).Section 6.3 provides an example of how treating smoothing as a pre-processing step underestimates variability in the posterior distributions of the registered functions.
In the presence of noisy observations, the following distributions are either altered or added to the registration model presented in Section 2.
The most significant change to the model is that we now include a smoothing penalty in the covariance specification for the registered functions.Here specifying P 2 (a roughness penaltysee Section 2 for more details) in the prior distribution for the registered functions establishes regularization in these functions.The associated smoothing parameter is λ X .As mentioned previously, P 1 is required to define Σ X as a proper covariance matrix.More details on these matrices can be found in Appendix A. As can be seen above, η X and λ X are considered as additional unknown parameters in this hierarchical model.
In the prior specifications of this model, equation ( 8) incorporates penalties for both smoothing and registration within the prior for the registered functions.The full conditional distribution for each approximated registered function, X i (h i ), when data are noisily observed is the joint fullconditional of the unregistered function and the warping function.
Instead of drawing from this joint full-conditional directly, we will proceed by first drawing from f (X i | rest) and then given X i , draw from f (w i | X i , rest).
These full conditional distributions are determined in the standard way recognizing that the prior distribution for a registered function can be factored into two components.One component penalizes lack of registration given the approximated unregistered function, X i ; the other component penalizes roughness in the registered function which implicitly penalizes roughness in the unregistered function.The roughness penalty is independent of the warping function and therefore also of w i .Specifically, the prior distribution (8) for each X i (h i ), i = 1, . . ., N , is such that Accordingly, the components of the joint distribution of the data and all unknown parameters that are dependent on w i are equations ( 15) and ( 9), and the resulting full conditional distribution for the approximated functions w i is such that This full conditional does not have a known distributional form and can be sampled from via a Metropolis step in a MCMC sampler.
The components of the joint distribution of the data and all unknown parameters that are dependent on X i are the sampling distibution ( 14) and equation (10).The resulting full conditional distribution is such that This full conditional distribution also is not of a known distributional form and can be sampled from using a Metropolis step.However, as significant features of the unregistered function, X i , should be unchanged by the registration.Smoothness in the registered function, X i (h i ), implies the same level of smoothness in the unregistered function X i .For ease of sampling, we will re-write (10) in terms of the unregistered function, X i so that which results in a multivariate normal full conditional distribution for X i .
When noisy observations, Y i , i = 1, . . ., N are recorded, the approximation we make in (11), while preserving conjugacy, prevents exact variational Bayes updates to be performed on the approximate posterior distributions for the following parameters: X i , i = 1 . . .N , σ 2 Y , η X , and λ X .Hence, the adapted variational Bayes procedure proposed here requires special handling under this data assumption.

Adapted Variational Bayes For Noisy Functional Data
When the functional data are recorded with noise, the adapted variational Bayes algorithm requires further adjustments to perform an approximate inference procedure.With the necessary adjustments, the convergence properties of the adapted variational Bayes algorithm no longer hold.However, we have found in practice that the adjusted algorithm still results in useful estimates for initializing a MCMC sampler.
Here we look at why the approximate posterior distributions for X i , i = 1, . . ., N , η X , and λ X cannot be updated properly using the adapted variational Bayes algorithm.In the m th iteration, the following update should be made to log q(X i ), for i = 1 . . .N : where Taking the expectation over the q distributions for all other parameters except for the base functions results to the following updated parameters of q(X i ) = N p (µ q(X i ) , Σ q(X i ) ) In (12), the expectation of f (h −1 i ) is unknown.So, the first approximation we will make is that where Taking the expectation over the q distributions for all other parameters except for the base functions results to the following updated parameters of In the expression for d ] are unknown.Thus, we will make the following approximations The variational Bayes algorithm update for λ X is similar to that of η X and requires the same approximations. where Taking the expectation over the q distributions for all other parameters except for the base functions results to the following updated parameters of q(λ X ) = G(c q(λ X ) , d q(λ X ) ), Again, in the expression for d ] are unknown.Thus, we will make the following approximations Due to these modifications, if noisy observations are observed the convergence properties of the adapted variational Bayes algorithm are not guaranteed to hold, and log [f (Y, w; q)] cannot be monitored.However, we can monitor convergence for this model as follows.Taking advantage of the fact that functional smoothing converges more quickly than functional registration, fix the approximated unregistered functions, X i , i = 1, . . ., N , after a small number of iterations and proceed as if they are known.Then, as in the model where the observations are recorded without noise, log [f (X, w; q)] can be monitored.

The Berkeley Growth Data
We refer back to the Berkeley Boys Growth Velocity dataset from Section 4.1.In Section 4.1, these data were smoothed prior to registration.Here, we again consider these functions with the  the Boys Growth Velocity Data of the tight credible bands that result from registering functions that are pre-smoothed.In Figure 8, the top and lower right illustrations contain the credible intervals for these same observations when the noise process is included in the model.
added assumption that they are corrupted by simulated mean zero iid Gaussian noise, where the true noise variance, σ 2 Y , is .25.While it is common in statistical analysis to perform preprocessing steps before applying a particular inference procedure, failing to account for the variability in parameter estimates due to the preprocessing step leads to overly narrow confidence (or credible) regions.In some cases, the effect may be fairly small, and not much is lost in this oversight.However, as we show here, the underestimated variability can be substantial when uncertainty in the preprocessing steps is ignored.
In Section 4 is an illustration of how closely AVB and MCMC estimates of the registered functions adhere to one another.Not only do these estimates tend to be fairly similar when the functions are recorded without noise, but the uncertainty in these estimates is minimal.Figure 7 contains the credible bands for two of the 39 pre-smoothed Boys Growth Velocity Functions.
These bands are so narrow the width between them cannot be seen.Keep in mind the posterior distributions of the registered functions are certainly multi-modal.These credible bands result from imposing the restriction that the mean value of the warping functions at each time point over the sample must equal that time point.Even with this restriction, the posterior distributions can be multi-modal.However, these narrow credible bands reflect that our estimates are in a highly probable area of the posterior distribution with minimal local variance.Figure 8 contains credible bands for both the unregistered and registered functions for the same two functions used in Figure 7 after noise has been added to the data and accounted for in the model.The variability due to noise is substantial.The solid line in all of the plots contains the noiseless version of these estimates (or observations in the case of the unregistered functions).
In addition to providing more accurate credible intervals, this model estimates the noise variance to be .258(actual noise variance is .25).This estimate is obtained using uninformative priors for both the noise variance, σ 2 Y and the associated smoothing parameter λ X .This analysis illustrates how regularizing the data prior to statistical analysis for registration models severely limits inference for these models.If significant noise is present in the data it is prudent to account for the variability in the registration process due to the noise.Our proposed hierarchical model is one way to account for this variability.

DISCUSSION
In this paper we have shown our Bayesian model for functional data registration compares favorably to the best registration methods currently available in terms of the quality of the estimates of aligned functions.Furthermore, the hierarchical structure of this model allows multiple inferential procedures to be included in one analysis.Here, we give an example where functional regularization and registration are performed in one model.However, these models will accommodate any combination of inferential procedures for which an appropriate prior can be defined.For instance, the models proposed here can easily be extended to a functional linear regression model where the registered functions or registered latent functions are considered as covariates.
While our registration algorithm provides high quality estimates in a highly flexible model, the associated computational costs due to running a MCMC sampling scheme for a high-dimensional model are considerable.To address these costs, we have proposed the Adapted Variational Bayes algorithm.This algorithm has been shown to converge in a similar way to traditional variational Bayes, but may not converge to a global maximum.However, an initial warping can be performed to move a function closer to its optimal registered value if the algorithm converges to a local maximum.
The bijective relationship between the base and warping functions in this model makes it possible to perform functional prediction in the context of registration.Using the estimated values of the base and registered functions from an initial registration of N functions, we use approximate distributions for these functions to predict future values of the registered, warping, and unregistered functions of a new function that is only partially observed.Furthermore, the AVB algorithm makes it possible to re-sample from these distributions to bootstrap confidence intervals for these predicted values.
While the AVB algorithm provides estimates that are similar to their MCMC counterparts, for some models MCMC sampling remains the optimal inferential procedure.For example, in the model that combines smoothing and registration AVB estimates are approximate and preferably are only used to initialize an MCMC sampler.While we have used Metropolis within Gibbs samplers, these are likely not optimal.Determining the most efficient method of sampling from these joint posterior distributions is left for future work.
For simplicity, we have used inverse-Gamma or Gamma priors for the variance components in these models.The best choice of priors for these components is a uniform prior on the square root of the variance components as suggested by Gelman [7].This is particularly a problem when the variance component is small or there is very little data to estimate these components.For the models presented here, Gamma priors for the smoothing parameters are sufficient as small changes in these parameters do not significantly affect the model.However, in the analysis of the noisy boys' growth velocity data in Section 6.3, we saw some evidence that the "uninformative" inverse-Gamma prior resulted in a slightly upward biased estimate of the noise variance.
Modeling functions as Gaussian processes in a Bayesian hierarchical model offers a unified approach to performing multiple inference procedures for functional data within one model.In Earls and Hooker [5], we established the properties of functional data estimates determined by approximating functional distributions over a finite subset of observed time points and estimating functions at unobserved time points with linear interpolation.Furthermore, we demonstrated that providing smoothing information in scale or covariance matrices results in regularized functional estimates.Here we extend this work by using informative covariance matrices to register functions where the registered functions are modeled as Gaussian processes.Future work includes adapting these models for other areas of inference for functional data and continuing to improve on the models we have proposed here.

APPENDIX A
Below, in detail, are the specifications for the hierarchical Bayesian registration model discussed in this paper.The first section includes the basic model for functional data registration also found in Section 2. Section A.2 describes the MCMC sampling scheme for this model.

A.1 Functional Registration
As discussed in Section 2, the initial assumption of this model is that we are interested in registering functional data, X i (t), i = 1, . . ., N , where these data are either observed directly over a set of time points, t = (t 1 . . .t p ) , or are estimated from noisy observations, Y i = (Y i (t 1 ), . . ., Y i (t p )) .
We assume a Gaussian noise process such that for each observation Y i , f (Y i (t j ) | X i (t j ), σ 2 ) = N (X i (t j ), σ 2 ) for i = 1, . . ., N j = 1, . . ., p which results in the joint distribution of all observations where Y is the matrix such that the observation for function X i (t) at time point t j is in the ith row and the jth column, X is the matrix of the corresponding means for each entry in Y, and X i = (X i (t 1 ), . . ., X i (t p )) , the vector of evaluations of the functions X i (t) at time points t = (t 1 , . . ., t p ) .
When the observations are observed noisily, the registered functions and noise variance are characterized by the following prior distributions: However, If each function X i (t) is observed directly over t, (13) assumes the roll of the distribution of the observed data and the covariance matrix, Σ X , designed to penalize roughness in the unregistered functions is excluded.This results in the following data distribution.
In both scenarios, we assume the following additional priors where a, b, c, and d are fixed hyperparameters.

APPENDIX B B.1 Adapted Variational Bayes
Below are the approximate posterior distributions for the adapted variational Bayes estimation procedure outlined in Section 3.1.For a more thorough discussion and illustration of how the optimal q distributions are derived see Goldsmith et.al. [10].
As the variational Bayes procedure described in Section 3.1 requires conditionally conjugate distributions for all parameters except for the base functions, we will use the alternate expression for the smoothing piece of the distribution on the registered functions found in (11) to provide the means to apply adapted variational Bayes for the model where data is recorded with noise.
This allows us to approximate the appropriate q distributions for X i , η X , and λ X .All other q distributions are determined in the usual way.For details on the approximated q distributions, see Section 3.1.Based on the full conditional distribution for above the following approximate posterior distributions are updated in each iteration.
If the observations are recorded without noise, i.e. we have observations X i , i = 1 . . .N as described in (14).The approximate posterior distribution of all parameters except the base functions is Σ q(X i ) + µ q(X i ) µ q(X i ) − 2µ q(X i ) (µ q(z 0i ) + (σ 2 q(z 0i ) + µ 2 q(z 0i ) )1 p 1 p + 2µ q(z 0i ) µ q(z 1i ) Note, these updates contain terms that cannot be evaluated.For instance, E (θ −η X ) [f (h −1 i )f (h −1 i ) ] cannot be determined because the approximate distribution of f (h −1 i ) is unknown.These terms can however be approximated.Section ?? provides details of the approximated values used for this analysis.

B.2 Convergence Criterion for Adapted Variational Bayes
The objective of the traditional variational Bayes algorithm is to find an approximate joint posterior distribution, given certain independence constraints, that has the minimum Kullback-Leibler distance from the true joint posterior distribution.It can be shown that minimizing the distance between the approximate posterior distribution, q(θ), and true joint posterior distribution, f (θ | data = X), is equivalent to maximizing a q-specific lower bound on the marginal distribution of X defined as f (X; q) := exp q(θ)log f (X, θ) q(θ) dθ This result is based on the definition of the Kullback-Leibler distance Kullback and Leibler [15]; details can be found in Goldsmith et.al. [10].
This algorithm is guaranteed to converge.Furthermore, when q(θ) is defined as a product of exponential distributions, the algorithm converges to a global maximum of f (X; q).Traditional variational Bayes monitors log[f (X; q)] until changes in this value are below a set threshold.
Here we will show when the functional data are assumed to be observed without noise that our proposed adapted variational Bayes algorithm also is guaranteed to converge, and convergence can be monitored in a similar fashion as traditional variational Bayes.
For a well-defined registration problem, this is easily discovered by inspection and can be overcome by initializing the troublesome w i to a value that more closely represents the true warping.
The lower bound of the marginal distribution of X and w can be monitored until changes in this function are under some threshold.The specific form of this function can be found in the next section.

B.3 Convergence Criterion
When the functional observations, X = X i , i . . .N , are recorded without noise E q(θ −w ) log f (X, w, θ −w ) − log q(θ −w ) is monitored until the desired threshhold is met.

Figure 1 :
Figure 1: Simulated Data Set 2. Top Left Unregistered functions.Top Right Registered functions using the minimum eigenvalue criteria (R package 'fda').Lower Left Functions registered by F-R (R package 'fdasrvf').Lower Right Functions registered by the GP model.

Figure 2 :
Figure 2: Registered Boys Growth Velocity.Top Left Original unregistered boys velocity data functions.Top Right Boys velocity functions registered using the minimum eigenvalue criteria (R package 'fda').Lower Left Boys velocity functions registered by F-R (R package 'fdasrvf').

Figures 1 and 2
Figures 1 and 2 contain the datasets used for this analysis.Each figure includes the original unregistered data along with plots of the functions registered using the three proposed methods.For all three registration methods, a range of parameter values were explored for optimal registration.

Figure 1
contains the functions of a simulated data set.These data consist of 20 unregistered scaled mixtures of three Gaussian probability density functions.All three registration procedures result in similar alignments.However, the GP method does a better job of recovering the original shape of the functions and results in the lowest sls.Note: The ME registered functions are based on 5 complete runs of the ME algorithm where in each run the previous runs results were used as the 'unregistered' functions.

Figure 3 :
Figure 3: Simulated Data -Difference Between MCMC and AVB Estimates.Left Plot of the squared L 2 norm of the difference between the MCMC and AVB estimates for each observation in decreasing order of magnitude.Center and Right The original unregistered function plotted with the MCMC and AVB estimates of the registered functions for the observations with the two largest discrepancies between the MCMC and AVB estimates.

Figure 4 :
Figure 4: Registered Boys Growth Velocity -Differences Between MCMC and AVB Estimates.Top Left Plot of the squared L 2 norm of the difference between the MCMC and AVB estimates for each observation in decreasing order of magnitude .Top Center and Right The original unregistered function plotted with the MCMC and AVB estimates of the registered functions for the observations with the first two largest discrepancies between the MCMC and AVB estimates.
efficiency of adapted variational Bayes for prediction also makes it possible to characterize variability in the estimates of the complete registered function, unregistered function, and base function via bootstrapping.To capture variability in the predicted functions, for each of m = 1, . . ., M iterations 1. Draw a new sample of N registered functions from N p ( μX(h) , Σ X(h) ). 2. Draw a new sample of N base functions from N p−1 ( μw , Σ w ).

3 . 4 .
From the bootstrapped samples determined in (1) and (2), compute the sample mean and covariance matrix for the approximated registered functions, μ(m) X(h) and Σ (m) X(h) and also for the approximated base functions, μ(m)w and Σ Register the partially observed function as in prediction step (1) in Section 5.1 above by setting the approximated target function, f , equal to μ(m) X(h) .

Figure 5 :
Figure 5: El-niño Data.Top Left Original 28 profiles of sea surface temperature.Top Right Estimated warping functions.As can be seen here, the time period of the original data ranged from 11 to 14 months.Lower Left Estimated registered temperature profiles.Lower Right The solid line is observation 29 recorded for 7 months.The dashed line is the estimated target function.The grey shaded area spans the 5 time points that are considered for the final time of the partial registration.

Figure 6 :
Figure 6: Estimates and Bootstrapped Confidence Intervals.Left Estimated registered function with 95% bootstrapped confidence interval.Note: In parts of the domain, the estimated confidence interval and estimated registered function overlap.This is largely due to a bimodal distribution of the last registered time.Center Estimated warping function with 95% bootstrapped confidence interval.Right Estimated unregistered function with 95% bootstrapped confidence interval.The dashed and dotted line is the true unregistered function.

Figure 7 :
Figure 7: Examples of Credible Intervals for Noiseless Observations .These are two examples from

Figure 8 :
Figure 8: Examples of Credible Bands for the Unregistered and Registered Functions when the Noise Process is Included in the Model.Top and Lower Left 95% credible bands for the unregistered functions are plotted with the original noiseless functions for subjects 8 and 11.Top and Lower Right For subjects 8 and 11, 95% credible bands for the registered functions are plotted with the estimate of the registered 'true' functions.
−w )] − log[q(θ (m) −w )]] ≤ E q(θ −w ) [log[f (X, w (m+1) , θ (m) −w )] − log[q(θ (m) −w )]] Section 4 provides several examples that illustrate how allowing the target function to be estimated within the model results in a more complete functional registration in comparison to the Procrustes method.However, the Gaussian process model does not constrain the timing of a feature in the target function to occur at the average time of the corresponding unregistered features.Although the model for the w i (t) is centered on zero, it is still possible that the average of the estimated warped time points, h • (t 1 ),. ..,h • (t p ), does not correspond to the original time points.