A Geometric Approach to Pairwise Bayesian Alignment of Functional Data Using Importance Sampling

We present a Bayesian model for pairwise nonlinear registration of functional data. We use the Riemannian geometry of the space of warping functions to define appropriate prior distributions and sample from the posterior using importance sampling. A simple square-root transformation is used to simplify the geometry of the space of warping functions, which allows for computation of sample statistics, such as the mean and median, and a fast implementation of a $k$-means clustering algorithm. These tools allow for efficient posterior inference, where multiple modes of the posterior distribution corresponding to multiple plausible alignments of the given functions are found. We also show pointwise $95\%$ credible intervals to assess the uncertainty of the alignment in different clusters. We validate this model using simulations and present multiple examples on real data from different application domains including biometrics and medicine.


Introduction
The problem of registration of functional data is important in many branches of science. In simple terms, it deals with deciding how points on one function match in some optimal way with points on another function. In contrast to landmarkbased matching, such an approach matches the entire domains of the functions in a general registration problem. The study of registration problems is popular in image analysis where pixels or voxels across images are matched, and in shape analysis of objects where points across shapes are matched. One can broadly classify registration problems into two main groups: (1) pairwise registration and (2) groupwise registration. In pairwise registration, one solves for an optimal matching between two objects, while in groupwise registration multiple (> 2) objects are simultaneously registered. In this paper, we focus on the problem of pairwise registration. This problem has been referred to in many different ways, some of which are alignment, warping, deformation matching, amplitudephase separation, and so on. While registration can be studied for many types of objects, from simple functions to complex high-dimensional structures, the fundamental issues in registration are often similar. We will focus on perhaps the simplest objects for studying registration problems, R-valued functions on [0, 1]. More specifically, we will take a Bayesian approach to this problem, motivated by geometrical considerations; the method will be characterized by the definition of a geometric prior on a suitable function space, representing the parameter space of interest. We also compare the proposed method to past ideas that often take an optimization-based approach.
To motivate the function alignment problem, consider the example shown in Figure 1. In panel (a), we display an example of a PQRST complex with labeled structures (P wave, QRS complex, T wave). This function represents a full heartbeat cycle and can be extracted from long electrocardiogram (ECG) signals for the purposes of diagnosing heart diseases such as myocardial infarction. The difficulty with using such objects for diagnosis is highlighted in panel (b). As given, the P wave and QRS complex on the red function occur earlier than on the blue one. This is usually due to natural variability in nonlinear heartbeat dynamics. In general, given two PQRST complexes, their important salient features are often not in correspondence. This presents a major challenge when modeling these functions. Even simple statistics such as the cross-sectional mean can be meaningless (see Figure 8 and row 3 in Figure 13). Aligning the functions prior to subsequent statistical analyses is thus required. The purpose of pairwise alignment is to estimate a warping function, and additionally the uncertainty in this estimate, that aligns the prominent features across two functions. In panel (c), we display the estimated warping function in red, and in panel (d) we show the resulting alignment of the two PQRST complexes. Now, the P wave, QRS complex and T wave occur at the same time across both functions. There exists a large literature on statistical analysis of functions, in part due to the pioneering efforts of Ramsay and Kneip [25,9], and several others [17,34]. When restricting to the analysis of elastic functions (functions that are temporally aligned) the literature is relatively recent [24,7,17,8,34,10,21]. The general approach in most of these methods is to use an energy function to compute optimal registrations and perform subsequent analysis on the aligned functions using standard tools from functional data analysis such as the cross-sectional mean, covariance and functional Principal Component Analysis (fPCA). The importance of registration in functional data is undeniable as evidenced in a recent Special Section of the Electronic Journal of Statistics titled Statistics of Time Warpings and Phase Variations [19]; this section contained a set of applied papers that analyzed four different datasets, including mass spectrometry functions [11], neural spike trains [42], juggling trajectories [23] and internal carotid arteries [27].
Recently, it has been argued that a Bayesian approach rather than pure optimization is a better option for many situations. The advantages of a modelbased Bayesian approach include: 1. A comprehensive exploration of the warping variable space resulting in potential multimodal solutions to the registration problem; 2. Assessment of uncertainty, via credible intervals, associated with the computed estimates.
The literature on registration methods that are based on Bayesian principles is fairly limited. Telesca and Inoue [35] proposed a semi-parametric model for groupwise alignment of functional data. These models were further extended in the context of analyzing microarray data in [36]. A nonparameteric approach to the groupwise registration problem was also proposed recently in [33]. A different Bayesian model was proposed for registering liquid chromatography-mass spectrometry data in [37]. The main difficulty in specifying Bayesian registration models lies in defining an appropriate prior on the space of warping functions, or some relevant subset, to enable efficient inference. In [30], Srivastava and Jermyn defined a Gaussian-type prior distribution on the space of warping functions, via the geodesic distance, in the context of detecting shapes in two-dimensional point clouds. The recent model of Cheng et al. [2] used the square-root slope function (SRSF) representation of functional data and utilized the fact that the derivative of a warping function is a probability density function. In this way, they constructed a Dirichlet process to impose a prior model implicitly on the space of warping functions, and sampled from the posterior distribution using Markov chain Monte Carlo (MCMC) techniques. The SRSF representation of functional data has many desirable properties related to the registration problem, which we emphasize in Section 2.
In the current paper, we describe a convenient geometric structure, a unit sphere, using the square-root density (SRD) representation of warping functions and use its geometry to impose the prior. In this setup, we develop a Bayesian registration model and utilize importance sampling from the posterior to compute posterior functionals such as the mean, median or maximum a posteriori (MAP) estimate. We also provide pointwise standard deviations and credible intervals to assess alignment uncertainty. We show that these tools are especially effective when two or more registrations are plausible. Thus, the main contributions of this paper are the following: 1. We use the spherical geometry of the space of warping functions to define a class of truncated wrapped normal prior distributions for the purpose of Bayesian alignment of functional data; 2. We define a sampling importance re-sampling approach to sample from the marginal posterior distribution of warping functions; 3. We use the Riemannian geometry of the space of warping functions to define an efficient k-means clustering algorithm, which can be used to identify multiple modes in the posterior representing different plausible alignments of the observed functions.
The rest of this paper is organized as follows. In Section 2, we give a detailed description of the registration problem and describe tools for statistical analysis on the space of warping functions. In Section 3, we introduce our registration model and in Section 4 we describe an importance sampling approach for sampling from the posterior distribution of warping functions. Finally, in Sections 5 and 6, we present simulation studies and different applications of the proposed framework. We emphasize examples where the posterior distribution is multimodal. Finally, we close with a brief summary and directions for future work in Section 7.

Problem Background
Before we describe our Bayesian framework, we first setup the registration problem mathematically. Let F be an appropriate subset (made precise later) of real-valued functions on the interval [0, 1]. For any two functions f 1 , f 2 ∈ F, the registration problem is defined as finding the mapping γ such that point t ∈ [0, 1] on the domain of f 1 is matched to the point γ(t) ∈ [0, 1] on the domain of f 2 . In other words, the functions f 1 (t) and f 2 (γ(t)) are optimally imsart-ejs ver. 2011/11/15 file: ps-sample_v1.tex date: February 7, 2017 S. Kurtek/Geometric Bayesian Alignment of Functional Data 4 matched under the chosen optimality criterion. The main question that arises is: What should be the criterion for optimal registration? A natural tendency is to choose an L p -norm between f 1 and f 2 • γ, but there are some known limitations of that approach. For instance, if we choose the L 2 norm, defined as is the standard Euclidean norm), we obtain the following optimization problem: This setup can lead to a degenerate solution, termed the pinching effect demonstrated in [20]. In this case, one can pinch the entire function f 2 to get arbitrarily close to f 1 in L 2 norm. To avoid this situation, one often adds a roughness penalty on γ, denoted by R(γ), leading to the optimization problem given by . Although this avoids the pinching effect, it introduces some other issues. First, the choice of λ is not obvious in general cases. Second and more important is the fact that this solution is not symmetric. That is, the optimal registration of f 1 to f 2 can be quite different from that of f 2 to f 1 . Another related issue is that this criterion is not a proper metric and this leads to additional problems in later analysis. Most papers on registration of functional data involve this setup and inherit the above-mentioned limitations.
For registration under this approach, each f ∈ F is represented by its SRSF q. One sets F to be the space of all absolutely continuous functions and the resulting space of all SRSFs is L 2 ([0, 1], R) henceforth referred to simply as L 2 . For every q ∈ L 2 there exists a function f (unique up to a constant) such that the given q is the SRSF of that f . In fact, this function can be obtained precisely using f (t) = f (0) + t 0 q(s)|q(s)|ds. Note that if a function f is warped by γ to f • γ, its SRSF changes from q to (q, γ) = (q • γ) √γ ; this last term involving √γ is an important departure from previous solutions. To setup the registration problem, we define an equivalence class of an SRSF as [q] = {(q, γ)|γ ∈ Γ}. Finally, the pairwise registration between any two functions f 1 and f 2 is performed by solving an optimization problem over equivalence classes of their SRSF representations: The solution to this problem is computed using the dynamic programming (DP) algorithm. The resulting distance between the aligned f 1 and f 2 is given by d([q 1 ], [q 2 ]) = q 1 − (q 2 , γ DP ) . As described in [32], this framework has many advances: it avoids the pinching problem, its registration solution is symmetric, it does not require an additional regularization term and the choice of λ that goes with it, and it is actually a proper metric on the quotient space F/Γ, which provides important tools for ensuing analysis. The most important reason why this setup avoids many problems of Equation 1 is that q 1 − q 2 = (q 1 , γ) − (q 2 , γ) for any γ ∈ Γ. In mathematical terms, it means that the action of Γ on L 2 is by isometries. The original method was later extended to apply to statistical analysis of cyclostationary biosignals [15], and was shown to perform well in different applications [38,39,43,19].
While the framework of Srivastava et al. [32] is precise in mathematically defining the function registration problem, it solves for optimal warping functions via energy optimization. In this paper, we argue that a model-based Bayesian approach has many additional advantages. Thus, to preserve the nice properties, such as the isometric action of Γ under the L 2 metric, we build our Bayesian model using the SRSF representation of functional data.

Representation Space of Warping Functions
The proposed Bayesian model defines prior distributions and importance functions on the space of warping functions Γ. Thus, we are faced with defining statistics and probability distributions on this space. In order to do this we use the Fisher-Rao Riemannian metric on Γ, which is given by (for w 1 , w 2 ∈ T γ (Γ) and γ ∈ Γ) [29,31,12]: whereẇ andγ represent derivatives. An important property of the Fisher-Rao metric is that it is invariant to re-parameterizations of probability density functions [41]. While this is not the only metric that achieves this property, it is important to note that there is no invariant metric that does not include derivatives. It is possible to define statistics and probability models directly on Γ under the Fisher-Rao metric, but this proves to be very complicated due to the non-trivial Riemannian geometry of this space. We use the Fisher-Rao Riemannian geometry in our Bayesian setup because the desirable properties of this metric (i.e., parameterization invariance) will naturally translate to the prior distributions on Γ. Inference on Γ is greatly simplified using a convenient transformation, which is similar to the definition of the SRSF for general functions [1]. Definition 1. Define the mapping φ : Γ → Ψ. Then, given an element γ ∈ Γ, define a new representation ψ : [0, 1] → R >0 using the square-root of its derivative as φ(γ) = ψ = √γ .
This is the same as the SRSF defined earlier for functions and takes this form becauseγ > 0 ∀ t. For simplicity and to distinguish it from the SRSF representation of observed functions, we refer to this representation as the square-root density (SRD). The identity map γ id (t) = t maps to a constant function with value ψ id (t) = 1. An important advantage of this transformation is that the L 2 norm of a function ψ is 1. Thus, the set of all such ψs, denoted by Ψ, is a subset of the unit sphere in L 2 . Furthermore, as shown in [1,29,31,12], the Fisher-Rao metric on the space of warping functions simplifies to the L 2 metric on Ψ, which in turn greatly simplifies all computation. Given a function ψ one can easily compute the corresponding warping function via integration using γ(t) = t 0 ψ(s) 2 ds; this provides the inverse mapping φ −1 : Ψ → Γ. Thus, the geodesic path between two warping functions, γ 1 , γ 2 ∈ Γ represented using their SRDs ψ 1 , ψ 2 ∈ Ψ, is simply the great circle connecting them (α : , where θ represents the length of this path (geodesic distance between warping functions γ 1 and γ 2 under the Fisher-Rao metric) and is simply the arc-length between ψ 1 and ψ 2 : where ·, · is the standard L 2 inner product.
Since the differential geometry of the sphere is well known, this transformation also simplifies the problem of defining probability distributions of warping functions. The general approach will be to define wrapped probability distributions, and perform random sampling and probability calculations on tangent spaces of Ψ; the tangent space for all ψ ∈ Ψ is defined as T ψ (Ψ) = {v : [0, 1] → R| v, ψ = 0}. In order to achieve this goal, we must first define some standard tools from differential geometry for this space: 1. Exponential map: For ψ ∈ Ψ and v ∈ T ψ (Ψ), the exponential map is defined as exp : Parallel transport: For ψ 1 , ψ 2 ∈ Ψ, the shortest geodesic path α : [0, 1] → Ψ such that α(0) = ψ 1 and α(1) = ψ 2 , and a vector v ∈ T ψ1 (Ψ), its parallel transport along α to ψ 2 is defined as κ : . The exponential and inverse exponential maps provide a natural way of mapping points from the representation space Ψ to the tangent space (at a particular element of Ψ) and vice versa. Parallel transport long geodesic paths allows translation of tangent vectors from one tangent space to another. An important property of parallel transport is that the mapping κ is an isometry between the two tangent spaces, i.e., for v . This tool from differential geometry is useful in defining probability models on the space of warping functions. In particular, we define an orthonormal basis in the tangent space at any point on Ψ by transporting a standard basis defined on the tangent space at the identity element, T 1 (Ψ).
Summary statistics on Ψ: In addition to defining prior distributions on the space of warping functions, we would like to be able to compute summary statistics such as the mean or median. These tools are especially useful in inference based on samples generated from the posterior distribution. Suppose that we have a sample of warping functions γ 1 , . . . , γ p . To begin, we are interested in defining a mean and median of these functions. To do this we again exploit the geometry of Ψ. We begin by mapping all of the functions γ to their corresponding SRD representations resulting in ψ 1 , . . . , ψ p . Once this is done, all of our data is on the subset of the unit sphere, where the geodesic distance is used to compute their intrinsic mean and median as follows. The sample Karcher mean is given byψ = argmin ψ∈Ψ p i=1 d(ψ, ψ i ) 2 while the sample geometric median is defined asψ = argmin ψ∈Ψ p i=1 d(ψ, ψ i ). Gradient-based approaches for finding the Karcher mean and geometric median are given in several places [16,5,6,14] and are omitted here for brevity.
K-means clustering on Ψ: One of the motivations behind this work is the discovery and analysis of multiple modes in the posterior distribution of warping functions. For this purpose, we introduce a k-means clustering approach on Ψ. In the previous section, we defined a procedure to compute the Karcher mean of warping functions and we will use it to specify the k-means clustering algorithm. Let γ 1 , . . . , γ p be a sample from the posterior distribution and ψ 1 , . . . , ψ p be their corresponding SRDs. The k-means clustering approach computes a partition of the sample space such that the within cluster sum of squared distances is minimized. This is achieved using the following standard algorithm [18]: (3) Update cluster meansψ 1,j+1 , . . . ,ψ k,j+1 using the Karcher mean. (4) Set j = j + 1. (5) Repeat Steps 1-4 until cluster assignments remain unchanged.
A major benefit of this algorithm is its flexibility. One can easily replace the k-means formulation by, for example, k-medians. This is especially useful when the mean may not be a good estimate of the posterior mode of interest.
There are two main limitations of this algorithm: (1) the solution strongly depends on the initialization of the k cluster means, and (2) the number of clusters k must be specified a priori (usually the expected number of posterior modes is unknown). We address the first issue using hierarchical distance-based clustering as follows. To overcome limitation (1), we compute all pairwise distances between the given samples using Equation 4 and perform hierarchical clustering using the maximum linkage criterion. We then initialize the k-means clustering algorithm using the k clusters provided by hierarchical clustering. To address the second issue, we use the following procedure to determine the "correct" number of clusters or posterior modes k. First, we compute the pooled total imsart-ejs ver. 2011/11/15 file: ps-sample_v1.tex date: February 7, 2017 S. Kurtek/Geometric Bayesian Alignment of Functional Data 8 variance across all clusters for k = 1 and k = 2. To decide whether the posterior has multiple modes, we examine the percentage decrease in the pooled variance due to the additional second cluster. If the percentage decrease is greater than 30%, we proceed to cluster the posterior samples. While this cutoff value seems ad-hoc, we have found through many simulations and real data examples that it works well in practice. Then, to decide on the final number of clusters, we use the silhouette measure of Rousseeuw [26]. To construct the silhouette for warping function i, we require the following two values: (1) a(i), which is the average dissimilarity of warping function i to all other warping functions in the same cluster, and (2) b(i), which is the minimum average dissimilarity of warping function i to any of the other clusters; we use the Fisher-Rao distance (Equation 4) as the dissimilarity measure. Then, the silhouette can be calculated as The silhouette for a given warping function measures the appropriateness of its cluster assignment. The average of the silhouette measures over all posterior warping function samples can take values between -1 and 1, which represent very poor and very good clusterings, respectively. The number of modes in the posterior is chosen as the number of clusters, which maximizes the average silhouette measure.
Discretization: To define the Bayesian registration model, we first discretize the observed functional data using a dense sampling of N points: 1], where N depends on the application of interest. We study the effects of different values of N on the posterior inference in Section 5.1. This allows us to model differences between SRSFs using multivariate normal distributions. Note that the function f evaluated at the N discrete points is denoted by f ([t]) (similarly q([t]) for the SRSFs). As will be seen later, the warping functions do not require an explicit discretization in the given model. But, in order to compute the action of Γ on the observed functions (SRSFs), we also discretize them with the same N points in the implementation. Finally, we use discrete approximations to compute the quantities defined in this section.

Bayesian Registration Model
Given two functions f 1 , f 2 and their corresponding SRSFs q 1 , q 2 , we introduce a novel Bayesian model for function registration. Let q * 2 denote (q 2 • φ −1 (ψ))ψ. At the first stage, we model the difference q 1 − q * 2 |ψ using a zero-mean Gaussian process. After discretization of the observed functions, we model the N differences |ψ using the multivariate normal distribution as follows: This part of our model is exactly the same as that proposed in [2]. The second stage of our model places a truncated wrapped normal (TWN) prior distribution on the space of warping functions Γ by using their SRD representation: We set the mean of the prior to be the identity mapping µ ψ = 1, which provides natural regularization toward γ id (i.e., no warping). We also truncate the support of the prior to the valid space of warping functions given by Ψ. Thus, the prior distribution π ψ is a truncated wrapped normal distribution defined and evaluated in T 1 (Ψ). This definition is similar to that presented in Kurtek et al. [12]; an alternative construction of Gaussian distributions on high-dimensional spheres is given in [4].
To define the covariance structure in the prior on warping functions, we require an orthonormal basis in the tangent space T 1 (Ψ). We begin by defining a set of basis elements, which are orthogonal to the representation space and have unit L 2 norm: . . n}. Then, to form an orthonormal basis for the tangent space T 1 (Ψ), we use the Gramm-Schmidt procedure under the L 2 metric. Notice that this orthonormal basis, denoted byB, is truncated by choosing a maximum number l = n, which yields 2n+1 basis elements denoted byb. The truncation of the basis is important for additional regularization (smoothness of the warping functions) and computational efficiency. Given an orthonormal basis in the tangent space T 1 (Ψ), one can approximate any warping function using a set of basis coefficients given by . . , 2n + 1}. Using this notation, we can write the truncated wrapped normal prior on warping functions as follows: where 1 is the indicator function. We specify K as a diagonal covariance matrix with σ 2 /j 4 as the jth diagonal element with a large value for σ 2 = 1000. Thus, we assume a weakly informative prior distribution on the directions given by the basisB. We choose quadratic decay of the standard deviation with respect to the degree of the basis functions based on simulations presented in Section 5.1. We require at least a linear decay for the eigenvalues of the covariance operator to be summable [3]. In practice, we want to favor smoother warping functions; thus, we weigh the low frequency basis elements (corresponding to low values of j) higher than the high frequency basis elements; the variance of the additional linear basis element is not penalized.
To model the concentration parameter in the likelihood, κ, we use a vague gamma prior with parameters α = 1 and β = 0.01 (E(κ) = 100, V (κ) = 10000). This prior is denoted by π κ . We assume that the registration variable, ψ, and the concentration in the likelihood, κ, are independent. This is a reasonable assumption due to the fact that the alignment of two functions does not depend on their scale as shown in [32].
Under this specification of the model, the marginal posterior distribution of ψ becomes: imsart-ejs ver. 2011/11/15 file: ps-sample_v1.tex date: February 7, 2017 whereΓ denotes the gamma function. We will use importance sampling to sample form the posterior distribution and perform Bayesian inference.

Model Justification
Here, we give a brief justification for each component of the proposed Bayesian registration model. In particular, we focus on the advantages of the given model over other possible choices.
Likelihood: In the current work, we specify the likelihood as a multivariate normal distribution on the pointwise differences between two SRSFs representing the observed functions. An alternative approach that is common in current literature is to model the pointwise differences between the observed functions themselves. Unfortunately, this suffers from the drawbacks discussed in detail in Secton 2. In particular, it is clear that, under that setup, the likelihood changes depending on whether one is aligning f 2 to f 1 or vice versa. This is a direct result of the lack of isometry of the L 2 metric under the action of Γ, i.e., [2] for further justification of the given likelihood.
Prior on Γ: We model the warping functions using a truncated wrapped normal distribution on the SRD space. This allows us to avoid discretizing the warping functions in the specification of the model (we only discretize at the final implementation stage), which is in contrast to the method presented by Cheng et al. [2]. In that work, the authors observe that warping functions are akin to cumulative distribution functions. Thus, they place a Dirichlet prior on increments of the discretized warping functions. In contrast, we use a basis on the SRD tangent space, which allows us to model the full warping function up to the level of basis truncation (the warping function can be easily evaluated at any point on the domain [0, 1] using the given basis). The proposed approach also permits one to easily incorporate prior knowledge into the model. First, the prior can be defined in a tangent space centered at any warping function. The given basis can be parallel translated using the simple expression in Section 2.1 to aid in this definition (see the next section for details). This can be especially useful if the observed functions are annotated with landmarks. Second, the prior knowledge about smoothness of the warping functions can be incorporated through the level of basis truncation. For smooth warping functions, the basis can be truncated at a relatively small number (and vice versa for "rougher" warpings). Finally, we are able to control the variance and decay in the diagonal covariance K, allowing further flexibility in the model.
Prior on κ: We choose the standard gamma prior on the concentration parameter κ. The main advantage of this choice is that we are able to analytically marginalize the posterior over this parameter. This simplifies the importance sampling approach discussed in the next section.

Importance Sampling
We begin by briefly introducing the concept of importance sampling and then provide some details of how this can be applied to our problem. Importance sampling is a variance reduction technique in Monte Carlo estimation where instead of directly sampling from a distribution of interest, which may be inefficient, one first samples from an importance function and then re-samples based on appropriate weights.
Suppose that we are interested in estimating the value of the following integral: θ = X g(x)p(x)dx, where p is a probability density function. The classical Monte Carlo estimate of this integral is given byθ = S i=1 g(x i ), where {x 1 , . . . , x S } are iid samples from p. If the variance of the classical Monte Carlo estimate is large it may be beneficial to introduce a new function h, termed the importance function, which can be used to generate the samples instead of p. One can then rewrite the integral as θ = X h(x) . We use this idea to generate samples from the posterior distribution represented by p as follows. Given a large sample {x 1 , . . . , x S } from h, we compute the associated weights as { p(xi) h(xi) , i = 1, . . . , S}. Then, to obtain s samples from p (where s S), we re-sample the set {x 1 , . . . , x S } with the corresponding (normalized) weights. This provides a flexible and efficient method for sampling from the posterior distribution. This process is also called sampling importance re-sampling (SIR). In the current work, we use an improved SIR method without replacement given in [28].
For our problem, we are faced with defining an importance function h that allows us to efficiently sample from the posterior p. The main requirement on h is that its support is the same as that of p. One option is to use the prior as the importance function directly, and generate the weights using the likelihood. But, in other cases, one may want to "upsample" a different part of the space, e.g., near the dynamic programming solution. Thus, we provide a general recipe for constructing wrapped normal importance functions similar to the definition of the prior on Ψ.
In order to do this, we require a method for defining an orthonormal basis in the tangent space at any point on Ψ. Given the truncated basisB in T 1 (Ψ) defined in the previous section, we can define an orthonormal basis in the tangent space at an arbitrary point T µ ψ (Ψ) using parallel transport, which was defined in Section 2.1. Parallel transport defines an isometric mapping between tangent spaces, and thus preserves the lengths of the basis vectors and the angles between them. We refer to the orthonormal basis in T µ ψ (Ψ) as B (with elements {b k , k = 1, . . . , m}), and use it to define a coordinate system in that space. Thus, we can again approximate any warping function using a set of basis coefficients given by ψ ≈ d = {d k = exp −1 µ ψ (ψ), b k , k = 1, . . . , m}. In this way, we can define a general version of the importance function as: imsart-ejs ver. 2011/11/15 file: ps-sample_v1.tex date: February 7, 2017 We define wrapped normal importance functions in the tangent space at a pre-specified mean. One can generate random samples from these models on the tangent space and then use the exponential map to get a random warping function.
where K h is a diagonal matrix (K h can be specified in the same was as in the prior). Note that there is no need to truncate the importance function. Map v rnd to Ψ using ψ rnd = exp µ ψ (v rnd ); 5. Compute the random warping function using γ rnd = φ −1 (ψ rnd ).
Using the idea of importance sampling, we can re-write the posterior distribution in Equation 8 as follows: It is obvious from the expression in Equation 10, that in the special case when the importance function is the same as the prior, one can simply sample from the prior distribution and weight each sample using the integrated likelihood. Thus, our approach is to generate a large sample {ψ 1 , . . . , ψ S } from h and evaluate a weight for each sampled warping function using: exp −1 µ ψ (ψ), b j , b j ∈ B, j = 1, . . . , m} as before. Once all of the weights have been computed, we re-sample a small number s of ψs from the original set using the methods proposed in [28]. The re-sampled functions ψ 1 , . . . , ψ s are samples from the posterior distribution p, and can be mapped to their corresponding warping functions using φ −1 . Posterior functionals can be mapped to Γ in the same way.

Simulation Studies
In this section, we present warping results using simulated scenarios. In all examples, we fix the original sample size to S = 500000, the posterior sample size to s = 200, and the number of basis elements in the prior and importance function to N − 1, where N is the sampling density of the observed functions. The importance function used throughout the simulation studies and the real applications is a wrapped normal centered at the identity element with the same covariance structure as the prior.

Simulation 1
In the first simulation study, we consider the effects of function sampling density and the order of decay of the standard deviation in the prior distribution. For this purpose, we simulated three different warping functions, γ 1 = t+0.15t(1−t), γ 2 = t + 0.70t(1 − t), γ 3 = t + 0.1 sin(2πt), t = [0, 1], and applied them to a function with two modes denoted by f . We display the original function f in Figure 3(a) in blue and the same function under the three warpings, f •γ 1 , f •γ 2 and f • γ 3 , in red, green and black, respectively.
We apply the proposed model to perform pairwise Bayesian alignment for each example using 100 replicates, and report the detailed results in Table 1 for quadratic decay of the prior standard deviations and sampling densities of 50, 100 and 150 points. For each example, we report the average Fisher-Rao distance between the true warping function and the estimated posterior meanγ in panel (a), the Fisher-Rao distance between the true warping function and the dynamic programming solution γ DP in panel (b), and the average Fisher-Rao Table 1 Simulation results for correct warping recovery for three different warping functions and sampling densities (SD) under quadratic decay of the prior standard deviations.  distance between the true warping function and the estimated posterior mean when using a Dirichlet priorγ DIR in panel (c). In all of the presented results, we set the parameters of the Dirichlet distribution to α 1 = · · · = α 40 = 1 (i.e., uniform prior on warping functions specified in the same was as in [2]), and use importance sampling to sample from the posterior. The standard deviations of the distances are also provided in parentheses. We highlight the best performance for each example and sampling density in bold. In all examples, the proposed geometric Bayesian model outperforms a model with a Dirichlet prior on the warping functions. Furthermore, the performance of the proposed method is comparable to, and often better than, the commonly used dynamic programming algorithm. In panels (d)-(f), we report the average percentage decrease in the distance between the two functions being registered, i.e., DP D = q1−q2 − q1−(q2,γ) Again, the proposed model performs very well according to this metric. It is important to note that the gains in performance are small when the sampling  density is increased from 100 to 150 points. Thus, for fairly smooth functions, as is the case in this simulation and the applications presented in the subsequent section, we will sample the functions with 100 points for computational efficiency. The replicate posterior means for the proposed method are displayed in red in Figure 3(b)-(d) with the true warping in blue. It is clear from this figure that there is little variation across replicates and that we are able to recover the true warping very well. Table 2 reports the same set of results for linear and no decay in the prior standard deviations for the proposed method across the three sampling densities. Linear decay performs comparably to quadratic decay, while no decay does not perform well as expected. Throughout the rest of the paper we utilize quadratic decay as indicated by these simulation results.

Simulation 2
In the second simulation, we explore the performance of the proposed alignment model when two modes are present in the posterior distribution. The two functions to be aligned, f 1 and f 2 , are shown in Figure 4(a) in blue and red, respectively. In the same panel, we show the alignment results, across 100 replicates, using the mean of each posterior cluster in green and black. For comparison, we also display the dynamic programming result in magenta. Note that in this simulation we have treated the number of clusters as known (k = 2) and applied the k-means clustering algorithm as described in Section 2.1. In panel (b), we display the two clusters of warping functions representing the two posterior modes (again in green and black) as well as the dynamic programming result (in magenta). The clusterwise posterior mean warping functions are much smoother than the dynamic programming solution and achieve essentially the same level of alignment between the two functions.
In Table 3, we provide a few summaries for each posterior cluster. In particular, we report the average cluster size, and the average distance between the two functions based on clusterwise posterior mean, median and MAP alignment. We expect the clusters to be balanced as the peaks in the bimodal function are approximately equidistant from the peak in the unimodal function. This should also be reflected in the post-alignment, clusterwise distances between the two functions. The original distance between them is 2.6668, and the distance after dynamic programming alignment is 1.4221. The reported clusterwise distances are comparable to the dynamic programming solution when using mean warping, and better when using median and MAP warping. This shows that in addition to being able to discover multiple plausible alignments as modes of the posterior distribution, we are able to better explore the full space of warping functions than the deterministic dynamic programming algorithm.

Applications
Next, we consider pairwise alignment of functions using the proposed Bayesian model for various types of real data. We start with three types of biomedical signals: gait pressure functions, PQRST complexes extracted from an ECG and respiration functions. For a detailed description of these datasets please see [15]. We proceed to show examples on growth velocity functions for boys and girls obtained from the Berkeley Growth Dataset (BGD) [40]. Finally, we show two examples on signature (tangential) acceleration functions from a subset of the data described in [44]. In each example, we first determine whether multiple modes exist in the posterior distribution of warping functions. If this is the case, we cluster the posterior samples using k-means clustering, where k is selected based on the average silhouette measure. Finally, we show the clusterwise alignment results and assess registration uncertainty in each cluster.

Biomedical Signals
We describe several alignment examples for biomedical signals. In all of the presented datasets, the functions must first be properly registered to align important features across the functional observations. At times, due to significant structural differences, registration ambiguities result in multiple plausible alignments, which cannot be detected using optimization-based registration algorithms. This is especially seen in the gait pressure functional data, which we consider in the first set of examples.
Gait pressure functions: We begin with three examples of pairwise alignment of gait pressure functions. In the first example, shown in Figure 5, we discover three modes in the posterior distribution. Panel (a) displays the registration results using the mean warping function in each cluster. The functions solution emphasize both modes as well as the large dip toward the midpoint of the gait cycle. Panels (d)-(i) show the uncertainty in each cluster using two displays: (1) pointwise standard deviation as a color (blue to red=low to high) on the mean warping as well as the pointwise 95% credible interval in black, and (2) pointwise standard deviation as a color on the warped version of the second function. We usually observe lower standard deviation along the pronounced features such as the steep increase and decrease in pressure at the beginning and end of the gait cycle. On the other hand, the standard deviation is inflated in flat regions where many types of warping provide a satisfactory solution.
The second example is displayed in Figure 6. In this case, we find two modes in the posterior distribution and display the same set of results as for the first example. The results are similar as in the previous case where different modes of the pressure functions are emphasized in each cluster. Again, the cluster 2 mean is very similar to the dynamic programming solution. Importantly, the result based on the proposed Bayesian model is always much smoother while achieving very similar alignment. Finally, in Figure 7, we display an example where the posterior distribution of warping functions is unimodal. In this case, the two functions to be aligned have two very clear gait pressure modes, and thus, there is little uncertainty in the registration.
PQRST complexes: The PQRST complex in ECG refers to the first peak (P wave), the sharp second peak (QRS complex), and the third peak (T wave). These functions have very pronounced features, and thus, most of the pairwise alignment results on this data yield a unimodal posterior distribution. We display one example of such an alignment in Figure 8. The posterior mean warping is very similar to the dynamic programming solution, albeit smoother. Also, there is very little registration uncertainty around the QRS complex. Alignment uncertainty is also low at the T wave, which is much more pronounced than the P wave in this example. The red (no warping) pointwise average of the two PQRST complexes displayed in panel (c) is clearly not a valid PQRST complex. As a result, warping in this application is necessary to obtain reasonable functional summaries.
Respiration data: Each function in this dataset represents lung volume during a breathing cycle. Respiration cycle alignment is important for understanding breathing variation as well as radiotherapy in lung cancer [15]. In this application, the posterior distribution of warping functions is also almost always unimodal due to the very simple structure of each breathing cycle. Figure  9 displays one example of pairwise alignment of two such respiration functions. Again, the produced posterior mean alignment is very good, with little uncertainty in the area of the peak of the breathing cycle.

Berkeley Growth Velocity Data
A major goal in studying the growth velocity functions of children is to characterize the number and timing of growth spurts in boys and girls. The BGD has been studied for these purposes before [22]. In the current paper, we emphasize that there may be multiple plausible time warpings that align growth spurts across children. In the first example, presented in the top part of Figure  10, we examine two growth velocity functions for boys. The resulting posterior distribution on the space of warping functions is bimodal. The mean warping in both clusters nicely aligns the large growth spurt. But, the average growth velocity patterns, as seen in panel (c), are quite different depending on which alignment is used. The cluster 1 alignment (green) results in a long constant velocity growth period in the average, while cluster 2 (black) results in a decreasing velocity (at an approximately constant rate) during the same period. This presents two very different growth mechanisms, which are useful for characterizing growth functions. The second example, shown in the bottom portion of Figure 10, considers alignment of two growth velocity curves for girls. Again, we discover two modes in the posterior distribution. As seen in panel (c), the mean warping in cluster 1 (green) emphasizes the first growth spurt and is followed by a smaller second spurt. On the other hand, the mean alignment in cluster 2 results in an average growth pattern where the two growth spurts are approximately of the same size.

Signature Acceleration Functions
The final application considers alignment of signature acceleration functions. As described in [25,38], each planar signature curve is first summarized using its tangential acceleration. Comparison and modeling of such functions are important in understanding inter and intra-class signature variability, and for signature classification. A major difficulty that arises in the analysis pipeline is that the signature acceleration functions contain natural warping variability. Thus, in order to obtain satisfactory results, such variability must be accounted for. We present two different pairwise registration results in Figures 11 and 12.
In the first example, the posterior distribution of warping functions contains two different modes. The posterior mean alignment agrees for close to half of the time interval at which point the mean warping in cluster 2 (black) diverges from the identity warping. This results in two drastically different alignments between the two signatures (and potentially different inferences depending on which alignment is used). Another interesting feature is that there is a large amount of uncertainty in the region where the mean warping in cluster 2 diverges from the identity element; this indicates that the corresponding region of the two acceleration functions is difficult to align. The posterior distribution in the second example is unimodal, and the posterior mean is very close to the dynamic programming solution. Furthermore, perhaps surprisingly, there is very little uncertainty in the alignment.

Groupwise MAP Alignment to a Known Template
We close the applications section with several examples of groupwise function alignment to a known template. For each of the datasets described above (and an additional simulated dataset), we randomly select one of the functions in the data as a template and align all functions in a pairwise manner to this template.
In these examples, we do not account for multimodality in the posterior distribution and use the MAP warping (γ M AP ) for alignment. The results are presented in Figures 13 and 14. For each example, we display the full original dataset with the template highlighted in black in panel (a). We show the aligned data in panel (b), the pointwise function averages before (red) and after (green) alignment in panel (c), and the estimated warping functions in panel (d).
In all examples, we see a drastic improvement in function alignment using the proposed method, which directly translates to better data summaries such as the pointwise function averages. As a specific example, consider the PQRST complexes in row (3). The MAP alignment is able to correctly match the P waves, QRS complexes and T waves across all of the given data. This results in an accurate representation of the pointwise average, which shares all of the features present in the original data. On the other hand, the QRS complex is the signature acceleration data.

Summary and Future Work
We have presented a Bayesian model for pairwise registration of functional data. This model utilizes a convenient geometric representation of warping functions called the square-root density, which allows for efficient sampling from the posterior distribution via importance sampling. A main advantage of the proposed approach over previous optimization-based approaches is that it is possible to discover multiple plausible registrations, which are given by different modes in the posterior distribution. We present several simulated and real data examples that highlight these advantages. We use simulations to compare the results obtained using the proposed model to those obtained using a similar model with a Dirichlet process prior on the warping functions (which does not exploit the geometry of the space of warping functions). There are multiple directions for future work. First, we will extend these methods to a groupwise registration model where the template function and the warping functions are estimated jointly. Second, we will extend these methods to a setting where soft landmark information is provided on the functions of interest. In such a case, one can incorporate this information into the prior distribution of the Bayesian model. Third, we will consider a more general problem of curve alignment for shape analysis where the curves are functions from a unit interval (open curves) or unit circle (closed curves) to R n , n > 1. Shapes of objects are invariant to translation, scale, rotation and re-parameterization, and thus, the prior distributions in our Bayesian model will be defined on product spaces, whose geometric structure will play an important role. Finally, a major question relates to propagating the registration uncertainty to subsequent statistical inference problems. One example is template estimation in the presence of multiple plausible warping solutions.