Extrinsic Gaussian processes for regression and classification on manifolds

Gaussian processes (GPs) are very widely used for modeling of unknown functions or surfaces in applications ranging from regression to classification to spatial processes. Although there is an increasingly vast literature on applications, methods, theory and algorithms related to GPs, the overwhelming majority of this literature focuses on the case in which the input domain corresponds to a Euclidean space. However, particularly in recent years with the increasing collection of complex data, it is commonly the case that the input domain does not have such a simple form. For example, it is common for the inputs to be restricted to a non-Euclidean manifold, a case which forms the motivation for this article. In particular, we propose a general extrinsic framework for GP modeling on manifolds, which relies on embedding of the manifold into a Euclidean space and then constructing extrinsic kernels for GPs on their images. These extrinsic Gaussian processes (eGPs) are used as prior distributions for unknown functions in Bayesian inferences. Our approach is simple and general, and we show that the eGPs inherit fine theoretical properties from GP models in Euclidean spaces. We consider applications of our models to regression and classification problems with predictors lying in a large class of manifolds, including spheres, planar shape spaces, a space of positive definite matrices, and Grassmannians. Our models can be readily used by practitioners in biological sciences for various regression and classification problems, such as disease diagnosis or detection. Our work is also likely to have impact in spatial statistics when spatial locations are on the sphere or other geometric spaces.

One of the paramount challenges in developing GP models on manifolds is constructing valid covariance kernels. Castillo et al. (2014) develop an elegant framework for intrinsic GP models on Riemannian manifolds by rescaling solutions of heat equations, but the constructed intrinsic kernels are often impractical to implement. We provide a general and simple solution by first embedding manifolds into Euclidean spaces via equivariant embeddings, which are embeddings that preserve a great deal of the geometry of the manifolds, and then constructing extrinsic kernels on the image manifold. We refer to the resulting GPs as extrinsic GPs (eGPs). eGPs are shown to inherit appealing properties of GPs defined on Euclidean spaces, and they adapt to the dimension of the manifolds instead of the dimension of the ambient space where the manifolds are embedded onto.
Another appealing feature of eGPs is their ease of implementation for inference.
One of the motivations for developing GP models on manifolds is the ubiquity of modern data that are represented in various non-conventional forms. In neuroimaging, the diffusion matrices in diffusion tensor imaging (DTI) are 3 × 3 positive definite matrices (Alexander et al., 2007).
In engineering and machine learning, pictures or images are often preprocessed or reduced to a collection of subspaces (Ho et al., 2004;Teja and Ravi, 2012). In machine vision and medical diagnostics, a digital image can also be represented by a set of k-landmarks, the collection of which form landmark-based shape spaces (Kendall, 1984). Other common examples include orthonormal frames (Downs et al., 1971), surfaces, curves, and networks. Most of the above examples can be described as manifolds, which are locally Euclidean spaces with smooth structures.
There are growing needs and practical motivations for studying regression and classification with predictors on known manifolds. For instance, in medical imaging, a common goal is to reliably predict disease status using DTI data or landmark-based digital images. This can be viewed as a classification problem with manifold-valued inputs or predictors. One example is diagnosis of Attention Deficit Hyperactivity Disorder (ADHD) in children based on DTI. There are also many applications in which it is of interest to relate manifold-valued predictors to quantitative traits. One such case is the study of how intelligence quotient relates to the shape contours of certain brain areas (such as the Hippocampus (Bartsch, 2012)). The shape can be represented by a set of landmarks on the boundary of the contours, the collection of which form a shape manifold. Without valid models and appropriate inferential methods for regression and classification on manifolds, making accurate inferences and predictions in the above applications and related settings will remain difficult.
There is already a rich literature on statistical inference for manifold-valued data consisting of i.i.d measurements. Much of this literature focuses on inference on the location and spread of manifoldvalued data Patrangenaru, 2003, 2005;Bhattacharya and Lin, 2017). Some model based methods have also been proposed (Bhattacharya and Dunson, 2010b;Lin et al., 2017;Pelletier, 2005). However, regression or classification problems with predictors on manifolds have received much less attention. Bhattacharya and Dunson (2010a) proposed a framework for regression and classification on manifolds by modeling the joint distribution of covariate and response variables (x, y) using a Dirichlet process mixture of product kernels. This joint model induces a nonparametric model for the conditional distribution of y given x with which one can infer the regression/classification function. However, the practical performance of these models is often unsatisfactory as the cluster allocations are driven too much by the marginal distribution of x, a nuisance parameter.
Our work focuses on regression and classification on known manifolds. There is, however, an important line of work in manifold learning, where the predictors concentrate around some unknown lower-dimensional manifold but are observed in an often higher-dimensional ambient space. The lower-dimensional geometry is often learnt first via dimension reduction tools, based on which a regression model is built (see, e.g., Cheng and Wu (2013)). An interesting exception is due to Yang and Dunson (2016) in which they show that by imposing a Gaussian process prior on the regression function with a covariance kernel defined directly on the ambient space, the posterior distribution yields a posterior contraction rate depending on the intrinsic dimension of the manifold.
They assume that the unknown lower-dimensional space where the predictors center around are a class of submanifolds of Euclidean space. Many interesting manifolds do not naturally arise as sub-manifolds; in particular, those given as quotient manifolds; projective shape spaces, planar shapes, 3-D shapes, affine shapes and many other manifolds arising as quotient spaces of spheres.
Our framework first embeds the manifold onto the Euclidean space via some often non-trivial embeddings and then defines eGPs on the image of the manifolds (including submanifolds as special cases with the embedding given by the identity map).
The paper is organized as follows. Sections 2 introduces eGP models. In section 3, we illustrate the broad utility of eGP models by applying them to a large class of regression/classification problems with predictors lying on various manifolds. Section 4 is devoted to studying the properties of eGP models in terms of mean squared differentiability and posterior contraction rates. Our paper ends with a discussion.

Regression and classification on manifolds
Let M be a smooth manifold where the predictors lie. Given data (x i , y i ) with x i ∈ M and y i ∈ R (i = 1, . . . , n), assume the following regression model where F : M → R is the regression function on M . Here i 's are some independent errors which determine the likelihood of the regression model. The goal is to develop statistical models for inference on the regression function F (x). If y is categorical or binary (0 or 1), then is called a classification map.
We focus on Bayesian inference on F . Let Π(F ) be a prior distribution for F , which updates with the data to produce a posterior distribution, based on which inference is carried out. We denote the posterior distribution by Π(F |D), where D = {(x 1 , y 1 ), . . . , (x n , y n )} is the data. A Gaussian process (GP), which can be viewed as a probability distribution on the space of functions, is one of the most popular candidates for a nonparametric prior for the regression function. The popularity of GP is due to its simple representation, tractability, flexibility for modeling and appealing theoretical properties. We proceed to propose a general extrinsic framework for constructing GPs on manifolds.
The usual definition of a GP in a Euclidean space generalizes to a manifold M . A stochastic process w(x) indexed by x ∈ M is a Gaussian process on M if its evaluation at any finite number of points on M follows a multivariate Gaussian distribution. Specifically, we say w(x) is a GP with mean function µ(x) and covariance kernel K(·, ·) if for any x 1 , . . . , x n ∈ M , Notice that K : M × M → R is a positive semi-definite kernel on M . Namely, for any points x 1 , . . . , x n on M and real numbers a 1 , . . . , a n , The fundamental difficulty in imposing a GP prior on a manifold stems from the highly challenging task of constructing a valid covariance kernel K(·, ·). Below we describe a simple recipe for constructing valid covariance kernels using an extrinsic approach. J is a smooth map such that its differential at each point x ∈ M is an injective map (from the tangent space of M at x to the tangent space of R D at J(x)), and J is a homeomorphism between M and its image M . Given a positive semi-definite kernel K on R D , we can then define a positive semi-definite kernel (and hence the covariance kernel of a GP) on M by Indeed, K ext satisfies condition (2.2) on M because K satisfies the same condition on R D , hence in particular on M ⊂ R D . We call the Gaussian process with the covariance kernel K ext (·, ·) defined above an extrinsic Gaussian process (eGP).
Remark 2.1. Let || · || be the Euclidean norm. We define the extrinsic distance on the manifold M as One can immediately generalize the popular squared exponential kernel in Euclidean spaces to manifolds by letting where ρ(x 1 , x 2 ) is the extrinsic distance given in (2.4). One can also generalize the class of Matérn covariance kernels to manifolds by letting where Γ(ν) is the Gamma function, K ν is the modified Bessel function of the second kind, and κ and ν are non-negative parameters of the covariance. Matérn covariance kernels are often used in spatial statistics with which one can easily control the smoothness of the sample paths with parameter ν. The following is clear.  of degree D acting on M such that for any h ∈ H and p ∈ M . The definition seems technical, however, the intuition is clear: if a large group H acts on the manifolds such as by rotation before embedding, such an action can be preserved via φ on the image M . Therefore, the embedding is geometry-preserving in this sense.
Remark 2.3. The extrinsic method described above has some advantages over using intrinsically defined covariance kernels. In particular, intrinsic kernels are difficult to construct in general.
For example, the squared exponential kernel α exp(−βρ 2 g (x 1 , x 2 )) with ρ g given by the geodesic or intrinsic distance is in general not a valid kernel. Explicit examples have been found for very special manifolds only, such as spheres. At the same time, simulation tests have shown that there is no significant difference in statistical performance between certain extrinsic and intrinsic models, at least for the example of spheres. However, intrinsic methods are often much more computationally complex and expensive.
With a valid covariance kernel on M , one can specify an eGP as a prior Π(F ) and carry out inference in a Bayesian framework. Given the regression model in (2.1), we assume that i ∼ N (0, σ 2 ), where the parameter σ 2 has a prior distribution π σ 2 such as the inverse gamma distribution. The prior distribution for the regression function Π(F ) will be given by the eGP with the covariance kernel in (2.3). The posterior distribution is given by where U is a measurable set in the product space M × (0, ∞) with M denoting the space of all M → R regression functions.
Another important class of problems are classification problems, in which one is generally interested in predicting a categorical (e.g., binary as a special case) outcome given the predictors. Denote the responses or outcomes as 1 or 0 for the binary case, and let F (x) be the probability of observing 1 at predictor level x. One can impose a prior distribution on F by imposing an eGP on a latent process w(x), such that F (x) = L(w(x)) and L is a fixed link function -for example the probit or logistic link. Properties of F (x) can be derived from those for w(x) as L provides a smooth one-to-one monotone transformation of w(x) into L(x). Extensions to categorical outcomes beyond binary are straightforward.

Examples
To illustrate the broad utility of eGP models, we consider a large class of examples with predictors lying on manifolds including spheres, planar shapes, positive definite matrices, and Grassmannians.
All details of the embeddings are provided for constructing the extrinsic kernels for eGPs. Embedding manifolds into Euclidean spaces or other manifolds has been applied in different settings.
In St. Thomas et al. (2014), for example, the manifold of the parameters of a statistical model is embedded into a big sphere, while Lin et al. (2016) embeds the response manifold of a regression model into a Euclidean space for inference. In section 3.1, a simulation study is carried out to compare the performances of an eGP model with that of an intrinsic one in a regression model with predictors on a sphere. In section 3.2, an eGP model is applied to classify gender of gorillas based on skull images. In this case, the predictor space is the 2-d landmark-based shape space, i.e., the planar shape. In Section 3.3, we consider a classification problem whose predictors are positive definite matrices; this problem has important applications in neuro-imaging. We apply the eGP model to an HIV study in identifying the most sensitive sites for disease detection or diagnostics.
Lastly in section 3.4, we apply our eGP model to a regression problem with predictors lying on a Grassmannian manifold in a simulation study.
3.1. Spheres. Modeling on the sphere has received particular attention due to applications in spatial statistics; for example, global models for climate or satellite data (Jun and Stein, 2008;Huang et al., 2011). We consider eGP models for regression with the predictors lying on a sphere S d . The model is illustrated with predictors on S 2 . Note that for the particular case of spheres, there is a somewhat extensive literature studying valid positive-definite functions or covariance functions on the spheres for various purposes (see. e.g., Gneiting (2013) and Du et al. (2013)).
To construct a valid extrinsic covariance kernel on S d , first note that S d is a submanifold of R d+1 , so that the inclusion map J serves as a natural embedding of S d into R d+1 . It is easy to check that J is an equivariant embedding with respect to the Lie group H = SO(d + 1), the group of d + 1 by d + 1 special orthogonal matrices. Intuitively speaking, this embedding preserves a lot of symmetries of the sphere.
One can adopt the extrinsic squared exponential kernel (2.3) on S d for an eGP model, with We now consider a simulation study in which the performance of an eGP model is compared with that of a GP model using an intrinsic kernel. Intrinsic kernels that are computation friendly are only available for some special cases such as S 1 and S 2 . We compare our extrinsic model to a GP model with the following intrinsic kernel. Letting d( which is a valid covariance kernel on a sphere (e.g, see section 3 of Huang et al. (2011)).
Data are simulated from the regression model, where x is a point on the unit sphere, x 1:3 are the coordinates of x in the three dimensional Euclidean space, the true regression function F is taken to be the sum of x 1:3 and is a zero mean Gaussian noise term. We apply a GP model with covariance kernels K int and K ext . Since the kernel parameters (θ = {α, β}) are correlated (Rasmussen, 2004), standard Markov Chain Monte Carlo (MCMC) sampling traverses the parameter space slowly. Instead, we use Hamiltonian Monte Carlo (HMC) for inference of kernel parameters which improves efficiency by producing relatively distant proposals that are accepted with high probability (Duane et al., 1987). Here are some details on the priors and the HMC chains: both the length-scale and magnitude hyperparameters of the covariance kernels of the eGP are given gamma(10,10) priors; π σ 2 is given by gamma (1,10 3.2. Landmark-based shape spaces Σ k 2 . We now apply eGP models to regression and classification on planar shapes. Planar shape spaces are one of the most important classes of landmark-based shape spaces with wide applications in biology and medical imaging. Such spaces were first studied in Kendall (1977), and in the pioneering work of Bookstein (1978) motivated by applications to biological shapes.
The planar shape Σ k 2 is the collection of zs modulo the Euclidean motions including translation, scaling and rotation. One has Σ k 2 = S 2k−3 /SO(2), the quotient of sphere by the action of SO(2) (or modulo the effect of rotation), the group of 2 × 2 special orthogonal matrices; A point in Σ k 2 can be identified as the orbit of some u ∈ S 2k−3 , which we denote as σ(z). Viewing z as a vector of complex numbers, one can embed Σ k 2 into S(k, C), the space of k×k complex Hermitian matrices, via the Veronese-Whitney embedding (see e.g. Bhattacharya and Bhattacharya (2012)): One can verify that J is equivariant (see Kendall (1984)) with respect to the Lie group with its action on Σ k 2 induced by left multiplication. This embedding J will be used to construct covariance kernels for eGPs on Σ k 2 .
As an example, we apply an eGP to a classification problem with predictors on Σ k 2 . We aim to classify the gorilla skull images from Dryden and Mardia (1998), which are represented as planar shapes with 8 landmarks, by gender. A binary GP classification model is developed using 59 gorilla skull images. We take y i ∈ {0, 1}, where 0 represents a female and 1 a male.
We have the following model: where Φ is the standard normal cdf.
Following Williams and Rasmussen (1996) and Neal (2012), we used Hamiltonian Monte Carlo (HMC) method for posterior computation. The likelihood is approximated using Laplace's method as in Williams and Barber (1998). Gamma priors are used on the kernel hyperparameters, with Gamma(0.5,2) for the length-scale and Gamma(50,1) for the magnitude paramter. The number of MCMC iterations is 10,000 with a burn in of 3,000; The HMC estimates of the kernel parameters are shown in Figure 3.
We use eight skull images as testing data and all these images are successfully classified with our GP classifier. The classification probabilities are provided in Table 1. The results are compared with a naive GP on the preshape data (modulo the effects of translation and scaling) without any embedding; the latter completely failed at classification by returning all the classification probabilities of 0.5. The results indicate that naive GPs are not suitable for complex manifolds not arising as submanifolds of an Euclidean space or when simple representation of the space using Euclidean coordinates is not available. In particular, for complex manifolds such as planar shapes, the naive  is designed to measure the diffusion of water molecules in the brain; diffusion tends to be directional along white matter tracks or fibers, corresponding to structural connections between brain regions along which substantial brain activity and communications occur. DTI data are now collected routinely in human studies, and there is abundant interest in using DTI to build better predictive models of cognitive traits and neuropsychiatric disorders. The diffusion anisotropy characterized in terms of diffusion matrices, corresponding to 3 × 3 positive definite matrices measured at each voxel in the brain. We denote the space of all such matrices as SPD (3).
The space SPD(3) belongs to an important class of manifolds that possesses particular geometric structures, which should be taken into account in statistical analyses. Our goal is to study the regression relationship between DTI-valued covariates and patient outcomes.
In order to carry out regression and classification on SP D(3) using our eGP models, we need a nice embedding to construct the extrinsic kernels. There are a few natural embeddings of SPD(3) into Euclidean spaces. In particular, one can embed it into the space Sym (3)  Given A 1 , A 2 ∈ SPD(3), their extrinsic distance under the embedding (3.5) is given by where · denotes the Frobenius norm of matrices (i.e. A = Tr(AA T ) 1/2 ). This extrinsic distance will be used to construct an eGP kernel in (2.5).
We now consider a diffusion tensor imaging (DTI) data set consisting of 46 subjects with 28 HIV+ subjects and 18 healthy controls. Diffusion tensors were extracted along one atlas fiber tract of the splenium of the corpus callosum. The DTI data for all the subjects are registered in the same atlas space based on arc lengths, with 75 tensors obtained along the fiber tract of each subject. This data set has been studied in a regression setting in Yuan et al. (2012) and in the context of two sample testing (Bhattacharya and Lin (2017)). A GP sampler is carried out between the control group and the HIV+ group for each of the 75 sites along the fiber tract. Therefore, 75 classifiers were run in total. We aim to find out which sites of the splenium of the corpus callosum are most sensitive to influence by HIV.
14 subjects (six controls and eight HIV+) are used to test the HIV status classifiers (0 for healthy and 1 for HIV+) using eGP models. A similar binary GP classification model is applied to the DTI data at each of the prespecified 75 locations along the chosen tract. We have identified the top ten most sensitive sites indexed by the arc length (location on the brain). The results are recorded in Table 2, which shows the total number of correct GP predictions of HIV status of the 14 tested subjects among the top ten sites. Again HMC with Laplace approximation is used for model parameter inference. The posterior distribution of kernel hyperparameters for the GP classifier for one of the 75 sites along the fiber tract is shown in Figure 4. Gamma(0.5,2) prior is used for kernel length-scale and Gamma(2.5,2) prior for kernel magnitude. The number of Monte Carlo iterations is 10,000 with a burn in of 3,000.  3.4. Stiefel manifolds and Grassmann manifolds (Grassmannians). We now consider regression and classification problems whose predictors lie on Stiefel or Grassmann manifolds. Given integers m ≥ k ≥ 0, the Stiefel manifold V k (R m ) is the collection of all k-tuples of orthonormal vectors in R m , and the Grassmann manifold Gr k (R m ) is the collection of all k-dimensional subspaces in R m . Every k-tuple of orthonormal (hence linearly independent) vectors span a k-dimensional subspace, and every k-dimensional subspace is spanned by some k-tuple of orthonormal vectors.
This means there is a surjective map V k (R m ) → Gr k (R m ). There is a natural action of O(k), the group of k × k orthogonal matrices, on V k (R m ) and any two k-tuples of orthonormal vectors span the same subspace precisely if they differ by an action of O(k), which provides the identification V k (R m )/O(k) = Gr k (R m ). Grassmann manifolds have many applications in signal processing and machine learning (Kutyniok et al., 2009).
There is an equivariant embedding of Gr k (R m ) into a Euclidean space (Chikuse, 2003). Let X ∈ V k (R m ) and σ(X) = X · O(k) be the O(k)-orbit of X in Gr k (R m ) = V k (R m )/O(k). Note that J(σ(X)) = XX defines an embedding J of Gr k (R m ) into the space of m × m matrices, which may be identified as R m 2 . Also, it is equivariant with respect to the group H = O(m) acting on Gr k (R m ) via left multiplication on R m and on m × m matrices by conjugation. Indeed, for h ∈ H, one has J(hσ(X)) = hXX h = φ(h)J(σ(X)), where φ(h) stands for conjugation by h. Now the extrinsic distance between two points in Gr k (R m ) is given by where · is the Frobenius norm on matrices. We use the kernel (2.5).
Remark 3.1. The Stiefel manifold V k (R m ) is naturally a submanifold of R m×k and the inclusion map is an equivariant embedding.
We now apply the eGP model to data simulated from y = F (XX ) + , where X is an m × k matrix with m = 10 the ambient dimension and k = 5 the subspace dimension. The data are simulated from the model with F (X) = βXX β, where β is some known vector. We simulated 100, 200 and 300 training data points and additional 50 points for testing with different signal-to-noise ratio levels. Table 3 records the RMSE values. As expected, the RMSE reduces with increasing training size and signal-to-noise ratio.

Properties of eGPs
In this section, we first study the properties of an eGP in terms of mean square differentiability.
The smoothness of a stochastic process captures and quantifies the intuition that inputs that are close (on a manifold) are likely to produce similar output values. Therefore, understanding the smoothness property is important for interpolation and prediction. In addition, we show that differentiable at x with respect to v if, as a → 0, the random variable w(γ(a)) − w(x) a converges to some limit D v w in mean squares, i.e.
In this case, D v w is called the MS derivative of w at x with respect to v.
(b) If w is MS differentiable at x with respect to every tangent vector at that point, then we simply say that w is MS differentiable at x. This notion of MS differentiability generalizes the existing one in Euclidean spaces.
Proposition 4.2. If the mean function µ is differentiable at x and the covariance function K is of class C 2 at (x, x), then the stochastic process w is MS differentiable at x.
Proof. Since µ is differentiable at x, the statement will hold for w if it also holds for w − µ, whose mean function is 0. Hence we may assume µ = 0 without loss of generality.
For a ∈ (− , 0) ∪ (0, ), consider the random variable It suffices to show that D a has a limit in mean squares (i.e. in L 2 ) as a → 0. Notice that It follows that, under the same limit, Therefore, as a → 0, D a satisfies the Cauchy condition with respect to the L 2 norm and, by completeness, admits an L 2 limit.
Proof. The first statement is immediate from Proposition 4.2. For i = 1, 2, let x i ∈ M and γ i : (− , ) → M be a smooth path with γ i (0) = x i and γ i (0) = V xi . By the Cauchy-Schwarz inequality and the MS differentiability of w, we have Similarly as above, we have . Similarly again, we also have This completes the proof.
Corollary 4.4. If µ is of class C n and K is of class C 2n , then w is n-times MS differentiable.
Example 4.5. Suppose J : M → R D is an embedding of M into a (higher-dimensional) Euclidean space R D . Given a stochastic process w in R D , we can pull it back to a stochastic process J * w in Clearly, if the mean and covariance functions of w are µ and K, then the mean and covariance functions of J * f are J * µ and (J × J) * K. Also, if µ is C n , K is C 2n and J is C 2n as well, then J * µ is C n and (J × J) * K is C 2n ; and hence by Corollary 4.4, J * w is n-times MS differentiable.
For example, if w is a Gaussian process in R D with a Matérn-ν covariance function (and zero mean), then J * w is an ν−1 2 -times MS differentiable Gaussian process in M ; and if w is a Gaussian process in R D with a squared-exponential covariance function, then J * w is an infinitely MS differentiable Gaussian process in M .

4.2.
Posterior contraction rates of eGPs. In this short subsection, we explore the posterior contraction rates of a regression model on a manifold with eGP as the prior for the regression function. Posterior contraction rates measure how fast the posterior concentrates in small neighborhoods of the true regression function, providing frequentist asymptotic guarantees on the behavior of the eGP posterior. Given data (x i , y i ) with x i ∈ M and y i ∈ R (i = 1, . . . , n), assume the regression . The prior distribution Π(F ) will be given by the eGP with the covariance kernel (2.5) (with a fixed magnitude). The length-scale parameter β is assumed a prior π β such that β d follows a gamma distribution Gamma(a 0 , b 0 ), where d is the dimension of manifold. For simplicity in exposition, assume σ is known though the results are straightforward to generalize to unknown σ. The posterior distribution of F is then given by where U is a measurable set in the space of regression functions. Let F 0 be the true regression function. We say the eGP posterior contracts to F 0 at a rate of n if Π U n (F 0 ) C | (x 1 , y 1 ), . . . , (x n , y n ) → 0, a.s.P n F0 , (4.2) where U n (F 0 ) C = {F : d M (F, F 0 ) > C n }, as n → ∞ for some large constant C and distance d M .
We have the following proposition.
Proposition 4.6. Assume the regression model (2.1) with an eGP prior with covariance kernel (2.5), the following holds.
There is a one-to-one correspondence (a bijection) betweenF and F , and one has U n (F 0 ) = {F : 1 y 1 ), . . . , (x n , y n ) → 0, where n is given in part (a).
(b) Similar proofs follow from part (a) noting that there is one-to-one correspondence between {F : M (F (x) −F 0 (x)) 2g (dx) < n } and {F : M (F (x) − F 0 (x)) 2 g(x)dx < n }, whereg(x) is the density onM induced by the embedding J and the density g(x) on M .

Discussion and conclusion
We propose a general extrinsic framework for constructing Gaussian processes on manifolds for regression and classification with manifold-valued predictors. Such models are general, easy to implement and shown to inherit good properties from Gaussian processes on Euclidean spaces.
Applications are considered by applying eGP models to regression and classification problems with predictors on a large class of manifolds ranging from spheres, landmark-based shapes spaces, to the spaces of positive definite matrices and Grassmannians. Our work will likely help practitioners make more accurate predictions or diagnoses based on medical imaging. Although the work focuses on regression and classification, the eGPs can be used in much broader settings such as in exponential family models for the response y i given x i , which allows Poisson regression etc. In addition, eGPs can be certainly used for spatial modeling where the spatial space is some geometric space such as the sphere and other geometric spaces. Future work will be devoted to constructing applicable covariance kernels employing the intrinsic Riemannian geometry of manifolds, which are only available now for a very limited class of manifolds, and also constructing valid GP models for spaces beyond manifolds such as stratified spaces of interests.