Estimation of the mean for spatially dependent data belonging to a Riemannian manifold

: The statistical analysis of data belonging to Riemannian man- ifolds is becoming increasingly important in many applications. The aim of this work is to introduce models for spatial dependence among Riemannian data, with a special focus on the case of positive deﬁnite symmetric matrices. First, the Riemannian semivariogram of a ﬁeld of positive deﬁnite symmetric matrices is deﬁned. Then, we propose an estimator for the mean which considers both the non Euclidean nature of the data and their spa- tial correlation. Simulated data are used to evaluate the performance of the proposed estimator: taking into account spatial dependence leads to better estimates when observations are irregularly spaced in the region of interest. Finally, we address a meteorological problem, namely, the estimation of the covariance matrix between temperature and precipitation for the province of Quebec in Canada.


Introduction
In recent years, attention to the statistical analysis of non Euclidean data has been growing. The conceptual framework is that of Object Oriented Data Analysis, as defined in Wang and Marron (2007). Indeed, non Euclidean data are mathematical objects more complex than numbers or vectors and they do not belong to a linear space. Thus, even the most simple statistical operations, such as finding a centerpoint for the data distribution or evaluating variability about this center, represent a challenge. Statistical analysis needs to carefully consider the mathematical properties of the data at hand and consequently to reformulate traditional methods in this new setting.
Data belonging to a Riemannian manifold are particularly interesting both from a mathematical and from a practical point of view. Studies in this field have been motivated by many applications: for example Shape Analysis (see, e.g, Jung et al., 2011), Diffusion Tensor Imaging (see Dryden et al., 2009, and references therein) and estimation of covariance structures. The general aim of these studies is the extension to Riemannian data of traditional statistical methods developed for Euclidean data, such as point estimation of mean and variance Dryden et al., 2009), exploratory data analysis, dimensional reduction (Fletcher et al., 2004), testing hypothesis among different populations (Schwartzman et al., 2010) and smoothing (Yuan et al., 2012).
This work is focused on the development of spatial statistical methods for Non Euclidean data. Little attention has been paid to this problem, while in many applications data are spatially distributed. In the general context of complex data, this issue has recently received much attention within the field of functional data analysis (see Delicado et al., 2010;Gromenko et al., 2012;Menafoglio et al., 2012) but the extension to non Euclidean data is even a greater challenge because they do not belong to a vector space.
Our final goal is the development of a complete spatial statistics theory for data belonging to a Riemannian manifold. We move here the first steps in this direction by proposing a tool for the description of spatial dependence and by addressing the problem of estimation of the mean in the presence of spatial dependence. The methods here introduced rely only on the definition of a distance among data and on the locally Euclidean structure of the manifold. Thus, applications to any Riemannian manifold is possible, once the appropriate distance to compare two elements of the manifold has been chosen. However, in the present work we focus on the notable case of positive definite symmetric matrices (PD data), whose Riemannian distance and properties are illustrated in Section 2. A semivariogram for PD data is proposed in Section 3 and its properties are discussed. In Section 4, we describe an estimator for the mean from a sample of spatially correlated PD data. A model for generating samples from a random field of spatially correlated positive definite matrices is proposed in Section 5. Simulated data are used to evaluate the performance of the proposed estimator of the mean. If observations are spatially located on an irregular grid, this method provides better estimates than those obtained ignoring spatial dependence. Finally, in Section 6 we address the problem of estimation of the covariance matrix between temperature and precipitation in the province of Quebec, Canada.

Statistical analysis of positive definite symmetric matrices
Positive definite symmetric matrices are an important instance of data belonging to a Riemannian manifold. In this section, we introduce notation and a few metrics, together with their properties, that we deem useful when dealing with data that are positive definite symmetric matrices. A broad introduction to the statistical analysis of this kind of data can be found, e.g., in Pennec et al. (2006) or Dryden et al. (2009).
Let P D(p) indicate the Riemannian manifold of positive definite symmetric matrices of dimension p. It is a convex subset of R p(p+1)/2 but it is not a linear space: in general, a linear combination of elements of P D(p) does not belong to P D(p). Moreover, the Euclidean distance in R p(p+1)/2 is not suitable to compare positive definite symmetric matrices (see Moakher, 2005, for details). Thus, more appropriate metrics need to be used for statistical analysis. A good choice could be the Riemannian distance: the shortest path between two points on the manifold. A description of the properties of Riemannian manifolds in general, and of P D(p) in particular, can be found in Moakher and Zéraï (2011) and references therein. For the scopes of the present work, it is enough to recall that the Riemannian distance between elements P 1 , P 2 ∈ P D(p) is d R (P 1 , P 2 ) = || log(P −1/2 1 where the σ i are the eigenvalues of the matrix P −1 1 P 2 and ||.|| F is the Froebenius norm for matrices, defined as This distance is also called trace metric, for instance in Yuan et al. (2012).
Once a metric has been introduced in P D(p), we can address the problem of estimating the mean given a sample of positive definite symmetric matrices. In recent years, many authors (Fletcher et al., 2004;Pennec et al., 2006;Dryden et al., 2009) proposed to use the Fréchet mean for a more coherent approach in dealing with data belonging to a Riemannian manifold. The Fréchet mean of a random element S, with probability distribution µ on a Riemannian manifold, is defined as Σ R = arginf P d R (S, P ) 2 µ(dS) and it can be estimated with the sample Fréchet mean where S i , i = 1, . . . , n is a sample from µ. For the P D(p) case, both the Fréchet mean and the sample Fréchet mean exist and are unique (see, e.g, Moakher and Zéraï, 2011). By means of extensive comparisons, Dryden et al. (2009) show that using estimators based on the Riemannian distance, or its approximation, gives better results than the estimator based on Euclidean metric.
Analogously, the variance of S can be defined as σ 2 = Var(S) = E[d R (S, Σ R ) 2 ] and estimated with the sample variance In practical applications, using the Riemannian distance could be computationally expensive. For this reason, other distances have been proposed to compare two positive definite symmetric matrices. For example, we may consider the Cholesky decomposition of the positive definite symmetric matrix P , i.e. the lower triangular matrix with positive entries L = chol(P ) such that P = LL T . Then, Wang et al. (2004) defined a Cholesky distance between two positive definite symmetric matrices as Using the Cholesky distance, the sample Fréchet mean for a sample S i , i = 1, . . . , n, is easily computed: Another possibility is to resort to the square root distance (Dryden et al., 2009): where P 1 2 is the matrix square root of P . Also for this case a simple formula exists for the sample Fréchet mean which minimizes square root distances from a sample S 1 , . . . , S n of positive definite symmetric matrices: It is worth noticing that the square root distance is also defined for non negative definite matrices. Thus, it is to be preferred in applications where matrix data may have zero eigenvalues, or very small eigenvalues which lead to instability in the computation of the Riemannian distance or the Cholesky decomposition.
In the following, we propose methods that are based on a general distance d(., .) on the manifold. In practice, the appropriate distance has to be chosen by looking at the problem at hand while weighing computational efficiency.

Semivariogram for positive definite symmetric matrices
Let us consider the random field where D is a subset of R d , E[S(s)] = Σ ∈ P D(p) for every s ∈ D. Since our aim is to perform the statistical analysis from a single incomplete realization of the random field, we ask the spatial dependence between S(s 1 ) and S(s 2 ) to be a function only of h = s 1 − s 2 , for s 1 , s 2 ∈ D,. This can be formally stated using the notion of joint probability measure on the manifold (see Pennec, 2006, for more details about probability measures on manifolds). For s 1 , . . . , s n ∈ D, consider the finite-dimensional measure for all possible Γ 1 , . . . , Γ n in the Borelian σ-field of P D(p). We require the random field to be strictly stationary, i.e. for every finite set s 1 , . . . , s n ∈ D, In general, the definition of a covariance between two random elements on a Riemannian manifold is not straightforward, but in this particular setting a natural extension of the variogram seems to be available. Indeed, in the one dimensional Euclidean setting the variogram is defined as i.e, the expected value of the squared Euclidean distance between the random variables minus the square Euclidean distance between their expected values. Hence, we may generalize the notion of variogram by substituting the Euclidean distance with a more appropriate distance, based on the geometry of the Riemannian manifold. By analogy with its definition in spatial statistics for Euclidean data (see, e.g, Cressie, 1993), we define the variogram for a positive definite matrix field as and consequently, when the limit exists. Since we assume E[S(s)] = Σ for every s ∈ D, the Riemannian semivariogram simply becomes In practice, we require that spatial correlation depends only on the length of the distance between two points s 1 and s 2 , thus restricting to the case of an isotropic semivariogram, where γ(h) = γ(||h||). This assumption is useful for estimation, but it can be removed in applications when information about the anisotropic structure of the field generating the data is available. Thus, in the presence of a sample (S(s 1 ), . . . , S(s n )) generated by the random field (1), the isotropic semivariogram γ can be estimated from the empirical Riemannian distances, for instance by means of the classical estimator illustrated in Cressie (1993): ∆ is a positive (small) quantity acting as a smoothing parameter, h = ||h|| and |N (h)| is the number of couples (s i , s j ) belonging to N (h). Finally, a model semivariogram can be fitted to the empirical semivariogram, via least squares. As it happens in the Euclidean setting, an accurate estimation of the semivariogram is crucial for subsequent analysis. All the guidelines and methods developed for vector data can also be easily applied to the estimation of γ.

Stochastic dependence in non Euclidean spaces
The development of statistical methods for the analysis of samples generated by random fields of positive definite matrices asks for a definition of stochastic dependence between two random elements taking values on a Riemannian manifold. In Euclidean spaces, the covariance is a measure of linear dependence between two random variables. However, in a non Euclidean framework linear dependence cannot be properly captured. The definition proposed in the previous section is based on the difference between the common variance of the random elements and half the expected value of their square distance. In this section, we explore properties and limits of this definition to fully understand the peculiarity of a non linear space for what concerns stochastic dependence.
Let (A, B) be a random vector whose components are positive definite matrices. The spatial model proposed in the previous section leads to a covariance between the random matrices A and B defined as In the Euclidean setting, covariance is a measure of how close to a linear subspace observations are expected to lie, e.g. a straight line in R 2 . No linear subspaces exist on a Riemannian manifold, unless locally. In this framework the covariance measures how "near" observations are expected to be, with respect to their variability (i.e., the variance of the individual random elements A and B). A negative covariance indicates that A and B are expected to be farther apart than what it is to be expected by looking only at their marginal means and variances.

D. Pigoli and P. Secchi
To better understand (4), we may focus on the special case E The maximum value is taken for A = B and Cov(A, A) = σ 2 . Therefore the covariance defined in (4) has an upper limit that is reached when the random elements are the same.

Estimation of the mean from a spatially correlated sample on a Riemannian manifold
This section addresses the problem of estimating the mean given a sample of spatially correlated positive definite symmetric matrices. The influence of spatial correlation on estimation and prediction is well known in the traditional Euclidean setting (see, e.g., Cressie, 1993) and it has been recently highlighted also for the case of functional data (Gromenko and Kokoszka, 2011). In particular, in the presence of strong spatial correlation, the sample mean can be inefficient as estimator for the mean of the population, having larger variance than estimators that take into account spatial dependence, see Cressie (1993, Section 1.3) for a proof in the case of real valued random variables and Gromenko and Kokoszka (2011), for extensive simulation studies on functional data. Indeed, in the presence of highly irregular spatial designs, the sample may contain a great amount of data coming from close by locations, together with a few isolated and distant observations. If spatial correlation is strong, data from close by locations are expected to provide similar information; their influence on the estimate should be mitigated, with respect to the few data coming from distant locations. We propose an estimator for the mean Σ of a random field S ∈ P D(p) which generalizes the estimator proposed by Gromenko and Kokoszka (2011) for a linear space. It is defined as a weighted sample Fréchet mean: where S(s i ) is the observation of the random field S at location s i ∈ D. Weights λ i have to be chosen taking into account the spatial dependence among observations. Analogously to Section 2, we add a subscript to indicate the distance that has been used in the estimation procedure: W R is the weighted sample Fréchet mean using the Riemannian distance and W S is the weighted sample Fréchet mean using the square root distance. Minimization of the weighted sum of square distance to estimate Fréchet mean on the manifold has been first proposed in Dryden et al. (2009) in the context of smoothing for Diffusion Tensor Imaging fields. Their aim is to estimate the diffusion tensor for each point of the domain, starting from noisy and discrete observations. Thus, weights are chosen as function of the distance of each observations from the point where the estimate is needed. This approach has been recently developed in Yuan et al. (2012), where a local polynomial regression estimator for the conditional mean E[S(s)|s = s 0 ] is introduced. Differently from these previous works, we here want to estimate the unconditional mean of the random field (1), starting from a spatially correlated sample of data belonging to the manifold. This leads to a different choice of weights λ i , that should now take into account the dependence among the random elements the data are realizations of. Following the analogy with the Euclidean setting, we ask the weights λ i to solve the quadratic constrained minimization problem: In the Euclidean case, (6) is equivalent to the minimization of the mean square error, but this need not be true for a general Riemannian manifold. However, choosing the weights λ i as the solution of problem (6) meets the qualitative request to attribute less influence to subsets of data which are strongly correlated. We also ask the weights to be non negative to avoid instability in the minimization on the manifold, since, in any case, the solution of the minimization problem would not result in a linear combination of the data. Many numerical methods exist to solve the quadratic programming problem set in (6). We resort to that proposed in Goldfarb and Idnani (1983). The covariance structure Cov(S(s i ), S(s j )) is obtained from the model semivariogram estimated with the procedure illustrated in the previous section.

Simulation studies
In this section we present a simulation study to test the performance of the proposed mean estimator. To do this, we introduce a simple method for simulating a random field of positive definite matrices with spatial correlation. Then, we use the simulated field to compare the weighted sample Fréchet mean W S with the usual sample Fréchet mean Σ S , for different experimental designs. Here, we choose the square root distance to compare two positive definite matrices for computational savings and to avoid problems with nearly singular matrices.

Simulation of a random field in P D(2)
We want to simulate a positive definite symmetric matrix field S(s) ∈ P D(2) with mean Σ and a spatial correlation structure. This is obtained through the sample covariance matrices of the realizations of a gaussian random vector field v.
Let s ∈ D ⊂ R 2 indicate the spatial coordinates of two independent gaussian random field x(s), y(s), with 0 mean and spatial covariance Given a 2 × 2 matrix A, say A = (1, 1; 0, 1), the random vector field v(s) = A(x(s), y(s)) T has covariance matrix Σ = AA T = (2, 1; 1, 1). We generate N realizations of the random vector field v(s) and compute the sample covariance matrix The positive definite symmetric matrix field S(s) has thus mean Σ and it has a spatial dependence structure inherited by the spatial correlation of the underlying vector field v(s). The law of the random field S(s) depends on the parameters q and N , which determine respectively the spatial dependence and the variability. In Fig. 1 some realizations of the matrix random field are reported for different values of q and N , using ellipses to represent 2 × 2 positive definite symmetric matrices. It can be seen that larger values of q correspond to lower spatial dependence and larger values of N to lower variability. By inspecting the realizations of the field, we can guess that for q and N both small, taking into account spatial dependence improves the estimate of the unconditional mean Σ. Indeed, for large values of N the variability of the field is so small that every single observation is a good representative of the mean and so every estimation techniques is adequate. Of course, when q is large, no spatial dependence is present, observations are independent and thus the sample Frechét mean is the proper estimator. Hereafter we focus on the case when q and N are small. First row: three datasets obtained in the first simulation for the three experimental designs: regular grid (left), irregular (middle) and clustered (right). Second row: weights λ i assigned to each location, rounded down to the third decimal digit, for the first simulated field and the three experimental designs. Third row: empirical semivariograms obtained from the three experimental designs in the first simulation. A fitted gaussian model is superimposed to the empirical semivariogram (solid line).

Estimation of the mean Σ of the simulated field
We now compare the proposed estimator with the sample Fréchet mean for three different experimental designs. We simulate 20 realizations of the random field S(s) on a rectangular grid, setting q = 0.01 and N = 4, which is a case of high variability and high spatial dependence. We then subsample each realization in different points s i , obtaining different sets of observations for each experimental design. Fig. 2  S(s i ) (a 2 × 2 positive definite symmetric matrix) is represented as an ellipse that is centered in s i and has axis √ σ j e j , where S(s i )e j = σ j e j for j = 1, 2.
The first experimental design corresponds to a regular grid, the second to an irregular grid, while the third grid presents a cluster of spatial locations. The same picture shows the empirical semivariogram obtained for each dataset with a superimposed gaussian semivariogram fitted via least squares. For each realization of the random field S, we estimated the mean for the three experimental designs both with the sample Fréchet mean Σ S and the weighted Fréchet mean W S . Fig. 3 shows the boxplots of the distances d S ( Σ S , Σ) and d S (W S , Σ) for the three experimental designs. The weighted estimator behaves better, especially in the case of clustered data, where it is able to disregard some of the redundant information coming from points in the cluster. Fig. 2 shows also the weights λ i obtained in the first simulation for the three experimental designs.

Applications to the estimation of mean covariance structure for meteorological variables
The simulation studies of the previous section support the tenet that the estimate of the mean covariance could be improved by taking into account data spatial dependence. As an illustrative application, we consider the problem of estimation of the mean covariance between different meteorological variables, say temperature and precipitation. Temperature and precipitation are two very important climatic variables. Their co-variability is also of interest: a better understanding of their relationship can provide insights on the precipitationforming process or improve weather forecasting methods. Moreover, relative behavior of temperature and precipitation affects agricultural production (see Lobell and Burke, 2008). For a broader introduction to the importance of the temperature-precipitation relationship and its estimate see, e.g., Trenberth and Shea (2005) and references therein. For each meteorological station an ellipse is plotted, representing 2 × 2 covariance matrix between temperature and precipitations in January.
We focus on the Quebec province, Canada. Data from Canadian meteorological stations are made available by Environment Canada on the website http:// climate.weatheroffice.gc.ca. Indeed different measurement stations provide meteorological data along time and a first idea could be to bundle all data together in order to estimate the covariance between meteorological variables. This procedure is questionable since it does not take into proper account the spatial distribution of the measurement stations, which can be far from a regular grid on the region of interest. Analyzing similar data, coming from Canadian meteorological stations, Gromenko and Kokoszka (2011) point out the relevance of spatial dependence between measurement stations when estimating the monthly mean temperature function. Fig. 4 shows the map of Quebec and the meteorological stations for which monthly data for temperature and precipitation are available, from 1983 to 1992. We assume that the monthly variation of the mean covariance between temperature and precipitation stays unchanged along the years of this short time period. The goal is to estimate the mean covariance between temperature and precipitation for each month of the year. Thus, for each meteorological station, we use the 10-year measures of temperature and precipitation to estimate a 2×2 sample covariance matrix for every month from January to December. Fig. 4 shows the ellipse representation of these covariance matrices for January.
Locations of the meteorological stations form an irregular pattern within Quebec. Thus, we expect that taking into account spatial dependence leads to a more accurate estimate of the mean covariance between temperature and precipitation in Quebec. We also assume that spatial dependence is constant along time. This allows to have more data for variogram estimation, which is a crucial point in the analysis. Fig. 5 shows the empirical semivariogram estimated with the method proposed in Section 3, with a superimposed fitted gaussian variogram, and the weights for each station obtained by solving (6). It is interesting to notice that three stations are associated with almost zero weights: this means that they are bringing redundant information for the estimation of the mean covariance structure. An ellipse representation of the estimates obtained with sample Fréchet mean Σ S and weighted sample Fréchet mean W S for the 12 months of the year appears in Fig. 6. The two estimators provide similar estimates for the winter period, from October to February, where a positive correlation exists between temperature and precipitation in the coldest months of the year. This is in agreement with Isaac and Stuart (1991), where correlation between daily temperature and precipitation is considered for the whole Canada by looking at the temperature precipitation index, i.e. the percentage of precipitation occurring at tempera-tures colder than the median daily temperature. They found that in January more precipitation is observed in relatively warm days.
The weighted sample Fréchet mean W S provides a quite different estimate for the beginning of spring (March and April), where no correlation seems to be present, while the sample Fréchet mean Σ S would suggest a positive correlation. For April, Isaac and Stuart (1991) found great variability of the temperature precipitation index in the different Canadian provinces. For Quebec, however, it is around 50%, thus suggesting no correlation. The two estimates agree again for May and June (negative correlation), while for summer months estimates provided by the weighted sample Fréchet mean W S suggest a different total variation for these covariance matrices (lower in July and September, greater in August) but both of them indicate that there is no correlation between temperature and precipitation, thus agreeing with Isaac and Stuart (1991) who report a temperature precipitation index around 50% for Quebec and Ontario in July, contrary to the trend of all the other Canadian provinces.
In conclusion, estimates provided by the proposed estimator W S are in full agreement with previous analysis of Canadian climate, while ignoring spatial dependence among measurement stations leads to anomalous results for March and April. Moreover, dealing with the covariance matrix, rather than with the temperature precipitation index, supplies also information about temperature and precipitation variability. Differences between the estimates of total variability provided by W S and Σ S are concentrated in summer months. In August, our method estimates a greater total variability, while in July and September a lower one.

Choice of a different design for meteorological stations
As shown in Section 6, the spatial correlation among the meteorological stations of Quebec implies that three of them bring no significant information for the estimation of the mean covariance between temperature and precipitation. We now imagine to have the possibility to add a new meteorological station. We assume that the spatial dependence between data generated by the meteorological stations is described by the gaussian semivariogram represented in Fig. 5, estimated via least square from the empirical semivariogram. We aim at finding a site for the new station that makes the weights λ i , given to the stations for mean covariance estimation, as close as possible to 1/n, being n the total number of meteorological stations, the new one included. This would provide an estimator (5) with a smaller variance.
Let us superimpose a fine grid of points on the region of interest and indicate with x and y the latitude and the longitude of a point on the grid. For each grid point (x, y), we solve problem (6) pretending that the new station is located in (x, y). We thus obtain a new weight λ i (x, y) for each of the n meteorological stations. The utility of positioning the new station in (x, y) is defined as We now look for the site (x, y) where the utility U (x, y) is maximized. Fig. 7 shows the utility function on the Quebec province and the site for the new station maximizing it. Of course, the exercise considers only the problem of estimation of the mean covariance between temperature and precipitation, disregarding other quantities of interest for meteorological analyses.

Conclusions and further development
This work is in the framework of statistical analysis for non Euclidean data. In particular, we introduced spatial statistics methods which take into account the specific nature of the non Euclidean data at hand. We introduced a semivariogram whose definition consistently relies only on the notion of distance between two elements of the space to which data belong. This allows to tackle the problem of estimation of the mean from a spatially correlated sample of non Euclidean data. Possible developments include the generalization to the case where a drift is also present and therefore to solve more advanced spatial problems -e.g. ordinary or universal kriging -allowing for the consideration of spatial dependance in smoothing procedures, such as those proposed in Dryden et al. (2009) or Yuan et al. (2012. The proposed methods rely only on the notion of distance between non Euclidean data and therefore they can be applied to any Riemannian manifold. Here we have focused on the notable case of positive definite matrices to show the effectiveness of our approach, both with simulations and with a significant real data application. However, our work can be easily adapted to other kinds of non Euclidean data. For example, Aston et al. (2010) focus on the covariance functions as objects of interest for linguistic and phonetic analysis; taking into account spatial dependence could generate interesting analysis for these kinds of problems. The proposed approach can be easily generalized to the infinite dimensional case once a proper definition of distance between covariance functions has been chosen. Some proposals in this direction can be found in Pigoli et al. (2012).
In Section 6 we apply our method to a meteorological problem, the estimation of the covariance matrix between temperature and precipitation in the province of Quebec (Canada). We show that taking into account spatial dependence provides estimates that are in a better agreement with known meteorological information.