Distribution-on-Distribution Regression with Wasserstein Metric: Multivariate Gaussian Case

Distribution data refers to a data set where each sample is represented as a probability distribution, a subject area receiving burgeoning interest in the field of statistics. Although several studies have developed distribution-to-distribution regression models for univariate variables, the multivariate scenario remains under-explored due to technical complexities. In this study, we introduce models for regression from one Gaussian distribution to another, utilizing the Wasserstein metric. These models are constructed using the geometry of the Wasserstein space, which enables the transformation of Gaussian distributions into components of a linear matrix space. Owing to their linear regression frameworks, our models are intuitively understandable, and their implementation is simplified because of the optimal transport problem's analytical solution between Gaussian distributions. We also explore a generalization of our models to encompass non-Gaussian scenarios. We establish the convergence rates of in-sample prediction errors for the empirical risk minimizations in our models. In comparative simulation experiments, our models demonstrate superior performance over a simpler alternative method that transforms Gaussian distributions into matrices. We present an application of our methodology using weather data for illustration purposes.


Introduction
The analysis of distribution data has gained significant attention in the field of statistics. Distribution data refers to data in which each sample is given in the form of a probability distribution or an empirical distribution generated from it. Examples include age-at-death distributions across different countries, house price distributions of different years, and distributions of voxel-voxel correlations of functional magnetic imaging signals. A distinctive feature of distribution data is that they take values in general metric spaces that lack a vector space structure. Existing complex data analysis methods, such as function or manifold data analysis methods, are inadequate for effectively handling distribution data due to their infinite dimensionality and non-linearity, posing significant challenges in processing. Developing methods and theories for analyzingdistribution data is an important and challenging problem for contemporary statistical practice. Refer to [19] for a review of this topic.
A common approach to handling distribution data involves the application of the Wasserstein metric to a set of distributions. The resulting metric space is known as the Wasserstein space ( [14]), where distribution data are considered as its elements. There are several advantages to using the Wasserstein metric: it gives more intuitive interpretations of mean and geodesics compared to other metrics, and it reduces errors by rigorously treating constraints as distribution functions. Based on this approach, numerous methods have been proposed for the anlaysis of distribution data ( [2,18,17,7,4,9,26]).
This paper focuses on a problem of distribution-on-distribution regression, that is, the regression of one probability distribution onto another. In the distribution-on-distribution regression problem, the task involves defining a regression map between non-linear spaces, which makes this problem technically challenging. The problem is used for comparing the temporal evolution of age-at-death distributions among different countries ( [4], [9]) and predicting house price distributions in the United States( [4]). For univariate distributions, several studies have investigated distribution-on-distribution regression models using Wasserstein metric. [4] proposed a model utilizing geometric properties of the Wasserstein space, [26] presented an autoregressive model for distributional time series data, and [9] introduced a model incorporating the optimal transport map associated with the Wasserstein space. However, few studies proposed distribution-on-distribution regression models for the multivariate case with the Wasserstein metric. For more detail, please refer to Section 3.3 for a comprehensive overview.
In this paper, we propose models for regressing one Gaussian distribution onto another. To define our models, we consider the space of Gaussian distributions equipped with the Wasserstein metric and use its tangent bundle structure to transform Gaussian distributions into matrices. Then, we boil down the Gaussian distribution-on-distribution regression to the matrix-on-matrix linear regression, using the transformation to the tangent bundle. Based on the transformation, we proposed two models: a basic model for the case where predictor and response Gaussian distributions are low-dimensional, and a low-rank model incorporating a low-rank structure in the parameter tensor to address high-dimensional Gaussian distributions. Additionally, we explore the extension of our proposed models to encompass non-Gaussian scenarios.
Our strategy and the model give several advantages: (i) the strategy enables the explicit construction of regression maps using the closed-form expression for the optimal transport problem between Gaussian distributions, (ii) it boils down the distribution-on-distribution regression problem to an easy-to-handle linear model while maintaining the constraint of distributions, and (iii) we can solve the linear model without computational difficulties.
We compare our method to another natural approach, which regresses a mean vector and covariance matrices of covariate Gaussian on those of response Gaussian. However, this approach deteriorates accuracy in predicting distributions, since it does not use the structure of distributions such as the Wasserstein metric. In the simulation studies in Section 6, we compare our proposed models with this alternative approach and find that our models perform better than the alternative approach.
The remaining sections of the paper are organized as follows. In Section 2, we provide some background on the optimal transport and Wasserstein space. In Section 3, we introduce Gaussian distribution-on-distribution regression models and discuss their potential generalizations to accommodate non-Gaussian cases. We show empirical risk minimization algrithmsin our models in Section 4, and analyze their in-sample prediction errors in Section 5. We investigate the finite-sample performance of the proposed methods through simulation studies in Section 6, and illustrate the application of the proposed method using weather data in Section 7. Section 8 concludes. Proofs of theorems and additional theoretical results are provided in Appendix.
1.1. Related Studies. There are several approaches to deal with distribution data apart from the Wasserstein metric approach. [16] introduced the log quantile density transformation, enabling the utilization of functional data methods for distribution data. The Bayes space approach has also been proposed as a viable solution for handling distribution data ( [6,21,20]).
Within the framework of the Wasserstein metric approach, significant developments have been made in methods and theories for analyzing distribution data. [25] considered the estimation for the Fréchet mean, a notion of mean in the Wasserstein space, from distribution samples. [3] established the minimax rates of convergence for these estimators. [18] proposed the Wasserstein covariance measure for dependent density data. [2] developed the method of geodesic principal component analysis on the Wasserstein space.
Various regression models utilizing the Wasserstein metric have been proposed for distribution data. [17] developed regression models for coupled vector predictors and univariate random distributions as responses. [7] developed regression models for multivariate response distributions. [4] and [9] proposed regression models for scenarios where both regressors and responses are random distributions, and [10] studies its extension to the multivariate case. [26] developed autoregressive models for density time series data.
1.2. Notation. For d ≥ 1, we denote the identity matrix of size d × d as I d . Sym(d) is a set of all symmetric matrices of size d × d. For a positive semidefinite matrix A, we denote its positive square root as A 1 2 . id(⋅) is the identity map. For a Borel measurable function f ∶ R d → R d and Borel probability measure µ on R d , f #µ is the push-forward measure defined by f #µ(Ω) = µ(f −1 (Ω)) for any Borel set Ω in R d . ⋅ denotes the Euclidean norm.
and is a Hilbert space with an inner product ⟨⋅, ⋅⟩ µ defined as ⟨f, We denote the norm induced by this inner product as ⋅ µ .
For a matrix A ∈ R d 1 ×d 2 , we denote its elements as A[p, q] for 1 ≤ p ≤ d 1 and 1 ≤ q ≤ d 2 . For a tensor A ∈ R d 1 ×d 2 ×d 3 ×d 4 , we denote its elements as A[p, q, r, s] for 1

Background
In this section, we provide some background on optimal transport, the Wasserstein space, and its tangent space. For more background, see e.g., [23], [1] and [14].
2.1. Optimal Transport. Let W(R d ) be the set of Borel probability distributions on R d with finite second moments. The 2-Wasserstein distance between µ 1 , µ 2 ∈ W(R d ) is defined by Here, Π(µ 1 , µ 2 ) is the set of couplings of µ 1 and µ 2 , that is, the set of joint distributions on R d × R d with marginal distributions µ 1 and µ 2 . In our setting, the minimizer π in (1) always exists (Theorem 4.1 in [23]), and is called an optimal coupling. When µ 1 is absolutely continuous with respect to the Lebesgue measure, there exists a map T ∶ R d → R d such that the joint distribution of (W , T (W )), whereW ∼ µ 1 , is an optimal coupling in (1), and such a map T is uniquely determined µ 1 -almost everywhere (Theorem 1.6.2 in [14]). The map T is called the optimal transport map between µ 1 and µ 2 , and we denote it as T µ 2 µ 1 . When d = 1, the optimal transport map has the following closed-form expression (Section 1.5 in [14]): where F µ 1 is the cumulative distribution function of µ 1 , and F −1 µ 2 is the quantile funciton of µ 2 .
2.2. The Wasserstein Space and its Tangent Space. The Wasserstein distance d W is a metric on W(R d ) (Chapter 6 in [23]), and the metric space (W(R d ), d W ) is called the Wasserstein space. We give a notion of a linear space induced from the Wasserstein space, by applying the the basic concepts of Riemannian manifolds, as shown in [1], [2] and [14].
Let arbitrarily fix a reference measure µ * ∈ W(R d ) which is absolutely continuous with respect to the Lebesgue measure. For any µ ∈ W(R d ), the geodesic from µ * to µ, γ µ * ,µ ∶ The tangent space of the Wasserstein space at µ * is defined by where the upper bar denotes the closure in terms of the norm ⋅ µ * in the space L 2 µ * (R d ). The space T µ * is a subspace of L 2 µ * (R d ) (Theorem 8.5.1 in [1]). The exponential map Exp µ * ∶ T µ * → W(R d ) is then defined by Exp µ * g = (g + id)#µ * , g ∈ T µ * , and as its right inverse, the logarithmic map Log µ * ∶ W(R d ) → T µ * is given by When d = 1, the logarithmic map is isometric in the sense that for all µ 1 , µ 2 ∈ W(R) (Section 2.3.2 in [14]). Remind that ⋅ µ * is the norm of L 2 µ * (R d ) with the reference measure µ * , as defined in Section 1.2.
2.3. Specification with Gaussian Case. We restrict our attention to the Gaussian measures. Let G(R d ) be the set of Gaussian distributions on R d , and we call the metric space (G(R d ), d W ) as the Gaussian space.
For two Gaussian measures µ 1 = N (m 1 , Σ 1 ), µ 2 = N (m 2 , Σ 2 ) ∈ G(R d ) with mean vectors m 1 , m 2 ∈ R d and covariance matrices Σ 1 , Σ 2 ∈ R d×d , the 2-Wasserstein distance between them has the following closed-form expression (Section 1.6.3 in [14]): When Σ 1 is non-singular, the optimal transport map between µ 1 and µ 2 also has the following closed-form expression (Section 1.6.3 in [14]): where we define S(Σ 1 , for two covariance matrices Σ 1 , Σ 2 . We introduce a tangent space of Gaussian spaces. Fix a Gaussian measure µ * = N (m * , Σ * ) ∈ G(R d ) as a reference measure with a non-singular covariance matrix Σ * . Replacing W(R d ) with G(R d ) in the definition of tangent space (3), we obtain the tangent space by a form of a function space Using the form of the optimal transport map (7), a function in the tangent space T G µ * has the following form This form implies that the function space T G µ * is a set of affine functions of x ∈ R d . Note that Exp µ * g ∈ G(R d ) holds for any g ∈ T G µ * , and also Log µ * µ ∈ T G µ * holds for any µ ∈ G(R d ).

Model
In this section, we define regression models between Gaussian spaces using the above notion of tangent spaces. We first present our key idea of modeling and then develop two models.
3.1. Idea: Nearly isometry between Gaussian Space and Linear Matrix Space. As our key idea, we give a nearly isometric map from Gaussian space G(R d ) to a linear matrix space. For d ≥ 1, we define a set of symmetric matrices as which is obviously a linear space. We will give a map from G(R d ) to Ξ d and show that this map has certain isometric properties. This isometry map plays a critical role in our regression model, given in the next subsection. We fix a non-singular Gaussian measure µ * = N (m * , Σ * ) ∈ G(R d ) as a reference measure.
Preliminarily, we introduce an inner product on the space Then we can easily check that ⟨⋅, ⋅⟩ m * ,Σ * satisfies the conditions of inner product. This design follows an inner product for a space of affine functions. Rigorously, for a ∈ R d and V ∈ Sym(d), we define an affine function f a,V (x) = a + V x and its space F aff = {f a,V ∶ a ∈ R d , V ∈ Sym(d)}. Note that T G µ * ⊂ F aff holds from (8). Then we consider an inner product between Inspired by the design, we obtain an inner product space (Ξ d , ⟨⋅, ⋅⟩ (m * ,Σ * ) ). The norm ⋅ (m * ,Σ * ) induced by this inner product is specified as We construct a nearly isometric map ϕ µ * from (G(R d ), d W ) to (Ξ d , ⋅ (m * ,Σ * ) ) as 6 We specify the maps ψ µ * ∶ G(R d ) → T G µ * and π ∶ F aff → Ξ d as follows. First, ψ µ * is the logarithm map Log µ * (⋅) as (4) with restriction to G(R d ). That is, for µ = N (m, Σ) ∈ G(R d ), ψ µ * µ is the affine function of the form (8). Second, for an affine function f a,V ∈ F aff , we define πf a,V = (a, V ). For summary, the map ϕ µ * ∶ G(R d ) → Ξ d in (10) is specified as We also define a map ξ µ * ∶ ϕ µ * G(R d ) → G(R d ) as the left inverse of the map ϕ µ * by Here, a range of the map (11) with the domain G(R d ) is written as which is obviously a subset of Ξ d . We obtain results on the distance-preserving property of the map ϕ µ * . As a preparation, for a d × d orthogonal matrix U , we define a class of Gaussian measures C U ⊂ G(R d ) as Here, we give a formal statement.
Moreover, if µ * ∈ C U holds, we have the following for any µ 1 , µ 2 ∈ C U : Note that since ϕ µ * µ * = 0 holds, the first claim shows that the Wasserstein distance between any Gaussian measure µ and the reference Gaussian measure µ * is equal to the distance between corresponding elements in the space (Ξ d , ⋅ (m * ,Σ * ) ). The second claim shows that if we choose a class of Gaussian measures appropriately, the map ϕ µ * is isometric on that class. This isometric property is essentially illustrated in Section 2.3.2 in [14] for the case of centered Gaussian distributions. Our claim can be understood as its generalization to the non-centered case.

Regression Model.
In this section, we develop our regression models for the Gaussianto-Gaussian distribution regression. Our strategy is to map Gaussian distributions to the linear matrix spaces using the nearly isometric maps and then conduct linear regression between the matrix spaces. Figure 1 illustrates the strategy. Specifically, we develop the Figure 1. Illustration of structure of the proposed regression model between the Gaussain spaces G(R d 1 ) and G(R d 2 ). The Gaussian distributions ν 1 and ν 2 are transformed to the random elements X and Y in the linear matrix spaces Ξ d 1 and Ξ d 2 by the nearly isometric maps ϕ ν 1⊕ and ϕ ν 2⊕ , respectively. Then, linear regression model with regression map Γ B 0 is assumed between X and Y . following two models: (i) a basic model, and (ii) a low-rank model. See Section 1.2 for the notation regarding matrices and tensors.
We review the setup of the regression problem. Let d 1 and d 2 be positive integers and F be a joint distribution on G(R d 1 ) × G(R d 2 ). Let (ν 1 , ν 2 ) be a pair of random elements generated by F, where we write ν 1 = N (m 1 , Σ 1 ) and ν 2 = N (m 2 , Σ 2 ). We assume ν 1 and ν 2 are square integrable in the sense that max{E[d 2 ]} < ∞ for some (and thus for all) µ 1 ∈ G(R d 1 ) and µ 2 ∈ G(R d 2 ). In the following, we give models for dealing with this joint distribution F.
3.2.1. Basic model. The first step is to define reference measures to introduce the nearly isometric maps. For j ∈ {1, 2}, we define the Fréchet mean of the random Gaussian distribution ν j as with the mean vector m j⊕ ∈ R d j and the covariance matrix Σ j⊕ ∈ R d j ×d j . Note that the Fréchet means ν 1⊕ and ν 2⊕ are also Gaussian, and we assume they uniquely exist and are non-singular. Using the Fréchet means ν 1⊕ and ν 2⊕ as reference measures, we transform random Gaussian distributions ν 1 and ν 2 to random elements X ∈ Ξ d 1 and Y ∈ Ξ d 2 by where ϕ ν 1⊕ and ϕ ν 2⊕ are the nearly isometric maps in (11).
For the random matrices X and Y transformed from the random distributions ν 1 and ν 2 as above, we perform a matrix-to-matrix linear regression. To the aim, we consider a coefficient tensor B ∈ R d 1 ×(d 1 +1)×d 2 ×(d 2 +1) and define its associated linear map Remind that ⟨⋅, ⋅⟩ 2 is a product for tensors defined in Section 1.2. To deal with the symmetricity of matrices in Ξ d 1 and Ξ d 2 , we define the following class of coefficient tensors: This definition guarantees ⟨A, B⟩ 2 ∈ Ξ d 2 holds for any B ∈ B and A ∈ Ξ d 1 .
We now give the linear regression model. We assume that the (Ξ d 1 × Ξ d 2 )-valued random element (X, Y ), which is obtained by the transform of the random pair if distributions (ν 1 , ν 2 ), follows the following linear model with some B 0 ∈ B: where E is a Ξ d 2 -valued random element as an error term. Note that B 0 is not necessarily unique. We can rewrite this model into an element-wise representation such that for 1 ≤ r ≤ d 2 , 2 ≤ s ≤ d 2 + 1. Furthermore, we impose the following assumption on the data-generating process in this model: For summary, we consider a regression map Γ G,B 0 between the Gaussian spaces G(R d 1 ) and G(R d 2 ) as Note that our model satisfies In other words, the regression map Γ G,B 0 maps the Fréchet mean of ν 1 to that of ν 2 .
Remark 1 (Scalar response model). A variant of the proposed basic model is the pairing of Gaussian distributions with scalar responses. In this case, the regression comes down to matrix-to-scalar linear regression. Let (ν 1 , Z) be a pair of random elements with a joint distribution on G(R d 1 ) × R, and let ν 1⊕ = (m 1⊕ , Σ 1⊕ ) be the Fréchet mean of ν 1 in G(R d 1 ). A Gaussian distribution-to-scalar regression model is Here, is the regression parameter and ε is a real-valued error term.

3.2.2.
Low-Rank Model. We consider the case where the coefficient tensor B is assumed to have low-rank, as an extension of the basic model. The issue with the basic model (13) is that the number of elements in B is , which is high dimensional and far exceeds the usual sample size when d 1 and d 2 are not small. A natural way to handle this issue is to approximate B with fewer parameters, and we employ the low-rank CP decomposition of tensors for that purpose. This approach was employed by [27] for a tensor regression model for scalar outcome, and by [13] for a tensor-on-tensor regression model. We define the low-rank coefficient tensor. Let K be a positive integer such that K ≤ min{d 1 , d 2 }. Then the tensor B ∈ R d 1 ×(d 1 +1)×d 2 ×(d 2 +1) admits a rank-K decomposition (e.g., [11]), if it holds that where a (k) .., K) are all column vectors. For convenience, the decomposition (18) is often represented by a shorthand where Based on this decomposition, we propose a rank-K model for Gaussian distribution-todistribution regression, in which the regression parameter B in (13) admits the rank-K decomposition (18). In the rank-K model, we assume a (k) In other words, when B is represented as A 1 , A 2 , A 3 , A 4 , we assume the matrices A 3 and A 4 have the forms where α k , β k , γ k , 1 ≤ k ≤ K are some scalars. Under this assumption, the symmetric condition in (12) holds, so that we have ⟨A, B⟩ 2 ∈ Ξ d 2 for any A ∈ Ξ d 1 . We denote the resulting parameter space for the rank-K model as Finally, we consider the regression model (16) with B 0 ∈ B low . In practice, the appropriate number of rank K is unknown, and it can be selected via the cross-validation method.

Comparison with Existing Models in Terms of Generalization to Multivariate
Case. For the univariate case where d 1 = d 2 = 1, regression models applying the Wasserstein metric to distribution-on-distribution were introduced by [4,26,9]. [4] and [26] transformed distributions in the Wasserstein space W(R) to elements in its tangent space (3) by the logarithmic map (4), and boiled down distribution-on-distribution regression to function-onfunction linear regression. Because the logarithmic map (4) is isometric in the univariate case, their methods fully utilize the geometric properties of the Wasserstein space. [9] modeled the regression operator from W(R) to W(R) by using the optimal transport map. This approach enabled to interpret the regression effect directly at the level of probability distributions through a re-arrangement of probability mass. Despite the effectiveness of these models for univariate distribution-on-distribution regression, their extension to the multivariate scenario remains non-trivial. This challenge primarily arises from two reasons. The first reason is that the explicit solution of the optimal transport problem for univariate distributions (2) is not available for the multivariate case. This brings numerical difficulties in the computation of optimal transport maps, which is required to transform distributions to unconstrained functions in the model by [4]. The derivation of optimal transport maps also becomes essential when devising estimators for the regression map within [9]'s model. The second reason is that the flatness of the Wasserstein space, that is, the isometric property of the logarithmic map (5), does not hold for the multivariate case in general. This means the transformation method by [4] lacks the theoretical support for preserving the geometric properties of the Wasserstein space in the multivariate case. Moreover, the identifiability result of the regression map in the model by [9], which depends on the flatness of the Wasserstein space, is hard to be generalized for the multivariate case. Another study [10] analyzes the multivariate case and reveals several theoretical properties such as the sample complexity.
We addressed these challenges by limiting the class of distributions to Gaussian distributions. In our model, we transform Gaussian distributions to unconstrained matrices via the map (11). Consequently, we simplify the regression of Gaussian distribution-on-Gaussian distribution to matrix-on-matrix linear regression. Given the explicit expression of the optimal transport map between Gaussian distributions as (7), our transformation avoids computational difficulties. Although our transformation is not isometric in general, it has certain isometric properties as shown in Proposition 1. This guarantees that our transformation method partially utilizes the geometric properties of the Gaussian space.
3.4. Generalization to Elliptically Symmetric Distributions. Our proposed regression models extend to scenarios where distributions ν 1 and ν 2 belong to the class of elliptically symmetric distributions, a broader category than Gaussian distributions. This is because, as shown in [8], the closed-form expression of the Wasserstein distance (6) holds if two distributions are in the same class of elliptically symmetric distributions.
We give more rigorous description. Let d ≥ 1 and let f ∶ [0, ∞) → [0, ∞) be a measurable function that is not almost everywhere zero and satisfies Given such a function f , for a positive definite matrix A ∈ R d×d and a vector v ∈ R d , one can consider a density function of the form )dx as the normalizing constant. Then, we can consider a class of distributions on R d whose elements have a density f A,v for some positive definite matrix A ∈ R d×d and vector v ∈ R d . We denote such a class as P f (R d ), and call it as the class of elliptically symmetric distributions with function f . For example, if we set f (t) = e −t 2 , we obtain the set of Gaussian distributions with positive definite covariance matrices as P f (R d ). Furthermore, by setting f (t) = I [0,1] (t), we obtain the set of uniform distributions on ellipsoids of the forms According to Theorem 2.4 of [8], the closed-forms of the Wasserstein distance (6) and optimal transport map (7) are valid for any two measures µ 1 , µ 2 in the same class of elliptically symmetric distributions P f (R d ). Since our models rely only the forms (6), (7), our result can be extended to the case in which (ν 1 , ν 2 ) are P f 1 (R d 1 ) × P f 2 (R d 2 )-valued random elements. Note that f 1 , f 2 ∶ [0, ∞) → [0, ∞) should be non-vanishing and satisfy the condition (22) for d = d 1 and d = d 2 , respectively.

Empirical Risk Minimization Algorithms
In this section, we propose empirical risk minimization procedures for constructing a prediction model following the regression map Γ G,B 0 (16) based on observed data. Specifically, we consider two cases: (i) we directly observe random distributions (Section 4.1), and (ii) we observe only samples from the random distributions (Section 4.2). We refer the estimation issue of the coefficient tensor B 0 itself and its related topics to Appendix.
As the block relaxation algorithm monotonically decreases the objective function ( [5]), the convergence of objective values (A is guaranteed whenever the function is bounded from above.
are generated from the distributions, then we observe the sample vectors. For each fixed (i, j), the W ijm are independent and identically distributed.
At the beginning, we develop a proxy for each Gaussian distribution ν ij = N (µ ij , Σ ij ). For i = 1, ..., n and j ∈ {1, 2}, we consider the empirical mean and covariance of W ijm aŝ for estimators of µ ij and Σ ij , respectively. We defineν ij = N (μ ij ,Σ ij ) and use it for a proxy of ν ij = N (µ ij , Σ ij ). Based on these proxies, we compute the empirical Fréchet means for j ∈ {1, 2}:ν j⊕ = arg min where we writeν 1⊕ = N (m 1⊕ ,Σ 1⊕ ),ν 2⊕ = N (m 2⊕ ,Σ 2⊕ ). As with the directly observed case, we can use the steepest descent algorithm for solving this optimization. Then, we transform Gaussian distributionsν ij into matrices byX i = ϕν 1⊕ν i1 andŶ i = ϕν 2⊕ν i2 . In the basic model, we solve the following least squares problem: where ⋅ (m 2⊕ ,Σ 2⊕ ) denotes the norm defined by (9) for m * =m 2⊕ and Σ * =Σ 2⊕ . In the rank-K model, we solve the following least squares problem: In either case, we use Γ G,B = ξν 2⊕ ○ ΓB ○ ϕν 1⊕ as the prediction map. As with the directly observed case, we can use the block relaxation algorithm for solving the optimization (25) by the similar manner of Algorithm 1. 14

Analysis of in-sample prediction error
In this section, we analyze the prediction error of the proposed models and algorithms. We especially focus on the in-sample prediction error measured on the observations, which is naturally extended to the out-sample prediction error. Here, suppose that we directly observe the pairs of Gaussian distributions (ν 1i , ν 2i ), i = 1, ..., n from the model (13) as the case in Section 4.1. For simplicity, we assume that the true values of Fréchet means ν 1⊕ and ν 2⊕ are known. In addition, we treat predictors {ν 1i } n i=1 as fixed in this analysis. Based on the sample (ν 1i , ν 2i ), i = 1, ..., n, we solve the following least squares problem forB = B or B = B low :B where X i = ϕ ν 1⊕ ν 1i and Y i = ϕ ν 2⊕ ν 2i . Then, we define the prediction map. Moreover, under the assumption that ΓB(X i ) ∈ ϕ ν 2⊕ G(R d 2 )(i = 1, ..., n), we define the in-sample prediction error with the Wasserstein metric in terms of the empirical measure by which is an analogy of the empirical L 2 -norm. We also assume that the Ξ d -valued random variable E in the linear model (17) is Gaussian, that is, that is, for any A ∈ Ξ d , ⟨E, A⟩ m * ,Σ * is a real Gaussian random variable. In the following, we measure the in-sample prediction error of the basic model in terms of the Wasserstein distance. Note that this is unique to our distribution-on-distribution regression problem, and deriving the convergence rate of in-sample prediction error under this setting is not a trivial problem.
This result shows that that our method achieves optimal convergence rates. That is, the convergence rates in Theorem 1 achieve the parametric rate n −1 2 regarding the sample size n. This rate comes from our parametric assumption of Gaussianity on distributions. In contrast, existing distribution-on-distribution regression models do not impose parametric assumptions, which results in slower convergence rates of estimators for regression parameters. For example, in the regression model proposed by [4], an estimator for the regression operator achieve the same rate as the minimax rate for function-to-function linear regression in a certain case (Theorem1 in [4]), which is generally slower than the parametric rate. In the regression model proposed by [9], an estimator for the regression map achieve the rate n −1 3 (Theorem 3.8 in [9]), which is slower than the parametric rate.
Next, we study the in-sample prediction error of the rank-K model. This analysis provides an effect of the number of ranks K, in addition to the results of the basic model in Theorem 1.
Theorem 2 states an advantage of the low-rank model, in addition to the result that the model achieves the optimal parametric rate. The constant part of the rate is √ Kd 1 in the rank-K model while d 1 d 2 in the basic model. This implies that when the dimensions of distributions ν 1 , ν 2 are large, the regression map in the rank-K model is better approximated than that in the basic model. In the rate of the rank-K model, the dimension of output distribution d 2 does not appear in the constant part. This is due to the specific forms of matrices (20) imposed on tensors in B low .
We add some discussion on the observations of distributions. Recall that we assume the true Fréchet means ν 1⊕ , ν 2⊕ are known, and distributions (ν 1i , ν 2i ) are directly observed. Relaxing these assumptions presents additional challenges for theoretical analysis. Specifically, if we estimate the Fréchet mean of ν 2i with the empirical Fréchet meanν 2⊕ , we solve the least squares problem (26) by replacing Y i = log ν 2⊕ ν 2i withỸ i = logν 2⊕ ν 2i . SinceỸ 1 , ...,Ỹ n are not independent, the standard theory for analyzing the error of empirical risk minimization is not directly applicable in this setting. Moreover, if distributions are not directly observed and only samples from them are available, we need to tackle the discrepancy between the estimated distributions based on the sample and the actual distributions in the analysis. As for the estimation of the Fréchet mean, [12] derive the rates of convergence of empirical Fréchet mean on the Gaussian space (Corollary 17 in [12]), which may be helpful for further theoretical analysis.
Finally, we prove the consistency and asymptotic normality of an estimator for identified regression parameters in the Appendix.

Simulation Studies
In this section, we investigate the finite-sample performance of the proposed methods together with another method through simulation studies.
from the basic model as follows. Firstly, for each i = 1, ..., n, we generate independent random variables G (1) and set a matrix X i ∈ S d by Here, Exp(1) is the exponential distribution with the rate parameter 1. Then we obtain a Note that under this setting, the random distribution ν 1i has the Fréchet mean ν 1⊕ . Next, we set the coefficient tensor for 1 ≤ r ≤ d, and set the other elements to be zero. Additionally, for each i = 1, ..., n, we generate independent random variables U ∼ U (−1 2, 1 2) and set the error matrix E i ∈ S d by Here, U (−1 2, 1 2) is the uniform distribution on the interval (−1 2, 1 2). We set Y i = ⟨X i , B⟩ 2 +E i and obtain a response Gaussian distribution ν 2i = ξ ν 2⊕ Y i ∈ G(R 2 ), where ν 2⊕ is the d-dimensional standard Gaussian distribution. Note that under this setting, the condition (15) holds and the random distribution ν 2i has the Fréchet mean ν 2⊕ . From the above procedure, we have obtained pairs of Gaussian distributions {(ν 1i , ν 2i )} n i=1 . Finally, we draw N independent sample vectors from each of the distributions {ν 1i } n i=1 and {ν 2i } n i=1 .
6.2. Methods and Performance Criterion. As an alternative approach, we consider the following model between ν 1i and ν 2i : Here, Z i = (m 1i , Σ 1i ) ∈ S d 1 and W i = (m 2i , Σ 2i ) ∈ S d 2 are the matrices obtained from Gaussian distributions ν 1i = N (m 1i , Σ 1i ) and ν 2i = N (m 2i , Σ 2i ), respectively. A 0 ∈ B is the regression parameter and E i ∈ S d is the error matrix in this model. Note that this alternative model does not consider the Wasserstein metric. For the proposed models, we construct estimatorŝ B as described in Section 4.2. For the alternative model (27), we construct an estimator by solving the least square problem To investigate the performance of the proposed and alternative methods, following simulations in [4], we generate 200 new predictors {ν 1i } n+200 i=n+1 and compute the out-of-sample average Wasserstein discrepancy (AWD). Denoting the true response distributions by ν * and the fitted response distributions by ν # 2i , the out-of-sample AWD is given by In the proposed model, when the fit of the response in the space Ξ d 2 does not fall in the range of map ϕν 2⊕ , that is, we need to modify the fit to calculate the fitted response distribution. To handle this problem, we use a boundary projection method similar to one proposed by [4]. Specifically, for d ≥ 1, let g d ∶ R d×(d+1) → R d×d be the map such that g((a, V )) = V for (a, V ) ∈ R d×(d+1) . If the event (29) happens, we calculate a constant η i such that and update the original fit by η i ΓB(X i ). Conceptually, we update the original fit by a projection onto the boundary of ϕν 2⊕ G(R d 2 ) along the line segment between the origin 0 and the fit ΓB(X i ). In the alternative method, if g d 2 (⟨X i ,Â⟩ 2 ) is not positive semidefinite, we update g d 2 (⟨X i ,Â⟩ 2 ) by arg min C∈Sym + (d 2 ) C − g d 2 (⟨X i ,Â⟩ 2 ) F .

6.3.
Results. Firstly, we set d = 2 and consider four scenarios with n ∈ {20, 200} and N ∈ {50, 500}. We simulate 500 runs for each (n, m) pair, and for each Monte Carlo run, we compute the AWD (28) based on 200 new predictors. The results of the proposed and alternative methods are summarized in the boxplots of Figure 2. In all four scenarios, the proposed method outperforms the alternative method. This result comes from the fact that the proposed method takes into account the geometry of the Wasserstein metric, while the alternative method does not. In this setting, the event (29) seldom happened even if the number of distributions n is small. Next, we set d = 6, n = 200, N = 500 and fit the proposed and alternative models whose regression tensors have rank K ∈ {2, 3, 4}. As with the previous experiment, we simulate 500 runs, and for each Monte Carlo run, we compute the AWD (28) based on 200 new predictors. Figure 2. Boxplots of the out-of-sample AWDs defined as (28) for the four scenarios with n ∈ {20, 200} and N ∈ {50, 500}. "proposed" denotes the proposed method and "alternative" denotes the alternative method. The number in brackets "[ ]" below the boxplots for the proposed indicates how many runs event (28) happened and boundary projection was needed.
The results are summarized in the boxplots of Figure 3. In all cases, the proposed method outperforms the alternative method. In this setting, event (29) happened more frequently than in the previous experiment.
Finally, to see the performance of the methods under the existence of model misspecification, we generate pairs of multivariate t-distributions {(t 1i , t 2i )} n i=1 and fit the Gaussian-on-Gaussian regression models. Specifically, we firstly generate pairs of Gaussian distributions {(ν 1i , ν 2i )} n i=1 from the basic model as described in Section 3. Denoting these Gaussian distributions as ν 1i = N (m 1i , Σ 1i ), ν 2i = N (m 2i , Σ 2i ), we set multivariate t-distributions as t 1i = t (m 1i , Σ 1i ), t 2i = t (m 2i , Σ 2i ). Here, t (m, Σ) denotes the multivariate t-distribution with location m, scale matrix Σ and the degree of freedom . We draw an i.i.d. observations of size N from each of the distributions {t 1i } n i=1 and {t 2i } n i=1 , and construct estimators for the proposed and alternative models, respectively. Finally, we generate 200 new Figure 3. Boxplots of the out-of-sample AWDs defined as (28) for the lowrank methods with rank K ∈ {2, 3, 4}. "proposed" denotes the proposed method and "alternative" denotes the alternative method. The number in brackets "[ ]" below the boxplots for the proposed indicates how many runs event (28) happened and boundary projection was needed.
predictors {t 1i } n+200 i=n+1 and calculate the out-of-sample is the true response t-distribution whose location and scale are given by is the fitted response Gaussian distribution. We set d = 2, n = 200, N = 500 and consider three scenarios with the degree of the freedom ∈ {5, 10, 15}. As with the previous experiments, we simulate 500 runs, and for each Monte Carlo run, we compute the AWD (28) based on 200 new predictors. The results of the proposed and alternative methods are summarized in the boxplots of Figure 4. In all three scenarios, the proposed method outperforms the alternative method. In addition, the prediction performance is getting better as the degree of freedom increases. This result comes from the fact that as the degree of freedom increases, the t-distribution becomes more close to the Gaussian distribution, and thus there is less model misspecification. Figure 4. Boxplots of the out-of-sample AWDs defined as (28) for the three scenarios with the degree of the freedom ∈ {5, 10, 15}. "proposed" denotes the proposed method and "alternative" denotes the alternative method. The number in brackets "[ ]" below the boxplots for the proposed indicates how many runs event (29) happened and boundary projection was needed.

Applications
In this section, we employ the proposed regression model to grasp the relationship between daily weather in spring (March, April, and May) and that in summer (Jun, July, and August) in Calgary, Alberta. We obtain data from https://calgary.weatherstats.ca. This dataset contains the temperature and humidity for each day in Calgary from 1953 to 2021. We consider the joint distribution of the average temperatures recorded daily and the average relative humidity recorded daily. We regard each pair of daily values as one observation from a two-dimensional Gaussian distribution. As examples, Figure 5 illustrates the observations and estimated Gaussian densities for spring and summer in each year from 1953 to 1956. We applied the proposed (13) and alternative (27) regression models with the distributions for spring as the predictor and summer as the response. Models are trained on data up to 1988 and predictions are computed for the remaining period, where we predicted the distribution of summer based on that of spring for each year. Table 1 shows the fitting and prediction results of the proposed method for training and prediction periods. Additionally, Table 2 shows the result of the alternative method. In these tables, we report the summary of the Wasserstein discrepancies between observed and fitted distributions in training periods, and those between observed and predicted distributions in prediction periods. We also show the prediction results of both methods from 2017 to 2019 in Figure 6. We find that fitting and prediction by the proposed model are generally better than those by the alternative model. This result can be explained by the fact that the proposed model takes into consideration the geometry of the Wasserstein space while the alternative model does not.

Conclusion
In this paper, we propose the distribution-on-distribution regression models for multivariate Gaussians with the Wasserstein metric. In the proposed regression models, Gaussian distributions are transformed into elements in linear matrix spaces by the proposed nearly isometric maps, and the regression problem comes down to matrix-on-matrix linear regression. It has the advantage that the distribution-on-distribution regression is reduced to a linear regression while keeping the properties of distributions. Also, owing to the linear regression structure, we can easily implement and interpret the models. We incorporate a low-rank structure in the parameter tensor to address large dimensional Gaussian distributions and also discuss the generalization of our models to the class of elliptically symmetric distributions. In the simulation studies, we find that our models perform better than an alternative approach of transforming Gaussian distributions to matrices that do not consider the Wasserstein metric.
To prove Theorem 1 and 2, we employ the following general result regarding the in-sample prediction error of least squares regression, which is shown by [15]. We refer to Section A.2 in [15] for Gaussian random variables in Hilbert spaces.
Here, ε i are independent Gaussian noise terms with zero mean and covariance trace 1, and g 0 ∶ X → Y is an unknown function in a class G. Let define the empirical norm g n = n −1 ∑ n i=1 g(x i ) 2 Y for g ∈ G, and define J(δ) = ∫ δ 0 log N n (t, B n (δ; G), ⋅ n )dt for δ > 0, where N n (t, B n (δ; G), ⋅ n ) is the t-covering number of the ball B n (δ; G) = {g ∈ G ∶ g n ≤ δ}. Then, if there exist real sequence {δ n } and constant C > 0 such that J(δ n ) ≤ C √ nδ 2 n , the least squares estimatorĝ n = arg min g∈G Using this result, we prove Theorem 1 and 2. Throughout the proofs, we denote a ≲ b when there exists a constant C > 0 not depending on n, d 1 , d 2 , K such that a ≤ Cb.
Proof of Theorem 1. Firstly we bound the in-sample prediction error regarding the map Γ B 0 , which is defined by Our strategy is to bound the metric entropy of the function space F = {Γ B ∶ B ∈ B} and employ Theorem 3. We define the δ-ball of space F as B n (δ; F ) = {Γ B ∈ F ∶ Γ B n ≤ δ} and denote its t-covering number as N n (t, B n (δ; F ), ⋅ n ). By defining B ′ = Γ B n for B ∈ B, the set B n (δ; F ) is isometric to the δ-ball within the space (B, ⋅ ′ ). Since the space (B, ⋅ ′ ) has dimension d 1 (d 1 + 1)d 2 (d 2 + 3) 2, by a volume ratio argument (Example 5.8 in [24]), we have log N n (t, B n (δ; F ), ⋅ n ) ≲ d 2 1 d 2 2 log 1 + 2δ t .
Using this upper bound, we have This implies we can apply Theorem 3 with δ n = d 1 d 2 √ n and obtain ΓB−Γ B 0 n = O P (d 1 d 2 √ n). Next, we bound the in-sample prediction error R n (Γ G,B , Γ G,B 0 ). Because the Wasserstein space has nonnegative sectional curvature at any reference measure (e.g., Section 2.3.2 in [14]), the Gaussian space, which is the restriction of the Wasserstein space to Gaussian measures, also has this property. In other words, the inequality holds for any µ 1 , µ 2 ∈ G(R d 2 ). This implies R n (Γ G,B , Γ G,B 0 ) ≤ ΓB−Γ B 0 n holds, and combining this fact with ΓB − Proof of Theorem 2. As with the proof of Theorem 1, we firstly bound the in-sample prediction error regarding the map Γ B 0 . We define the function space as F low = {Γ B ∶ B ∈ B low }, define its δ-ball as B n (δ; F low ) = {Γ B ∈ F low ∶ Γ B n ≤ δ} , and denote its t-covering number as N n (t, B n (δ; F low ), ⋅ n ). By defining B ′′ = Γ B n for B ∈ B low , the set B n (δ; F low ) is isometric to the δ-ball within the space (B low , ⋅ ′′ ). Recall that if a tensor B = A 1 , A 2 , A 3 , A 4 is in B low , the matrices A 3 and A 4 have the forms (20). Based on this fact, denoting α = (α 1 , ..., α K ), β = (β 1 , ..., β K ), and γ = (γ 1 , ..., γ K ), let consider an corresponding from Since the δ-ball within the space (B low , ⋅ ′′ ) is isometric to the δ-ball within (R 2Kd 1 +4K , ⋅ ′′′ ), we eventually have that the set B n (δ; F low ) is isometric to the δ-ball within the space (R 2Kd 1 +4K , ⋅ ′′′ ). Therefore, by a volume ratio argument, we have log N n (t, B n (δ; F low ), ⋅ n ) ≲ Kd 1 log 1 + 2δ t .
Using this upper bound, as with the proof of Theorem 1, we have δ 0 log N n (t, B n (δ; F ), ⋅ n )dt ≲ δ Kd 1 .

Appendix B. Parameter Identification
In this section, we deal with the identification of regression parameter B in our proposed models. Although the parameter B does not need to be identified in the empirical risk minimization problems in the main article, it must be identified when we consider estimation or inference for the regression parameter.
B.1. Basic Model. Recall that assuming linear regression model (13) is equivalent to assuming the model (14) for each 1 ≤ r ≤ d 2 and 1 ≤ s ≤ d 2 + 1. Let fix indexes 1 ≤ r ≤ d 2 and 1 ≤ s ≤ d 2 + 1 and consider the identification of parameter B[⋅, ⋅, r, s] ∈ R d 1 ×(d 1 +1) in (14). In order to deal with the identifiability issue coming from the symmetry in the matrix X ∈ Ξ d 1 , we impose the following condition on the parameter B[⋅, ⋅, r, s]: B[p, q, r, s] = 0, for 1 ≤ p ≤ d 1 , p + 2 ≤ q ≤ d 2 + 1. (32) In other words, the matrix B[⋅, ⋅, r, s] has a lower triangular form In summary, by adding condition (32) to the existing parameter space, we define the following modified parameter space for the basic model : B * = {B ∈ B ∶ the condition (32) holds for each 1 ≤ r ≤ d 2 and 1 ≤ s ≤ d 2 + 1}.
Then, the parameter B is uniquely identified in B * .

27
B.2. Low-Rank Model. Next, we consider the identification of regression parameters in the low-rank model. Let B admit the rank-K decomposition B = A 1 , A 2 , A 3 , A 4 . Following an identification strategy used in [27] for tensor regression models, we adopt the following specific constrained parametrization to fix the scaling and permutation indeterminacy of the tensor decomposition.
• To fix the scaling indeterminacy, we assume A 1 , A 2 , A 3 are scaled such that a (k) In other words, the first rows of A 1 , A 2 , A 3 are ones. Since A 3 is assumed to have the form in (20), this implies that all elements of A 3 are ones. This scaling of A 1 , A 2 , A 3 determines the first row of A 4 and fixes scaling indeterminacy (Section 4.2 in [27]). • To fix the permutation indeterminacy, we assume that the first row elements of A 4 are distinct and arranged in the descending order a (1) This fixes permutation indeterminacy (Section 4.2 in [27]).
Adding these constraints to the existing parameter space, we define the modified parameter space for the rank-K model as If the tensor B = A 1 , A 2 , A 3 , A 4 ∈ B * low satisfies the condition rankA 1 + rankA 2 + rankA 3 + rankA 4 ≥ 2K + 3, then Proposition 3 in [27] implies that B is uniquely identified in B * low .
Theorem 5 (Asymptotic Normality of Estimator). In addition to the assumptions in Theorem 4, suppose θ 0 is an interior point of Θ 0 and the map θ ↦ E[m θ (vech * (X i ), vech * (Y i ))] has nonsingular Hessian matrix V θ 0 at θ 0 . Then, √ n(θ n − θ 0 ) converges in distribution to a normal distribution with mean zero and covariance matrix Remark 2. When the norm ⋅ (m 2⊕ ,Σ 2⊕ ) is equal to the Frobenius norm, that is, m 2⊕ = 0 and Σ 2⊕ = I, the second-derivative matrix V θ 0 has the form Therefore, V θ 0 is nonsingular if and only if the matrix E[vech * (X i )vech * (X i ) ⊺ ] is nonsingular.
Finally, the map θ ↦ E[m θ (vech * (X i ), vech * (Y i ))] is assumed to have nonsingular Hessian matrix V θ 0 at θ 0 . Then, the conditions of Theorem 5.23 in [22] are fulfilled, and we have the conclusion from the theorem.