Procrustes Analysis on the Manifold of SPSD Matrices for Data Sets Alignment

In contemporary high-dimensional data analysis, intrinsically similar and related data sets are often significantly different due to various undesired factors that could arise from different acquisition equipment, calibration, environmental conditions, and many other sources of batch effects. Therefore, the task of aligning such data sets has become ubiquitous. In this work, we present a method for the alignment of different, but related, sets of Symmetric Positive Semidefinite (SPSD) matrices, which constitute a commonly-used family of features, e.g., covariance and correlation matrices, various kernels, and prototypical graph and network representations. Our method does not require any a-priori correspondence, and it is based on non-Euclidean Procrustes Analysis (PA) using a particular Riemannian geometry of SPSD matrices. While the derivation is focused on the manifold of SPSD matrices, we show that our alignment method can be applied directly in the original high-dimensional data space, when considering SPSD features that are sample covariance matrices. We demonstrate the advantage of our approach over competing methods in simulations and in an application to Brain-Computer Interface (BCI) with electroencephalographic (EEG) recordings.


I. INTRODUCTION
A LONGSTANDING problem in data science is how to represent, analyse, and process a union of different, but related, data sets. Often, due to the inherent heterogeneity of many types of data sets, useful representations usually cannot be achieved simply by merging multiple data sets into one big data set. Furthermore, a model learned from one data set is typically not appropriate as-is for the analysis of another. The density of the data sets could be different as a result of a broad variety of application-dependent differences, e.g., the recording conditions, the technology of the data acquisition, the production process and calibration of the recording device, the population of subjects, the physical conditions, etc. In order to exploit the model learned from one data set for analysis tasks in another data set, or to jointly process and analyze them, prior alignment of the data sets is required. One approach for alignment, which is primarily based on geometric considerations, is Procrustes Analysis (PA) [1]. Originated in shape analysis, PA aims to adjust two or more shapes in order to facilitate a meaningful comparison between them. Given pairs of landmarks from both shapes, the adjustment is typically performed by applying three transformations: centering, scaling, and rotation, such that the distances between the landmarks are minimized. Extending this idea from shapes to high-dimensional point clouds facilitated an appealing data alignment approach, which is simple, efficient, and mathematically tractable, and it does not require any rigid a-priori model assumptions or estimates of the whole distribution of the data. Indeed, data alignment using PA has been successfully applied to various fields, including Brain-Computer Interface (BCI) [2], genetics and bioinformatics [3], [4], [5], indoor navigation [6], face recognition [7], and hierarchical representation [5], [8], to name but a few.
The application of PA to high-dimensional data sets poses challenges since such data typically do not live in a Euclidean space. The recent common practice for high-dimensional data analysis is based on the manifold assumption, i.e., to assume the existence of an intrinsic low-dimensional manifold underlying the data [9], [10], [11], [12]. However, learning the manifold from observations might be impractical in some real-world problems. An alternative approach is to use informative features with a-priori known and useful geometry. One natural example of such features is covariance matrices that embody multivariate associations. If the covariance matrices are full rank, they live on a Riemannian manifold, also known as the Symmetric Positive Definite (SPD) cone [13], [14]. Rodrigues et al. [2] presented PA based on the Riemannian geometry of SPD matrices, with application to BCI involving electroencephalographic (EEG) recordings of different subjects. Yet, many types of data sets, such as gene expression data [15] and hyper-spectral imaging data [16], [17], have a low dimensional structure, and therefore, the covariance matrices stemming from such data are rank deficient, i.e., they are Symmetric Positive Semidefinite (SPSD) matrices. Other examples of SPSD features are kernels, similarity matrices, and graph Laplacians, where the multiplicity of the zero eigenvalue equals the number of connected components of the associated graph [18].
In this work, we address the problem of aligning highdimensional data sets by solving a new PA problem designed specifically for SPSD matrices with a fixed rank [19]. sets of SPSD matrices, we aim to align the sets by matching their statistical moments and available labels. For this purpose, we derive the three geometric operations composing PA: centering, scaling, and rotation, by using the Riemannian framework of SPSD matrices presented in [19], [20]. Broadly, our proposed centering is based on Riemannian mean subtraction derived using parallel transport. The scaling is based on matching the Riemannian dispersion, which is carried out with respect to the geodesic path. Lastly, the rotation makes use of a small number of landmarks and is designed to match the first-and second-order statistical features of the sets. Conceptually, our work could be viewed as an extension of [2] from SPD geometry to SPSD geometry. Seemingly, the difference between SPD geometry and SPSD geometry is subtle and could be technically solved by an appropriate regularization, yet, it is in fact fundamental and involves a completely different geometry.
We demonstrate the usefulness of our approach in simulations and in an application to BCI involving EEG recordings. Recently, the spatial covariance matrix of EEG recordings has been shown to be an informative feature for BCI applications, leading to state-of-the-art performance [21], [22], [23], [24]. Here, we show that using our method to align sets of covariance matrices of EEG recordings acquired from different subjects is beneficial, facilitating improved BCI performance compared to other alignment methods.
Seemingly, one limitation of the proposed approach, when it is applied to aligning data through the alignment of the SPSD features of the data, is that every downstream task is also restricted to using these aligned SPSD features. We show that if the SPSD features are computed as the sample covariance matrices, then the centering and rotation steps of the proposed PA, which are derived following SPSD geometry considerations, can be applied directly in the original data space. Consequently, in this case, we are not restricted to using the SPSD features in downstream tasks following the proposed alignment, in contrast to other related algorithms for data alignment [2], [20], [25].
The remainder of the paper is organized as follows. In Section II, we present the relevant mathematical background. Section III presents the problem formulation. The proposed method is presented in Section IV. In Section V, we show how our method can be applied directly to high-dimensional data sets with a low-dimensional structure. In Section VI, we showcase the performance of the proposed algorithm in applications to simulated data and to real EEG recordings. Section VII concludes the paper.

II. MATHEMATICAL BACKGROUND
We present a brief mathematical background on the three Riemannian manifolds considered in this paper. For more details, we refer the readers to [14], [19], [20], [26].

A. The Manifold of SPD Matrices
We denote the set of all r × r SPD matrices by The tangent space T P P r at a point P ∈ P r is the set of all symmetric matrices For S 1 , S 2 ∈ T P P r , the affine invariant metric is defined by [13] where A, B = Tr{A T B}. The set P r in (1) with the inner product (2) constitutes a Riemannian manifold. For P 1 , P 2 ∈ P r , the map of P 2 to the tangent space T P 1 P r is given by the following explicit expression of the Logarithmic map (3) Its inverse is given by the following explicit expression of the Exponential map The geodesic path from P 1 ∈ P r to P 2 ∈ P r is given by Intuitively, the logarithmic map could be interpreted as follows. If we move from a base point P 1 with initial velocity Log P 1 (P 2 ) and maintain constant speed along the geodesic, then after one time step, we arrive at P 2 . The square of the arc length of the geodesic path in (5) is given by where λ i are the eigenvalues of P −1 1 P 2 . The arc length in (6) defines an affine-invariant distance, to which we will refer as the Riemannian distance in P r . We note that the affineinvariant property of this distance can be explicitly written by In the context of PA, we will consider the Fréchet mean, the dispersion, and the following transport map.
Definition 1 (Fréchet mean and dispersion): where d R is the Riemannian distance on M.
2) The dispersion (second order moment) of a set {x i ∈ M} N i=1 is defined by Definition 2 (Isometric Transport Map [20]):  1) It satisfies: 2) It is an isometry, i.e., it preserves pairwise distances: Note that these definitions of the Fréchet mean, the dispersion, and the isometric transport map are not specific to the SPD manifold, and they will be used in the context of other manifolds as well.
Let {P i ∈ P r } be a set with mean M ({P i }) = P . Yair et al. showed in [27] that the map where T p = (P 0 P −1 ) 1/2 , is an isometric transport map on P r .
The map in (9) is tightly related to Parallel Transport (PT) with one important distinction: it maps points from the manifold to the manifold, while PT maps points from tangent space to tangent space. Broadly, PT is the process of transporting a vector along a curve of the manifold such that the covariant derivative of the transported vector along the curve is zero (see [28] for more details). The map in (9) is equivalent to mapping P i ∈ P r to the tangent space at P , parallel transporting the mapped vector along the geodesic path from P to P 0 , and mapping the result back to manifold P r .

B. The Grassmann Manifold
Let V d,r be the set of all d × r matrices, r < d, whose r columns are orthonormal vectors in R d . Let O r be the set of all r × r orthogonal matrices, such that any O ∈ O r satisfies OO T = O T O = I. Following [26], we view the Grassmann manifold as the quotient space G d,r = V d,r O r . A point on G d,r is an r-dimensional subspace of R d , and it is represented by an equivalence class We consider the inner product proposed in [26] Note that G ⊥ and B are not uniquely identified, but any choice results in the same inner product. The geodesic path from [G 1 ] ∈ G d,r to [G 2 ] ∈ G d,r is given by where Log and Exp denote the Logarithmic and Exponential maps on the Grassmann manifold (for their explicit expressions, see [20]). The arc length of the geodesic path in (10) specifies a Riemannian distance. The square of the arc length is given by where Θ represents the angles between the subspaces and can be computed by the following Singular Value Decomposition In [20], it was shown that the following map is an isometric transport map on G d,r where . For simplicity, we denote the isometric transport map on G d,r by

C. The Manifold of SPSD Matrices
We denote the set of all d × d SPSD matrices with a fixed rank r < d by We follow the work of Bonnabel and Sepulchre [19], which considers the following quotient manifold representation Since any C ∈ S + d,r can be decomposed as where U ∈ V d,r and R ∈ P r , C can be represented by where V d,r × P r is termed the structure space representation [19]. The representation in the structure space (13) is not unique because any orthogonal matrix O ∈ O r satisfies and therefore (13) is considered as a point on the Grassmann manifold G d,r .
The fact that the structure space is not unique makes the practical use of a set of SPSD matrices challenging. Recently in [20], a canonical representation of a set of SPSD matrices in the structure space was proposed, such that every matrix in the set has a unique structure space representation, and it was shown to be useful when SPSD matrices are considered as data features. Given a set of SPSD matrices C = {C i }, we denote the canonical structure space representation by where the computation of (G i , P i ) is given in Algorithm 1.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
By [19,Thm. 1], the space S + d,r equipped with the following metric which is the sum of the metrics in G d,r and P r with some parameter k > 0, is a Riemannian manifold with horizontal space We note that there exist in the literature other metrics for S + d,r , e.g., those presented in [29], [30]. We choose this particular representation and its associated metric for three reasons. First, it provides a metric, which generalizes the Affine Invariant Riemannian Metric (AIRM) for SPD matrices (2), since it is invariant to orthogonal transformations, scaling, and pseudoinversion [19]. Second, it was shown in [20] that this metric provides better empirical results compared to the metric proposed in [30]. Third, this geometry enables the computation of essential components, such as the Fréchet mean, and as far as we know, there is no algorithm for computing the Fréchet mean using the geometry proposed in [29].
In [19], a special path of interest between two points in S + d,r , , was proposed. This path is based on a horizontal geodesic in the structure space, and therefore, we first find two representatives of C 1 and C 2 in the structure space that are connected by horizontal geodesics. Define the rotation of U 2 with respect to U 1 by where Then, by [19,Thm. 2], the path between C 1 and C 2 , given by where , admits the following properties. First, it connects C 1 and C 2 , i.e., and it is a geodesic in the structure space. Third, the squared total length of γ C 1 →C 2 (t) is given by The path γ C 1 →C 2 (t) is not necessarily a geodesic path in the Riemannian manifold S + d,r with the metric in (15). In addition, its length in (19) is not a distance since it does not satisfy the triangle inequality. Nevertheless, it was shown theoretically in [19] and empirically in [20] that the length in (19) is a meaningful measure of dissimilarity.
We consider a map of C 2 ∼ = ( U 2 , R 2 ) to the horizontal space T (U 1 ,R 1 ) S + d,r by the corresponding logarithmic maps in the structure space as follows Informally, in the remainder of this paper, we view the horizontal space in (16) as the tangent space, the path γ C 1 →C 2 (t) in Algorithm 1: Canonical Representation for a Set of SPSD Matrices [20]. (18) as an approximate geodesic path, and the map in (20) as an approximate logarithmic map in S + d,r .

III. PROBLEM FORMULATION
Consider two data sets: the source set and the target set where X i and Y i are d × d SPSD matrices with rank r < d, which are viewed as data points. Suppose that each set consists of L classes. We consider a semi-supervised setting, where the labels of only a small subset X l ⊂ X and the labels of only a small subset Y l ⊂ Y are known, and they are used for the proposed data alignment. We assume that the statistical distributions of X and Y are different due to various extrinsic factors, yet the sets are intrinsically related.
To formulate our goal, we follow [2] and parametrize the statistical distributions of X and Y using their first-and secondorder moments, which are defined in the Riemannian sense as follows.
Let M be the SPSD manifold. Since the length in (19) is not a distance, the Fréchet mean in (7) cannot be used as is. Instead, we use the mean proposed in [31] which is computed as follows.
The mean of C is given by For computing the dispersion of C, we use (19) as a measure of dissimilarity between C and C i instead of d R in (22) because d R is not defined, giving rise to the following alternative dispersion Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. Now, equipped with the mean and dispersion of a set, we parametrize the statistical distributions of X and Y as follows where X and Y are the means of X and Y, respectively, M x i and M y i are the means of the i th class in the sets X and Y, respectively, and σ x and σ y are the dispersions of X and Y, respectively.
Our goal is to align X and Y such that the above parametrization of their statistical distributions coincide. After alignment, a model learned from one set, say X , can be exploited for an analysis or learning task applied to the other set, say Y, despite their heterogeneity. For example, let f : X → {1, 2, . . ., L} be a classification function defined on the set X . A good alignment of X and Y facilitates good classification results simply by applying f to points in Y after alignment.

IV. PROPOSED METHOD
Following standard PA [1], the proposed alignment of the two sets X and Y is composed of three operations on the manifold S + d,r : centering, scaling, and rotation. In this section, we present their proposed implementation on S + d,r .

A. Centering
We center the sets X and Y by subtracting their respective means using a transport map defined as follows.
Rigid transport map if the following two conditions hold.
1) It satisfies: 2) It preserve the length of the curve between C and C i ∀i. i.e.: Note that a rigid transform map (Definition 3) differs from an isometric transform map (Definition 2) by requiring geodesic length rather than distance preservation.
We define the following map: where the matrices T g and T p are given in Sections II-A and II-B, respectively.
is a rigid transport map. See Appendix A for the proof of Proposition 1.
) denote the origin of the space. We center X and Y by applying Γ + We denote the sets after centering by whose means according to Proposition 1 are C 0 .

B. Scaling
We propose to scale the centered source set X (ctr) such that its dispersion after scaling matches the dispersion of the centered target set Y (ctr) , namely σ y . This operation is carried out by sampling the geodesic path (18) between the origin C 0 and X , whose length is given by Consider the structure space representation of X (ctr) and Y (ctr) defined in (25). Let σ g and σ v be the dispersion of {G } on P r , respectively. We define the scaling of X such that the scaled source set is given by The dispersion after scaling is given by

Proposition 2:
Then the dispersion of the set X (scl) is equal to the dispersion of the set Y (ctr) , i.e., See Appendix B for the proof of Proposition 2.

C. Rotation
According to the problem formulation presented in Section III, the sets X and Y are composed of L classes. In this step, we propose to match the means of the classes such that the classes in both sets coincide.
Definition 4 (Rotation): 1) R preserves the length of the path between all pairs of points: 2) R maps C to itself, i.e., R(C) = C.
3) R preserves the length of the path to C: Let SO(d) be the set of all special orthogonal d × d matrices, i.e., orthogonal matrices with determinant 1. The following map is a rotation about the origin C 0 for any C ∼ = (G, P ): where O p ∈ SO(r), and O g ∈ SO(d) is of the form where O g 1 ∈ SO(r) and O g 2 ∈ SO(d − r). In the Supplementary Materials, we show that the map in (27) indeed satisfies all the three properties of Definition 4. To find the best R, which describes the relation between X (scl) and Y (ctr) , we search for the rotation which matches the classes means of X (scl) to those of Y (ctr) by minimizing the following criterion where M x c ∼ = (G x c , P x c ) and M y c ∼ = (G y c , P y c ) are the mean of the c-th class of X (scl) and Y (ctr) , respectively. We note that in contrast to the previous steps (centering and scaling), which are completely label-free, this step requires subsets of X and Y that are sufficiently large and allow for accurate estimations of M x c and M y c , respectively. Alternatively, an unsupervised rotation by matching the second-order moments of the data sets as in [5], [27], can be considered.
By using the arc length in (19) and the form of R in (27), the criterion in (28) can be decoupled to two minimization problems We solve (29) and (30) by using the steepest descent algorithm on the Riemannian manifold of all rotation matrices, which was implemented in [32]. This algorithm computes the Riemannian gradient by mapping the Euclidean gradient to the tangent space of the manifold. The Euclidean gradient of (30) is given in [2] by For the Euclidean gradient computation of (29), see Supplementary Materials. We note that the convergence of the steepest descent algorithm implemented in [32] to a global minimum is not guaranteed. However, we will show experimentally that it performs well on both simulated and real problems. We note that the matrices O g and O p , as well as T g and T p , are invertible matrices. This gives rise to the following insight. Suppose X and Y are represented canonically in the structure space and have the same dispersion, i.e. σ x = σ y , such that the discrepancies between the statistics of X and Y are indeed expressed by transportations, T g and T p , and rotations, O g and O p . Consequently, the centering step in Section IV-A and the rotation step in this section, can be viewed as the estimation of the inverse of T g , T p , O g and O p , and the proposed PA can potentially match the statistical distributions.

V. DA IN THE ORIGINAL DATA SPACE
In this work, we consider the SPSD matrices as data features. Despite being quite broad (e.g., low-rank covariance matrices, graph Laplacians, kernel and similarity matrices), the focus on SPSD matrices applies not only to the current context of data alignment, but also to every subsequent downstream task that is restricted to working with particular SPSD features.
As a remedy, in this section, we show that the centering and rotation steps can be applied equivalently in the original data space rather than to the SPSD sample covariance matrices. We remark that we did not find such an equivalent implementation of the scaling step.
Consider two sets of data sets: , where the columns of D x i and D y i are samples in R d with zero mean. The sample covariance matrices of D x i and D y i are given by respectively. We assume that Σ X i , Σ Y i ∈ S + d,r , and denote the sets of the sample covariance matrices computed from D x and D y , by . Let (G i , P i ) be the canonical representation of Σ X i , obtained by applying Algorithm 1 to the set X . To apply alignment directly to D x we first compute the centering matrices of the set X , T p x and T g x , and the centering matrices of the set Y, T p y and T g y , as described in Section IV-A. Second, we compute the rotation matrices O g and O p as described in Section IV-C. Then, we apply three consecutive steps: centering, rotation, and transportation to the mean of Y, which are given by where X and Y are the means of X and Y, respectively. Fig. 1. Two tori x(θ, ϕ) and y(θ, ϕ) defined by the parametrization given in (34).
The proof appears in Appendix E. Proposition 3 implies that when we apply the transformation in (33) to D x i , the relation between the sample covariance of the data before and after the transformation is given by centering, rotation about the origin, and transportation to the mean Y . Consequently, after the transformation in (33), the covariance matrices characterizing the data from both sets D x and D y are aligned.
The implication of Proposition 3 is that we are not restricted to using SPSD features in downstream tasks following the proposed alignment, in contrast to other related algorithms for data alignment, such as [2], [20], [25]. Section VI-B illustrates the application of this result.
For each torus, we generate a set of covariance matrices as follows. Let [θ, ϕ] T be a random variable with a uniform distribution U [0, π/2] × [0, π/2]. Each realization [ϕ i , θ i ] T of [ϕ, θ] T defines a point on the torus. We collect N = 500 points from x(θ, ϕ): For each point x i , we compute the sample covariance where The above procedure is repeated for the other torus y(θ, ϕ), resulting in another set of N = 500 points [ϕ i , θ i ] T with their associated covariance matrices Y i . The obtained two sets of covariance matrices are denoted by Since the considered torus is a two-dimensional surface embedded in R 3 , the sample covariance matrices X i and Y i computed based on a small neighborhood are approximately in S + 3,2 . Our goal in this context could be stated as follows. We wish to align X and Y, such that the covariance matrices X i and Y j of any two close points in the (hidden) parameter space [θ i , ϕ i ] T and [θ j , ϕ j ] T , where each lies on a different torus, will be close.
We apply our method and get a set of covariance matrices after each of the three steps: centering, scaling, and rotation. For the rotation step, we divide the covariance matrices into four classes according to their respective angles θ i and ϕ i . Each class corresponds to one of the four combinations θ i ≶ π/4 and ϕ i ≶ π/4. We use 50 labels from the target set Y, which is 10% of the set size.
To visualize the resulting sets after adaptation, we use the approximate logarithmic map in (20) to map the matrices to the tangent space. After mapping to the tangent space, the matrices are viewed as vectors in a linear space, where we can apply Principal Components Analysis (PCA) and display the Principal Components (PC). Fig. 2 shows the PC before and after each of the steps of the proposed method: (a) X and Y before alignment, (b) X (ctr) and Y (ctr) after centering, (c) X (scl) and Y (ctr) after scaling, and (d) X (rot) and Y (ctr) after rotation. Each point in Fig. 2 represents a covariance matrix, and it is colored according to its respective (hidden) angle ϕ. We observe that the proposed method indeed aligns X and Y. Furthermore, the correspondence between the colors of the points from both sets (representing the angle ϕ) illustrates that the obtained alignment respects the intrinsic low-dimensional structure of the data (in this case, the hidden parametrization by ϕ and θ).

B. Toy Problem -Denoising in the Original Data Space
In Section V, we showed that the proposed alignment can be applied in the original feature space, in which the data is given, thereby not restricting downstream tasks to the use of SPSD features. In this section, we demonstrate this capability using a subsequent denoising task following the alignment.
where σ θ i (t n ) and ϕ θ i (t n ) are realizations of a Gaussian variable with zero mean and variance σ 2 = 4 · 10 −4 , such that the input Signal-to-Noise Ratio (SNR) (before denoising) is 26.2 dB. Fig. 4 presents examples for: (a) a set of samples from the clean signal D y i , and (b) a set of samples from the noisy signal D (1) x i . Following the notation in Section V, we denote D y = {D y i } and D (j) x , we first apply the transformation in (33) to adapt D (j) x i to D y i and get D (j) x i , and then, we average over all the noisy (and adapted) signals: x i . To illustrate the importance of data alignment for denoising, we compare between the averaged signal without alignment and the averaged signal after alignment. Fig. 5 shows the denoising results with and without the proposed alignment. The clean signal D x i (t) and the filtered signal D y i (t) are depicted by the blue curve and the red curve, respectively. Each row in the figure presents a different trajectory of D y i (t) and D x i (t). The left and right columns show the denoising results without alignment and the denoising results by using the proposed alignment, respectively.
We observed that by using the proposed alignment prior to averaging (denoising), we obtain a better noise attenuation. Furthermore, when using alignment, the structure of the signal is preserved while the averaging of the signals D (j) x i (t) without a prior alignment hinders the structure of the signal. To quantitatively measure the denoising results, we compute the SNR of the resulting signal. The SNR obtained by the denoising after alignment is 42 dB, whereas the SNR obtained by the denoising without alignment is only 20.6 dB.

C. Application to BCI
One of the paradigms of BCI is the Steady State Visually Evoked Potential (SSVEP), which is based on the fact that when a user is looking at a flickering object at some frequency, this frequency manifests itself in the spectrum of the EEG signal. In typical SSVEP applications, a user is exposed to several visual stimuli with different frequencies, where each stimulus corresponds to a different command. The user chooses the desired command by looking at the associated stimulus, and the Fig. 5. Denoising results. The clean signal D x i (t) and the filtered signal D y i (t) are represented by the blue curve and the red curve, respectively. Each row presents a different trajectory of D x i (t) and D y i (t). The left and the right columns, show the denoising results without alignment and the denoising results by using alignment, respectively. system executes that command by identifying the corresponding frequency in the EEG signal.
We apply our method to adapt the EEG recordings of one subject to another. In our experiment, we use the EEG data from the first SSVEP experiment in the MAMEM database [33], [34], which consists of the recordings of 11 subjects. Each subject was exposed in each trial to one of five flickering lights with a different frequency {f k } 5 k=1 . The EEG signals were recorded during the experiment in 256 channels. However, we use only the 13 most informative channels according to the performance analysis reported in [33]. The task in this experiment is to classify the EEG signals of each trial according to the flickering frequencies. We note that in the technical report published with the MAMEM database [33], subjects 3, 5, and 8 were marked as outliers.
In the context of this paper, we will show that there is a significant difference between the data recorded from different subjects, and therefore, it could be beneficial to view each subject as a domain. Subsequently, we will use our method for the purpose of adapting data from two subjects (i.e., two domains), which allows for high-quality classification of EEG signals of one subject by a classifier trained on data from another subject. We note that for fair comparisons with the competing methods, Fig. 6. The covariance matrix of EEG signals recorded in five trials. In each trial the subject was exposed to a flickering light with a different frequency f k .
we assume that all the labels of X are given, rather than the labels of a subset X l ⊂ X .
Let D ∈ R N c ×N t be the EEG signals recorded in one trial after mean subtraction, where N c = 13 is the number of channels and N t = 1250 is the number of time samples. For preprocessing, we use the mean subtraction and denoising algorithm AMUSE [35], which are implemented in the SSVEP toolbox [36] released as a part of the MAMEM database. Next, we filter D with five band-pass filters centered around the flickering frequency f k . Let D k ∈ R N c ×N t be the EEG signals filtered with the k-th filter. We use the covariance matrix proposed in [37], which is block diagonal with the sample covariance The dimension of Σ is 65 × 65 since there are 5 flickering frequencies, and therefore we have 5 · N c = 65 filtered signals. Fig. 6 shows the covariance matrix of the EEG signals recorded in five trials. In each trial, the subject was exposed to a flickering light with a different frequency f k . It can be observed that in each covariance matrix, a different Σ k is dominant. Fig. 7. The eigenvalues decay of three different covariance matrices computed for three different trials of the SSVEP experiment.
In Fig. 7, we plot the eigenvalues decay of three different covariance matrices Σ (36), where each matrix is computed based on a different trial. We report that such decays are prototypical. We see that the eigenvalues (sorted in descending order) have small values, i.e., λ i ≈ 0, for i > 12, implying that the rank of the matrices is approximately r = 12. As a result, we consider the covariance matrices as SPSD matrices of fixed rank r = 12.
We compute the covariance matrix given in (36) for each trial of each subject, and denote the covariance matrix of the i-th trial and the l-th subject by X i ) be the canonical representation. Throughout this experiment, we set k in (15) i )} i multiplied by k, balancing the distances in G d,r and the distances in P r .
To illustrate the advantage of using the Riemannian geometry of SPSD matrices, we compare it to the Riemannian geometry of SPD matrices. We apply two logarithmic maps to the set X (6) = {X (6) i }: (a) the logarithmic map of the SPD manifold given in (3), and (b) the approximation for the logarithmic map of the SPSD manifold given in (20). Then, we compute the two PCs in the obtained tangent space. Note that in the tangent space, the matrices are viewed as vectors in a linear space, where the standard PCA can be applied. Fig. 8(a) and (b) present the resulting PC computed in the tangent space of the SPD manifold and the SPSD manifold, respectively. Each point represents a matrix of a single trial, which is colored by its respective label, i.e., the stimulus frequency. One common practice to counter the fact that the matrices are rank deficient is the Ledoit-Wolf estimator for high-dimensional covariance matrices [38]. To compare the SPSD geometry to the SPD geometry after applying the Ledoit-Wolf estimator, we use the implementation of [38] in [39] for the estimation of the covariance matrices Σ k , k = 1, . . ., 5 in (36). Fig. 8(c) presents the PCs obtained in the tangent space of the SPD manifold when the covariance matrices were estimated by the Ledoit-Wolf algorithm. Fig. 8(d) presents the PC obtained by stacking each matrix into one vector and then applying PCA, namely, using Euclidean geometry.
Each point in Fig. 8 represents a covariance X (6) i , and it is colored according to the corresponding flickering frequency. We observe that the Riemannian geometry of SPSD matrices provides better separation between the frequencies compared to the separation obtained by the Riemannian geometry of SPD matrices or by the standard Euclidean Geometry. Fig. 8. The PC of the covariance matrices in the set X (6) after mapping to the tangent space by using: (a) the logarithmic map of the SPD manifold given in (3), (b) the approximation for the logarithmic map of the SPSD manifold proposed in [20], (c) the logarithmic map of the SPD manifold given in (3), where the covariance matrices were estimated using [38]. (d) presents the PC obtained by stacking each matrix into one vector and then applying PCA. Points are colored according to the flickering frequency classes. To objectively measure the degree of separation between the frequencies for each of the subjects (except the outlier subjects 3, 5, and 8), we train a Minimum-Distance to Mean (MDM) classifier [21] with the labels of 80% of the trials, and then classify the rest of the trials. The MDM classifier associates a sample to the class with the closest mean. We repeat this experiment 20 times for each subject and compute the mean Classification Accuracy (ACC). Table I shows the mean ACC obtained by using each of the different geometries: Euclidean, SPD with sample covariance matrices, SPD with covariance matrices estimated using [38], and the SPSD geometry proposed by [19]. We see that the MDM classifier obtains the highest mean ACC when using the SPSD geometry.
We note that Fig. 8 visualizes covariance matrices of only one subject. To illustrate the need for data alignment, we visualize the covariance matrices of two different subjects. Fig. 9 presents the tSNE representation [40] computed in the tangent space of data from subjects 2 and 4, which were arbitrarily chosen. Each point represents a covariance matrix of one trial, and it is colored according to the flickering frequency. The covariance matrices of subject 2 are marked by asterisks and the covariance matrices of subject 4 are marked by circles. We observe that trials of the We apply the proposed alignment algorithm to align the covariance matrices of all pairs of subjects, where one subject is the source and the other is the target. For the rotation step, we use the labels of 5 trials from each class in the source set. To quantitatively evaluate the alignment of each pair of subjects, we train a MDM classifier [21] on the target set and test it on the source set. We compare the obtained ACC of four algorithms for data alignment: (a) PA on P d [2], (b) a naive union of SPSD matrices without alignment, (c) Transportation on S + d,r [20] and (d) PA on S + d,r (ours). To compare the algorithms numerically, we compute the average accuracy of all pairs excluding the outlier subjects 3, 5 and 8, where our algorithm is applied both with and without the scaling step. Table II shows the average accuracy obtained by the four algorithms. Our algorithm obtains a comparable ACC to the ACC obtained by [20], and in contrast to the experiment in Section VI-A, it obtains the highest accuracy without the scaling step. Fig. 10 presents the confusion matrix for each of the four algorithms consisting of the ACC of all pairs of subjects, excluding subjects 3, 5, and 8. Table III presents the average precision, recall, and F1 measure of each class, obtained by the tested algorithms. Fig. 6 implies that a specific classifier for the SSVEP application could be implemented as follows:   where · F is the Frobenius norm, and Σ k is the k-th block of Σ defined in (36). We consider the classifier in (37) as a baseline classifier for this application. To compare our method with this baseline, we designate the recordings of subject 4 as the target set and train the MDM classifier on this set. Then we classify the recordings of all the other subjects. Table IV compares the obtained ACC with the ACC obtained by the baseline classifier (37). We see that for most of the subjects, the proposed alignment improves the classification results. The mean ACC is improved by 8%. In practice, a subject should be considered as a (good) target if the intra-subject covariance matrices are well separated into the different classes. Specifically, when using the MDM classifier, this separation can be evaluated by the percentage of the covariance matrices for which the closest mean is the mean of the true class. For subject 4 (and subject 11 as well), 100% of the covariance matrices satisfy this condition, and therefore, this is a good choice of a target set. We repeat the above experiment, where each time the recordings of a different subject are designated as the target set. We report that the mean ACC is improved by the alignment for all the subjects, except subjects 3, 5 and 8, which are the outlier subjects. Same as in Fig. 9, in Fig. 11 we present the tSNE representation of covariance matrices from three pairs of subjects before and after alignment in the left and right columns, respectively. Each point represents a covariance matrix of one trial, and it is colored according to the flickering frequency. Covariance matrices from the source set are marked by asterisks and covariance matrices from the target set are marked by circles. It can be observed that before alignment, points from the source set and points from the target set of the same class, reside in different regions. In contrast, after applying the proposed alignment, the classes of the source set coincide with the classes of the target set.

VII. CONCLUSION
In this work, we propose an algorithm for data alignment based on the Riemannian geometry of SPSD matrices. We show that this algorithm can be applied not only to SPSD matrices but also to high-dimensional data sets with a low-dimensional structure, by performing the centering and rotation steps in the data space using geometric considerations stemming from their corresponding sample covariance matrices. The advantage of the proposed algorithm is illustrated in both simulated and real data.
Our method relies on the assumption that the rank and the dimension of the SPSD matrices in a given set are fixed. In future work, we plan to address the case where SPSD matrices have different ranks or different dimensions [25], corresponding for example to a different number of electrodes in two different EEG experiments. In addition, our algorithm only takes into account the first and the second moments of the statistical distribution of a given data set, while in future work, we will investigate the incorporation of higher moments.

APPENDIX A PROOF OF PROPOSITION 1
Proof: For the proof of property 1 in Definition 3 see [20]. Next, we prove property 2 in Definition 3. Let {C i ∼ = (G i , P i )} ⊂ S + d,r be the canonical representation of a set with a mean M ({C i }) = C ∼ = (G, P ), and let C 0 ∼ = (G 0 , P 0 ) be another point on S + d,r such that G 0 = Π G (G 0 ) and P 0 = G T 0 C 0 G 0 . The length of the curve γ C→C i is given by (19) Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.
Next, we show that The structure space representation of the transported set Γ + C→C 0 (C i ) is given by (T g G i , T p P i T T p ), where the matrices T g and T p are give in Sections II-A and II-B, respectively. To use (19) for the computation of the right hand side of (38), we need to insure that T g G i satisfies By definition, 2 is an SVD. Since G 0 = T g G (see [20]), we have Therefore, O 1 and O 2 are given by the left and the right orthogonal matrices of the SVD: G T G i = OΣO T i , respectively. Hence The last equality is due to the fact that Π G (G i ) = G i OO T i = G i , since (G i , P i ) is the canonical representation of C i . Now, we use (19) to compute the right hand side of (38) The third equality holds since d P is an affine-invariant distance and d G is invariant to orthogonal matrix multiplication.

APPENDIX B PROOF OF PROPOSITION 2
To show that σ 2 (X (scl) ) = σ 2 y we use the following lemmas. Lemma 1: Let G(t) be a geodesic path between G 1 ∈ G d,r and G 2 ∈ G d,r , where G 2 = Π G 1 (G 2 ). The geodesic path G(t) satisfies See Appendix C for the proof. According to Lemma 1, G ). Therefore, the length of the geodesic path between C 0 and X (scl) i is given by Lemma 2: Let G(t) be a geodesic path between two points G 1 , G 2 ∈ G d,r . The distance between G 1 and G(t) is given by where G T 1 G 2 = O 1 (cos Θ)O T 2 is an SVD. See Appendix D for the proof.
We note that an analogous result to Lemma 2 in the SPD manifold can be obtained by substituting (5) in (6) d 2 P (P 1 , P (t)) = t 2 where λ i are the eigenvalues of P −1 1 P 2 . By using Lemmas 1 and 2, we now prove Proposition 2. Proof: The second equality is due to Lemma 1 and (41), and the third equality is due to Lemma 2.

APPENDIX C PROOF OF LEMMA 1
Proof: Let G 1,⊥ be the orthogonal complement of G 1 , such that [G 1 G 1,⊥ ] ∈ O d . The geodesic path G(t) is given by [26] G(t) = G 1 V cos(Σt)V T + U sin(Σt)V T , where U ΣV T is the compact SVD of G 1,⊥ B, B ∈ R (d−r)×r . Recall that (17)). To find the SVD of G T 1 G(t), we multiply (42) by G T 1 from the left G T 1 G(t) = G T 1 G 1 V cos(Σt)V T + G T 1 U sin(Σt)V T = V cos(Σt)V T + G T 1 U sin(Σt)V T . The second term equals zero since the columns of U are in the subspace of G 1,⊥ . Therefore, the SVD of G T 1 G(t) is given by G T 1 G(t) = V cos(Σt)V T , and Π G 1 (G(t)) = G(t)V V T = G(t).

APPENDIX D PROOF OF LEMMA 2
Proof: Let H be the direction of G(t) at t = 0, i.e.,Ġ(0) = H. The path length between G 1 and G(t) is given by [26] where H = U ΣV T is the compact SVD of H. As noted in [26],