Tangent space spatial filters for interpretable and efficient Riemannian classification

Methods based on Riemannian geometry have proven themselves to be good models for decoding in brain-computer interfacing (BCI). However, one major drawback of these methods is that it is not possible to determine what aspect of the signal the classifier is built on, leaving open the possibility that artifacts drive classification performance. In areas where artifactual control is problematic, specifically neurofeedback and BCIs in patient populations, this has led people to continue to rely on spatial filters as a way of generating features that are provably brain-related. Furthermore, these methods also suffer from the curse of dimensionality and are almost infeasible in high-density online BCI systems. To tackle these drawbacks, we introduce here a method for computing spatial filters from any linear function in the Riemannian tangent space, which allows for more efficient classification as well as the removal of artifact sources from classifiers built on Riemannian methods. We first prove a fundamental relationship between certain tangent spaces and spatial filtering methods, including an explanation of common spatial patterns within this framework, and then validate our proposed approach using an open-access BCI analysis framework.

Abstract-Methods based on Riemannian geometry have proven themselves to be good models for decoding in braincomputer interfacing (BCI). However, one major drawback of these methods is that it is not possible to determine what aspect of the signal the classifier is built on, leaving open the possibility that artifacts drive classification performance. In areas where artifactual control is problematic, specifically neurofeedback and BCIs in patient populations, this has led people to continue to rely on spatial filters as a way of generating features that are provably brain-related. Furthermore, these methods also suffer from the curse of dimensionality and are almost infeasible in high-density online BCI systems. To tackle these drawbacks, we introduce here a method for computing spatial filters from any linear function in the Riemannian tangent space, which allows for more efficient classification as well as the removal of artifact sources from classifiers built on Riemannian methods. We first prove a fundamental relationship between certain tangent spaces and spatial filtering methods, including an explanation of common spatial patterns within this framework, and then validate our proposed approach using an open-access BCI analysis framework.

I. INTRODUCTION
Brain-computer interfaces (BCIs) are well known, if not infamous, for their sensitivity to noise and their low signal-tonoise ratio. Over the past decades, many methods have been invented in order to derive features from the raw signal data that are predictive of user intention. However, as the electroencephalogram (EEG) is highly sensitive to both neural and nonneural signals, optimizing setups for predictive accuracy was insufficient. Rather, it was necessary to be able to confirm that any classifier was both predictive and based purely on brain-derived features. These divergent requirements spurred the field to develop in two different directions: spatial filtering and Riemannian manifold techniques.
Spatial filters are linear combinations of channel activity that reconstruct a single (neural or non-neural) source with certain desired properties. Initially, these weightings were computed via physical or neurophysiological models [1]. However, it was quickly discovered that data-driven spatial filters could lead to features that reflect robust differences in brain Asterisk indicates corresponding author.
* J. Xu  activity. By optimizing for variance [2,3] or independence [4,5], or even searching for filters that maximize the difference between multiple types of intention [6,7], many different sorts of spatial filters can be computed. In order to verify that the reconstructed signal comes from the brain, it was possible to plot the spatial patterns corresponding to those filters on the scalp.
Beginning with common spatial patterns (CSP), there has been a large body of literature dedicated to finding algorithms that optimally reconstruct source activity based on a given criterion. For differences between two classes, CSP has proven itself to be robust and easy to implement (for a more exhaustive review, see [8]). More recently, methods have been developed to find sources that track a continuous variable of interest [9]. One major difficulty that was recognized early on is that, while it is relatively simple to generate appropriate spatial filters for data that is already recorded, the application of these filters to new data is often confounded by the highly non-stationary nature of the EEG signal. Filters that persist across multiple recording sessions, or filters that work on multiple subjects, are both open areas of research. Some groups look at different criteria to derive robust filters [10,11], others use more probabilistic techniques [12,13], and still others consider options like sparsity [14,15] or looking at patches of channels [16]. Each of the aforementioned techniques has shown its value in solving an aspect of the spatial filtering problem, but they often require very different approaches to solve the ensuing optimization problems, and it is hard to decide which one may be most appropriate for a given situation. A further issue is the inefficient use of data. Spatial filters themselves can only be used to generate features. To fit classification or regression models requires re-using the data for the prediction model fitting.
Nevertheless, spatial filtering remains a crucial method in applications where artifacts are of great concern. In particular, it is crucial for neurofeedback studies. When the goal is to give feedback on neural activity, there must be a way of ensuring that the model which reconstructs a source of interest from the original signal only uses brain data to do so. This requirement invalidates many black-box machine learning methods, such as random forests [17], and often results in methodologies in which spatial filters are retrained in each recording session and validated by hand before neurofeedback can take place.
Outside of this sphere, methods based on Riemannian geometry have been gaining momentum as a model for robust classification for performance-optimized BCIs. Thanks to work in differential geometry, metrics for computing the distance between sensor covariance matrices have been discovered that arXiv:1909.10567v1 [eess.SP] 23 Sep 2019 are invariant to many common sorts of noise found in the electroencephalogram [18]. These methods can be translated into algorithms for finding classifiers that are far more robust to noise across a variety of contexts [19]. In particular, the approaches that use tangent space projection [20] have been shown to out-perform most other conventional methods in a recent meta-analysis [21]. Two major downsides, however, are their high computational complexity and their interpretation. Because these methods work in the space of sensor covariance matrices, their size scales quadratically with the number of sensors. Further, the issue of interpretation is a significant problem. As of now, it is not possible to determine what parts of a signal are being used to build a tangent space classifier, and therefore these can only be used in artifact-sensitive contexts when paired with an artifact detection pipeline or other artifact cleaning methods.
In our paper, we show the following contributions: that it is possible to find sets of spatial filters that describe a linear function in the Riemannian tangent space, and further that this space has a fundamental relationship to common spatial patterns. Via this approach, the full literature of linear machine learning methods can be used for spatial filtering, instead of requiring a different optimization for each regularization of interest. Using this connection, it is possible to visualize the sources that a tangent space classifier uses, and thereby to identify artifact sources used for classification and remove them via orthogonal projection. Finally, we show in offline comparison that using spatial filters derived via this approach significantly out-performs common spatial patterns and can even, in low-data situations, out-perform the tangent space function they are derived from. A preliminary version of this work has been reported at [22].

II. BACKGROUND
Riemannian manifold-based classification methods (hereafter abbreviated Riemannian methods) can often seem difficult to understand. For convenience, we include this section that reviews our notation and the basic operations of Riemannian methods, as well as a short review of the mathematics behind spatial filtering.

A. Preliminary and Notations
We notate the the raw sensor data as X ∈ C×N ×T , where C, N and T represents the number of channels (electrodes), samples (length of each trial) and trials respectively. We represent the data of channel c (with c ∈ {1, · · · , C}) as X c . In addition, the data from the t-th trial (with t ∈ {1, · · · , T }) are expressed as X t . Similarly, we use (·) t to express the variables derived from X t . Moreover, the covariance matrices computed from X, i.e., the points lying on the manifold, are denoted as C ∈ C×C×T . The Fréchet mean, a generalization of the standard arithmetic mean to other spaces, of the manifold points set C is expressed as C m . In the following section, we use A to denote any symmetric positive definite (SPD) matrix, for which the following property holds true: v T Av > 0, ∀v = 0. We next describe some common operations for manipulating points on the symmetric positive definite (SPD) manifold. Firstly, λ (A) is used to express the vector of eigenvalues of A. Next, the logarithm for an SPD matrix is defined as: where D is the diagonal eigenvalue matrix of A, i.e., A = VDV T , log (·) represents taking the logarithm elementwise for a matrix, and V is the orthogonal matrix of eigenvectors. The exponential and powers of a SPD matrix are defined in similar fashion, i.e.: Expm (A) = Vexp (D) V T A p = VD p V T , p ∈ and p = 0, (II. 2) where exp (·) and (·) p represent taking logarithm and power of p elementwise within a matrix. Please note that p can also be a fraction, e.g., p = 1 2 means the square root and p = − 1 2 denotes the inverse square root. At last, since the vectorization of an SPD matrix is also frequently employed to reduce the computational complexity, it is defined as below: vec (A) = [α 1,1 A 1,1 , · · · , α i,j A i,j , · · · , α C,C A C,C ] An overview of these notations is shown in Table II.1.

TABLE II.1: The List of Notations
Basic Variables C− #channels c− c-th channel T − #trials t− t-th trial N − #samples K− #spatial filters D − Diagonal matrix I − Identity matrix A − Any SPD matrix B − Any SPD matrix Data Related Variables X ∈ C×N ×T − Bandpass filtered trialwise data C ∈ C×C×T − Covariance matrices on the manifold C w ∈ C×C − Weight covariance matrix on the manifold C ref ∈ C×C − Reference point for constructing tangent space S ∈ C×C×T − Covariance matrices on the tangent space S w ∈ C×C − Weight matrix on the tangent space F ∈ C×C − Spatial filters with full rank − → s t ∈ C(C+1) 2 ×1 − Tangent vector of t-th trial − → f ∈ C×1 − Single spatial filter component − → w ∈

B. Riemannian Manifold based Methods
Most BCIs use classifiers built on features extracted from the bandpower in physiologically relevant ranges from the recorded channels, often approximated via the variance after spectral filtering. One limitation of this information is that it cannot take correlations between channels into account; in order to overcome this, the Riemannian classification framework is based on the sample covariance matrices [19], which encode both cross-channel information and variance information, and are also theoretically symmetric positive definite. Therefore, in this section, we will introduce several fundamental procedures utilized in Riemannian methods. For a more mathematically exhaustive treatment of Riemannian manifolds, please refer to [23].
1) Riemannian Metric and Distance: Most feature extraction or classification algorithms concentrate on maximally increasing the discriminability of data points. A linear classifier, for example, can be thought of as a one-dimensional projection of a dataset in which the two classes are as far from each other as possible according to a given criterion. One convenient proxy of measuring the discriminability of a set of points is therefore via the inter-point distances. A classifiable dataset corresponds to a dataset in which inter-point distances are low within a class and high between classes. Therefore, a good metric can lead to good models in machine learning tasks. In standard vector algebra, the metric function is usually the standard Euclidean metric, i.e., the squared Euclidean distance between two matrices, and is usually measured by the Frobenius norm of their difference, as shown in the below: where A and B are two matrices of the same size and · F represents the Frobenius norm. While this metric can also be used with SPD matrices, it is incapable of adequately capturing the structure of SPD matrices, leading to certain undesirable effects such as the swelling effect [24]. What this means is that naively attempting to use covariance matrices as features in a linear classifier by simply vectorizing the input points often works very poorly.
In order to take advantage of the structure inherent to covariance matrices, it is desirable to have a metric that generalizes the properties of the Euclidean metric in standard vector spaces to the SPD manifold. One importance of such property is the idea of geodesic distances -that the distance between two points is equivalent to the length of the shortest path to get from point A to point B. In vector spaces, this is simply equivalent to the magnitude of the difference between two points, but this is not necessarily true for manifolds. The affine-invariant Riemannian metric is proposed [25] and defined as Eq. (II. 5), and has the property of preserving geodesic distances, which is to say that the distance between two points is the length of the shortest path between them upon the SPD manifold.
where · 2 represents the L2 norm. Based on the chosen metric, the expression for the mean of a set of matrices is defined as: AIRM A, C t , (II. 6) where A ∈ C×C and C = C 1 , · · · , C T ∈ C×C×T . If C m is globally unique, then it is named as the Fréchet mean of the set of SPD matrices C. Given the metric and a set of points, it is possible to implement purely distancebased classifiers such as k-Nearest Neighbors or Minimum Distance to Riemannian Mean (MDRM) [26]. Classifiers based on this metric have shown themselves to be highly effective in particular for BCI data [19,27,28].
2) Tangent Space: When observing the explicit Riemannian metric defined in Eq. (II.5), one inconvenient issue is that the length of the shortest path between two manifold points (the geodesic) cannot be derived via simple subtraction and norm computation, as it can with the Euclidean metric. In order to treat SPD matrices in a manner identical to traditional feature vectors, we adopt the tangent space mapping.
The tangent space is defined as a finite-dimensional Euclidean space which exists at each point on the manifold and linearizes the curvature of the manifold around that point, which is called the reference point. Simply speaking, it is a way of transforming manifold points such that we can now treat them as standard vectors. However, the distances and angles derived from the tangent space representation of points are only valid within a small neighborhood around the reference point, which means that this transformation only works in a small volume of the manifold. Therefore, to ensure the approximation error is minimized, the Fréchet mean of a set C is adopted as the reference point for that set.
To transform points from the manifold to the tangent space at a point and vice versa, the so-called logarithmic and exponential maps are used. Under the affine-invariant Riemannian metric, the logarithm and exponential function pair at a point A are formulated as following: where S t is the projected point lying on the tangent space, C t is the original manifold point and the operation of Logm B (A) and Expm B (A), i.e., the logarithm and exponential of A based on another SPD matrix B, is defined as [25,19]: To further simplify operations on the tangent space, the projected points are usually vectorized. Note that this procedure does not alter the location or norm of the points, it simply makes them easier to notate and use. We denote these vectors as tangent vectors and formulate them as follows: where − → s t is the tangent vectors of t-th trial.
After obtaining the set of tangent vectors s, standard machine learning algorithms can be applied.
3) Pros and Cons of Riemannian Methods: As an emerging technique, Riemannian methods have seen an upsurge of interest in the BCI field recently [27,28] due to their rich feature space and robustness to outliers. In particular, Jayaram et al. [21] have compared Riemannian methods and standard processing pipelines over more than 200 subjects and showed that Riemannian methods are, on average, superior to many other conventional methods.
One major pitfall of these methods, however, is their sensitivity to the number of channels. As shown in Eq. (II.9), the dimension of the tangent vectors increases quadratically with the number of channels C. In addition, the computational complexity of the eigenvalue decomposition for matrices grows cubically. Due to these reasons, it becomes infeasible to apply Riemannian methods on data sets with a large number of channels. In addition, since the full covariance matrix is utilized for classification, interpreting the contribution from each channel that is used by the classifier can be a challenge. Therefore, the application of Riemannian methods is still restricted to low-channel situations where interpretability is of lesser importance.

C. Spatial Filtering
The novelty of Riemannian methods is not only the adoption of the new metric function but also the extraction of covariance matrix based features instead of variance-based features. Although in traditional EEG-based BCI systems, power (variance)-based methods are much more commonly adopted, they are, unfortunately, significantly affected by poor signal quality. To remove artifacts and noise while reducing the computational complexity, spatial filtering techniques are often used. Since the projection of the underlying neuronal sources to the EEG electrodes can be modeled as a linear transformation [29], with the appropriate projection, it is possible to recover the activity of specific parts of the brain. This both increases signal quality and provides a convenient signal for neuro-feedback.
Based on the way the filters are extracted, they can be categorized into fixed weight and data-driven [30]. Among the latter, one of the most popular methods is Common Spatial Patterns (CSP) [6,7,31], which has had a great impact on BCIs in the past two decades. In addition to increase signal quality, spatial filters are also important in ensuring that features used for machine learning in a BCI are brainbased. Since spatial filters represent a decoding model of brain activity, in which the time-series of interest is distilled from the recorded data, it is possible under some assumptions to recover an encoding model, which shows how the desired source projects to the sensors. This projection, called a spatial pattern, can then be visually validated to confirm that it represents a current source within the brain. While patterns are typically recovered by inverting the filtering matrix, this is only exact when the spatial filtering matrix is full rank. Therefore, for the computation of spatial patterns, we adopted the method proposed in [29], which is a more general way to derive the spatial patterns from linear filters. We next briefly review the mathematics behind CSP, as it is one of the most common and simple methods for generating spatial filters. As formulated below, CSP aims at extracting the signal sources which maximize the variance ratio between two conditions: where − → f CSP represents the optimal CSP filter component, and C (+) and C (−) represents the arithmetic mean covariance matrix of each condition. Obviously, Eq. (II.10) can be solved by the Generalized Eigenvalue Decomposition (GED) between C (+) and C (−) . The spatial filter matrix with shape of C×K is extracted by selecting the eigenvetors correpsonding to the first K largest GED eigenvalues.
While CSP is usually extracted by solving GED(C (+) , C (−) ), it can also be interepreted in a discriminative view, as described by Blankertz et al [32]: Here the solution of CSP is obtained by maximizing the variance ratios between discriminative and common activity: Thus, the spatial filter matrix of CSP, i.e., where − → f CSP,c is the c-th spatial filter component of the matrix.

III. METHODS
Utilizing the least dimension to achieve the highest discriminability is always the ideal when designing a feature extraction algorithm. Although the features extracted from standard Riemannian methods are of high quality, they are hamstrung by the curse of dimensionality and a lack of interpretability. It is striking that, when reviewing these two factors which impede the application of Riemannian methods, spatial filtering techniques seem to be the remedy. The arguments are two-fold: First, reducing the dimensionality of the covariance matrices decreases computation time drastically. Second, thanks to the associated spatial patterns of spatial filtering, it is possible to verify what aspects of the recorded signal are being used by the classifier. Hence, how to leverage the spatial filtering technique in the standard Riemannian methods becomes an interesting question.
Inspired by this idea, in this section, we first propose a novel spatial filter extraction algorithm in which we approximate a linear function on the Riemannian tangent space points by a set of spatial filters, which render that function much less computationally intensive and also more understandable. We support the proposed algorithm by rigorous mathematical proofs. Moreover, by adopting this approximation idea, a simplified regression-like classification method is also put forward. Subsequently, CSP is proven to be a special case of the proposed tangent space spatial filtering. We validate our theoretical findings experimentally via the validation setup proposed in [21].
Mathematically-oriented readers are invited to begin below; readers more interested in a practical understanding may refer to Section III-D.

A. The Approximation of Standard Riemannian Methods via Spatial Filtering
For the tangent space based Riemannian methods, the decision function (or decision boundary) on the tangent space completely determines the classification accuracy, because the predicted labels are entirely based on the output of decision function. In order to unify spatial filtering and tangent spacebased methods, one option is to attempt to find filters that can preserve this function. For simplicity, we consider linear functions in the tangent space: where − → w is the weight vector on the tangent space andŷ t represents the predicted label from the decision function for t-th trial. One thing that should be noted is that as a constant value, the bias term can be ignored in the above equation since the later proof of equivalence still hold if adding same bias to both sides. Moreover, based on the definition and property of AIRM [25], the inner product on the tangent space can be expressed as the function of manifold points as derived below: where S w is defined as the weight covariance matrix generated via the reshaping of the tangent space weight vector − → w into a symmetric matrix, and C w is the weight covariance matrix re-projected onto the manifold via the exponential map Expm C m (S w ). In addition, we use (·) | Cref to represent the variable lying on the tangent space which is computed based on the reference point C ref and the operation < S 1 , S 2 >| Cref is defined in below lemma: Lemma 1: Inner products between tangent vectors are invariant to affine transformation where S 1 , S 2 are two matrix-formatted tangent vectors on the tangent space computed at reference point C ref . For a full proof please refer to Appendix B-A.
Similarly, the approximated predicted labels from all the manifold points which are passed through a spatial filtering matrix F are as expressed below: thereafter for the convenience of notation and (·) ⊥F represents the matrix after filtering, e.g., A ⊥F = F T AF. Note that a property of the AIRM is that, for full-rank filtering matrices F, the Fréchet mean of the filtered matrices is the filtered mean of the original matrices, i.e. C m ⊥F = F T C m F. After explicitly formulating the true and approximated decision function, the next problem remained to resolve is the extraction of the spatial filter matrix F. The optimal scenario from the perspective of consequence is that this spatial filter matrix F can perfectly reconstruct the decision function. Hence, in the next subsection, we provide mathematically rigorous derivation and proof to find the optimal solution of F.

B. Optimal Spatial Filter Extraction from the Tangent Space
Naively, the goal of spatial filter extraction is to find a filtering matrix that maximally reconstructs the tangent space function, which is shown as follows: where F * K is the optimal filter matrix composed of K spatial filter components from full filter matrix F.
After substitutingŷ t (Eq. (III.2)) and y approx t (Eq. (III.4)) into Eq. (III.5), the objective function of the optimization becomes rather complicated and intractable, even if the cost function in Eq. (III.5) is only the squared loss.
Clearly, the major obstacle for solving this optimization problem lies in the complex formulation of y approx t | F K . Considering that F K is a subset of F, we first focus on the structure of y approx t to see whether it can be simplified in the case that F K is full rank.
By leveraging lemma 1, y approx t is solved by substituting into lemma 1 and we can obtain: (III.7) After explicitly presenting the solution of y approx t , it seems rather sophisticated to compute y approx t . However, we can also notice that in the substitution of Eq. (III.6), S 1 and C ref are constant once a tangent space function is found. Furthermore, if C m and C w can be jointly diagonalized by a properly chosen F, the matrix multiplications in Eq. (III.7) will be remarkably simplified, and an eigenvalue decomposition is no longer needed to compute C , for test points. If the spatial filters F are extracted in such a manner, then y approx t (Eq. (III.4)) are simplified as: where D m is adopted to represent the filtered reference point, which is now diagonal, and D w is the filtered weight matrix.
In addition, if we can further whiten the filtered reference point, i.e., F T C m F = D m ⇒ I, then we can not only simplify Therefore, by properly choosing F, y approx t can be simplified as: From here, we make one major assumption that the filtering matrix F approximately diagonalizes all C t . If this assumption holds , i.e., F T C t F is a diagonally dominant matrix for all t, based on the Gershgorin circle theorem [33] we know that (III.10) where D t represents the diagonal matrix which only contains the diagonal elements of F T C t F. Moreover, since F T C t F is diagonally dominant, then following approximation can be inferred: Therefore, we know: After applying the approximation in Eq. (III.12) into Eq. (III.9), y approx t can be simplified as: represents the diagonal vector of D (·) . We reiterate that one primary assumption in the above simplification is that all C t are roughly jointly diagonal, which is a very strong assumption. However, there is evidence for this in the fact that the projection of the physiological sources in the EEG signal to the electrodes is linear: Since the head moves very little with respect to the electrodes within a session, we can assume that the mixing (and hence unmixing) matrices stay relatively constant, even if the actual variances are non-stationary.
The key step that enables the simplification from y approx is the simultaneous diagonalization of the weight covariance matrix and the whitening of C m . The generalized eigenvalue decomposition (GED) conveniently solves both goals: where the F is named as tangent space spatial filter (TSSF) and D is the corresponding eigenvalues. Importantly, to ensure that Eq. (III.14) will hold, the order of C m and C w in the GED equation cannot be switched. Now, since y approx t can be drastically simplified as long as F are extracted with the GED manner, when looking back to the objective function for extracting optimal filters, i.e., F * K = arg max the last remaining obstacle is the true predicted labelŷ t . We next prove that the equivalence betweenŷ t and y approx t will hold under some conditions, as seen in theorem 1.
Theorem 1: Equivalence between true and approximated decision function and full rank. (III.15) Proof: For convenience, we first list some properties of linear algebra in the tangent space that will be convenient in the proof. All properties are proved in Appendix A.
Furthermore, we would like to quote one significant lemma in which the following equivalence holds based on the property of affine-invariance Riemannian metric. The corresponding proof is shown in Appendix B-B.
Lemma 2: Equivalence of logarithm mapping Since spatial filter F and the eigenvalue matrix D are extracted via the GED(C w , C m ), we have: On the one hand, Eq. (III.8) and (III.9) show that the approximated decision function can be reformulated as: On the other hand, based on the previous proved and quoted lemmas as well as the definitions, the true decision function can be expressed as: (III.19) As indicated by point 2) of the list at the beginning of this proof (refer as List 2. in the following context), we know that Therefore, the results from Eq. (III. 19) can be further derived as: Therefore, as a summary,ŷ t ≡ y approx t , if and only if when F is extracted via GED(C w , C m ) and F is with full rank.
Q.E.D By leveraging this equivalence, the objective function in Eq. (III.5) can be reformulated as: Since the F K is known as the subset of F which is extracted from the GED(C w , C m ), the optimization problem in Eq. (III.21) is then equivalent to the problem of ordering the columns of F.
This problem can be tackled by observing the result in Eq. (III.13), which states that as long as the filtered input data C t ⊥F is roughly diagonal, the linear functions in the Riemannian tangent space can be approximated by linear functions of the log-variances of the filtered data. More importantly, the coefficients of this approximated linear function are simply the log-eigenvalues after the GED is solved. i.e., log( − → d ). Thus, standard techniques for determining the most important variables in a linear regression problem can be used. For simplicity's sake, we use the absolute values of the regression coefficients as markers of their importance to the function. a) Intuitive Explanation: One very common and effective technique across domains is whitening data. By decorrelating the different channels, constructed features are often more distinct and predictive. However, whitening has a fundamental flaw, in that there are arbitrarily many whitening matrices that are possible since the covariance of whitened data is invariant to rotations. One explanation for the finding above is that the GED can be decomposed into a whitening transform and a subsequent rotation. The whitening is with respect to the data, and the rotation is chosen based on the weight matrix. Therefore this technique can be considered a particular choice of data whitening that simultaneously preserves the information of a function in the tangent space.

C. The Classification based on the TSSF
As a feature extraction method, spatial filtering always requires a classifier to deal with the processed features, which often requires an extra optimization step. One method to use the proposed TSSF is like any other feature reduction technique, fitting a classifier after the spatial filtering step. However, another advantage brought by the linear approximation function of TSSF is that this secondary training can be skipped, which means it is possible to further reduce the computational time for TSSF.
From Eq. (III.13), we notice that this function is actually a linear regressor using the log-eigenvalues of the GED as the regression weights. Therefore, we can directly input the filtered data into this regressor to obtain the predicted value. This method is named as one-step classification in our paper, and the ordinary way to classify the data is named as two-steps classification, i.e., filtering and classifying.
Example 1: Tangent Space Spatial Filter -The Generalization of CSP As a data-driven spatial filter, it is inevitable to compare the performance of TSSF and CSP, especially considering the great impact of the latter in the BCI field. Instead of merely comparing the performance of both filters, we also prove that CSP is a special case of TSSF. To begin with the proof of their relationship, let us first review the TSSF.
The solutions of TSSF are obtained via solving the GED(C w , C m ) as described in Section III-B. Moreover, the equivalent solution of eigenvectors can also be extracted by solving GED(S w , C m ), i.e.,: the proof of which can be referred in Section A. Furthermore, when the classifier on the tangent space is specified as the Fisher LDA classifier [34], the weight vector on the tangent space is as expressed in Eq.(III.23). The corresponding proof can be found in [34].
− → s t and µ (a) , a ∈ {+, −} are the within class mean for the tangent vectors and S within is the within scatter matrix.
Under the special case that the S within is equal to the identity matrix I, the weight vector of LDA classifier is simplified as: Based on the reverse operation of vec (·) (Eq. (II.3)), the equivalent formulation of Eq. (III.24) in matrix format is: where S (·) is the arithmetic within-class mean for project points on the tangent space. Moreover, assuming the special situation holds in which the between-class Euclidean mean difference of the covariances is the exponential transform of the between-class Euclidean mean difference of tangent space points, i.e., combining the special LDA classifier with the conclusion drawn from equation(III.22), the solution of TSSF can be further formulated as: In addition, if we further replace the Fréchet mean C m in Eq. (III.27) with arithmetic mean C m , we will have: (III.28) By combining the definition of CSP from the discriminative perspective as described in Eq. (II.11) and the equivalence as shown in Eq. (III.28), we are able to conclude the relationship between CSP and TSSF as: Namely, CSP is the representation of TSSF when LDA is chosen as the classifier on the tangent space, and the within-class scatter matrix is assumed to be the identity. One important caveat is the exponential relationship of class mean subtraction, as shown in the Eq. (III.27), which is not necessarily true.
One related work we would like to mention in this example is [35], in which Barachant et al replaced the arithmetic mean with the Fréchet mean in CSP. One crucial component of our equivalence in Section III-C is the relationship between C (·) and S (·) . We assume that they are related by the exponential transform, but that is likely not true if C (·) is computed as the arithmetic mean of the covariance matrices in a given class, due to the swelling effect [24]. Since the Fréchet mean is a much better proxy of common activities across trials, the proposed Riemannian CSP is a far better approximation of LDA in the tangent space, and Barachant et al also show increased performance and robustness with this alteration.

D. Summary of the extraction and application of TSSF
For practitioners interested in using the proposed TSSF framework, we summarize its procedures in this subsection. Generally, the usage of TSSF can be divided into two stages: how to extract spatial filters and how to use the spatially filtered signals for BCIs. Therefore, the corresponding algorithms are introduced below and summarized into pseudocode separately. To link each algorithm's description with its pseudocode, we adopt the abbreviation that A1-1 denotes the Step-1 of Algorithm 1.
1) Extraction of TSSF: To extract the TSSF, the input data should be bandpass filtered data already epoched into trials, and the choice of the linear model on the tangent space is supposed to be defined beforehand. Subsequently, the covariance matrices are estimated based on the input trialwise EEG signal and their Fréchet mean is computed to use as the reference point for the tangent space projection (A1-1 and A1-2). After finding the Fréchet mean, all covariance matrices are projected onto the tangent space and vectorized into tangent vectors (A1-3 and A1-4). Afterward, by using these tangent vectors the linear model is trained, the weights are obtained (A1-5) and reshaped into a symmetric matrix, and the equivalent weight covariance matrix on the manifold is computed via the exponential transform (A1-6). Next, the fullrank filter matrix of TSSF, as well as the regression coefficients for one-step classification, are obtained by solving the GED problem (A1-7) and sorting based on the absolute value of the logarithm of the eigenvalues in descending order (A1-8 and A1-9). At last, based on the predefined parameter that how many filter components are needed, the first K components of the full and sorted filter matrix are extracted, and the same is done with the regression coefficients (A1-10).

Algorithm 1 Extraction of Tangent Space Spatial Filter (TSSF)
Data: Bandpass filtered trialwise data X ∈ C×N ×T , loss function for linear model L Result: TSSF and regression coefficients with K components: 1. Compute the covariance matrices:

Compute the Fréchet mean:
6. Project weights onto manifold: Obtain the sorted TSSF and regression coefficients: ) ∈ C×1 10. Extract the first K components: 2) Application of TSSF: Once the TSSF are extracted, there are some options regarding how to generate features and use the trained linear model. The first step is to apply the extracted spatial filters onto the trialwise data (A2-1). Subsequently, there are three types of features which can be generated from the filtered data: the log-variance of filtered data (A2-2.a)), the diagonal vector of the logarithm of filtered covariance matrices (A2-2.b)) and the full tangent vector computed based on filtered covariance matrices (A2-2.c)). These three types of features and their descriptions, as well as corresponding abbreviations, are summarized in Table III.1.

Formulation
Description Abbreviation Diagonal of logarithm of covariance matrices Diag. log-cov Logarithm of covariance matrices Log-cov After obtaining the classifiable features, the last step is to classify them. As described in Section III-C, two possible classification algorithms can be applied: one-step classification and two-steps classification. One thing that should be noted is that one-step classification can only be applied to the diagonal elements based features, namely features from (A2-2.b)) and (A2-2.c)). For the one-step classification, the inner product between regression coefficients and features are computed, and the label is taken as the sign of the result in binary classification problems. For the two-steps classification, a second classifier is chosen and fitted with the features from the training set (A2-2.b).i)). After that, test data to be classified can be classified by this trained second classifier.

Algorithm 2 Feature generation and classification
Data: Test trialwise data X ∈ C×N ×T , Second classifier Clf 2 if needed Result: TSSF and regression coefficients: 1. Filter the test data: 3. Return label (several options are provided, only choose one): a). One-step classificaiton (only applicaple for features from 2.a) or 2.b)): . Two-steps classificaiton (applicaple for all features): i). Use a set of − → e from training datasets to fit a second classifier Clf 2 ii). Use the fitted classifier to classify the testing datasets and obtain the predicted label. return predicted label end

E. Experimental Setup
Now that we have shown the theoretical validity of tangent space spatial filtering, we move on to our empirical results. We base our experimental setup on a recently released opensource benchmark, which is known as Mother of all BCI Benchmark (MOABB) [21]. After that, we first fix the experimental paradigm as left-hand versus right-hand motor imagery because the corresponding neurophysiological knowledge, as well as the activated neuronal sources, are well studied. Furthermore, the analysis is restricted to the αand β-bands (8Hz ∼ 32Hz) based on neurophysiological knowledge. Also, all channels are utilized except for the electrooculography (EOG) channel. Based the chosen paradigm, we tried to adopt all eight available datasets in the MOABB, as summarized in Table III.2; however, as indicated in the Table III.2, the dataset BNCI 2014-004 is excluded from the analysis (marked in red in Table III.2) due to having only three electrodes. After the bandpass filtering, the covariance matrices are first estimated from the trial-wise data via the empirical covariance estimator. Subsequently, three algorithms are employed to generate feature: CSP, TSSF, and standard Riemannian tangent space methods. TSSF based features are further subdivided into three types depending on the degree of approximation as summarized in Table III.1, and two methods, namely one-step and two-steps classification as described in Section III-C. The difference between them is the choice of the second classifier: either fitting a new classifier after spatial filter generation (twosteps) or employing the eigenvalues from the GED solution as linear regression coefficients. These classification methods are summed up in Table III.3. For CSP and standard Riemannian features, the L2-regularized SVM classifier is used as a classifier.

Name
First classifier Second Classifier

One-step L2 Regularized SVM
Linear regression based on the eigenvalue of GED (refer to section III-C) Two-steps L2 Regularized SVM L2 Regularized SVM TABLE III.3: Summary of classifiers. For all regularized SVM listed above, the parameters are found by grid search [36].
The motivation of selecting a regularized SVM as the first classifier to generate weight vectors on the tangent space is inspired by the results from [21], in which the combination of regularized SVM and Riemannian methods has been validated as the best among all benchmarked pipelines. For choosing hyperparameters, a grid search [36] is employed to find the optimal value within the range from 0.01 to 100.
For better understanding the difference among the multiple variants of TSSF based methods, CSP, and standard Riemannian methods, we summarize all the above steps into a flowchart ( Fig. III.1). After the prediction, the scoring metric chosen by us is the ROC-AUC (receiver operating characteristic -area under the curve) metric, and these scores are computed via five-fold cross-validation within each session of every data set.
After obtaining scores from different pipelines, the next step is to compare and analyze their performance statistically. In our work, two statistics, the p-value and the effect size, are adopted to compare the proposed TSSF against CSP as well as the full Riemannian approach. The p-value for the onesided test is computed across sessions and subjects but within each data set, the null hypothesis of which is that the median accuracy of using one pipeline is not larger than using another pipeline. The effect size is measured by the standardized mean  difference (SMD) between the accuracies of the two compared methods. Further details about these statistical tests can be found in [21].

IV. RESULTS
To comprehensively assess the performance of the proposed TSSF, three aspects are considered in this paper: the quality of the filtered feature, the interpretability revealed from associated spatial patterns, and the computational time.
Of these three perspectives, feature quality is the only indicator which can be analyzed in a purely quantitative manner through the classification accuracy. Therefore, in this section, we exclusively analyze the performance of feature quality. Although it can be argued that the results of computational time can be analyzed in a quantitive way, i.e., by exhaustively comparing the simulation results of computational time, we more concerned with the theoretical, computational complexity analysis since the latter is more general than the simulation results. Therefore, interpretability and computational time are both left until the discussion, as the results are more qualitative and require more context to be properly interpreted.

Number of filters: 12
(c) Applying the first twelve filters Parameters: effect size t (standardized mean difference) and p-value p are computed within each dataset. In each block, the statistics are computed based on the null-hypothesis that the median accuracy of row method is not larger than the column method. The green block means there exists an overall significant result across all datasets. The red block means there exist contradictory results, i.e., the overall one-tailed results show significance, but the effect size is not positive. Furthermore, the number in parentheses next to the p-values represents that the percentage of datasets in which significance is reported. The meaning of each label can be referred from Table. III As a typical indicator of feature quality, the classification accuracies are chosen to be compared as a way of assessing which features are most informative. In subsequent subsections, we begin with a comparison of all proposed classification pipelines over all the datasets, to see whether any of them consistently outperform the rest. The results are shown in Figure

A. Statistical performance across datasets
We select three typical cases of applying spatial filters: two, six, or twelve spatial filters. By observing Fig. IV.1a we can notice that even when only applying two filters, the p-value of comparison between all TSSF-based pipelines and CSP highly significant, and the effect sizes are moderate. Moreover, in the comparisons with the full Riemannian method, the TSSF Cov 2 step even significantly outperforms the full Riemannian method, albeit with a small effect size (0.23).
When increasing the filter number to 6, as shown in Fig. IV.1b, the performance of all TSSF-based pipelines continues to surpass CSP. Surprisingly, TSSF Var 2 step also shows significantly better results than the full Riemannian method TS AIRM, though again with a rather small effect size (0.08). In addition, performance begins to differ among the different TSSF-based pipelines. After increasing the number to 12 (Fig. IV.1c), although one more TSSF-based pipeline significantly outperforms TS AIRM, the differences among the TSSF-based pipelines also further enlarge.
Observing and comparing these figures from a macro perspective, we can discover several trends: First, CSP is constantly outperformed by all Riemannian based methods. Second, the performances of TSSF-based methods tend to differ from each other, only at large numbers of filters. Third, the difference in performance between one-step and two-steps methods also enlarges as the filter number increases.

B. Statistical performance within each data set
In order to go into more detail on how the various algorithms perform, in Fig. IV.2, we show meta-analyses between some individual algorithms which describe the per-dataset performance.
We first compare CSP against TSSF Var 2 step, as both fit the same secondary classifier. In order to see the benefit of the matrix logarithm-based features as compared to the filtered log-variance features, we further provide the comparison between TSSF Var 2 step and TSSF Cov 2 step. Lastly, we are interested in seeing how the one-step classifier works, and so we also include TSSF Var 1 step. As some datasets only contain 20 or fewer channels, we choose to always use 6 filters. The results are as shown in Fig. IV.2. In the comparison of different spatial filter extraction method ( Fig. IV.2a), the TSSF-based method overwhelms CSP with only one exception. In addition, as shown in the Fig. IV.2b, the features types does not seem to have a significant influence on the feature quality within each data set, even if the log-variance features is with dimension dim = K = 6 and logarithm of covariance matrices features is dim = K(K+1) 2 = 21. Although there exists a controversial fact that an overall significance appears in the comparison across datasets with p = 0.0436, this chance-level p-value is not supported by a significant difference within any individual dataset. In the last sub- figure, Fig. IV.2c, we can see that while two-step classification is significantly better than onestep classification across datasets, this is heavily influenced by only one data set.
C. Performance w.r.t. the number of applied filters in single data set Last but not least, we look at how performance changes as a function of the number of applied filters. As a meta-analysis here results in an enormous number of statistical tests on not very much data, we focus on this section of the analysis on a single dataset. For better reflecting the relationship between accuracy and number of applied filters, we choose the data set Munich Motor Imagery, which has the highest channels numbers (128). The chosen data set is Munich Motor Imagery and the null-hypothesis is that the median accuracy of TSSF-based methods is not larger than CSP. Significance threshold is set as 0.05, as indicated by the black straight line.
From Fig. IV.3 we first notice that the accuracies of all TSSF-based features converge to the performance of the standard Riemannian method with merely four filters while CSP needs around 20 filters to reach a stable plateau. Second, except for TSSF Var 1 step, all other TSSF-based methods constantly significantly outperform the CSP whatever number of filters is used, as shown in Fig. IV.4. Lastly, for all log-variance based TSSF pipelines, their accuracy usually decreases when the number of filters continues to increase. Moreover, this fact can also be observed in other datasets, as indicated in Appendix C-A.
In this section, we have comprehensively compared the quality of the features extracted from various ways, and confirmed two things: that Riemannian methods reliably outperform CSP, and that TSSF can approximate-and sometimes even outperform-standard Riemannian methods. As a spatial filtering method, however, the interpretability is always of the highest significance, especially for online purposes, because it is the only way that we can know whether reasonable underlying neuronal sources are utilized. Moreover, the computational efficiency of the spatial filtering method is also vital because the online BCI system usually has a strict requirement for its computational complexity. Therefore, in the next section, we will further discuss these two aspects.

V. DISCUSSION
We have shown that spatial filters can be extracted from linear functions in the Riemannian tangent space, and further that CSP can be seen as a special case of this general framework. This can be used to render Riemannian methods suitable for online use even in cases of over 100 channels. Moreover, we validate our approaches using over 220 subjects via an open-access toolkit [21] and show that as far as classification accuracy is concerned, this method is statistically indistinguishable from the full tangent space approach on average, and in some cases can significantly improve on it. All in all, the proposed framework allows for the possibility of offthe-shelf algorithms made to work on vector data being used in the case of EEG data to generate spatial filters, eliminating the need for complicated optimization frameworks.
One notable contribution of this paper is the proposal of one-step classification, which further reduces the computational time of the testing stage significantly, and so we begin our discussion there. Subsequently, we analyze the associated spatial patterns of both CSP and TSSF. Afterward, we discuss the signal sources as well as the robustness reflected from these patterns. We end our discussion with several suggestions regarding its usage and finally, a look towards the future work this result implies.

A. One-step classification
While one-step classification relies strongly on the assumption that the input points are roughly jointly diagonalizable, and hence that the proposed approximation holds, we have shown in practice that this appears to be the case for sufficiently small numbers of filters. What this suggests is that certain underlying sources can be extracted by static spatial filters, while others do not correspond to static eigenvectors of the covariance matrices. If few enough filters are chosen, the resulting classifier is very close to the tangent space function, but as more are added, the approximation quality degrades. This explains the results in Fig. IV.3 in which the only classifier whose quality degraded as a function of filter number was the single-step log-variance classifier. Another major benefit to using one-step classification is that it is a better use of training data. Current spatial filtering-based approaches to classifications need to either re-use data for both spatial filter and classifier fitting, or partition training data into disjoint sets, which reduces the quality of both solutions. When the approximation holds, one-step classification is a much more data-efficient solution.
B. One-step classification: Computational complexity analysis and simulation results M H X a I 6 u k Y N 1 E I U P a J n 9 I r e r C f r x X q 3 P u a t O W s x c 4 z + w P r 8 A c A 3 n K 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 g G z R B 9 t N / I 9 M H X a I 6 u k Y N 1 E I U P a J n 9 I r e r C f r x X q 3 P u a t O W s x c 4 z + w P r 8 A c A 3 n K 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 g G z R B 9 t N / I 9 M H X a I 6 u k Y N 1 E I U P a J n 9 I r e r C f r x X q 3 P u a t O W s x c 4 z + w P r 8 A c A 3 n K 0 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 g G z R B 9 t N / I 9 o q h q h I L / H p T W k m J J 2 3 B p I h R w l A s r G F f C 7 k r 5 n C n G 0 X 5 B 3 4 b g r z 5 5 X Z w e j n 1 v 7 H 9 6 P T w + 6 u L o k T 3 y i o y I T 9 6 Q Y / K e T M i U c P K F f C M / y E / n 1 v n u / H K 7 V t f p x E v y X 7 l b 9 4 6 J t f 4 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " d G d 2 Q + n w m k 3 z s P 9 O y a i Z Q o q h q h I L / H p T W k m J J 2 3 B p I h R w l A s r G F f C 7 k r 5 n C n G 0 X 5 B 3 4 b g r z 5 5 X Z w e j n 1 v 7 H 9 6 P T w + 6 u L o k T 3 y i o y I T 9 6 Q Y / K e T M i U c P K F f C M / y E / n 1 v n u / H K 7 V t f p x E v y X 7 l b 9 4 6 J t f 4 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " d G d 2 Q + n w m k 3 z s P 9 O y a i Z Q o q h q h I L / H p T W k m J J 2 3 B p I h R w l A s r G F f C 7 k r 5 n C n G 0 X 5 B 3 4 b g r z 5 5 X Z w e j n 1 v 7 H 9 6 P T w + 6 u L o k T 3 y i o y I T 9 6 Q Y / K e T M i U c P K F f C M / y E / n 1 v n u / H K 7 V t f p x E v y X 7 l b 9 4 6 J t f 4 = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " d G d 2 Q + n w m k 3 z s P 9 O y a i Z Q

q F D T D d j g P + h x E S 9 A k y z g f B T d s n I k i B Y 1 C c W v 7 U Z j j o O Q G p V B Q N V h
h I e f i m k + g 7 6 D m b s 6 g n J t U 0 X 3 H j G m S G X c 0 0 j n 7 s K P k q b W z N H a V 9 d r 2 q V a T L 2 n 9 A p P j Q S l 1 X i B o s R i U F I p i R m v H 6 V g a E K h m D n B h p N u V i i k 3 X K D 7 l 4 Y z I X r 6 5 O e g 9 6 k d h e 3 o + + f m y f H S j n X y g e y R F o n I F 3 J C T s k 5 6 R J B / n p r X u D t e H f + p r / r v 1 + U + t 6 y Z 5 c 8 C v / j P b b 0 t 4 U = < / l a t e x i t > Considering that one major critique of Riemannian methods in online practice is their inability to scale to high numbers of channels, the computational complexity comparisons between the one-step classification framework and the full Riemannian tangent space method are provided from both the perspectives of theoretical analysis and simulation results. For a better understanding of simulation results, we first start with the theoretical analysis. As seen from ). For high numbers of channels, this can be difficult to do for real-time feedback, and to verify that we ran a theoretical runtime analysis using the Munich Motor Imagery data set, as it has over 100 channels. The simulation results are shown in Fig. V.2. As seen from these results, standard Riemannian methods are slower than both CSP and TSSF based methods. In particular, the full Riemannian methods is 25 times slower than TSSF based methods with similar performance when observing the results from the data set with 128 channels. As for the accuracy comparison, the superiority of TSSF based methods is already validated, as shown in Fig. IV.4 in which TSSF Var 1 step has an overall significantly better classification performance than CSP when using four or six filters.

Time (s) Accuracy Time (s) Accuracy
In summary, by adopting the one-classification, it is no longer impossible to enjoy the robustness and excellent performance of Riemannian methods in an online BCI system with high-dimensional data. One additional advantage is that, by employing fewer features, the model suffers less risk from overfitting.
C. How robust is spatial filter order to artifacts?
Another important aspect of our work is the observation that this procedure allows one to easily validate the relevance of the features that a Riemannian classifier is using. By visualizing the spatial filters, it is easy to ensure that artifactual sources are not included in the classifier, which is of crucial importance when a BCI is used for neurofeedback.
As shown in Fig. V.3, in which only two filters are applied, the TSSF based methods are clearly better than CSP based, especially for S1, S2, S7, S8, S9, and S10. Correspondingly, we can notice that for these subjects, the patterns based on CSP look much more patchy than TSSF based. As for the subjects that both methods have tied performance, i.e., S4 and S6, their spatial patterns seem almost identical to each other.
We would, however, like to better understand where and how components that are not brain-related enter the spatial filters in these two methods. Therefore, we increase the filter number to ten and compare the spatial patterns from two contrasting subjects in order to verify how the methods are robust to artifacts. The chosen subjects are S2 and S4 because as described in [2], they reflect two extremes of artifact contamination. In S2 55% of all trials are contaminated by artifacts while only 6.3% of all trials are affected by artifacts for S4. Therefore, their associated patterns are as shown in Fig. V.4.
Observing the Fig. V.4, S2 Comp 0 and S2 Comp 1 of the spatial pattern of TSSF seems similar to S2 Comp 4 and S2 Comp 6 of CSP. As for the rest CSP patterns of S2, most of them appear to be artifacts while for TSSF patterns of S2, only Comp 8 and Comp 9 look slightly patchy while the rest of the patterns show strong activity around the sensorimotor cortex. Moreover, in S4's patterns, from Comp 0 to Comp 6, the results of CSP and TSSF reflect similar neuronal sources with a slightly different order. However, when looking at the last three patterns of both filters, artifactual sources appear. Overall, the ordering in TSSF is much more informative than that in CSP, although in very low artifact scenarios they are similar.
Therefore, we would like to conclude that the associated spatial patterns reflect more neurophysiologically explainable neural sources in TSSF. In contrast, CSP often gets distracted by artifacts, especially when processing data suffering plenty of contamination, e.g., S2 of data set Munich Motor Imagery. When considering CSP as a simplified LDA based TSSF, this susceptibility to noise directions makes much more sense. When considering the patterns from the low-artifact subject S4, the patterns from both spatial filtering methods are almost identical with each other, in particular for the first several patterns.
D. How many filters lead to optimal performance? By enlarging the feature space, classification accuracy only increases when useful information is encoded within the additional features. For the case of spatial filtering, the most informative features are usually from the first several spatial filters and afterward, the features are no longer as informative as before, as indicated by the spatial patterns shown in Fig. V.4. Therefore, when applying only a few spatial filters, there always exists a positive relationship between the performance and number of applied filters. However, this positive relationship turns into a plateau when further increasing the filter numbers because the additional features are no longer as informative as before, and can even turn negative in cases of overfitting.
This initial positive relationship and the subsequent plateau always raises the question of the optimal number of spatial filters, as the optimal scenario for applying spatial filtering techniques is to use the least number of filters to achieve a given level of performance. We argue that TSSF reliably requires less spatial filters than CSP in order to achieve the same level of performance. The arguments are three-fold: First, CSP usually needs more filters than TSSF to reach the plateau in classification accuracy. For instance, as described in Fig. IV.3, TSSF based methods merely require 3 or 4 filters to reach 95% of its best performance while CSP needs at least 20 components to reach the same level. Although in other datasets, the comparison is not as evident as in Munich Motor Imagery, it is still clear that TSSF based methods converge faster as shown in Appendix C-A. Second, the optimal minimum filter number for TSSF appears to be independent of the number of channels in the dataset, possibly reflecting a true biological set of sources conserved across all the data. For all seven datasets, after using six filters or even less, TSSF already reaches the level of its best performance, while for CSP, this number varies from 6 to 20 and shows a roughly positive correlation with the number of channels. Last but not the least, as the figures in Appendix C-A indicate, the best classification performance of the TSSF methods is higher than of the CSP, and so even a suboptimal number of filters can compare with it.
For a given task, the truly activated neural area should be conserved across subjects and datasets as indicated by the neurophysiological knowledge, and so it is likely that TSSF more reliably extracts the neurophysiological signals independent of variables like setup and channel number. More specifically speaking, as shown in Fig. V.4, the associated spatial patterns when using ten filters of both methods looks pretty similar when processing high-quality data (only 6.3% trials are contaminated for S4). However, for S2 in which 55% trials    The associated spatial patterns of data set Munich Motor Imagery when applying the first ten spatial filters for S2 and S4. The major conclusion drawn from these two figures is that the TSSF based patterns present a better ordering comparing to the CSP which means the CSP tends to be affected by artifacts.
are contaminated, CSP ranks the two most important neural sources as the fifth and seventh components, while TSSF ranks them at the top two components. From a machine learning point of view, this result is unsurprising if one considers CSP to be a simplified LDA, which does not take second-order information into account, and that our TSSF is done using an SVM in the original tangent space. However, what is surprising is the conservation of the number of necessary sources across the various datasets -an SVM is sparse in terms of the number of support vectors it uses to build its projection, but these support vectors live in the tangent space. There is no obvious reason that the spatial filters derived from them should also be only informative in 6 out of 128 dimensions.
E. The origin of the robustness and discriminability of Riemannian methods As shown in Appendix C-A, for all datasets, the accuracies of log-variance based methods usually first rise until reaching a peak and begin to drop after that as the filter number increases. Practically speaking, this is a confirmation of a well-known fact within spatial filtering: more filters are not always good. When approximating the Riemannian tangent space function, we find a similar trend: with very few filters, it is possible to outperform the full Riemannian tangent space classifier sometimes, as evidenced by Sections IV-A and Fig. IV.1. However, in contrast to CSP, increasing the number of filters does not degrade performance as visibly. This robustness to the number of filters is yet another benefit of TSSF.
We might, however, ask why this particular feature exists. Intuitively, in the sense of accuracy, the superiority of covariance-based features should be thanks to the crosschannel covariances. This argument is indeed correct; however, not in the usual way. We normally expect cross-channel power terms to contribute significantly to the decision function on the tangent space such that the Riemannian methods are rather robust. However, this intuitive inference cannot explain the success of diagonal elements-based TSSF features, i.e., TSSF Cov 1 step, TSSF Var 1 step and TSSF Var 2 step, which support the hypothesis that the off-diagonal terms (correlations between spatially filtered signals) are neglectable in comparison with the diagonal entries.
After excluding the possibility that off-diagonal terms influence the robustness through direct contribution to the decision function, we might ask how they affect the matrix logarithm operation. Luckily, this can be investigated via comparing the accuracies from TS Cov 2 step, TS Cov 1 step and TS Var 1 step. We begin by reviewing the three compared feature spaces: 1) TS Cov 2 step: vec Logm C m ⊥F ( C t ⊥F ) ∈ C(C+1) 2
By observing the , TS Cov 2 step seems a bit better than TS Cov 1 step, it is still impossible to assert that one pipeline is consistently superior to another, in particular considering the poor data quality of these two datasets. This phenomenon also partly validates our previous claim that the contribution from offdiagonal elements to the decision function is neglectable.
When comparing between the accuracy of TS Cov 1 step and TS Var 1 step, the differences arise solely from the distinct way the features are computed, as the linear regression coefficients are both the GED log-eigenvalues in both cases. The discrepancy between diag Logm I ( C t ⊥F ) and log diag( C t ⊥F ) comes from whether the C t ⊥F is diagonal, i.e., the filtered cross-channel power terms are all with zero or not. If they are, both features are equivalent to each other. If not, log diag( C t ⊥F ) is the approximation of diag Logm I ( C t ⊥F ) and the approximation error depends on how close the diagonal elements of C t ⊥F are to its eigenvalues.
Therefore, combined with the Gershgorin circle theorem [33], we know that the differences in classification accuracy and robustness between the two features are related to the filtered cross-channel power terms. In other words, the smaller the cross-channel power of C t ⊥F are, the more discriminative and robust the log-variance based features log diag( C t ⊥F ) will be.
In conclusion, the off-diagonal terms influence the robustness and discriminability of Riemannian methods via affecting the logarithm operation, instead of via a direct contribution to the decision function. In other words, if the correlations between filtered signals are non-zero, then source power is not equal to the log-variance, because the eigenvectors of C t ⊥F are no longer the standard basis.

F. Suggestions for the usage of TSSF
After exhaustively benchmarking the TSSF based methods against conventional algorithms, we provide several suggestions to the reader who would like to use the TSSF method: 1) Use the empirical covariance estimator when possible: Many Riemannian methods recommend regularized covariance estimators such as the Ledoit-Wolfe estimator. However, since diagonal loading cannot be added during online use (as the spatially filtered variances are used), high regularization runs the risk of degrading the approximation. 2) Choose the Riemannian metric carefully: Most of the theoretical analysis is based on the assumption that the affine-invariant Riemannian metric is utilized. A similar property remains to be validated for other Riemannian metrics. 3) Choice of features: as discussed before, there are three types of features to be adopted, vec Logm C m ⊥F ( C t ⊥F ) , diag Logm I ( C t ⊥F ) and log diag( C t ⊥F ) . As computational complexity of the feature decreases, the signal to noise ratio is affected negatively. Therefore, the concrete choice of features highly depends on the experimental environment, e.g., number of channels, sufficient computing sources, etc. But based on our experience, diag Logm I ( C t ⊥F ) is a good candidate for a task which demands high accuracy, while log diag( C t ⊥F ) might be a better choice for a strict real-time requirement.

G. Future Work
This paper covers the fundamental concept and proof of spatial filtering via the tangent space. However, there are still many interesting directions worthy of being explored later on. We will discuss these possible directions from two levels, the extension of the scientific idea and the extension of these proposed algorithms: 1) Unsupervised dimensionality reduction and multi-class TSSF: TSSF based methods usually perform rather well with few components, which implies there exists an optimal subspace of low dimensionality where the brain projects. In this paper, this optimal subspace is found in a supervised fashion. Whether we can leverage this idea to find an unsupervised dimensionality reduction will be an interesting open question. Moreover, it will also be fascinating to further investigate whether the TSSF can be applied to multi-class classification considering the potential of TSSF in practical application. 2) TSSF in comodulation manner: The proposed TSSF is currently extracted in a regression-like manner because it is derived from the inner product between a weight vector and a data vector. Whether we can also leverage continuous information encoded within the target variables, just like the source power comodulation (SPoC) method [9], will also be very worthwhile to explore. 3) Other choices of the first classifier: Although the SVM based TSSF methods have indeed achieved a satisfying performance as presented in this paper, no matter from the perspective of classification accuracy or the interpretability, we will not assert that regularized SVM represents a global optimum. On the contrary, it will be very interesting to explore the influence brought by the selection of the classification algorithm on the tangent space. 4) Multiple frequency bands: In this paper, the features are extracted from the joint µ and β band, i.e., from 8Hz to 32Hz. Hence, it remains a mystery whether the ampler information induced by the filter bank TSSF will outperform current TSSF based methods, just like the enhancement of filter bank CSP [38] comparing to CSP.

VI. CONCLUSION
Thanks to its impressive performance, the Riemannian manifold classification framework has seen an upsurge in interest in recent years. Historically, it has been hampered by various issues, namely that Riemannian methods scale poorly to highdensity setups and are somewhat difficult to introspect.
To tackle these obstacles, we have proposed a set of methods based on the combination of spatial filtering techniques with Riemannian methods, because the former possess nice properties which Riemannian methods are lacking, such as low dimensionality and the visualization of signal sources (or associated spatial patterns). In order to further simplify the computation of the proposed idea, we have proved the rationality behind several variants of Riemannian features based on the approximation of the decision function on the tangent space. Moreover, we have also put forward one-step classification in order to simultaneously find a classifier and spatial filters. We hope that this work will allow for the expansion of Riemannian geometry-based methods into more BCI applications, and that it might spur further development in both application and theory for this sort of interface.

APPENDIX A EIGENVECTOR INVARIANCE AFTER LOGARITHM MAPPING
The generalized eigenvalue decomposition (GED) is defined as below: where F represents the generalized eigenvectors and D is the generalized eigenvalue matrix. For the convenience, we denote the solutions of this GED problem as: Since C m is a SPD matrix which means it is invertible, equation (A.1) could be expressed as below by left multiplying C −m for both sides: By leveraging the SPD property, we can decompose the C −m into C − m 2 C − m 2 , which is as proved below: can be proved as equivalent to identity matrix I. Hence, based on these ingredients derived by SPD property, equation (A.3) is further written as: Note that as C −m C w is not symmetric, its eigenvalue decomposition is not constrained to be orthogonal. Moreover, since C − m 2 C w C − m 2 is symmetric, the eigenvectors are orthogonal basis, which could be formulated as below: On the other hand, ED(C − m 2 S w C − m 2 ) can be solved as: Therefore, as a conclusion, the eigenvectors of GED problem keep invariant after logarithm mapping as long as the projecting matrix is not reference matrix.

APPENDIX B PROOF OF LEMMAS
A. Lemma 1 Lemma 1: Invariant inner product between tangent vectors after affine transformation (B.1) Proof: < S 1 , S 2 >| Cref represents the inner product between tangent vector S 1 and S 2 . Besides, these tangent vectors are projected on the tangent space based on the reference point C ref . For more details about the proof, please refer to section 4 of [39] and section 3 of [25].
Q.E.D. Proof: Based on the SPD property of matrix A, we can apply ED to A and obtain: Moreover, combining above results with V T V = I, we have