On robust parameter estimation in brain–computer interfacing

Objective. The reliable estimation of parameters such as mean or covariance matrix from noisy and high-dimensional observations is a prerequisite for successful application of signal processing and machine learning algorithms in brain–computer interfacing (BCI). This challenging task becomes significantly more difficult if the data set contains outliers, e.g. due to subject movements, eye blinks or loose electrodes, as they may heavily bias the estimation and the subsequent statistical analysis. Although various robust estimators have been developed to tackle the outlier problem, they ignore important structural information in the data and thus may not be optimal. Typical structural elements in BCI data are the trials consisting of a few hundred EEG samples and indicating the start and end of a task. Approach. This work discusses the parameter estimation problem in BCI and introduces a novel hierarchical view on robustness which naturally comprises different types of outlierness occurring in structured data. Furthermore, the class of minimum divergence estimators is reviewed and a robust mean and covariance estimator for structured data is derived and evaluated with simulations and on a benchmark data set. Main results. The results show that state-of-the-art BCI algorithms benefit from robustly estimated parameters. Significance. Since parameter estimation is an integral part of various machine learning algorithms, the presented techniques are applicable to many problems beyond BCI.


Topical Review
Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. the application to structured data, i.e. data that can be divided into meaningful units consisting of groups of samples.
One example of structured data are pooled data sets. Here samples coming from different sources (e.g. recordings sites, subjects) make up meaningful units which may largely vary in quality. In this case not only individual samples may be regarded as outliers, but all data from a specific recording site or subject may need to be discarded if this data source systematically biases the parameter estimation.
Another example of structured data are EEG recordings from a brain-computer interfacing (BCI) (e.g. [5,6]) experiment. In these experiments subjects are asked to perform certain tasks such as motor imagination over a limited period of time (i.e. a trial) and the mental state present in the EEG is decoded in realtime. Each trial consists of a few hundred EEG samples and is a meaningful unit in the data, because it indicates the start and end of a task. This structure naturally leads to a multi-scale definition of outlierness which is illustrated in figure 1. Two types of outliers can be identified in this example. First, the sample at the beginning of trial 1 has a significantly larger value than all other samples in the data, thus it can be easily identified as outlier sample. Second, also trial 4 can be regarded as outlier, because it lacks the high-variance response in the middle of the trial which is present in all other trials. However, without the structural information one can not identify this trial as outlier, because its samples are not different from the other samples in the data set. This example shows that by grouping samples into larger units one can identify outliers which are fundamentally different from individual outlier samples. Such outlier trials can only be identified after aggregating the information within each trial and comparing the trial distributions (or resp. summary statistics). As shown later sample-based estimators such as [1][2][3][4] are not robust 6 to these second-level outliers.
A proper and robust analysis of EEG data in general and BCI data in particular is a necessary prerequisite from the data analysis side for moving experiments out of a lab environment for general tasks of man-machine interaction [7][8][9][10][11][12][13][14]. While this review is driven by the general motivation to set the scenes for BCI or more general neuroscience experiments beyond the lab, we will focus here on a selected subset of mathematical aspects of this challenge, hoping that it will be of use for the community. Specifically, we will focus on the class of minimum divergence parameter estimators and will derive novel mean and covariance matrix estimators which are tailored to structured data, demonstrating that a multi-level treatment of outliers is important when estimating parameters in structured data. Naturally, we will draw from own work, hoping that we have appropriately maintained the balance to the literature, providing the necessary pointers for further reading.
The remaining paper is organized as follows: at first, we briefly review robust methods in BCI. Then we formulate parameter estimation as divergence minimization problem and show that using particular classes of divergences results in robust estimates. In section 4, we present the minimum βdivergence estimators for the Gaussian and Wishart distribution model and discuss the application to structured data. In section 5 we investigate the advantages and limitations of our estimators using simulations and discuss their relations to state-of-the-art estimators. We evaluate and compare the estimators on a large BCI data set with 80 subjects. Finally, section 6 concludes this paper with a discussion and an outlook on future work.

Robust methods in brain-computer interfacing
EEG signals are often affected by electrical sources which are unrelated to brain activity, but produce very larger potential differences. These artifacts can be of physiological origin, e.g. eye blinks or muscle contractions, or be due to technical reasons such as loose electrodes or power grid noise. In either case artifacts contaminate the recorded EEG signals and heavily bias the estimation of parameters at all stages of the BCI processing pipeline. Since poorly estimated parameters do not well represent the underlying neural processes, artifacts negatively affect BCI performance. This section provides an overview of recent techniques which tackle the outlier problem in BCI.

Artifact removal
Removing artifactual components from the recorded data is a common strategy to robustify BCI systems. Most of the artifact removal methods exploit the assumption that artifactual and neural activity are independent. These algorithms first decompose the EEG signal into independent source comp onents (ICs) and then discard the artifactual ICs. Especially, ICs related to ocular artifacts can be easily identified with this approach. Other artifact types produce ICs that are often less consistent, but can be distinguished from neural activity. Various methods have been proposed for automatic or semi-automatic identification of ICs representing EEG artifacts [15][16][17][18][19][20][21]. An automatic and adaptive artifact detection method based on Riemannian geometry has been proposed in [22]. Surveys on artifact removal can be found in [23,24] and the effects of different artifact types on motor imagery BCI are investigated in [25,26].

Trial and channel selection
An alternative to discarding artifactual ICs is to focus on the identification and removal of artifactual trials and channels. This approach is advantageous when individual channels are unreliable (e.g. loose electrode) or when only few trials are contaminated by artifacts (e.g. by subject movements). The authors of [27] proposed a sparsity-aware method to eliminate low-quality trials from a BCI data set. Other researchers used M-estimators [28,29], robust divergences [30] and Riemannian geometry [31] for robust estimation of covariance matrices. Since these methods implicitly perform some kind of trial weighting, they also reduce the impact of artifactual trials. The identification of reliable and informative channels is a topic which has received a lot of attention in the BCI community. Various techniques have been suggested to identify the optimal channel configuration [32][33][34][35]. Sparsity enforcing methods (e.g. [36,37]) have also been successfully applied in this context.

Robust spatial filtering
Common spatial patterns (CSP) [38,39] is a popular algorithm for optimizing spatial filters in motor imaginary BCIs. Since the algorithm relies on class covariance matrices which have to be estimated on the calibration data, it can be severely affected by artifacts. Trial and channel selection can significantly improve the performance of CSP in the presence of outliers. Another approach to increase robustness, especially in small-sample settings, is based on regularization of the covariance matrices [40][41][42][43][44]. Researchers have also formulated CSP as a maxmin optimization problem [45], proposed a CSP variant based on Student's t-distribution [46], applied trial pruning [47] and used generalized norms [48] to robustify the algorithm. Other work increases robustness by adding regularization to the CSP objective [44,49,50] or by formulating the algorithm as divergence maximization problem [51,52] and utilizing the robustness property of particular divergences.

Nonstationarity & robust classification
A significant fraction of errors in BCI can be attributed to the nonstationarity of the EEG [53][54][55] which leads to a changing feature distributions and compounds the classification problem. Various adaptation strategies (e.g. [56,57]) have been proposed to cope with nonstationarity in BCI. Researchers have also tackled the nonstationarity problem by regularizing the spatial filters [50,58] towards stationarity, i.e. by tradingoff the discriminativity and stationarity of the extracted features, and by applying an importance-weighted covariance estimator [59]. Other approaches project the signals into a stationary subspace prior to spatial filter computation by using the stationary subspace analysis (SSA) algorithm [60,61], it's geometry-aware extension [62] or a methods based on principal components [63]. The authors of [43,64] jointly perform feature extraction and classification, others [65][66][67] directly perform classification on the manifold of covariance matrices. Shrinkage is a common approach to robustify the covariance matrix estimation and the linear discriminant analysis (LDA) classifier [68][69][70]. Recent works apply deep neural networks [71][72][73][74] for classification of EEG motor imagery signals. These powerful nonlinear methods have recently become interpretable [74][75][76], which is an important aspect in BCI research because it allows to make sure that the model relies on neurophysiological features.

Parameter estimation based on divergence minimization
In the following we will lay out the basis for robust estimation methods that implements a further direction beyond nonstationarity and regularization, essentially providing the basis for all discussed algorithmic directions: if the parameters (e.g. mean, covariance matrix) are estimated better, many of the variants discussed above can improve.

Minimum divergence estimator
In parameter estimation a common assumption is that the observations D = {x i : i = 1 . . . n} come from an underlying statistical model q with unknown parameter θ * . A standard procedure to estimate this parameter is to maximize the log-likelihood function L(θ | D) of the parameter given observations In an alternative formulation of parameter estimation, also known as the minimum disparity estimation or minimum divergence estimation (MDE) [4], one aims to minimize the divergence D( p q θ ) between the empirical distribution p of the observations and the model distribution q θ . Note that a divergence [77] is a non-negative measure from information geometry (see [78,79]) used to quantify the disparity between two probability distributions. A divergence is in general asymmetric and has value zero iff the distributions coincide. When using a specific divergence, namely the Kullback-Leibler divergence 7 , the minimum divergence estimator coincides with the maximum likelihood estimator (MLE) [4], i.e.
The formulation of the parameter estimation problem in terms of divergence minimization has one important advantage, namely it allows to impose additional properties such as robustness on the estimator by using specific divergences [52,80]. For instance, in the case of β-divergence (with β > 0) one can prove [81] increased robustness of the estimator, i.e. reduced vulnerability to outliers in the data. Note that for β −→ 0 the β-divergence coincides with the KL-divergence and loses its robustness property. Beta divergence is an instance of a general class of divergences termed Bregmann divergences [82] and has many interesting properties (see [77] for more details). Minimum divergence estimators have also been applied with other measures of disparity such as Hellinger distance [83], power divergences [84] or γ-divergence [85]. Note that although we limit our discussion to β-divergence in this paper, the proposed ideas are applicable to all other measures of disparity in probability distributions.

Iterative algorithm
The minimization of a divergence D( p q θ ) between the empirical and model distribution with respect to parameter θ is in general a non-convex problem and can be solved (up to local optimality) iteratively by using a fixed point algorithm. For a specific class of divergences, termed Ψ-divergences, the estimating equation reduces to (see [81,86] denotes the expectation over the whole input space. Ψ is assumed to be monotonically increasing, convex and differentiable scalar function. Note that equation (3) is related to the update equation of M-estimators [1] and the parameter θ (k+1) is determined iteratively as solution of equation (3) starting from an initial value θ (0) . For the function the fix point equation (3) minimizes the β-divergence between the empirical and model distribution as Ψ-divergence reduces to β-divergence (see [81] for more details). The steps of the MDE can be summarized as follows. First, we initialize the param eter θ (0) by either a random value or the value obtained when applying the MLE. Then, we use equation (3) to iteratively compute the parameter θ (k+1) . Note that fix point equation (3) will be different for different underlying models q θ and Ψ-functions. In this work we use the Gaussian and Wishart model for q θ and the Ψ-function defined in equation (4). We stop the iteration after k max steps or when a stopping criterion, e.g. relative change in estimate below a threshold, has been reached. Note that this algorithm converges to a local optimum [81], thus several restarts may be required in practice. Among the several restarts one has to select the solution, e.g. by preferring parameters with a specific property such as minimum determinant, by using crossvalidation, or by applying stability related selection criteria. The joint estimation of several parameters such as mean and covariance matrix can be performed easily by keeping one parameter fix at a time.

Robustness property
The goal of robust estimators is to reliably estimate the parameter of the underlying model when data is heavily contaminated. Formally, we aim to estimate the density qθ(x) with θ ≈ θ * while observing data generated from where (x) is the contamination probability density and q θ * (x) is the true probability density. A common assumption is that an outlier sample x out has a very low probability to be generated by the true model, i.e. q θ * (x out ) ≈ 0. This implies that the log-likelihood term becomes extremely small for this sample, i.e. (x out ; θ * ) ≈ −∞, consequently the maximum likelihood method will not estimate the true parameter θ * , but a parameter θ with a significantly larger log-likelihood term (x out ; θ * ) << (x out ;θ) for the outlier sample. Instead of ignoring the outlier x out , the maximum likelihood estimator tries to compensate for its very small log-likelihood term. Thus, the outlier introduces a huge bias in the estimation. The authors of [81] have shown that minimizing βdivergence is equivalent (up to a normalizing constant) to max- ). Since Ψ β is an exponential function (for β > 0), it reduces the influence of outliers to zero, i.e. Ψ β ( (x out ; θ * )) ≈ 0. This property makes the minimum β-divergence estimator very robust, because extreme unlikely samples are being effectively discarded. For more formal discussion of robustness we refer to [1,4,81,87].

Robust parameter estimators for structured data
Samples in a data set are often naturally grouped into meaningful units. This type of structured data can be analyzed in 7 The Kullback-Leibler divergence between distribution p and q is defined as two ways, namely with respect to the individual samples or the groups. Consequently, we can define robustness with respect to both levels of analysis. Figure 2 visualizes the difference between sample-level and group-level estimation. In the former case parameters are estimated directly from the samples, whereas in the latter approach, parameters are estimated from summary statistics that have been computed from the groups. In the following we present robust mean and covariance matrix estimators for both the sample-and the group-level view. We assume that all samples except outliers are generated from a Gaussian distribution.

Sample-level estimators
When estimating parameters on sample-level we discard structural information in the data and assume that all observations come from a Gaussian distribution q with unknown parameters µ * and Σ * . For minimum β-divergence estimator in such data generation model, the iteration formula in equation (3) reduces to Note that µ (k) and Σ (k) stand for the parameter estimates in kth iteration step and is a factor downweighting the influence of outlier samples x i . From the formula we can see that if the sample x i is an outlier, i.e. it is very unlikely that it has been generated by a Gaussian with parameters µ (k) and Σ (k) , then its influence on the update of the parameters is very small due to vanishing ψ β ( (x i ; µ (k) , Σ (k) )). Since ψ β ( (x i ; µ (k) , Σ (k) )) is a monotonically decreasing function (for β > 0), it limits the influence of extreme outliers. Note that for β → 0 these estimators reduce to the maximum likelihood estimators μ = 1 , Σ (k) )) = 1. As mentioned before we can estimate both parameters simultaneously by alternating equations (6) and (7). Note that this estimator has been proposed in [81]. We refer to it as Gaussian-MDE or G -MDE.
The middle panel of figure 3 visualizes the downweighting effect of G -MDE. We sample 10 trials with 250 samples each (circles) from a distribution with (almost) the same covariance matrix and 1 trial from a distribution with completely different covariance matrix (crosses). The thick black line represents the estimated covariance matrix at each iteration of the algorithm and the color represents the weight ψ β (see equation (8)) assigned to each sample. Note that red color stands for small weights and gray color represents weights close to 1. One can see that as the outlier samples are more and more downweighted the estimated covariance matrix captures the structure of the clean data. The speed of this downweighting depends on the initialization value and the β parameter. The top panel of figure 3 visualizes the estimation error (log scale), i.e. Frobenius norm between estimated and true covariance matrix, at different iterations. It should be noted that some of the samples generated by the outlier trial (crosses) are not downweighted by G -MDE at convergence point 8 , i.e. still have gray color. In our experiment the G -MDE with β = 0.1 requires 25 iterations on average (100 repetition) to reach this point.

Group-level estimators
In structured data sets it may be advantageous to downweight groups of samples, e.g. outliers trials, rather than individual samples. To this end, we first compute summary statistics for each group, and treat them as second-level samples. Summary statistics should be chosen based on the assumption on the distribution of each trial. Since we assume that non-outliers are i.i.d. Gaussian, a natural choice is the sufficient statistics for Gaussian, i.e. sample average and sample covariance. To obtain group-level robust estimators, we apply equation (3) to each of the summary statistics. The sample mean again follows a Gaussian distribution, and therefore, we can directly apply equation (6) to estimate the mean parameter from the second-level samples where m is the number of groups and z j is the average of all samples in group j. Two ways of robustly estimating parameters from data: (i) direct estimation of parameters from the samples and (ii) estimation of parameters from summary statistics that have been computed from the samples and capture the structural information in the data. 8 We assume that the estimator converges if Σ (k+1) −Σ (k) On the other hand, we cannot use equation (7) to estimate the covariance parameter from the second-level sample covariances. In the following we derive an update rule for the grouplevel estimator for the covariance parameter. It is known that the sample covariance follows the Wishart distribution [88], defined as is the scatter matrix 9 and Γ D is the multivariate gamma function defined as Thus, in order to robustly (wrt group-level outliers) estimate a covariance matrix we compute the scatter matrices S j ∈ R D×D : j = 1 . . . m for each group and treat them as samples of an unknown Wishart distribution with parameters Σ and ν. Note that Σ denotes the true covariance matrix which we want to estimate and ν (under the assumptions that the samples are i.i.d.) equals the number N (or N − 1 if mean is subtracted) of samples within a group (which we assume to be fix). The maximum likelihood estimator for the Wishart distribution iŝ or equivalently it is the average covariance matrix. We robustly estimate a covariance matrix Σ from the scatter matrices S j of trials j = 1 . . . m by minimizing β-divergence using the following iterative formula is a factor downweighting the influence of outlier trials and to indicate that these scatter matrices depend on µ (k) and µ (k+1) , respectively. For β → 0 this estimator gives the maximum likelihood solution in equation (13). As before we may simultaneously estimate the mean and covariance matrix parameter by alternating equation (9) (which depends on Σ (k) ) and equation (14)  (which depends on µ (k+1) through the estimation of the scatter matrices S (k+1) j ). We refer to this novel minimum divergence estimator as Wishart-MDE or W -MDE. The derivation of the estimator can be found in the appendix.
The lower panel of figure 3 visualizes the downweighting effect of W -MDE and color encodes the value of ψ β from equation (15). As before our algorithm arrives at a good estimate of the true covariance matrix by downweighting the outlier trial. In contrast to G -MDE the proposed estimator downweights all samples from the outlier trial (crosses). Furthermore, it needs only 9 iterations on average with β = 0.001 (G -MDE requires 25 iterations) to reach the convergence point, i.e. the point where the relative change in Frobenius norm between the two consecutively estimated covariance matrices is less than 10 −8 . At the point of convergence the covariance matrix estimated by W -MDE is significantly closer to the true covariance matrix (in terms of Frobenius norm) than the covariance estimate provided by G -MDE (see bottom right panel in figure 3). We recomputed the results for a range of β values, but did not observe any qualitative change; in all cases W -MDE was the more efficient and more precise estimator.
Finally, we would like to note that other approaches to robust group level parameter estimation exist. For instance, instead of the formula in equation (9) one could also rely on trimming and use the median in order to compute a robust average. Also the covariance matrices could be naturally fitted into a Karcher averaging if a covariance matrix is estimated for each epoch and the Riemannian geometry is concerned. An empirical evaluation of such averaging has been performed in [89].

Combined estimator
The W -MDE uses scatter matrices S which are computed for each trial (see equation (11)). Of course outlier samples may negatively affect the estimation of these matrices. A robust estimator which considers both, sample and trial outliers, can be obtained by applying G -MDE for computing of the scatter matrices for each trial and W -MDE for computing the final covariance matrix. We will refer to such estimator as WG-MDE. For simplicity we use the same β parameter for both estimation steps (G -MDE and W -MDE), however, a separate parametrization of both steps may be beneficial in practice.

Use in practice
In the discussion so far we have assumed that the parameter of the Wishart distribution ν equals the number of samples used to estimate the scatter matrix (sample size). However, this assumption only holds if the samples are i.i.d. which is often not the case in real-world data sets, e.g. in EEG recordings. If samples are correlated then the parameter ν should be set to the effective sample size [90,91] which is smaller than the number of samples within a group.
Another practical recommendation concerns the computation of the ratio of Gamma functions in γ. A naive computation of this ratio leads to numerical problems because of the very high values of the Gamma function. A common trick which is applied to stabilize the computation is to log-transform the ratio, compute the difference of the log terms and transform back via an exponential map.
Depending on how the scatter matrices and mean parameter are computed, there exist multiple variants of the W -MDE. If data is sampled from a zero mean distribution, then there is no need to perform mean estimation at all. In this case µ should be set to 0 and equation (14) should be applied iteratively. If the mean parameter is assumed to be same across all groups, then we recommend to compute the scatter matrices of groups j as describe above, namely . If different groups have different means, then one should compute the scatter matrices of group j as where z j denotes the group average. In both cases we may alternate between equation (9) and equation (14) or only apply the latter formula iteratively. The choice of the right W -MDE variant largely depends on the problem.

Experimental evaluation
In the following we compare the performance of four parameter estimators, namely sample estimator (SE), minimum covariance determinant estimator (MCDE) 10 [2], G -MDE and W -MDE, using simulations and evaluate these estimators on two motor imaginary BCI datasets. Our results show that state-of-the-art BCI algorithms largely benefit from robustly estimated parameters and that, in some cases, group-level estimation is clearly advantageous.

Single outlier simulation.
In the first experiment we evaluate the downweighting effect of the three robust estimators, MCDE, G -MDE and W -MDE, when adding one outlier trial to a data set consisting of 20 clean trials with 100 samples each. Note that the clean data is generated from the same distribution as the circles in figure 3. After application of the three robust estimators we display the ratio i.e. the weights assigned to the samples x out of the outlier trial relative to the weights assigned to the samples of the clean trials x clean . Note that for MCDE these weights are either 1 (if the sample is among the selected h samples) or 0 (otherwise). For W -MDE we show the corresponding trial weights.
Note that a small ρ value means that the outlier trial has been effectively discarded. In the experiment we add outliers by (i) rotating and (ii) uniformly scaling the true covariance matrix. Figure 4 visualizes ρ(α, σ) for a range of rotation α and 10 MCDE finds h n samples which have a covariance matrix with the lowest possible determinant, thus MCDE resists (n − h) outliers. We refer the reader to [92] for a critical discussion on MCDE.
scaling σ parameters. Note that black color represents small and white color large ρ values. The optimal robust estimator is shown in the left panel. It assigns ρ = 0 (discards outlier trial) to all scale and angle combinations except the true distribution (scale 1 and angle 0). One can see that all three estimators downweight the outlier trial if its covariance matrix is much larger than the true covariance matrix or if it is rotated. However, only the estimator using the Wishart model identifies the trial with significantly smaller variance (scale < 1) as outlier. Both MCDE and G -MDE do not identify this trial (or more precisely its samples) as outliers even when the covariance matrix is largely rotated. Since the variance of samples coming from this outlier trial is significantly smaller than the variance of the clean data, the samples are within the range of the clean data (even when the outlier trial has a rotated covariance matrix), thus they are not identified as outliers on the sample-level. Only the trial information makes these samples distinguishable from the clean data. This effect of lying within the region of clean data is also responsible for the fluctuations (smearing effect) in the maps produced with MCDE and G -MDE. Since some proportion of samples (coming from the outlier trial) will always lie in the range of clean data, their ψ β values will be relatively high as they are no outliers according to the sample-level view. Therefore the corresponding ρ value will be significantly larger than zero. This effect can be also seen in figure 3 where the samples which come from the outlier trial (crosses) but lie within the range of the clean data stay gray, i.e. are not downweighted by the estimator. On the other hand when applying W -MDE the outlier trial will have very small ψ β value irrespectively of whether its samples lie within the range of clean data or not. Thus, the corresponding ρ value will be close to zero. This results can be observed for a range of β parameters.

Multiple outliers simulation.
In a second experiment we sample 50 trials with 100 samples per trial from a 10-dimensional zero-mean normal distribution. The samples come either from a clean distribution or from an outlier distribution which is a scaled and rotated version of the former. We investigate two ways of adding outliers (i) we add outliers sample by sample (sample-level), i.e. for each sample we throw the dice whether it comes from the clean or outlier distribution, or (ii) we sample a whole trial from the outlier distribution (group-level), i.e. the decision whether a trial comes from the clean or outlier distribution affects all samples of the trial. Furthermore, we use two different scales for the outlier distribution, namely scale σ = 0.01 and scale σ = 100. The clean covariance matrix is Σ clean = V clean D clean V clean with V being a random rotation matrix and D being a diagonal matrix sampled from the uniform distribution. The covariance matrix of the outlier distribution Σ outlier = σV outlier D outlier V outlier has the same form but it is scaled by σ. Note that Σ outlier is sampled independently of Σ clean and is not fix across trials, i.e. outliers from different trials may come from different outlier distributions. Figure 5 displays the results for the different estimators. The y-axis shows the log scaled error measure which is the distance (Frobenius norm) to the true covariance matrix Σ clean . The different lines represent the median error over 50 repetitions when selecting the best parameter (among several parameters which have been tested) for each method, repetition and experimental setting. The first row shows the results for the small scale experiment. One can see that MCDE performs slightly worse than G -MDE and the sample estimator. Since MCDE favours covariance matrices with small determinant it naturally focuses on the outlier samples (coming from trials with small variance), therefore the estimate is worse than when applying the other estimators. The heuristic used by MCDE fails as the small variance samples are the outlier samples in this example. G -MDE slightly outperforms the W -MDE in the sample-level experiment but the difference to SE is not large because the small variance samples do not affect the sample covariance estimator very much. For the group-level experiment W -MDE demonstrates its advantages. It gives much better estimates than the other three estimators even when the probability of outlier is very high. Note that the solid line stands for the results when initializing the algorithm with the sample covariance matrix, whereas the dashed line shows the results for random initialization (with scale σ = 1). Since the β-divergence model depends on the initialization in the sense that all samples/trials are downweighted which are outliers (wrt the model used for the kth iteration), we can Comparison of three robust estimators with respect to their ability to discard outlier trials. The clean data consists of 20 trials with 100 samples per trial coming from a normal distribution with (almost) fix covariance matrix. We add one outlier trial to these data. The outlier samples come from a distribution with a covariance matrix which is the scaled and/or rotated version of the true underlying covariance matrix. The maps represent the ρ value defined in equation (16). Black color stands for small ρ values, white color represents large ρ values. One can see that all estimators downweight the outliers for large scales and angles, however, only the Wishart model identifies the outlier trial when small scales are applied. Furthermore, the Wishart estimator provides much cleaner ρ values.
improve the performance when applying random initialization. In the case of 90% outliers random initialization of the algorithm performs significantly better than initialization with the sample covariance matrix because in the former case the initialized matrix has the same scale as the clean data (thus the outlier trials are being discarded), whereas in the latter case, the initialized matrix has the scale of the outlier trials (thus the clean data is discarded as outlier). Note that our estimator resists the presence of 90% outliers (i) because it is modeldriven, i.e. penalizes the outliers based on their likelihood of being generated by the model and not by using heuristics, and thus is very robust if initialized with a parameter close to the true solution, and (ii) because in this simulation each outlier trial was sampled from a distribution with a different covariance parameter, thus no common outlier model exists. Note that other initialization strategies can be applied in practice and may positively affect the results.
The bottom row shows the results for the large scale artifacts. Here MCDE shows its advantages (preference of covariance matrices with small determinant) until the probability of outlier exceeds 50% (breaking point). After that its error largely increases. For G -MDE the performance is quite stable until 80% of outliers. Note that this method performs so well because it iteratively discards the extreme outlier samples even when initialized with the average covariance matrix. In other words even when the probability of outlier is very high many samples will lie in the range of the clean data and extreme outlier samples will be rapidly downweighted because the sample covariance matrix, which is used for initialization, does not provide enough support for them. Thus, iteratively these samples will have less and less influence and the final estimate will be better than the sample covariance matrix. Note that the positive effect of random initialization is limited in the Gaussian model as the extreme outliers are downweighted anyway. In the group-level scenario W -MDE performs very well until the point where more than 20% of the trials become outliers 11 . Beyond this point the error largely increases as the initialization with the sample covariance matrix prefers the outlier trials over the clean trials (which are treated as outliers). However, when using random initialization the initial covariance matrix has the same scale as the clean data and W -MDE downweights the influence of outlier trials until the probability of outlier exceeds 90%. We would like to stress that in practice it is often impossible to correctly estimate parameters in a 90% outlier setting, because the outlier model is much more complex and prior information about the scale of the correct parameter is not available.

Motor imagery BCI
This section investigates the impact of robust parameter estimation on spatial filtering algorithms in BCI, in particular CSP and its variants. These methods are well suited to discriminate between two motor imagery classes because they enhance the differences in band power (ERD/ERS 12 [5,93]) between the conditions. Mathematically, a CSP spatial filter w ∈ R D maximizes / minimizes the Rayleigh quotient where Σ 1 , Σ 2 ∈ R D×D are the estimated (average) covariance matrices of the two conditions, e.g. left hand and right hand motor imagery. If these covariance matrices are not well estimated, e.g. due to artifacts in the EEG, then the spatial filters will not be physiologically meaningful, thus will not extract the BCI related neural activity. Robust covariance matrix estimators downweight artifactual samples / trials and thus result in better spatial filters.

Data sets and setup.
We use two data sets, namely the Vital BCI data set [94] consisting of EEG recordings from 80 subjects and the BCI Competition III dataset IVa [95] consisting of five users, for the experimental evaluation.
In the Vital BCI data set the experiment started with a calibration session in which participants were asked to perform motor imagery tasks with the left and right hand or with the feet. After recording 75 trials for each condition, the best binary combination of motor imagery tasks was selected and the BCI system was trained. Subsequently, the feedback session started in which the system decoded the imagined movement. Visual feedback was provided to the user. The decoding efficiency of the system is measured in terms of classification accuracy. The signals were recorded from 118 Ag/AgCl electrodes, from which we manually select 62 electrodes densely covering the motor cortex. We downsample the signal to 100 Hz and apply a 5th order Butterworth filter with pass-band 8-30 Hz. We use a fix time segment from 750 to 3500 ms after the trial start for feature extraction.
The BCI Competition III dataset IVa contains EEG signals from five healthy subjects performing right hand and foot motor imagery without feedback. Two types of visual cues, a letter appearing behind a fixation cross and a randomly moving object, shown for 3.5 s were used to indicate the target class. The presentation of target cues were sandwiched between periods of random length, 1.75 to 2.25 s, in which the subject could relax. The EEG signal was recorded from 118 Ag/AgCl electrodes, band-pass filtered between 0.05 and 200 Hz and downsampled to 100 Hz, so that 280 trials are available for each subject. We manually select 68 electrodes densely covering the motor cortex and band-pass filter the signal in the frequency range 8-30 Hz using a 5th order Butterworth filter.
For both data sets we compute CSP (3 filters per class) with the class-covariance matrices estimated with SE, MCDE (with parameters h = 0.5:0.05:1), G -MDE (with parameters β = 2 −15:1:0 ) and W -MDE and WG-MDE (both with parameters β = 2 −15:0.5:−8 ). We select the parameters by minimizing cross-validation error on the training data. Following spatial filtering log-variance features are computed and the LDA classifier is applied [39]. Table  1 displays the error rates obtained with the different covariance matrix estimators on both datasets. One can see that all robust estimators clearly outperform the SE baseline and that the lowest error rates are obtained for the combined estimator WG-MDE. This result shows that both types of outliers (i.e. sample and trial) occur in the BCI datasets and negatively affect the spatial filter computation. Since the amount of outliers in the data vary from subject to subject, not all users benefit from robust estimation (e.g. subjects A2 and A5). Figure 6 provides an overview of the Vital BCI results using scatter plots. For individual subjects the improvement over the SE baseline is quite large. For instance, subject 21 has chance-level performance (i.e. error rate 46%) when computing spatial filters by using SE, but the error rate decreases to 17% when applying WG-MDE. Similar error rate decreases are obtained for the other robust estimators. The overall performance improvement of WG-MDE over SE, MCDE and G -MDE is significant with p < 0.05 and over W -MDE with p < 0.01 according to the one-sided Wilcoxon sign-rank test. Also when considering the best parameter for each subject WG-MDE leads to an average error rate of 23.5% and clearly outperforms MCDE, G -MDE and W -MDE with error rates of 27.4%, 25.1% and 25.1%, respectively. This result indicates that some subjects benefit more from the trial-level robustification performed by W -MDE, whereas for others robust sample-level estimation performed by G -MDE (and MCDE) suffices. For instance, subject 74 has an error rate of 50%, 49%, 37% when computing spatial filters by using SE, MCDE and G -MDE, respectively, whereas the error rate decreases to 29% when using WG-MDE. The error rate of this subject can even be lowered to 21% when applying W -MDE with the best parameter, but a comparable performance can not be obtained with the sample-level estimators. Similarly, the error rate of subject 10 can be lowered from 49% to 23% when using G -MDE instead of SE, but the improvement obtained with W -MDE is much smaller (even for the best parameter the error rate stays above 32%).

Robust estimation improves BCI performance.
Since WG-MDE combines the advantages of the trial-and sample-level estimator, it leads to the best overall performance in the Vital BCI dataset. For BCI Competition dataset the advantages of the combined estimator over the group-level one are only marginal, i.e. both estimators lead almost to the same solution. There are two potential explanations for this result. First, subjects A1-A5 may be not affected by samplelevel outliers (or at least much less affected than subjects in the Vital BCI dataset), thus applying a robust estimator in the first step of WG-MDE does not have any advantage over applying the standard estimator. This seems not to be the case as G -MDE clearly outperforms the SE baseline. Second, the parameter β used for the robust estimator in the first step of WG-MDE may be too small, so that the robust sample-level estimator 'behaves' like the standard estimator. Note that we use only one parameter for WG-MDE in our experiments, i.e. the same β is used for estimating the trial-wise scatter matrices with G -MDE and for estimating the final covariance matrix with W -MDE. This brings the risk that the parameter may be of optimal scale for the second step of WG-MDE, but too small for the first one; or of optimal scale for the first step, but too large for the second one. The two datasets may be differently affected by this scale effect. By using two separate β parameters (and also adapting ν) the WG-MDE results can be  potentially further improved, but we leave the investigation of more complex parameterizations for future work.

Robust estimation decreases the condition number.
In the following we discuss why robustly estimated parameters lead to lower error rates in BCI applications. Figure 7 provides an intuitive explanation for the large performance improvement of subject 21. It shows the right hand motor imagery patterns computed from the CSP filters with SE and W -MDE. The pattern obtained by using SE shows activity in the right hemisphere and at the left temporal electrodes. This activity is due to artifacts and does not have neurophysiological origin. The signal recorded at C6 electrode shows strong artifacts which negatively influence the covariance estimation and the spatial filter computation for this subject. The W -MDE (and also the other robust estimators) downweights these artifactual trials (bottom row) and allows to extract the true motor imagery related neural activity. Mathematically, we can show that the robustly estimated covariance matrices have a significantly smaller condition number compared to the covariance matrices estimated by SE (t-test, p 0.001). The condition number of a matrix is defined as the ratio of the largest and smallest eigenvalue. Also we can show that the decrease of the condition number is correlated to the error rate decrease (r = 0.2927, p < 0.01). This result provides one explanation why robustly estimated parameters lead to lower error rates in BCI applications. The artifacts in the data (if not removed) lead to large condition number of the class-covariance matrix which negatively affects the stability of the CSP solution. Since CSP is a greedy algorithm (i.e. computes a maximum likelihood solution), it is largely affected by the over-and underestimated eigenvalues and thus focuses on the artifacts instead of the true neural activity.

Sample and trial robustness revisited.
This section analyzes the relation between the proposed sample-level and trial-level estimators in more detail. In particular, we show that the trials considered as outliers by G -MDE and W -MDE only partially overlap. The left upper panel of figure 8 shows the normalized weights ψ β assigned to each trial by G -MDE (computed as average over sample weights) and W -MDE for subject 1. Although the correlation between the weights is quite high, r = 0.7, some trials are downweighted by W -MDE but not by G -MDE and vice versa. The two trials marked by the red and green circles in figure 8 are almost completely discarded (weight close to zero) by W -MDE but are not downweighted by G -MDE. The time courses of these trials are shown in the bottom panel of figure 8. Obviously channel CPz has a problem (recorded values exactly zero) which negatively affects the spatial filter computation when using the SE (i.e. top CSP filter does not capture BCI related neural activity). Furthermore, G -MDE does not identify these trials as outliers, i.e. it does not downweight them. There exist also the opposite case where trials (right bottom corner of the figure) are downweighted by G -MDE but not penalized by W -MDE. In general it is not only important to downweight the outliers trials correctly, but also to assign high weights to representative, non-outlier trials.
The right panel of figure 8 demonstrates the superiority of W -MDE in identifying representative trials. The dashed line represents the average (over all subjects) test performance of SE when computing the spatial filters (one per class) using all 75 calibration trials per class. The three solid lines stand for the average performance of SE when using the 2-20 best trials according to G -MDE and W -MDE weighting or random selection. More precisely, we select 2-20 trials per class with the highest G -MDE and W -MDE weights or by random selection (50 repetitions) and compute the spatial filters and the classifier using these trials. Note that the weights ψ β were computed on all calibration trials. Values above the dashed line stand for an error rate increase relative to the 75 trials baseline. One clearly sees that the weights computed by W -MDE belong to 'better', i.e. more informative, trials than the G -MDE weights or random selection. For instance, when selecting the six best trials according to the W -MDE weighting the average error rate only increase by 4% compared to the 75 trials baseline, whereas the performance loss is twice as large when using the G -MDE weights. One intuitive explanation for this result is that it is easier to identify good trials when looking at the data from trial-level than from samplelevel perspective because the average sample weight of a trial (sample-level perspective) does not well reflect it's overall quality. The proposed W -MDE provides this trial-level view and performs better on this task. Figure 9 displays the C3 channel signal of the eight best trials according to the G -MDE and W -MDE weighting of subject 19. A good weighting would select trials with a typical motor imagery effect over the C3 channel, i.e. high variance in the right hand condition and low variance in the left hand condition. One can see that W -MDE selects 'better' trials and produces a neurophysiologically more meaningful CSP pattern when maximizing the 'right hand' condition than G -MDE.

Do CSP variants also benefit from robust estimation?
In the following analysis we show that robust param eter estimation is also important for more advanced CSP variants. Table 2 displays the average error rate of the 80 Vital BCI subjects (median error rate is shown in the brackets). The asterisks indicate significance (0.05, 0.01 and 0.001 level) when comparing the error rate of the given method to the corre sponding (non-robust and robust) CSP baseline using the Wilcoxon sign-rank test. For simplicity, we apply TRCSP [44] and sCSP [50] with a fixed parameter of λ = 0.02 and β-CSP [51] with β = 0.1, i.e. we do not select these parameters for each subject separately. The first row shows the error rates when estimating the covariance matrices with SE, whereas the second row shows the results when using the covariance matrices computed by WG-MDE. Also here we rely on the β parameter which was selected in the above analysis, i.e. we refrain from selecting the optimal parameters for each of the CSP variants separately.
The results show a significant error rate decrease relative to the CSP baselines. All methods perform significantly better than CSP in the non-robust as well as in the robust setting (for β-CSP the error rate decrease is only significant in the non-robust setting). Also in all cases the results improve when robustly estimating the covariance matrices (i.e. comparing the first and second row). This result shows that even algorithms such as β-CSP which are robust by design or TRCSP which stabilize the CSP algorithm (e.g. in the case of large condition numbers) by restricting the norm of the filters benefit from robust parameter estimation.

Robust estimation and nonstationarity.
Nonstationarity in EEG is a critical issue, especially because it aggravates the session-to-session transfer in BCI. The development of techniques which robustify the signal analysis against nonstationarity is therefore of large interest to researchers as well as practicians. Data from BCI experiments can be contaminated with artifacts or be free of outliers; it can be stationary or it's distribution can change significantly over time. The robust estimator proposed in this work is designed to tackle the outlier problem, however, the results in section 5.2.2 suggest that it also has a positive effect (because it clearly outperforms the standard estimator) on datasets, which are known to exhibit nonstationarity between training (calibration data) and test (feedback data) phase.  Here are two potential explanations for this observation.
(1) The smaller condition number of the proposed estimator (see section 5.2.3) acts as a regularizer and thus prevents the estimator to overfit on the training data. Thus, the estimated parameters are less training data specific and better capture the global properties of the signal which we believe are often more stationary. This reduces the vulnerability to nonstationarity. (2) The robust estimator implicitly focuses more on the stationary part of the data, because it downweights the impact of trials which exhibit the most changes (i.e. outliers) in the training data (see equation (14)). Discarding these trials does not completely remove the nonstationarity from the data, but it reduces it by a significant amount.
Thus, robust parameters estimation implicitly robustifies the signal analysis against nonstationarities in the training data, which often also helps in session-to-session transfer. However, it does so only to a certain extend. The results in table 2 show that the combination of robust estimation and an explicit minimization of nonstationarity (by applying sCSP) gives further improvement over mere robust estimation. Depending on the data, robustness against outliers or the minimization of nonstationarity may have a larger effect on performance.

Conclusion
This work introduced a novel robust covariance matrix estimator based on the minimum divergence principle and a Wishart distribution model. We demonstrated the advantages of this estimator for structured data in simulations and on real data sets.
In future work we will consider the use of alternative disparity measures, e.g. optimal transport [96] or γ-divergence [85], and models, e.g. multimodal distribution or Dirichlet distribution, for robust estimation. Furthermore, we aim to provide a Bayesian interpretation for multi-scale robustness and apply the proposed estimator in the medical domain for clinical multi-site studies. Furthermore, we will investigate the advantages of robust parameter estimation for multi-modal data [97,98] and multi-subject BCI settings [99]. Finally, we will also study the impact on outliers on BCI performance prediction methods (e.g. [100][101][102]  Let Σ (k+1) = 1 β+1 Σ (k+1) , ν = (β + 1)ν − βD − β and α = Note that if splitting the integral on the right hand side into two integrals then the first one gives the first moment of the Wishart distribution and the second one is the zeroth moment times νΣ (k+1) . Thus, we obtain Assuming |Σ (k+1) | = |Σ (k) | at convergence point this is which implicitly depends on µ parameter through the estimation of the scatter matrices S j . Note that