Independent Vector Analysis for Feature Extraction in Motor Imagery Classification

Independent vector analysis (IVA) can be viewed as an extension of independent component analysis (ICA) to multiple datasets. It exploits the statistical dependency between different datasets through mutual information. In the context of motor imagery classification based on electroencephalogram (EEG) signals for the brain–computer interface (BCI), several methods have been proposed to extract features efficiently, mainly based on common spatial patterns, filter banks, and deep learning. However, most methods use only one dataset at a time, which may not be sufficient for dealing with a multi-source retrieving problem in certain scenarios. From this perspective, this paper proposes an original approach for feature extraction through multiple datasets based on IVA to improve the classification of EEG-based motor imagery movements. The IVA components were used as features to classify imagined movements using consolidated classifiers (support vector machines and K-nearest neighbors) and deep classifiers (EEGNet and EEGInception). The results show an interesting performance concerning the clustering of MI-based BCI patients, and the proposed method reached an average accuracy of 86.7%.


Introduction
The brain-computer interface (BCI) enables a direct connection between the brain and the external world.Electroencephalography is a technique capable of translating brain activities into commands based on scalp-recorded measurements [1,2].Some benefits of using this BCI method are its non-invasive nature, safety, high temporal resolution, and relatively low cost.All these advantages have attracted the interest of the scientific community, and as a result, electroencephalogram (EEG) signals have been used in the study of several areas, such as epilepsy and brain tumor detection [3], alternative communication channels for disabled patients [4], emotion recognition [5], and neuromuscular disorders [6].Among all EEG-based applications, the motor imagery (MI) paradigm is probably one of the most popular.It refers to the imagination or mental rehearsal of a motor movement without any real motor execution [7,8].
The MI paradigm has been employed in cognitive psychology and cognitive neuroscience to explore the unconscious structure that anticipates a movement execution.In some contexts, such as medical, athletic, and musical areas [9][10][11], a mental rehearsal can be as effective as an authentic physical proceeding, which leads to a promising future therapeutic tool to improve the performance of motor functions in patients with damage to the central nervous system [9].
There are several proposed methods to identify the MI movements from EEG signals as accurately as possible, most of them focused on feature selection or classification algorithms.However, feature extraction over EEG signals for BCI systems has shown to be a crucial stage for classification performance.
BCI Competition III Dataset 4a (DS4a) is a widely recognized motor imagery dataset that has been extensively studied, leading to the development of various techniques.For instance, Na Lu et al. [12] introduced a method known as structure-constrained seminonnegative matrix factorization (NMF), which extracts key EEG patterns in the time domain by enforcing mean envelopes of event-related potentials (ERPs) as constraints.This approach, called SCS-NMF, achieved an accuracy of 68.94%.On the other hand, Rasool Ameri et al. [13] developed a dictionary pair learning (DPL) method for EEG classification, using L0-and L1-norm calculation to obtain sparse coefficients via linear projection, resulting in an accuracy of around 80%.More recently, researchers in [14] proposed a framework combining bispectrum, entropy, and common spatial pattern (BECSP) for feature extraction from MI-EEG signals, achieving an accuracy of 84.91% after selecting the most interesting ones, through a tree-based method.Additionally, the study in [15] compared three popular signal decomposition techniques-empirical mode decomposition, discrete wavelet transform, and wavelet packet decomposition-for EEG classification, with wavelet packet decomposition (WPD) sub-bands yielding an average accuracy of 92.8% for DS4a.Despite these consistent results, DS4a classification remains a challenge.Considering the previous methods, the possibility of working with only one subject may not be enough to deal with a multi-source retrieving problem in some cases.Assuming that the motor imagery data are collected from several subjects executing the same task, the problem can be extended to a multi-model approach, such as independent vector analysis (IVA), which could explore the dependences across subjects.
IVA was firstly proposed for the separation of convolutive mixtures in the frequency domain [16], considering the joint blind source separation (JBSS) problem [17].Since then, there have been applications in fMRI (functional magnetic resonance imaging) [18,19] from HD-sEMG (high-density surface electromyography) [20], multimodal neuroimaging data fusion [21], and EEG data as muscle artifact removal [22].However, the IVA application in MI-based BCI as a feature extraction technique is a pioneering approach.This work proposes a novel perspective of MI classification through a new feature extraction method: a combination of IVA and Autoregressive (AR) models.The method is applied to the motor imagery dataset from the BCI Competition III Dataset 4a.Classification is obtained through different methods: support vector machines (SVM), K-nearest neighbors (KNN), EEGNet, and EEG-Inception, in order to evaluate the efficiency of the obtained features.Comparing the results with the ones in the literature, this novel approach showed a homogeneous accuracy performance.
In Section 2, we describe the JBSS problem and the IVA method.Section 3 shows the AR model and the classifiers applied in this work.Section 4 presents BCI competition III dataset 4a and the data preprocessing.The simulation results are presented and analyzed in Section 5. Finally, we conclude this paper in Section 6.

Joint Blind Source Separation
In certain applications, such as neurodiagnostic applications [23], dealing with multiple datasets is a necessity, which leads to multisubject/multimodal data fusion.In these cases, the task of blind source separation may be extended to JBSS, which exploits correlations across datasets (inter-set dependence) while still searching to recover independent latent sources within a dataset (intra-set independence) [19].The general concept of the JBSS problem involves K datasets, each containing M independent sources and N samples.Such a mixing process can be modeled by the following equation: where s [k] T ∈ R M is the concatenated source vector of the k-th dataset, (•) T denotes the vector transpose, A [k] ∈ R M×M is the k-th invertible mixing matrix, both assumed unknown, and x [k] Following the recommendation of [17], it is important to perform data whitening prior to performing source separation.Considering V [k] as the whitening matrix of the k-th dataset, z [k] (n) = V [k] x [k] (n) is the whitened mixture signal.
The demixing process aims to find matrices W [k] and the corresponding source vector estimates y [k] (n) for each one of the K datasets.Hence, the separation system is given by The mixing matrices are potentially distinct for each dataset and are not necessarily related, admitting permutation and/or scale ambiguity.

Independent Vector Analysis
Independent vector analysis is a powerful approach for solving the JBSS problem, and is an extension of independent component analysis (ICA) to multiple datasets by leveraging the dependence across datasets [16,17].In [19], the role diversity, i.e., different statistical properties, is explained for both ICA and IVA, and the application of both to medical image analysis is discussed.When applied to multiple datasets, Group ICA (GICA) is a widely used approach [24,25]; however, it is noted that it has limitations when compared to IVA in terms of common group-level spatial maps and inter-subject variability preservation [26,27].The statistical dependence modeled through a multivariate probability density model provides full interaction across the datasets, making IVA an attractive method for problems such as subgroup identification when used with multisubject data [19].Ref. [28] showed encouraging results of IVA application to subgroup identification, revealing significant differences between the identified subgroups.Such studies demonstrate the ability of IVA to deal with subject variability and subgroup identification, highlighting the advantages of IVA in dealing with multiple datasets.
In IVA, the components from a particular dataset are assumed to be statistically independent of each other, as in ICA methods.However, in contrast to ICA, IVA also exploits the dependence between correlated components from different datasets.These correlated components are regrouped into the so-called source component vectors (SCVs).The mth SCV can be written as y m = [y [1] m , . . ., y which is statistically independent of all the other SCVs [17].The IVA cost function is given by where H[•] is the entropy function, det(W [k] ) is the determinant of matrix W [k] and C 1 is a constant term that depends only on x [k] .The mutual information part of the IVA cost function is responsible for solving the permutation ambiguity that occurs in the JBSS problem [17,19].Furthermore, the minimization of the cost function (3) simultaneously minimizes the entropy of all components and maximizes the mutual information within each estimated SCV [17].In addition, IVA has shown, in most cases, good performance in capturing variability in spatial components across datasets [25,27].In this paper, we work with IVA-G [17] that only takes secondorder statistical information into account, assuming multivariate Gaussian distributions for the SCVs.

IVA Using Vector Gradient Descent
In [17], the authors describe four algorithms for minimizing the IVA cost function given by (3).Within these alternatives, the vector gradient descent algorithm was chosen in this paper due to the decoupling method that enables the tailoring of the step size for each direction, resulting in faster convergence per iteration than that achieved in traditional methods [29].Using this approach, the IVA cost function ( 3) is differentiated with respect to w [k] m [17], where w [k] m is the mth row of W [k] : where , p(y m ) is the pdf (probability density function) of y m , and h [k] m can be defined as a unit length vector such that × M matrix obtained by removing the mth row of the demixing matrix W [k]  [17,29].
The gradient obtained by ( 4) is used to iteratively adapt each demixing row of W [k] : (w followed by a normalization step: (w where µ is the adaptation step size and it represents each iteration.

Classification Algorithms
Having extracted the features through IVA, a dimensional reduction is highly recommended before classification.Thus, an autoregressive (AR) model is used [30,31], and its weights are extracted for the classification step.This step will be detailed in the sequel, followed by a brief description of the classification methods SVM, KNN, EEGNet, and EEG-Inception.

Autoregressive Model
The AR model is frequently used to represent a random process in view of preserving their important attributes and also reducing the data dimension.This is possible due to the model structure, where the output variable linearly depends on its own previous values [32].Thus, an autoregressive model of order q describes the signal u as follows: where ν(n) is a white noise with zero mean and variance σ 2 ν , and {a 1 , . . ., a q } are the AR parameters that can also be written as a = [a 1 , . . ., a q ].
Based on the Yule-Walker equations [33], the coefficients of the AR model can be estimated by where In this paper, an AR model is obtained for each estimated source (y m ) in each k-th dataset.

Classifiers
SVM is an efficient supervised algorithm based on statistical learning theory that can be used for classification or regression problems [34].While in other methods, the separation hyperplane normally assumes distributed class-conditioned data, SVM seeks to find the separation hyperplane with the largest margin between classes.
The KNN classifier is one of the most popular neighborhood classifiers in pattern recognition [35].It is a nonparametric supervised learning classifier that uses the majority within the K-closest training examples to classify or predict an individual data point.The previous described algorithms are well-known in the machine learning field.
Nevertheless, deep learning approaches have been attracting the attention of the scientific community, and have presented promising results in biomedical engineering applications.EEGNet [36] is a compact convolutional neural network for EEG-based BCIs.The method uses depthwise and separable convolutions to construct an EEG-specific network that encapsulates several well-known EEG feature extraction concepts, such as optimal spatial filtering and filterbank construction, while simultaneously reducing the number of trainable parameters when compared to other networks.More recently, a deep learning model has been proposed by E. Santamaria-Vazquez et al. [37], called EEG-Inception.This method integrates the inception modules for event-related potential (ERP) detection, which can be efficiently combined with other structures in light architecture and requires very few calibration trials.

Experimental Setup
In the previous section, we described a method for feature extraction based on IVA.In order to better understand the proposed method, in this section, we investigate the performance of the algorithm by applying it to a real EEG dataset for motor imagery movements.In the following, we describe this dataset and the preprocessing stages.

Dataset Description-BCI Competition III Dataset 4a
Dataset 4a from BCI Competition III (DS4a) is provided by B. Blankertz et al. [38], and contains data recorded from 5 subjects-identified as "aa", "al", "av", "aw", and "ay"-using 118 channels sampled at 1000 Hz, which were downsampled to 100 Hz.The cuebased BCI paradigm involves two motor imagery tasks: imagining right-hand movement and right-foot movement, totaling 280 trials per subject.
During each trial, a fixation cross was shown to each subject, followed by a short acoustic warning tone, indicating the beginning of the trial.Two seconds later, a cue arrow pointing either right or down appeared on the screen for 3.5 s, instructing the subjects to perform the corresponding motor imagery task.Subjects were to continue the motor imagery task until the arrow disappeared, after which a brief black screen signaled a short break.

Proposed Method
In order to evaluate the proposed method, initially, we split the dataset into training and test data using k-fold cross-validation with k f = 10.However, since the IVA matrix initialization (e.g., based on a Gaussian distribution) is a relevant stage for feature extraction and to maintain the test data unknown, using the training data, we also considered a holdout sample technique of 10% as a validation dataset to investigate the IVA initialization effect on the performance of the method.In addition, each time series given by the EEG signal was separated into window samples of 4 s according to each motor imagery class.

Training Stage
Firstly, in the training stage, the data were whitened, as recommended in [17], for each subject, separately.The EEG signal collected from each subject was considered to be one dataset.IVA was applied in the training data for each class separately to obtain the 2 were stacked and AR modeling was applied to each extracted feature (corresponding to each EEG channel) in order to reduce and adjust the data dimension.The resulting AR parameters were used as the classifier inputs.Finally, the data of each subject were classified according to the two considered classes.This second step is exemplified in Figure 2.
. . .W for each . . . . . .Optimization of the IVA cost function, given by (3), is not an easy task.As usually occurs with gradient descent-based algorithms, initialization plays a crucial role.In order to better explore the method's potential, a search for a good W c is denoted as W selected c .
To select the appropriate W selected c , SVM and KNN classifiers were chosen (block diagram of Figure 2), since both are low-cost, well-established algorithms and can provide a feasible direction to find the suitable W selected c .More details will be discussed in Section 5.1.When using deep learning approaches, given by EEGNet and EEG-Inception, the AR parameter extraction step is not necessary.Thus, W selected c and the training data were used directly.

Test Stage
After selecting the suitable initialization, W selected c , the procedure described in Figure 2 is reapplied using the test data, where the estimated SCV components are represented by c test .The whole method is summarized in Algorithm 1.The IVA and classifier weights obtained in the training stage are kept constant and applied to the test data.This procedure was used to classify the motor imagery movements between the right hand (RH) and right foot (RF) for DS4a.
In the following, methods will be named after the classifier used: IVAS for SVM, IVAK for KNN, IVAEN for EEGNet, and IVAEI for EEG-Inception.
and y [k] 2 train ] and y 2 valid ] AR model is applied for each channel and subject, according to Equation ( 7)

Results and Discussion
In order to evaluate the algorithm's performance, in this section, we analyze the effect of IVA initialization, the number of EEG channels considered, and correlation cross-subjects for dataset DS4a.To analyze such aspects, we fixed IVA adaptation step size to µ = 1 and the number of AR coefficients q = 4, based on our previous work in [39].

IVA Initialization
In Section 4.2, we described the IVA matrix selection methodology, which is grounded on a random IVA initialization search.Concerning the number of initialization iterations, for the sake of computational efficiency and based on pre-analysis, 100 iterations were used to select the appropriate W selected c .In that sense, IVA was randomly initialized 100 times using a Gaussian distribution with zero mean and unit variance, and for each initialization, accuracy was computed based on the parameters of the classifiers and validation dataset, and considering the same IVA matrix initialization for all subjects.Figure 3 shows the kernel density estimation (KDE) for DS4a and two algorithms: IVAS and IVAK.In both cases, it is possible to verify the occurrence of an initialization that maximizes accuracy, even if it may be a rare event.In Figure 3a, the subjects "aw" and "ay" show a longer tail and similar pattern, finding initializations with accuracies over 90%, and the subject "al" has the highest accuracy probability.On the other hand, in Figure 3b, for instance, three out of five subjects present a greater likelihood for accuracy around 80%, while the curves obtained for subjects "aa" and "av" present a mean achieved accuracy lower than the one obtained by the other subjects, showing a probable higher classification complexity.Based on these analyses of the initialization that maximizes accuracy, the matrix that leads to the greatest accuracy is chosen for the dataset and subjects (measured in the validation set), and applied to the test dataset.

Number of EEG Channels
Another interesting analysis is to investigate the algorithm's performance with respect to the number of EEG channels used as each IVA dataset input.Considering that DS4a EEG signals have 118 channels, the number of channels was analyzed using 13, 21, 37, 80, and 118.To reduce the number of channels, those located in brain regions known for higher activity during motor imagery tasks were chosen [40].The results are shown in Figure 4.The results show that using all the available channels leads to a decrease in performance.The best results for DS4a were obtained by IVAS, with 37 EEG channels (FAF5, FAF1, FAF2, FAF6, F7, F5, F3, F1, Fz, F2, F4, F6, F8, FFC7, FFC5, FFC1, FFC2, FFC4, FFC6, FFC8, FT9, FT7, FC3, FC1, FCz, FC2, FC4, FC6, FT8, FT10, CFC7, CFC5, CFC3, CFC1, CFC2, CFC4, CFC6).Ideally, as the number of channels increases, more information is available.However, we hypothesize that the amount of noise also increases and could prejudice the feature extraction process.Among the algorithms tested, SVM was the one that performed better.In this case, the IVAK algorithm presented the lowest performance when the number of EEG channels was the maximum 118 channels.

Correlation Cross-Subjects
In Section 2.1, we mentioned the SCVs and how they are extracted through IVA.In this section, we present the relation between the results achieved from SCV covariance matrices, obtained through the use of the estimated sources y [k] m , and the DS4a cross-subjects.Figure 5 shows two SCV covariance matrix examples extracted from IVA components (IVA Cp.).These results are based on the use of 37 channels, which achieved the best outcome in the previous analysis.In Figure 5a, we present the covariance matrix obtained from the SCVs for the right hand movement and IVA component 6 as an example.As can be seen, two cases present a higher cross-correlation: "aa" with "av", which achieves a value of 0.909; and "aa" with "aw" which achieves 0.708.The second example is based on the right foot movement and IVA component 23, shown in Figure 5b, where the highest correlation of 0.936 was achieved also among subjects "aa" and "av", but other relevant correlations were reached between subjects "av" and "aw", with a value of 0.887, and subjects "al" and "aw" with a value of 0.749.Based on the covariance measure obtained from the SCVs, Tables 1 and 2 show the results obtained for the five highest correlations cross-subjects, for each MI class (the same considered in the discussion above), IVA component, and subject.For this reason, the same IVA component or subject may appear more than once, meaning that it contributed again to one of the highest correlation situations.The correlation values shown were computed based on an average of 10-fold.
In Section 4.2, we computed the KDE and investigated the IVA initialization for the five subjects, having noted a similar distribution between subjects "aa" and "av", and subjects "aw" and "ay" in Figure 3. Comparing these results with the ones derived from Tables 1 and 2, we can observe an analogous behavior, i.e., subjects "aa" and "av" present a high cross-correlation when considering IVA components 9 and 13, for right hand movement, and components 24 and 3 for right foot MI.Additionally, for subject "av", four of the five selected correlations were related to subject "aa" in Table 2, which represents a strong relation between them.In the second case, for subjects "aw" and "ay", higher correlations emerged from IVA components 9 and 2 for the right hand and right foot, respectively.These results present a significant correlation across subjects (around 0.45) and an intriguing perspective, since the KDE distribution of the subjects could lead to a potential clustering of MI-based BCI patients, from which similar features could be exploited to improve classification performance or aid the development of a global model.

Deep Learning Approaches
Thus far, we have combined IVA feature extraction with two different classification algorithms.IVAS presented the best performance and provided valuable features using, as parameters, q = 4 and µ = 1.In order to explore the deep learning approach and evaluate the influence of the components extracted from IVA, the obtained independent components were applied to the EEGNet model (IVAEN) and EEG-Inception model (IVAEI).It is important to note that the AR step was not applied in this case.These methods were implemented and trained using the BRAINDECODE library [41].Moreover, we applied an augmentation data technique based on Gaussian white noise and/or replication [42].Having chosen the parameters for each algorithm, Table 3 presents the final results for the DS4a dataset.The second-best result for subjects "aa" and "al" was obtained using IVAEN.IVAEI achieved 92.1% and 91.4% for subjects "aw" and "ay", respectively.The latter matched the WPD method's performance, while the former showed a slight difference of 3.3%.The IVAS algorithm was able to find the third-best result for subject "av" (the most difficult subject to be classified) when compared to other results in the literature.On average, it is possible to note that the IVAEI results showed the second-best average accuracy performance of 86.7% and a standard deviation of 9.4.

Conclusions and Future Perspectives
In this work, we have presented a feature extraction method for motor imagery classification through EEG signals.This approach minimizes the mutual information to achieve independent vector analysis through multiple datasets.The proposed method was evaluated using the BCI Competition III Dataset 4a with five subjects.Although there are some limitations in terms of tested datasets, when we concentrated on a single wellestablished one, we were able to conduct a more in-depth and focused analysis, using this dataset for a proof of concept.For DS4a, the IVAEI algorithm obtained the best results, reaching an accuracy of 86.7%, considering the average between all subjects.Moreover, we showed how the selection of the algorithm parameters such as step sizes, number of AR coefficients, and number of EEG channels affect the algorithm's performance, and how the components' correlations identified from IVA could lead to another interesting result, concerning the clustering of MI-based BCI patients.In the future, we consider investigating a generalization of the method, extending the work using more complex datasets, exploring the clustering problem, or even integrating multimodal data for enhanced feature extraction.Furthermore, IVA can be incorporated into a number of remaining challenging tasks in the EEG context, such as online analysis, transfer learning, and feedback systems.While IVA has been traditionally used as an offline method, new extensions allow for the regression of previous results using a new subject's data without the need to perform a complete decomposition [43,44], which would allow for the required flexibility.A possible transfer learning approach is another interesting concept that can be exploited, considering that the correlation is naturally incorporated into the process to enhance the model's performance and robustness.
Concerning the deep learning approaches, we focused on comparing different feature extraction methods where deep learning algorithms were used only for the final classification stage, since the feature extraction stage plays an important role in the classification of EEG signals for BCI systems.Additionally, the comparison between methods with and without feature extraction could yield insightful results regarding the use of IVA to obtain network weights, something that was not on the scope of this paper.In this sense, IVA might be better suited for a separate stage, but incorporating a feedback system could be a promising approach for future developments.
Finally, despite IVA's strong identifiability properties, the direct interpretation of sensor domain components remains challenging.Future research could enhance interpretability by transforming data to the spectral domain or using features like event-related potentials (ERPs), as proposed in previous studies [21,45,46].

c
matrices that correspond to the extraction of the main features for the c-th class and k-th subject, with k ∈ {1, . . ., 5} for DS4a, and c = 1, 2. This procedure is presented in Figure 1.Then, considering the k-th subject, the estimated SCV components were obtained by multiplying the training and validation data by each class matrix W [k] 1 and W [k] 2 , followed by each corresponding whitening matrix V [k] 1 and V [k] 2 , resulting in y [k] c train and y [k] c valid , as shown in Algorithm 1.Using both matrices at this point is necessary considering that validation and test data are assumed to be completely unknown.The choice of obtaining IVA matrices for each class can leverage the feature extraction process, leading to a possible classification improvement.Subsequently, y

Figure 1 .Figure 2 .
Figure 1.Procedure description to obtain the IVA matrices W [k] c for each class based on the training data.
implemented using the validation data.The suitable initialization for W[k]

Figure 3 .
Figure 3. Performance analysis of IVAS and IVAK concerning IVA initialization based on the KDE for subjects from Dataset4a.(a) Dataset4a with IVAS; (b) Dataset4a with IVAK.

Figure 4 .
Figure 4. IVAS and IVAK performance analysis with respect to the number of EEG channels.

Figure 5 .
Figure 5. Examples of SCV covariance matrices obtained through IVA for the DS4a considering right hand and right foot movements.

Table 1 .
Main IVA component correlations per subject for right hand (DS4a) and the similarities between subjects compared with the KDE analysis.

Table 2 .
Main IVA component correlations per subject for right foot (DS4a) and the similarities between subjects compared with the KDE analysis.

Table 3 .
Accuracy and standard deviation obtained in classifying BCI Competition III Dataset 4a.