A novel multimodal approach for hybrid brain-computer interface

Brain-computer interface (BCI) technologies have been widely used in many areas. In particular, non-invasive technologies such as electroencephalography (EEG) or near-infrared spectroscopy (NIRS) have been used to detect motor imagery, disease, or mental state. It has been already shown in literature that the hybrid of EEG and NIRS has better results than their respective individual signals. The fusion algorithm for EEG and NIRS sources is the key to implement them in real-life applications. In this research, we propose three fusion methods for the hybrid of the EEG and NIRS-based brain-computer interface system: linear fusion, tensor fusion, and $p$th-order polynomial fusion. Firstly, our results prove that the hybrid BCI system is more accurate, as expected. Secondly, the $p$th-order polynomial fusion has the best classification results out of the three methods, and also shows improvements compared with previous studies. For a motion imagery task and a mental arithmetic task, the best detection accuracy in previous papers were 74.20\% and 88.1\%, whereas our accuracy achieved was 77.53\% and 90.19\% . Furthermore, unlike complex artificial neural network methods, our proposed methods are not as computationally demanding.


I. INTRODUCTION
Brain-computer interface (BCI) is an important tool for the detection of signal patterns in brain activity. It has been used widely in many areas, such as in robotics control, workload detection or brain-disease detection [1]- [7]. BCI technology is usually divided into invasive BCI and non-invasive BCI. Invasive BCI requires the sensors to record brain activities from within the skull. Non-invasive BCI, on the other hand, records brain signals using sensors placed on the scalp, and it is undoubtedly a much safer technology and easier to use [8]. However, signals from non-invasive BCI sensors are usually full of noise, from the subjects' unconscious eye movement or ambient noise, for example. Finding a way to use noninvasive sensors to detect brain signals accurately is still a big challenge [9].
Transitional non-invasive BCI systems use either electroencephalography (EEG) sensors, near-infrared spectroscopy (NIRS) sensors or Magnetic Resonance Imaging (MRI) to record brain activities [10]- [13]. Compared with the MRI equipment, EEG and NIRS equipments are lowcost and smaller in size. Both EEG and NIRS have been implemented in many real time BCI applications [12].
EEG equipment consists of metal electrodes placed on the scalp to record electrical signals [14]. The electrodes record the activity of the surrounding neurons. For the EEG-based BCI system, classical feature extraction methods such as common spatial patterns (CSP) [15], power spectrum density (PSD) and auto-regressive modeling (AR) [16] have been previously proposed to analyze and localize the EEG patterns and activated brain area of motor imagery (MI) and mental arithmetic (MA) [17]. Then, in regards to classifiers, several machine learning classifiers such as k-nearest neighbours (KNN) [18], support vector machine (SVM) [19], [20], and linear discriminant analysis (LDA) [21] are used in classifying these proposed EEG features. In recent years, with the wide application of deep learning, many studies have already shown that convolution neural network (CNN) exhibits good performance for EEG processing or NIRS processing [22], VOLUME [23]. CNN and CNN-based models, such as EEGNet [24] or convolutional neural networks-stacked autoencoders (CNN-SAE) [25], have obtained remarkable MI classification results. Moreover, some researchers have developed an LSTMbased framework [26] to extract essential features of timevarying EEG signals to classify motor imagery signals.
NIRS-based BCI systems, generally used to measure the hemodynamic signals from target regions of the brain [27], can also analyze oxy-hemoglobin NIRS (oxy-NIRS) or deoxy-hemoglobin NIRS (deoxy-NIRS) concentrations in order to localize and classify the brain activity [28]. Generally, when a specific brain area becomes more active, energy metabolism increases, leading to an increased oxygen consumption and increased levels of carbon dioxide in the area. Then, oxy-NIRS signals will decrease while deoxy-NIRS signals will increase. For the signal processing or pattern classification of NIRS, transitional algorithms such as SVM or artificial neural networks are used [29].
Instead of using EEG or NIRS mehods individually, using them simultaneously provides a multimodal hybrid BCI. It has been repeatedly proven in literature that multimodal BCI systems have a better accuracy and stability than the BCIs based on a single modality [30]- [32]. For instance, Jaeyoung Shin et al. presented an open source benchmark for a hybrid EEG-NIRS fusion BCI system [12], [30] which also uses a linear discriminant analysis classifier for the multimodal data classifications.
Non-linear feature fusion is commonly used in deep learning, and there are many ways to achieve it. For instance, in [33] a bilinear method was used to fuse two feature vectors, which can be thought of as the 2-order tensor product fusion. Even further, a trilinear method was used in [34] to fuse three modality features. In addition, some tensor-based methods have been proposed to solve multimodal fusion problems such as Visual Question Answering: for example, [35] proposed a Deep Attention Neural Tensor Network which can discover the joint correlations over images, questions and answers with tensor-based representations; [36] proposed a low-rank multimodal fusion method which performs multimodal fusion using low-rank tensors to improve efficiency; finally, [37] proposed a polynomial tensor pooling block for integrating multimodal features by considering high-order moments.
In this study, a novel approach for the hybrid BCI system is proposed. Firstly, CNN was used to get the feature vectors from the EEG, oxy-NIRS, and deoxy-NIRS signals. A singlemodal data was calculated to compare with hybrid-system based classification results. Then, three fusion methods were used to process the multimodal feature vectors: linear fusion classification method, tensor fusion classification method and the pth-order polynomial fusion classification method. Jaeyoung Shin et al's benchmark dataset [30] was used for comparison purposes to validate the methods proposed in this paper.
The classification results for both the motor imagery (MI) experiment data and the mental arithmetic (MA) experiment data were compared with results from previous works. In both these two different tasks, the pth-order polynomial fusion classification shows the best classification performance. In the MI task, the accuracy achieved was 77.53%, while the accuracy achieved in the MA task was 90.19%. Furthermore, compared with the deep neural networks method, our algorithm is simple and uses less computing resources. It can be used for real-time hybrid BCI systems.
The rest of the paper is organized as follows: hybrid BCI benchmark dataset and fusion algorithm are introduced in Section II. Experiments and pre-processing, as well as their results, are presented and discussed in Section III. The discussion and conclusions are given in the Section IV and Section V respectively.

A. DATASETS
Subjects 29 subjects, consisting of 14 males and 15 females, participated in the data collection procedures. The average age was 28.5 ± 3.7 (mean± standard deviation).
Data Acquisition EEG data was collected by a thirtychannel BrainAmp EEG amplifier (Brain Products GmbH, Gilching, Germany), with a sampling rate of 1000 Hz. These electrodes were placed according to the international 10-5 system. NIRS data was collected by NIRScout (NIRx GmbH, Berlin, Germany) with a sampling rate of 12.5 Hz and containing thirty-six channels. More details about the electrodes' position can be found in [30].
Motor imagery (MI) experiment Subjects were required to conduct kinesthetic MI. Namely, they needed to imagine  opening or closing their hand while they were grabbing a ball, so that the MI is actual MI rather than visual. Each trial consisted of three parts: instruction (2s), task (10s) and rest (15-17s). Each session consisted of a pre-rest (60s) , 20 repetitions of the aforementioned trial, and a post-rest (60s). The MI experiment dataset contains the data from 3 sessions. Mental arithmetic (MA) experiment Subjects were required to conduct subtraction in the form of a three-digit number minus a one-digit number (e.g., 384-8). As in the MI experiment, the MA experiment dataset also contains data on 3 sessions where each session consists of 20 repetitions of the trial. Subjects were given an initial subtraction calculation, and completed this subtraction as their first trial of their total 20. For the remaining 19, the new subtraction was the result from the last calculation minus the initial one-digit number, which thus remained constant throughout the trials.
Data pre-processing The EEG data was first re-referenced by using common average reference and filtered with the 4th-order Chebyshev type 2 filter with a bandpass of 0.5-50 Hz. Then, independent component analysis (ICA)-based EOG rejection was used to remove artifacts. After that, the EEG data was downsampled to 200 Hz. For each trial, 35s of data was extracted, containing a segment of the last rest (10s), the task (10s) and the final rest (15s). Then, a 3s time window with 1s step was employed to collect data. Eventually, 33 segments were obtained from each trial. Each segment has a shape of 30 × 600 (channels×times). For the NIRS data, the concentration changes of oxy-and deoxy-hemoglobin (oxy-NIRS and deoxy-NIRS) were first calculated by using the modified Beer-Lambert law. The oxy-NIRS and dexoy-NIRS were then filtered by a 6th-order zero-phase Butterworth filter with a bandpass of 0.01-0.1 Hz. After that, a 3s time window like the EEG was used to segment the NIRS data. After this process, a segment ends up with a shape of 36 × 30. Thus, for both the MI and MA experiments, each subject completed 60 trials, and each trial has 33 segments. Fig. 3 shows the segments of one of the subjects for three modal data from 0s to 3s.

B. MULTIMODAL CLASSIFICATION METHOD
In this section, we will first show how the classification model is built by using single modal data. Following this, two common methods to fuse features in deep learning are introduced to allow for comparison with our method and a better understanding of it. After that, our pth-order polynomial fusion method is introduced. In the end, in order to  tackle the unacceptably large amount of parameters, tensor decomposition is used.
Notation Focusing on the tri-modal task, assume x 1 , x 2 and x 3 denote the EEG, oxy-hemoglobin NIRS (oxy-NIRS) and deoxy-hemoglobin NIRS (deoxy-NIRS), respectively. Also assume that z 1 , z 2 and z 3 denote their respective feature vectors obtained from three CNNs. To express the formula concisely, Einstein notation [38] is applied to describe multiplication between tensors or between a tensor and a vector. In particular, if we assume x i to denote a vector and W ijk to denote a 3rd-order tensor, their product can be written as y jk = x i W ijk , which means y jk = I i x i W ijk . Also, when we concatenate z 1 , z 2 and z 3 , the result is then written as z 1,2,3 . Given a vector x i1 , its first copy is written as x i2 , and its (N − 1)th copy is x i N .
Single modal classification We have data from three different modalities, and each has a shape of channels×times. Specifically, these are EEG, oxy-NIRS and deoxy-NIRS, with shapes 30 × 600, 36 × 30 and 36 × 30, respectively. For the single-modal classification, we conduct 1D-CNN and then each convolutional layer is followed by a batch-norm layer and a ReLU layer [39]. Since oxy-NIRS and deoxy-NIRS have same shape, we use the same CNN structure for them. Table 1 and 2 show the CNNs used on EEG and NIRS in detail. Notice that the feature vectors z 1 , z 2 and z 3 are the results of "AvgPool1d".
Considering the triple modal fusion problem, there are linear and multilinear methods. Fig. 4, 5, and 6 show three ways to fuse multimodal data.
Linear fusion (LF) classification It is commonly used to simply concatenate all feature vectors obtained from the output of "AvgPool1d". Thus, the linear fusion product can be written as where W i,o is a fusion matrix, and y o is the fused feature vector with a length of o. y o is then put in a fully connected network (FCN) in order to generate a classification result. Fig. 4 shows the flow of LF classification.  the outer product is usually introduced. It is a natural way to obtain a feature containing the interaction amongst multiple feature vectors. In our task, the outer product result can be written as where Z abc is a 3rd-order feature tensor, and a, b, c are the lengths of three modal feature vectors. After that, a 4th-order weight tensor W abco is used to obtain the fused feature vector y o : Then, L2 normalization is applied to y o . Finally, FNC is used for classification. This operation is the same as the one shown above in the LF classification section. Fig. 5 shows the flowchart of TF classification. pth-order polynomial fusion (pth-PF) classification Tensor fusion considers the interaction between multiple feature vectors. However, the interactions within each feature vector or between two of the feature vectors are not present in the fusion. To tackle this problem, polynomial fusion is introduced. As in with linear fusion, we firstly obtain z 1,2,3 by concatenating all feature vectors. Then, p − 1 copies of z 1,2,3 are made and the outer product is calculated: Also, as in with tensor fusion, a (p + 1)th-order weight tensor W i1i2...ipo is employed for fusion: The subsequent operations are the same as shown in the TF classification section. Fig. 6 shows the flowchart of PF classification.
Rethink three fusion methods As it turns out, the fusion weight matrix/tensor in these fusion methods can be "decomposed". It should be noticed that the "decomposition" here is just to decompose the aforementioned formulas in order to make them easier to understand. For LF, Eq. 1 can be written as: where  We can see that LF actually only considers the single modal feature and simply adds them together. For TF, the W abco in Eq. 3 cannot be "decomposed" since three modal vectors are entangled together. This may raise the problem of overfitting in the model, because the noise in the sample may be amplified during feature entanglement. For pth-PF, LF is a special case of pth-PF when p = 1. To be concise with the formula, if we let p = 2, then Eq. 5 can be written as: where and We can therefore see that all interactions between modes are taken into account. Compared with TF, this kind of entanglement may also introduce more noise, but at the same time it also generates more fruitful features. It can be observed, from the experimental results, that this operation has indeed a positive effect.
Dimension reduction by tensor decomposition In deep learning, the usually large amount of parameters can lead to difficult training or overfitting [9], [40]. The number of parameters required by TF increases exponentially along with the increase of the mode, and the number of parameters required by PF increases exponentially along with the increase of the order. In order to tackle this issue, canonical/polyadic (CP) [41] decomposition is used. Its main idea is based on using several core tensors to represent the original weight tensor. Given a weight tensor W i1i2...ipo , this tensor can be represented by: where r is called rank, and is used to control the size of all core tensors. With the increase of r, the core tensors can better approximate the weight tensor (i.e. increasing the ability of representation). Thus, Eq. 5 can be written as: Considering that all of z 1,2,3 ip are the same, we can further reduce the parameters by assuming symmetric structure of the core tensors, namely, W imro = W inro , m = n.
Complexity analysis Assume that three feature vectors have lengths of A, B and C, and the fusion vector has a length of O. Let CP rank be R. Then, the computational and storage complexity for LF, TF and pth-order PF will be O((A + B + C)O), O(ABCO) and O((ABC) p O), respectively. By conducting CP decomposition, the computational and storage complexity for TF and pth-order PF will be O((A+B +C)RO) and O(p(A+B +C)RO). By assuming symmetric structure of the core tensors, the complexity for pth-order PF is reduced to O((A + B + C)RO). We can thus see that CP decomposition significantly reduces the amount of parameters, and also that symmetric structure can further reduce parameters on pth-order PF.

A. NETWORK CONFIGURATION
For the single-modal model, the network details are shown in Table 1 and Table 2. For the triple-modal model, the CNN model for extracting feature vectors is the same as the one before "Avgpool1d" in the Table 1 and Table 2. The fused feature vector has a length of 128. TF employs CP decomposition, and pth-order PF employs CP decomposition as well as symmetric structure. After that, a FCN with one layer is employed for classification.
In the training phase, we applied cross-entropy as the loss VOLUME 4, 2016  For the triple-modal data, we have 29 subjects, with each subject carrying out 60 trials. A trial lasts 35 seconds, where resting goes from -10s to 0s, the task goes from 0s to 10s, followed by another resting period from 10s to 25s. We only used 3s data as the input for the model. For each subject, we conducted a 5-fold cross validation and took the average. Finally, all the subjects' results were averaged.

B. RESULTS
We firstly compared the classification accuracy results with the best results from the following published paper [30]. Our methods show significant improvement.
The results of the classification accuracies of our methods are displayed in Fig. 8, where x-axis indicates the left edge of the moving time window and the y-axis shows the accuracy.
We can see that all the tri-modal fusion models perform better than single-modal models. Furthermore, in both tasks, PF achieves better results than LF and TF.

A. COMBINING EEG AND NIRS SIGNALS
In recent years, a large number of new technologies have been developed for the analysis of brain activity from singlemodal BCI equipment. New algorithms are always being developed over time that improve the reliability of the current BCI system. However, a single-model BCI system still has many limitations. For instance, a multimodal BCI has better anti-noise capabilities than a single-model BCI [42]- [44], [44]. Multimodality implies a multi-view, and thus a different representation for the same brain activity patterns [45]. It helps the algorithm to detect specific patterns from the signals. Many previous studies are in line with our results having also indicated the improvement by fusion of multisignals [46].
We selected a NIRS and EEG-based hybrid BCI fusion system because both of these two systems are real-time and low-cost [31], [47]. Additionally, compared with MRI, these two technologies do not require a medical license. We believe EEG and NIRS-based hybrid BCI methods can and will be used as a critical tool in many applications.

B. HYBRID BCI SIGNAL PROCESSING BASED ON TENSOR FUSION
According to recent literature, deep learning methods or artificial neural networks have been applied for the detection of patterns in brain activity. Several artificial neural networks have also been applied to this dataset. However, almost all of the neural networks result in an over-fitting problem since the BCI dataset includes small samples. Although tuning the neural structures or parameters could slightly improve the accuracy, the method's stability leaves much to be desired [22], [48].
A tensor is a higher-order array that represents signals from different types of sensors [49]- [51]. For example, EEG data or NIRS data can be represented by time ÃŮ frequency ÃŮ electrode, and functional MRI data can be represented by voxels ÃŮ scans ÃŮ subjects [52]- [54]. In addition, tensor decomposition has been widely used to capture brain signal patterns. In this work, the classification accuracy of our proposed methods is much better than neural networks, and our system is also very robust.

V. CONCLUSIONS
In this paper we have focused on using tensor fusion methods to construct a hybrid BCI system. The hybrid BCI system includes simultaneous EEG signals and NIRS signals. It also includes two tasks: a MI task and a MA task. We used a shallow CNN neural network to detect the EEG and NIRS feature vectors. Then, we used LF, TF and pth-order PF fusion methods to integrate the feature vectors. The results show that the pth-order PF fusion method has the best classification accuracy out of the three fusion methods and the single-modal classification. Furthermore, when comparing the performance of our method with published literature, pthorder shows better results. We believe our method could be useful for hybrid BCI systems.