A Decomposition-Based Hybrid Ensemble CNN Framework for Driver Fatigue Recognition

Electroencephalogram (EEG) has become increasingly popular in driver fatigue monitoring systems. Several decomposition methods have been attempted to analyze the EEG signals that are complex, nonlinear and non-stationary and improve the EEG decoding performance in different applications. However, it remains challenging to extract more distinguishable features from different decomposed components for driver fatigue recognition. In this work, we propose a novel decomposition-based hybrid ensemble convolutional neural network (CNN) framework to enhance the capability of decoding EEG signals. Four decomposition methods are employed to disassemble the EEG signals into components of different complexity. Instead of handcraft features, the CNNs in this framework directly learn from the decomposed components. In addition, a component-specific batch normalization layer is employed to reduce subject variability. Moreover, we employ two ensemble modes to integrate the outputs of all CNNs, comprehensively exploiting the diverse information of the decomposed components. Against the challenging cross-subject driver fatigue recognition task, the models under the framework all showed superior performance to the strong baselines. Specifically, the performance of different decomposition methods and ensemble modes was further compared. The results indicated that discrete wavelet transform-based ensemble CNN achieved the highest average classification accuracy of 83.48% among the compared methods. The proposed framework can be extended to any CNN architecture and be applied to any EEG-related tasks, opening the possibility of extracting more beneficial features from complex EEG data.


Introduction
Driver fatigue is one of the most common and serious hazards on the road and thus a threat to the safety of human life in the world. It often produces drowsiness or even causes short sleep episodes, making drivers less able to recognize potential hazards. Based on the statistics of the causes of traffic accidents in various countries, fatigue driving occupies a large proportion [1]. Hence, effectively and efficiently monitoring the driver's fatigue to avoid transportation accidents has become one of the most concerning research topics for automotive safety engineering. In recent years, many re-searchers have attempted to exploit physiological signals to perform fatigue detection, including but not limited to eye movement signals [15], heart rate [12], facial expression [2] and etc. Among different physiological signals, electroencephalogram (EEG) that can directly reflect brain activities has become increasingly popular in driver fatigue monitoring systems [19,9]. Specifically, applying EEG can help to improve driving safety through direct and passive communication between the brain and the road environment. However, due to its nature of complexity and subject variability, EEG decoding performance in cross-subject recognition tasks is limited [18]. In this work, we deal with the problems from the aspects of reducing the complexity of the original EEG signals by using signal decomposition methods. The original EEG signals are decomposed into components of different frequency bands.
Decomposition can facilitate the model to focus more on specific frequency bands and extract features that are more beneficial for driver fatigue recognition as there will be less impact from the less useful frequency bands. As a result, this can ease the model learning tasks from the EEG signals.
Various decomposition methods have been applied to EEG signals to improve the decoding performance [38]. There are mainly two categories to exploit the decomposed components. The first way is to perform handengineering feature extraction based on the prior knowledge of the signals, followed by classifier training. For instance, Anuragi et al. [4] first decomposed the EEG signals by using empirical wavelet transform (EWT) [13].
Then, the entropy-based features were computed from the Euclidean distances of 3D phase-space representation (PSR) of the decomposed components. After the feature extraction based on the Kruskal-Wallis statistical test, ensemble learning was performed to recognize Epileptic-seizure states.
Sairamya et al. [29] used relaxed local neighbor difference pattern (RLNDiP) features from the time-frequency domain, consisting of five brain rhythms obtained by discrete wavelet transform (DWT) [26]. Artificial neural networks (ANN) were employed for the automatic diagnosis of Schizophrenia.
Gu et al. [14] exploited non-negative matrix factorization (NMF) and empirical mode decomposition (EMD) to decompose the EEG signal into a set of intrinsic mode functions (IMFs). Statistical features were extracted from the de-noised components and inputted to classifiers such as support vector machine (SVM). PrakashYadav et al. [37] applied variational mode decomposition (VMD) [11] to decompose the EEG signals of seizure into 14 IMFs.
Normalized energy, fractal dimension, number of peaks, and prominence parameters were employed as the extracted features. A Bayesian regularized shallow neural network was then proposed to perform the classification task.
Interestingly, Sadiq et al. [28] considered the decomposed components as a feature vector and further applied the methods like neighborhood component analysis (NCA) to reduce the huge dimension of the feature matrix.
Although this study tried to exploit raw data, it suffered from a problem of information loss. The second category of exploiting decomposed components is that decomposition methods were exploited to reduce the noise of the signals. Feature selection was performed on each component to reduce the noise more meticulously. For instance, Aliyu et al. [3] computed a series of features from the decomposed components by DWT. Then, the correlation and p-value analysis were used to select the optimal features. Unlike featurewise, some other researchers directly selected the most effective components and removed the rest. For example, Wang et al. [35] exploited the sample entropy of each IMF decomposed by using VMD and selected the IMF of the highest sample entropy to train the classifier further. In summary, simpler patterns can be obtained after decomposition, making the signal easier to analyze. However, less attention has been paid to learning patterns directly from raw signals of decomposed components, which has the potential to extract more distinguishable patterns of the recognized states. In addition, although some feature selection methods can help reduce the noise, the problem of information loss exists simultaneously. Thus, properly capturing more distinguishable patterns while reducing the impact of redundant information from different components on model learning is still a research gap.
Furthermore, there has been an outstanding performance in exploiting deep convolutional neural networks (CNNs) to decode raw EEG signals endto-end. Schirrmeister et al. [30] was the first to comprehensively study the design and training of CNN by using raw EEG data. A shallow and deep CNN architecture was designed to outperform the filter-bank common spatial pattern (FBCSP) on motor imagery classification. CNNs with different architectures have been tried for various EEG-based recognition tasks. For instance, a multi-scale EEGWaveNet [33] was proposed by Punnawish et al.
to address epileptic seizure detection. Khare et al. [20] exploited a CNN with the input of transformed time-frequency signals to perform emotion recognition. However, these models were only analyzed in specific domains. In 2017, Lawhern et al. [21] proposed a compact CNN named EEGNet, which can be applied to different brain-computer interface (BCI) paradigms and achieved good performance in subject-dependent and subject-independent settings. Moreover, Cui et al. [9] proposed an InterpretableCNN which performed spatial and temporal convolution operations. It improved the subjectindependent recognition accuracy compared with EEGNet and showed its superiority in the drowsiness detection task. Apart from the commonly used two-dimensional CNN, other architectures, such as long short-term memory (LSTM) networks [34], and 3D CNN [39], have presented promising results for the corresponding BCI paradigms as well. Overall, previous studies have shown that end-to-end CNN models can improve recognition performance.
However, signal complexity and uncertainty of the single model remain to be the constraints for capturing the basic patterns and achieving accurate classifications. Although decomposition could be applied together with CNN to perform recognition tasks, a single CNN may tend to overfit it and suffer from local optima when data of a higher dimension is used. Hence, exploiting CNN on raw signals of decomposed components is necessary and an ensemble model has a more robust capability of learning diverse input data compared with the single model.
To this end, we propose a novel decomposition-based ensemble CNN framework to improve the cross-subject driver fatigue recognition performance. Specifically, four decomposition methods are used to reduce signal complexity. Then, individual CNN is employed to automatically learn the beneficial patterns from each component which has a lower complexity than the original signals. Finally, two ensemble output modes are exploited to combine the classification scores before or after the Softmax layer. In addition, a component-specific batch normalization (CSBN) layer is added to reduce the subject variability of EEG signals. In the proposed framework, the diversity of the trunks in the ensemble model is provided by feeding the decomposed components with different complexity, increasing the generalization ability of the framework. We tested the framework on a challenging cross-subject driver fatigue recognition task. Results indicated that the proposed models under our framework outperformed the state-of-the-art (SOTA) methods. In particular, the DWT-based ensemble CNN boosted the performance by above 5% as compared to the SOTA.
The contributions of this work can be summarized as follows: • A novel decomposition-based ensemble CNN framework for driver fatigue recognition is proposed. Using decomposition methods reduces signal complexity and enriches the input diversity by disassembling the raw EEG signals into individual components with different characteristics. Then, individual CNN is utilized to extract the beneficial features from each component. Finally, the ensemble of all the outputs of CNNs ensures strong generalizations.
• This paper compares the use of different decomposition methods on the challenging cross-subject driver fatigue recognition task for the first time. The results demonstrated that DWT is more suitable for EEG decoding in the driver fatigue recognition task.
• A CSBN layer is employed to reduce subject variability.
• Using different architectures of CNN in the proposed framework is investigated. The framework boosts the performance of both shallow and deep CNNs. In particular, deep CNN has better recognition accuracy and is found to be more suitable for decoding decomposed components.

Preliminary study: Signal decomposition
There are two categories of signal decomposition, namely mode-based and wavelet-based methods. EMD and VMD are two representative mode-based methods. EMD [16] is the first adaptive signal decomposition method that can analyze non-stationary and nonlinear signals. However, EMD has the limitation of lacking mathematical theory support. VMD [11] has relatively stricter filter bank boundaries than EMD and avoids the cumulative error of envelope estimation caused by recursive mode decomposition. Two representative wavelet-based methods are also selected, namely EWT and DWT.
EWT [13] also addressed the limitation of EMD which lacks mathematical formulations by building adaptive wavelet filter banks. Lastly, DWT [10] is a classical wavelet-based method with solid theory and is also suitable for analyzing nonlinear and non-stationary signals. These four methods are fundamental among different signal decomposition methods and have been widely used in different applications. Therefore, these four methods are selected to perform the data pre-processing in the proposed framework. In the following parts, these four decomposition methods will be introduced in chronological order.
where ψ j,k (t) = 2 j 2 ψ(2 j t − k), j, k ∈ Z is the wavelet function in DWT. In practical time series classification problems, signal x(t) and ψ j,k (t) are both discrete as t is a discrete-time index. Finite-length times series x(t) ∈ L 2 (R) are all applicable to DWT.
This paper employs the famous Mallat algorithm. In theory, approximation A j (t) or detail D j (t) in scale j is calculated through the inner product between scaling function φ j,k (t) or wavelet function ψ j,k (t) and time series x(t) by Equations 2 and 3.
To get rid of the heavy computation herein, the coefficients {c j [k], k ∈ Z} and .
Viewing {h n , n ∈ Z} and {g n , n ∈ Z} as a pair of low-pass and high-pass filters and c j+1 [i] as the input signal, Equation 4 is implemented as shown in Figure 1 which is the famous Mallat algorithm [26]. .
where k is the IMF number and r k (t) is the final residual value.
The set of IMFs serves as a complete, adaptive and nearly orthogonal basis for the original signal. The algorithm of EMD is described is as follows: EWT overcomes the mode mixing problem caused by the discontinuity of the time-frequency scale of the original signal. It has a complete and reliable mathematical theoretical foundation with low computational complexity.
In the EWT, limited freedom is provided for selecting wavelets. The algorithm employs Littlewood-Paley and Meyer's wavelets because of the analytic accessibility of the Fourier domain's closed-form formulations [32].
The formulation of the band-pass filters are denoted using Equations 6 and with a transitional band width parameter γ satisfying γ ≤ min n ω n+1 −ωn ω n+1 +ωn . The most common function ζ(x) in Equation 6 and 7 is presented in Equation 8. This empowers the formulated empirical scaling and wavelet function to be a tight frame of L 2 (R) [8].

Variational mode decomposition
VMD [11] can adaptively find the optimal center frequency and the limited bandwidth based on the determined number of decomposed modes.

VMD provides improvements over wavelet transform and EMD such that
there is no modal aliasing effect and the method is not sensitive to noise.
VMD can reduce the non-stationarity of time series which have high complexity and strong nonlinearity, and decompose the signals to obtain multiple IMFs with different frequency bands.
VMD can be considered to solve the following problem as shown by Equa- with the constraints of where m k is mode k, ω k is the central frequency of m k , K is the number of The alternating direction method of multipliers (ADMM) algorithm is utilized to solve the above problem in VMD. Then, the modes m k and ω k are obtained during the shifting process. According to the ADMM algorithm, the m k and ω k can be computed by Equations 12 and 13, where n represents the number of iterations,ŷ(ω),m k (ω),λ(ω) andm n+1 k represent the Fourier transform of x(t), m j (t), λ(t) and m n+1 k , respectively.
After decomposition, sub-band signals are obtained by EWT and DWT, as well as IMFs are obtained by EMD and VMD. These will be represented as the term "components" in the following sections.

Methodology
This section presents the detailed hybrid ensemble CNN framework, which is divided into three subsections, i.e., data pre-processing and model learning,

Data pre-processing and model learning
Using the four decomposition methods described in Section 2, the preprocessing of EEG signals is introduced as follows. Let X = {x 1 , x 2 , . . . , x n , . . . , x N } be the input space (i.e., EEG signals) and Y = {y 1 , y 2 , . . . , y n , . . . , y N } be the output space (i.e., EEG signal categories). After applying a decomposition method with D as the predefined components, the input sig- where C l is the number of channels produced by the convolution and H l × W l is the size of the 2D feature map. After propagating the input through the convolutional and CSBN layers, which will be further introduced in Section 3.2, the final feature map h d L is flattened and fed into a FC layer that transforms the fea-tures into two classification scores: s d 1 and s d 2 , corresponding to two classes, alert and fatigue, respectively. Finally, two ensemble output modes (E1 and E2), which will be described in detail in Section 3.3, are used to integrate the classification scores. Lastly, the CNN of each component is combined as an ensemble model, which is denoted as "ECNN".
Overall, in the proposed framework, there are four decomposition methods and two ensemble output modes, which give rise to eight models under our framework as listed in Table 1. The framework can be extended to any backbone CNN model that has adequate capability to decode EEG signals.

Component specific batch normalization (CSBN) layer
In the framework, a CSBN layer is employed to decrease subject variability. Batch normalization (BN) [17] is usually exploited to alleviate the issue of internal covariate shifting. Standardizing hidden features is first performed in a standard BN layer, and then two affine parameters γ and β are used to transform the inherent mean and variance into trainable variables. Therefore, for a channel of hidden features, h d l,c , the operation of BN is expressed as: where c represents the c th channel of hidden features andĥ d l,c is the result of standardizing the hidden features, which is given by: where > 0. µ and σ 2 are the mean and variance of a channel of hidden features within a mini-batch, which are computed by where where t represents the source or target domain. For convenience, h d l,c is rewritten as h d,l,c here. Then, Equation 15-17 can be reformulated aŝ

Ensemble output modes
Two ensemble modes, E1 and E2, are employed to integrate the classification scores s d 1 and s d 2 in different trunks of the ensemble model, which are described as follows.
1) E1: To force the automatic learning of the proper patterns for integration, the parameters of the models in different trunks are updated simultaneously. Specifically, the output scores are averaged by where i represents the index of the score.
Then, the average score is transformed to the probabilities by the Softmax layer, which is described as: Finally, denoting F (·) as the output of a forward pass in the integrated model, the parameters in separate trunks are updated together in back-propagation by using the cross-entropy loss: which is known as the soft voting. Finally, the predicted class can be obtained based on the output probabilities.

Introduction of public driving dataset
The public driving dataset recorded from a sustained-attention driving task [7]  Regarding sample extraction, in practice, short-interval EEG data prior to the deviation onset can be adopted to perform fatigue recognition. In this work, to evaluate the proposed framework and to perform a fair comparison with the SOTA works, the 3-second EEG data prior to the deviation onset which was commonly exploited in previous methods [36] [9] was adopted to classify the 'alert' vs. 'fatigue' on the upcoming lane-departure event.
Then, we followed Wei et al. [36] Table 2. The data size of one sample is 30 (channels) × 384 (sample points).  Specifically, the first category of methods for comparison included the support vector machine (SVM) classifier [36] with the input of extracted power spectral density (PSD) features, EEGNet [21], ShallowCNN [30] and Inter-pretableCNN (ICNN) [9]. Among them, SVM+PSD represented the methods that were based on the hand-engineering features, while EEGNet and Shal-lowCNN were the commonly used end-to-end EEG decoding baselines with the input of raw EEG data. Lastly, ICNN was the SOTA method on the used public driving task. ICNN was also the end-to-end model with the input of raw EEG data. Furthermore, the second category of methods included the SOTA transfer learning techniques for driver fatigue recognition which are TCA [24] and MIDA [24]. These two methods used the extracted PSD features as the inputs. The SM model [22] which performed end-to-end EEG decoding was also included for comparison. To demonstrate the effective-

Comparison results
The LOSO average accuracy (denoted as avg. acc. in tables) of the eight proposed ensemble models under the decomposition-based hybrid ensemble CNN framework and the baseline algorithms in the cross-subject driver fatigue recognition task was compared in Table 3. Overall, we can observe that the eight proposed models under the framework all outperformed the baseline methods. Compared with the SOTA results of ICNN, the increase in LOSO average accuracy ranged from 0.15% to 3.76%. In particular, the DWT-based ensemble CNN in E2 mode (DWT+EICNN(E2)) achieved the best LOSO average accuracy of 82.11% amongst the proposed models. Better performance of the proposed models demonstrated the effectiveness of the framework on fatigue recognition in the driving context. As for the classification performance for the individual subject data, the decomposition-based hybrid ensemble models also presented the highest accuracy among all compared methods for all subjects except for subject 7 and 10. This could be due to the much lower decoding capability of the backbone network-ICNN in nature, which was shown from the lower accuracy results for subject 7 and 10 as obtained by ICNN.
To better understand the capability of the proposed framework for driver fatigue classification, Precision, Sensitivity, Specificity and F1-score of the proposed models and the baseline methods were compared in Table 4. The class 'alert' was set as positive, while the class 'fatigue' was set as negative in the calculation of these metrics. It was observed that all eight proposed mod- We further discuss the effect of the two ensemble modes and the four decomposition methods in the framework.
1) Two ensemble modes: From Table 3, for a specific decomposition method-based ensemble model, a higher LOSO average accuracy can be observed when using E2 mode as compared to using E1 mode. One possible reason is that during each simultaneous update in E1 mode, the low-quality patterns obtained may disturb the feature learning of the proper components for classification. Since the E2 mode performed better than the E1 mode, the ablation study in Section 4.3 was mainly investigated in the E2 mode.
2) Four decomposition methods: With the same parameter of 10 decomposed components for VMD and EWT and in the same E2 mode, the ensemble model based on EWT presented better results of LOSO average accuracy. Notably, the EMD-and DWT-based ensemble models in E2 mode obtained an even higher LOSO average accuracy than that of VMD and EWT. This could be because that the signals were decomposed into the simplest components in EMD and DWT. Hence, the models can capture beneficial features more easily.
To demonstrate the superiority of the proposed models, the one-tailed Wilcoxon paired signed-rank test was conducted as shown in Table 5. Only the p-values with significance (p < 0.05) were listed in the table. We can observe that the DWT-based models in E2 mode consistently showed a significantly stronger capability of classification than all baseline methods, while other proposed models showed significant improvements in LOSO average accuracy over most of the baselines.
According to the above results, the proposed decomposition-based hybrid ensemble CNN framework showed significantly superior performance on the average accuracy with a significant difference from the others. Therefore, the proposed framework is effective in boosting the performance of backbone models on the EEG-based cross-subject driver fatigue recognition task.

Sensitivity test of four decomposition methods
After decomposition, the total number of decomposed components fed into the ensemble models will affect the classification performance. In Section 4.2, we first set the total number of decomposed components fed into the ensemble models as 10, 4, 10, and 6 for the four decomposition methods, VMD, EMD, EWT, and DWT, respectively. The total number of components in the range of 3 to 10 were then selected for further investigation using the ensemble ICNN in E2 mode. The LOSO average accuracy of using the different total number of components for training was listed in Table 6. For EMD and DWT, since the maximum number of decomposed components that can be obtained was 4 and 6, respectively, the number of components below this maximum was investigated.
The results showed the best performance in the VMD-and DWT-based models trained with 5 decomposed components. For the EMD-and EWTbased models, using 4 and 9 decomposed components, respectively, obtained This may be attributed to the characteristic of VMD. According to the original work of VMD [11], the performance of VMD may be compromised when it comes to non-stationary signals such as long-term EEG signals, as drastic change and global overlapping may occur in the spectral bands of modes.
In other words, possibly due to the impact of the nature of EEG signals on VMD, the performance of the VMD-based model may be very sensitive to the total number of decomposed components. This may explain why the VMD-based model only showed high LOSO average accuracy at certain total numbers of components.

Investigation on the use of different backbone networks in proposed framework
To demonstrate that any backbone CNN with adequate decoding capability can be utilized in the proposed framework, the comparison results of using two different networks: a deep CNN (EEGNet-8,2 [21]) and a shallow CNN (ShallowCNN [30]) as the backbone was presented in Table 7. Since  This allows the feature alignment between the training data and the testing data in different complexity levels. As a result, the effectiveness of the CSBN layer designed for reducing subject variability and boosting the performance of the proposed ensemble framework was illustrated.
On top of that, based on Table 8, it is worth emphasizing the effectiveness of decomposition and ensemble learning in the proposed framework on performance improvement again, as highlighted by the higher LOSO average accuracy of the ensemble models that used decomposed components with CSBN than that of ICNN that used the original EEG signals with AdaBN.

Analysis of ensemble learning in proposed framework
The effectiveness of ensemble learning is investigated in this section.
Specifically, we compared the performance of three groups, namely the model It is worth noting that while some components of less useful frequency bands may substantially deteriorate the performance when individually trained,

Visualization
To better understand the responsible regions and frequency bands for classification in the proposed framework, a class activation mapping (CAM)based model interpretation technique [9] was employed to perform an in- Based on the visualization results, it was observed that the occipital lobe and the frontal lobe with the beta frequency and the delta frequency band contributed more to alert recognition. Moreover, the regions of the centroparietal and occipital EEG channels with the theta, the alpha and the beta frequency bands mainly contributed to fatigue recognition. The finding on the responsible area was compatible with previous studies [9] [27]. Regarding the responsible frequency bands, the theta and alpha frequency bands have been found to be strong indicators of early fatigue and used in various driving simulator studies to identify the driver fatigue [6] [31]. The delta frequency band has also been found to be the evidence of alert [9]. Interestingly, the beta frequency band which is frequently associated with active concentration [5] was shown to be responsible for both alert and fatigue recognition. Overall, the visualization demonstrated that the proposed framework could focus on useful areas and useful frequency bands, illustrating its effectiveness.

Conclusion
Against the challenge of decoding non-stationary and high-complexity EEG signals for driver fatigue recognition, we propose a decomposition-based hybrid ensemble CNN framework that can capture more beneficial features from EEG signals while not compromising on the amount of information   This can provide more directions for driver fatigue monitoring systems. In future works, more decomposition methods will be further investigated for driver fatigue recognition tasks.