Deep learning for electroencephalogram (EEG) classification tasks: a review

Objective. Electroencephalography (EEG) analysis has been an important tool in neuroscience with applications in neuroscience, neural engineering (e.g. Brain–computer interfaces, BCI’s), and even commercial applications. Many of the analytical tools used in EEG studies have used machine learning to uncover relevant information for neural classification and neuroimaging. Recently, the availability of large EEG data sets and advances in machine learning have both led to the deployment of deep learning architectures, especially in the analysis of EEG signals and in understanding the information it may contain for brain functionality. The robust automatic classification of these signals is an important step towards making the use of EEG more practical in many applications and less reliant on trained professionals. Towards this goal, a systematic review of the literature on deep learning applications to EEG classification was performed to address the following critical questions: (1) Which EEG classification tasks have been explored with deep learning? (2) What input formulations have been used for training the deep networks? (3) Are there specific deep learning network structures suitable for specific types of tasks? Approach. A systematic literature review of EEG classification using deep learning was performed on Web of Science and PubMed databases, resulting in 90 identified studies. Those studies were analyzed based on type of task, EEG preprocessing methods, input type, and deep learning architecture. Main results. For EEG classification tasks, convolutional neural networks, recurrent neural networks, deep belief networks outperform stacked auto-encoders and multi-layer perceptron neural networks in classification accuracy. The tasks that used deep learning fell into five general groups: emotion recognition, motor imagery, mental workload, seizure detection, event related potential detection, and sleep scoring. For each type of task, we describe the specific input formulation, major characteristics, and end classifier recommendations found through this review. Significance. This review summarizes the current practices and performance outcomes in the use of deep learning for EEG classification. Practical suggestions on the selection of many hyperparameters are provided in the hope that they will promote or guide the deployment of deep learning to EEG datasets in future research.


Introduction
Electroencephalography (EEG) is widely used in research involving neural engineering, neuroscience, and biomedical engineering (e.g. brain computer interfaces, BCI) [1]; sleep analysis [2]; and seizure detection [3]) because of its high temporal resolution, non-invasiveness, and relatively low financial cost. The automatic classification of these signals is an important step towards making the use of EEG more practical in application and less reliant on trained professionals. The typical EEG classification pipeline includes artifact removal, feature extraction, and classification. On the most basic level, an EEG dataset consists of a 2D (time and channel) matrix of real values that represent brain-generated potentials recorded on the scalp associated with specific task conditions [4]. This highly structured form makes EEG data suitable for machine learning. A great number of traditional machine learning and pattern recognition algorithms have been applied on the EEG data. For example, independent component analysis (ICA) is commonly used for artifact removal [5]; principle comp onent analysis (PCA) and local Fisher's discriminant analysis (LFDA) are typically used to reduce dimensionality of the features [5]; classic supervised learning methods such as linear discriminant analysis (LDA), support vector machines (SVM), and decision trees are common in neural classification [6,7]; and canonical correlation analysis (CCA) is frequently used to identify steady-state visual evoked potentials (SSVEPs). Neural networks did not immediately receive the high attention seen today in neural classification applications because of practical issues, such as very long computation time and problems with the vanishing/exploding gradients [8]. Fortunately, the availability of large datasets and the recent development of graphic processing units (GPU's) brought neural network researchers an inexpensive and powerful solution to their hardware bottleneck [9], allowing them to investigate deep learning architectures (neural network architectures containing at least two hidden layers). These innovations have led to an exponential increase in interest and applications of deep learning in the past decade. Indeed, it significantly improved performance in a wide range of traditionally challenging domains, such as images [10], videos [11], speech [12], and text [13]. Because neural networks iteratively and automatically optimize its parameters, they are generally believed to require less prior expert knowledge about the dataset to perform well [9]. This advantage led to early adaptations in the realm of medical imaging [14] which usually involves large datasets that are otherwise difficult to be interpreted, even by experts. Recently, due to the increasing availability of large EEG datasets, deep learning frameworks have been applied to the decoding and classification of EEG signals, which usually are associated with low signal to noise ratios (SNRs) and high dimensionality of the data.
This systematic review of the literature on deep learning applications to EEG classification attempts to address critical questions: which EEG classification tasks have been explored with deep learning (section 3.3)? What input formulations have been used for training the deep networks (section 3.4)? Are there specific deep learning network structures suitable for specific types of tasks (section 4.1)? To bridge this gap, we compiled all peer-reviewed published EEG classification strategies using deep learning and reviewed their EEG preprocessing methods, network structure and performance. By analyzing the overall trends and with architectural compariso ns made in individual studies, different EEG classification tasks were found to be classified more effectively with specific architecture design choices. This information was compiled into a recommendation workflow diagram, which can serve as a starting point for the initial architecture design phase in future applications of deep learning to EEG classification.

Search methods for identification of studies
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses), a systematic review and meta-analysis procedure [15], was used to identify studies and narrow down the collection for this review of deep learning applications to EEG signal classification, as shown in figure 1. This search was conducted on 22 December 2018 within both Web of Science and PubMed databases using the following group of keywords: Duplicates between the two databases and studies not within the inclusion criteria (outlined below) were excluded. Full texts of the remaining studies were then screened.
The following criteria were used to exclude unqualified studies: • Electroencephalography only-To reduce variability in the studies, studies with multi-model dataset, e.g. EEG analysis combined with other physiological signals (electrooculography, electromyography) or videos were excluded. • Task classification-This review focused solely on the classification of tasks performed by humans using their EEG signals. Other studies, such as power analysis, non-human studies, and feature selection with no end classification, were excluded. • Deep learning-In this review, deep learning is defined as neural networks with at least two hidden layers • Time-Given the fast progress of research in this topic, only studies published within the past five years were included in this review.

Data extraction and presentation
The following data categories were collected (see appendix for details): a.

Results
This section first details the preprocessing methods found within this review. Then, the general types of tasks, the input formulations, and architecture trends are analyzed. The results section finishes with a case study over a public dataset, which allows for a comparison on specific deep learning design choices.
3.1.1. Emotion recognition tasks. Emotion recognition tasks typically involve having subjects watch video clips which have been assigned specific emotions by experts prior to the viewing [27]. EEG was measured during these viewings and an emotion self-assessment typically followed. The self-assessment and original emotion class was then converted into a pair of valence and arousal values, a widely used system to describe emotions PRISMA study selection diagram. The diagram moves through the four stages of PRISMA study selection: identification, screening, eligibility, and included. This process led this review from an original count of 349 studies to a final count of 90 studies. [28]. The primary drive for emotion recognition studies is the eventual application in brain-machine interfaces as understanding a patient's emotion will help the underlying algorithm decide whether a selected movement was the intended movement. More generally, emotion recognition studies help computers better understand the current emotional state of the user.

Motor imagery tasks.
Motor imagery tasks involve having the subject imagine certain muscle movements on limbs and/or the tongue [29]. Their applications are mostly BMI-related in that eventual BMI applications will need to accurately classify a user's intended movement.

Mental workload tasks.
Mental workload tasks involve measuring EEG data while the subject was under varying degrees of mental task complexity. There were many methods used to categorize the levels of mental workload, including driving simulation studies [30], live pilot studies [31], and responsibility tasks [32]. For driver and pilot studies, mental workload was classed based on statistics of the subject's behavior, such as reaction time and path deviation. For responsibility studies, workload was classed based on the increasing number of actions the subject was responsible for. This kind of task may be applied in two general areas: cognitive stress monitoring or BMI performance monitoring.

Seizure detection tasks.
For seizure detection studies, EEG signals of epileptic patients are recorded during periods of seizure and during seizure-free periods [33]. EEG signals were also recorded from non-epileptic patients for some datasets as a control class. These studies were designed for the eventual application for detecting upcoming seizures and preemptive notification of the epileptic patient.

Sleep stage scoring tasks.
Sleep stage scoring studies, which was the task type with the fewest number of studies, record the EEG signals of subjects overnight. These signals were scored by experts and classified into sleep stages 1, 2, 3, 4, and the rapid eye movement stage. The eventual application for research of this kind focuses on reducing the reliance on trained personnel in the analysis and understanding of patient sleep stages.
3.1.6. Event related potential tasks. Studies that focused on the detection and classification of event related potentials typically record EEG from subjects who are undergoing a visual presentation task. In these tasks, a subject watches a rapid sequence of pictures or letters with the aim of focusing attention on specific indicators. Once a specific letter or image appears, a stereotypical response is seen in the EEG data, typically in the form of a P300 response. These tasks are useful in research due to the relatively clean signal (minimization of artifacts) and high signal-to-noise ratio [34], which are characteristics not typically seen in EEG data. Research into event related potential tasks will help lead to improved non-verbal communication systems using EEG.

Preprocessing methods
EEG data is inherently noisy because EEG electrodes also pick up unwanted electrical physiological signals, such as the electromyogram (EMG) from eye blinks and muscles on the neck. There are also concerns about the motion artifacts occurring from cable movement and electrode displacement when the subject moves. The identification and removal of EEG artifacts have been extensively studied in the past literature [35][36][37][38], which will not be repeated in this review. Outside of the 41% of studies that did not address any specific artifact removal process, studies approached the artifact removal process in one of the following three strategies (shown in figure 2): (1) manual removal (29%), (2) automatic removal (8%), and (3) no cleaning/removal (22%). Surprisingly, more than a quarter of the studies (26 of 90 studies) removed artifacts manually. It is indeed easy to visually identify abrupt outliers, for example when signals are lost or when intense EMG artifacts are present. However, it is difficult to identify persistent noisy channels or noise that is sparsely presented in multi-channel recordings. More importantly, manual data processing is highly subjective, rendering it difficult for other researchers to reproduce the procedures. Together with the 22% of studies that did not take any actions to remove artifacts, 63% of the studies reviewed did not systematically remove EEG artifacts. The most frequent artifactremoval algorithms used in the remaining 8% of the studies reviewed were independent component analysis (ICA) and discrete wavelet transformation (DWT).
Most studies used frequency domain filters to limit the bandwidth of the EEG to be analyzed. This is useful when there is a certain frequency range of interest so that the rest can be safely discarded. About half of the studies low pass filtered the signal at or below 40 Hz, which is in or below the typical low gamma band. The filtered frequency ranges, organized by task type, and artifact removal strategies are shown in figure 2, which shows that most studies employed some form of artifact removal strategy in addition to reducing the frequency ranges to be analyzed.
An interesting question yet to be answered is whether it is still necessary to filter the data if the deep learning is capable of pulling such information from unfiltered data. To the authors' best knowledge, there are no studies that specifically analyze whether deep learning can achieve comparable results without any artifact cleaning or removal process.

What input formulations have been used for training the deep networks?
EEG signals are intrinsically noisy and suffer from channel crosstalk [27]. In typical scalp EEG recording settings, each EEG electrode picks up signals from the area nearby, making the spatial resolution coarse (several centimeters). Unmixing the signals is not trivial because of the anisotropic volume conduction characteristics in human brain tissues, skull, scalp, and hair. Therefore, one of the standing challenges in EEG data analysis is how to formulate inputs. So, what input form ulations have been used and how effective have they been shown to be in the context of deep learning?
Studies fell into three types of input formulation categories: calculated features (41%), images (20%), and the signal values (39%) ( figure 3(A)). The selection of input for mulation relied heavily on the task and deep learning architecture, so these decisions will also be described in the respective sections below.
Unsurprisingly, a wide range of methods have been developed to extract features from the EEG, accounting for 41% of all identified studies (figure 3(A)). EEG data is commonly analyzed in frequency domain because they are often found to be associated with behavioral patterns [4]. Power spectral density (PSD), wavelet decomposition, and statistical measures of the signal (i.e. mean, standard deviation) are the three most common input formulations used in the reviewed studies.
Signal strength in the time domain is also used directly as inputs to the neural networks (39%). Traditionally, this approach is usually associated with particular hand-engineered time domain features, such as the power spectral density features [50]. Neural networks promise to automatically learn complicated features from large amounts of data, prompting the idea of end-to-end learning [51]. This machine learning philosophy encourages researchers to feed raw signal values directly into the neural network without hand-designed features, which may contribute to the practice of directly analyzing raw EEG data with deep learning.
When it comes to task-specific input formulations, three tasks (emotion recognition, mental workload, and motor imagery tasks) heavily chose calculated features. The most prevalent input formulation tactic for the remaining three types of tasks (seizure detection, sleep stage scoring, and event related potential analysis) was to use the signal values as inputs. Seizure detection studies showed both the highest proportion of using signal values as inputs as well as the smallest proportion of studies using calculated features, which lends to the idea that calculating features is not suitable for that task. Generating spectrogram images as inputs has seen success in all tasks, but both event related potential and sleep stage scoring tasks only had a single study attempted to use images as inputs, so further research into effective inputs for these tasks is needed. The proportions of input formulations by task are shown in figure 3(B).

Deep learning architecture trends
3.4.1. Architecture design choices. This section of the review focuses on understanding the trends in the formation of specific deep learning architectures, namely, the primary characteristic and end classifier. This aggregated information is displayed in figure 4.
The most prevalent architecture design framework, CNN's (43%), involve alternating layers of convolution with pooling layers (typically maximum pooling layers). The primary design features for CNN's was the number of convolutional layers and the type of end classifier. DBN's followed CNN's at 18% as the second most prevalent choice. DBN's are composed of a number of stacked restricted Boltzmann machines followed by an end classifier, which is typically a number of fully-connected layers. Hybrid architectures made up the next group at 12% of the total number of the studies, divided, as shown in figure 4, into two divisions: hybrid CNN's and Hybrid MLP's. Hybrid CNN's, in addition to convolutional and pooling layers, include an additional architecture type, such as the inclusion of a number of recurrent layers or restricted Boltzmann machines. Hybrid MLPs are composed of a number of dense layers and the inclusion of another type of deep learning algorithm. The next group of architectures by proportion of the total study count was RNN's (10%), which are composed of a number of recurrent layers (each layer containing a study-specific number of recurrent units) followed by a number of fully-connected layers. MLPNN's, which had a primary design feature of the number of hidden layers as its only reviewed characteristic, followed with 9% of the study count. The final group of studies used an SAE (8%), which has the total number of fully connected layers followed by, in all cases, a single fully connected layer.    total number of studies reporting activation functions. Less prevalent activation functions include exponential linear unit (ELU) (8%), leaky rectified linear unit (leaky ReLU) (8%), and hyperbolic tangent (tanh) (5%). Additionally, there were single studies incorporating the following types of activation functions: parametric ReLU (PReLU), scaled exponential linear unit (SELU), and split tanh. Further analysis of convolutional activation functions is detailed in the discussion section.
Activation functions for fully-connected layers can be grouped into two divisions: non-classifier fully-connected layers and classifier fully-connected layers. The vast majority of classifier fully-connected layers employed a softmax activation function, whereas non-classifier fully-connected layers used the sigmoid activation function. There were only three SAE studies that discussed activation functions and these did not form a consensus, with [52,53] employing sigmoid activation functions for non-classifier AE layers, while [54] instead using ReLU. More research is needed in order to better understand the most effective activation function for SAE architectures.

Task specific deep learning trends.
Emotion recognition, motor imagery, and sleep stage scoring tasks did not show any consensus on the selection of deep learning algorithms. Seizure detection studies were essentially split between using either CNN's or RNN's, showing the highest percentage of studies using RNN's as compared to other tasks. For these studies, only a single study chose to use either an SAE or MLPNN and no seizure detection study used DBN's. Sleep stage scoring tasks had the highest percentage of studies using hybrid formulations, which were evenly represented when compared to studies using CNN's. ERP studies had a clear preference for CNN's (the highest percentage of CNN studies compared to all other tasks). Task-specific deep learning strategy decisions are shown in the figure 5.

Input formulation by deep learning architecture.
The specific input formulation strategies varied significantly as a function of the type of deep learning architecture (figure 6). There was no instance among DBN, MLPNN, or SAE studies that used images as inputs, which is unsurprising as image processing is considered to be in the domain of CNN's. For CNN studies that used images as inputs, the average acc uracy was essentially the same as CNN studies that used calculated features as inputs, with both input formulation strategies achieving an average accuracy of 84%. This is compared to the average accuracy of 87% for CNN studies that used signal values as inputs. This goes against the intuition that says the more effort applied to the pre-processing stages, the more accurate the classification will be. This points to the surprising conclusion that future research, rather than being inhibited by the desire to spend less time pre-processing, may actually improve results by sending signal values directly into the deep learning framework.
DBN's showed a trend similar to that of CNN studies. DBN studies were split between using signal values and calculated features, with calculated features being the most prevalent decision. Studies that used calculated features achieved an average accuracy of 85% while signal value studies achieved an average accuracy of 86%. There were no studies that compared the classification accuracies between using signal values and calculated features, thus indicating more research is needed.
RNN's, on the other hand, did not share the trend found with CNN's and DBN's. RNN studies had a relatively even number of instances from all three types of image for mulations. RNN studies that used signal values as inputs achieved an average accuracy of 85%, which is less than the accuracies for studies that used calculated features (89%) and images (100%). However, there are significantly less number of studies that used RNN's, so more research is needed for the most effective input formulation strategy for RNN's.
Feature selection seemed to be the only practical choice for MLPNN and SAE studies since only a single study for each architecture chose instead to use signal values. The sole MLPNN study that chose to use signal values [55] achieved an accuracy of 75%. The single SAE study using the signal values as inputs [53] only had a single channel for analysis, but was able to achieve a 96% accuracy.

Case studies on a shared dataset
The majority of studies covered by this review used EEG datasets that were not publically available. A simple accuracy comparison between studies analyzing different datasets is not valid as they have widely different tasks, subjects, and data procurement procedures. On the other hand, studies that examined identical datasets using different approaches provide a more meaningful comparison. The datasets that are used in more than one study within this review include an emotion recognition dataset ( [56], various subsets used in ten studies), a mental workload dataset ( [57], used in four studies), two motor imagery datasets from the same BCI competition ( [58], one dataset used in six studies and the second dataset used in two studies), a seizure detection dataset ( [33], used in three studies), and two sleep stage scoring datasets ( [59,60], which were used in groups of two studies and three studies, respectively).
The common emotion recognition dataset (DEAP) [56], the dataset analyzed by the highest number of studies, is a collection of EEG and peripheral signals from 32 subjects participating in a human affective state task. Each subject watched 40 one-minute music videos. After each one minute viewing, the subject performed a self-assessment to classify their levels of arousal, valence, dominance, and liking. Based on the selfassessment and the original classification of the music video, each one-minute segment was labeled with numeric values for valence and arousal level. EOG artifacts were manually removed by the original researchers. Table 1 shows the prevailing input and architecture choices along with the highest achieved accuracy by study. The above collection shows a wide range of specific architecture choices applied to the common emotion recognition task. SAE's did not seem to be a good choice for this task, as [48] reported an accuracy of 54% after exploring various EEG features and neural network designs. Jirayucharoensak et al [48] tried three different combinations of calculated features, but found that even the combination of PCE, CSA, and PSD features was not able to rectify the shortcomings of SAE's for this dataset. Xu and Plataniotis [50] compared accuracies between an SAE and several DBN's and found that a DBN with 3 restricted Boltzmann machines performed significantly better than SAE's and DBN's with different numbers of RBM's. Convolutional layers were employed by five of these studies. Two of the five were hybrid architectures with the CNN outputting into LSTM RNN modules, but neither attempt breached 75% accuracy. The difference in accuracy in the three standard CNN studies is likely due to the differences in input formulation. Yanagimoto and Sugimoto [63] used signal values as inputs into the neural network, while [66,67] instead converted the data into Fourier feature maps and 3D grids and achieved accuracies of 87% and 88%, respectively. These  CNN architectures were composed of two convolutional layers each with either a one or two dense layers. The sole MLPNN architecture applied to this dataset [63] achieved an 82% accuracy, which is comparable to the standard CNN that used signal values, but the input formulation may have been the largest factor in the accuracy difference. Whereas the deep CNN used signal values, which requires significantly less pre-processing, the MLPNN required extensive effort in input pre-processing with PSD features and prefrontal asymmetry channel selection.
The sole architecture that employed an RNN alone (no convolutional layers) within this group [65] was composed of two LSTM layers and a single dense layer. This study was able to achieve an accuracy of 87% while only using signal values as the input. This was unexpected as intuition assumes architectures with greater complexity, such as the two hybrid convolutional recurrent architectures, will lead to higher classification rates. In the case of this dataset, a deep learning recurrent architecture with no convolutional layers was able to outperform the architectures that included both recurrent and convolutional aspects.
For this dataset the most effective architectures reported were DBNs, CNNs, and RNN's. The choice comes down to input formulation in that input images were more suited for CNN's, signal values for RNN's, while calculated features, specifically PSD features, worked better for DBN's. The high accuracy and relative shallowness of the algorithms that dealt with Fourier feature maps and 3D grids may point to the necessity of future research on EEG image generation methods.

Discussion
In this section, recommendations for design choices on nonhybrid deep learning architectures based on the type of task are given. These recommendations are based on the prevalent general design choices across the entire dataset with support from studies that varied specific design features. Recommendations are given by task: mental workload, emotion recognition, motor imagery, event related potential, and seizure detection. Sleep scoring applications are not included as the number of studies was too low and there was no consensus on architecture design choices. The recommendation section will be followed by a discussion on hybrid architecture types.

Are there specific deep learning network structures suitable for specific types of tasks?
Outside of the common datasets used in some of the studies (see section 3.5), it's difficult to compare classification accuracies achieved by different architecture design choices across tasks and EEG datasets. Moreover, most studies varied different aspects of the algorithm design or input processing. Nevertheless the analysis provided before can help to direct future research using deep learning networks for the range of tasks reviewed herein. Figure 7 depicts a workflow diagram that represents the recommendations formulated from this review.

Mental load tasks.
Studies specifically motivated the recommendation that either DBN's or CNN's be used for this type of task. Yin et al published three studies all based on one mental workload dataset: [32,68] reported better accuracy using SAE compared to SVN or MLPNN; subsequently, [64] reported that DBN consistently outperformed the SAE architecture. Hajinoroozi et al [69] reported that the standard DBN significantly outperformed standard CNN (although less accurately than hybrid models), whereas [41] found several variations of CNN's able to outperform when compared against a standard DBN.

Emotion recognition tasks.
The application of deep learning networks to this task had the highest number of studies and included the repeated dataset described in the results section. References [50,70] found that DBN's were more capable for this task type when compared against SAE's and MLPNN's respectively. In the shared dataset, the four highest performing studies [50,[65][66][67] achieved accuracies ranging between 87% and 89%. These studies employed a DBN [50], CNN's [66,67], and an RNN [65]. As the accuracy differences between these four studies is small, it is likely more research is needed to better understand the most effective deep learning algorithm for this type of task. These studies taken together led to the recommendation of DBN's, CNN's, or RNN's as good candidate architectures.

Motor imagery tasks.
There were no direct compariso ns between DBN and CNN architectures among motor imagery tasks and further research is needed to identify the more effective architecture. The hybrid architecture does shed some insight into whether SAE is a valid option. In [43], a hybrid CNN/SAE architecture was compared to a standard CNN and a standard SAE. While the hybrid architecture outperformed the others, the standard CNN far exceeded the SAE. Wang et al [45] compared the performance of a CNN against a pure LSTM RNN network, finding the CNN architecture to significantly outperform the RNN. For these reasons, the recommendation for motor imagery tasks is to use either a CNN or DBN network. 4.1.4. Seizure detection tasks. The majority of seizure detection studies used either CNN's or RNN's and, unlike the previous three groups, there were no instances of studies using DBN's. The sole standard MLPNN architecture study [71] varied the number of hidden layers, but experienced large drops in accuracy with more than two hidden layers, indicating further improvements must come outside MLPNN architectures. On a shared seizure detection dataset from the University of Bonn [33], two studies achieved near perfect classification performance with [72] reaching 99% accuracy with a CNN and [73] reaching 100% with an RNN. More research into the possibility of using DBN's is needed before disregarding that option for this task, but both CNN's and RNN's are good candidate architectures for this type of task.

ERP tasks.
Of the ERP task studies, three different types of deep learning architectures were used: SAE's, DBN's, and CNN's. The two SAE studies, [54,74], compared performances between SAE architectures and MLPNN's, both finding SAE's to perform best. Kulasingham et al [75] specifically compared performances between DBN's and SAE's and found that DBN's had a slight advantage in classification accuracy. The remaining five studies used different variations of CNN's, but did not offer any direct comparisons between other architecture types. For these reasons, both CNN's and DBN's have been selected as good candidate architectures for ERP tasks.  Figure 7 also depicts the specific input formulation and architecture parameter recommendations. DBN's saw no use of images as inputs, so the options for DBN's in the Input Formulation stage are restricted to calculated features or signal values. Based on [26], which found that channel-wise DBN's outperformed a combined channel approach, DBN inputs should then be fed into the deep learning architecture in a channel-wise approach in that the signal values from each channel are fed into individual architectures before outputting into a common classifier.

Architecture design and input formulation
In the Calculated Features branch of the DBN path, we recommend use either differential entropy features or PSD features. Zheng and Lu [76] compared the classification acc uracy when using differential entropy features against PSD features, finding differential entropy features outperformed PSD features over all frequency sub-bands. PSD features, the most prevalent choice for DBN input formulation, was used in [50], which was the highest performing study within the common emotion recognition dataset. Regardless of the chosen feature type, this review's recommendation is to calculate features from manually selected critical channels. Zheng and Lu [76] found that classification accuracies were higher when calculating features from set of manually selected critical channels rather than from all channels. Manual selection of channels requires significant task-specific prior knowledge, which may not always be feasible. Figure 7 also includes suggestions for the major characteristic section of the DBN path; three studies motivated the recommendation that, regardless of input formulation, inputs should be sent through three RBM's. References [29,67,75] all achieved higher classification accuracies when using three RBM's as opposed to one, two, or four RBM's. In the final stage in the diagram, we recommend a single dense layer as the end classifier, which is the choice in most DBN studies. For the final fully-connected layer in all paths, the softmax activation layer is recommended, while for fully-connected nonclassifier layers, the sigmoid activation layer is recommended.
In the CNN branch of figure 7, the input recommendation splits into two branches: signal values or images. CNN studies had the greatest proportion of studies using signal values as inputs and the majority of those studies did not limit the number of channels, indicating that CNN's are more capable of handling the high dimensionality and size of EEG signal value datasets when compared to other DL algorithms. For example, [63] achieved higher accuracy with raw signal values from all channels than other studies that required extensive effort creating inputs using the same dataset (table 1). Spectrograms were the most prevalent choice when images were used as CNNs' input. Within the common emotion recognition dataset group, the second and third highest accuracy was achieved by [66,67], respectively, both choosing to form images as inputs. Qiao et al [66] used spectrograms created from a set of manually selected critical channels to generate EEG images, whereas [67] instead created a 3D grid of electrode values. As both studies achieved accuracies within a single percentage, both image formulation strategies seem like viable options for future research.
The two CNN input paths do not share the same major characteristic recommendation. For signal values, four studies compared accuracies while varying the number of convolutional layers. Yanagimoto and Sugimoto [63], an emotion recognition study, and [78], a seizure detection study, found that five convolutional layers achieved the best accuracies. Antoniades et al [79] found that accuracy peaked with four convolutional layers and trended downwards as convolutional layers increased. Schirrmeister et al [80] compared a shallow two convolutional layer CNN versus a deep four convolutional CNN and found that the deep CNN consistently outperformed the shallow CNN. While there were no studies specifically comparing the different numbers of classifier layers, the identified studies mostly used one or two fully connected layers. This review therefore recommends four to five convolutional layers feed into one to two fully connected classifier layers. For the image branch of the CNN path, there were not any studies that specifically compared different numbers of convolutional layers or end classifier layers We therefore recommend following [66,67] with two convolutional layers feeding into one or two fully connected layers, while cautioning that more research is needed to optimize the strategy to use images as CNN inputs.
When assessing trends of activation functions of convolutional layers, two studies performed an analysis on the performance differences when using different activation functions. In [45], performances were compared between a CNN with ReLU, ELU, and SELU and found that SELU performed best. Abbas and Khan [44] did a similar analysis and assessed performance differences of a CNN when using ReLU, sigmoid, and tanh activation functions, finding that ReLU performed best. Due to the high number of studies employing ReLU (70% of convolutional architecture studies employed a ReLU activation), it is this review's recommendation that convolutional layer construction begin by using ReLU activation before investigating the performance changes due to different activation functions.
The RNN branch of figure 7 also splits into two branches for the input formulation section: signal values and images. While no study directly compared performances between different input formulations, this recommendation is based on the lack of consensus for feature selection, high accuracy reports from image studies, and the high performing raw signal value study within the shared dataset, [65]. Unlike the image branch for CNN's, the image recommendation for RNN's instead points towards the use of 2D or 3D grids as opposed to spectrograms based on two well performing seizure detection studies [47,49]. These input recommendations are given with caution as more research is needed to know the most effective EEG input formulation for RNN architectures.
Regardless of the input formulation, the RNN branch merges into one for the major characteristic section of figure 7. The majority of RNN studies used two LSTM layers with two studies specifically varying this parameter. In [81], the authors first compared performances between LSTM or GRU recurrent layers and found that LSTM layers performed better. Next, the authors varied the number of recurrent layers between one and eight, concluding that two layers led to the highest classification. In [82], the authors also compared performances while varying the number of LSTM layers and found that there was significant improvement when using two layers versus one layer, whereas accuracy essentially remained even with the introduction of a third LSTM layer. Based on these two studies, the recommendation outlined in figure 7 is that future research uses two layers of LSTM recurrent units for pure RNN architectures. Finally, the RNN branch concludes with the recommendation for end classifier layers. In the case of RNN's, no study specifically varied the number of classifier layers, with all RNN studies choosing either one or two fully-connected layers. Due to this lack of research, this review recommends future research use either one or two fully-connected layers for classification.

Hybrid architectures
There were ten studies that applied hybrid architectures, which are architectures that use a combination of two or more standard deep learning algorithms. Among these studies, several performed accuracy comparisons between the proposed hybrid architectures versus architectures based on the comp onent standard deep learning algorithms. Nine of the ten hybrid studies combined convolutional layers with another type of layer, with eight of those nine employing one or more RNN layers. While there were no task-specific trends for hybrid designs, results from these hybrid studies suggest that LSTM RNN modules outperform both standard and GRU RNN modules and that there are consistent improvements with the addition of non-convolutional layers to standard CNN's.
LSTM is a popular extension of the standard RNN module [83]. In [62], the researchers designed two hybrid architectures: a two convolutional layer CNN feeding into a dense layer that fed into either an LSTM RNN module or a standard RNN module. The LSTM CNN/RNN hybrid architecture consistently outperformed the standard CNN/RNN architecture. Bresch et al [81] also compared performance differences between different types of recurrent units. In this study, the classification performance when using LSTM units was compared against performance when employing a GRU unit, finding LSTM units to be the more effective recurrent unit. The hybrid architectures varied significantly as to the number of convolutional or recurrent layers as well as to the order of these different architecture types. Further structured research is needed in order to better understand how varying different aspects of hybrid architectures affects performance.
Generally, the hybrid architectures that used a CNN as part of the design benefited from the addition of non-convolutional layers. Tabar and Halici [43] compared the accuracies between a CNN/SAE architecture against a CNN and an SAE and found that the hybrid architecture significantly outperformed against the rest. Jiao et al [41] designed channel-wise architectures, a standard DBN, fused CNN systems (where two architectures feed into a common fully connected layer), and a fused CNN/DBN hybrid, which outperformed all other architectures. Hajinoroozi et al [30,69] analyzed the same collections of mental load task datasets, comparing channel-wise CNN's with an RBM in place of the first convolutional layer against a channel-wise CNN, a standard CNN, an MLPNN, and several types of standard DBN's. The channel-wise hybrid significantly outperformed against the rest for cross-subject classification. This is significant because it indicates that this particular architecture type is effective for transfer learning of EEG analysis. Hybrid CNN architecture designs show promise for EEG classification, while further research is needed to adequately assess the effectiveness of each design.

Conclusions
Deep learning classification has been successfully applied to many EEG tasks, including motor imagery, seizure detection, mental workload, sleep stage scoring, event related potential, and emotion recognition tasks. The design of these deep network studies varied significantly over input formulization and network design. Several public datasets were analyzed in multiple studies, which allowed us to directly compare classification performances based on their design. Generally, CNN's, RNN's, and DBN's outperformed other types of deep networks, such as SAE's and MLPNN's. Additionally, CNN's performed best when using signal values or (spectrogram) images as inputs, whereas DBN's performed best when using signal values or calculated features as inputs. We further discussed deep network recommendations for each specific type of task. This recommendation diagram has been provided in the hope that it will guide the deployment of deep learning to EEG datasets in future research. Hybrid designs incorporating convolutional layers with recurrent layers or restricted Boltzmann machines showed promise in classification accuracy and transfer learning when compared against standard designs. We recommend more in-depth research of these combinations, particularly the number and arrangement of different layers including RBM's, recurrent layers, convolutional layers, and fully connected layers. Outside of network design, we also encourage further research to compare how deep networks interpret raw versus denoised EEG, as this has not yet been specifically assessed.

Competing interests
The authors declare that there are no conflicts of interest regarding the publication of this paper.

Funding
This work has been supported in part by the CACDS (Core facility for Advanced Computing and Data Science) at the University of Houston.