Deep learning-based electroencephalography analysis: a systematic review

Electroencephalography (EEG) is a complex signal and can require several years of training to be correctly interpreted. Recently, deep learning (DL) has shown great promise in helping make sense of EEG signals due to its capacity to learn good feature representations from raw data. Whether DL truly presents advantages as compared to more traditional EEG processing approaches, however, remains an open question. In this work, we review 156 papers that apply DL to EEG, published between January 2010 and July 2018, and spanning different application domains such as epilepsy, sleep, brain-computer interfacing, and cognitive and affective monitoring. We extract trends and highlight interesting approaches in order to inform future research and formulate recommendations. Various data items were extracted for each study pertaining to 1) the data, 2) the preprocessing methodology, 3) the DL design choices, 4) the results, and 5) the reproducibility of the experiments. Our analysis reveals that the amount of EEG data used across studies varies from less than ten minutes to thousands of hours. As for the model, 40% of the studies used convolutional neural networks (CNNs), while 14% used recurrent neural networks (RNNs), most often with a total of 3 to 10 layers. Moreover, almost one-half of the studies trained their models on raw or preprocessed EEG time series. Finally, the median gain in accuracy of DL approaches over traditional baselines was 5.4% across all relevant studies. More importantly, however, we noticed studies often suffer from poor reproducibility: a majority of papers would be hard or impossible to reproduce given the unavailability of their data and code. To help the field progress, we provide a list of recommendations for future studies and we make our summary table of DL and EEG papers available and invite the community to contribute.

Results. Our analysis reveals that the amount of EEG data used across studies varies from less than ten minutes to thousands of hours, while the number of samples seen during training by a network varies from a few dozens to several millions, depending on how epochs are extracted. Interestingly, we saw that more than half the studies used publicly available data and that there has also been a clear shift from intra-subject to inter-subject approaches over the last few years. About 40% of the studies used convolutional neural networks (CNNs), while 14% used recurrent neural networks (RNNs), most often with a total of 3 to 10 layers. Moreover, almost one-half of the studies trained their models on raw or preprocessed EEG time series. Finally, the median gain in accuracy of DL approaches over traditional baselines was 5.4% across all relevant studies. More importantly, however, we noticed studies often suffer from poor reproducibility: a majority of papers would be hard or impossible to reproduce given the unavailability of their data and code.
1 Introduction 1.1 Measuring brain activity with EEG Electroencephalography (EEG), the measure of the electrical fields produced by the active brain, is a neuroimaging technique widely used inside and outside the clinical domain. Specifically, EEG picks up the electric potential differences, on the order of tens of µV , that reach the scalp when tiny excitatory post-synaptic potentials produced by pyramidal neurons in the cortical layers of the brain sum together. The potentials measured therefore reflect neuronal activity and can be used to study a wide array of brain processes.
Thanks to the great speed at which electric fields propagate, EEG has an excellent temporal resolution: events occurring at millisecond timescales can typically be captured. However, EEG suffers from low spatial resolution, as the electric fields generated by the brain are smeared by the tissues, such as the skull, situated between the sources and the sensors. As a result, EEG channels are often highly correlated spatially. The source localization problem, or inverse problem, is an active area of research in which algorithms are developed to reconstruct brain sources given EEG recordings [69].
There are many applications for EEG. For example, in clinical settings, EEG is often used to study sleep patterns [1] or epilepsy [3]. Various conditions have also been linked to changes in electrical brain activity, and can therefore be monitored to various extents using EEG. These include attention deficit hyperactivity disorder (ADHD) [10], disorders of consciousness [46,41], depth of anaesthesia [60], etc. EEG is also widely used in neuroscience and psychology research, as it is an excellent tool for studying the brain and its functioning. Applications such as cognitive and affective monitoring are very promising as they could allow unbiased measures of, for example, an individual's level of fatigue, mental workload, [19,176], mood, or emotions [5]. Finally, EEG is widely used in brain-computer interfaces (BCIs) -communication channels that bypass the natural output pathways of the brain -to allow brain activity to be directly translated into directives that affect the user's environment [106].

Current challenges in EEG processing
Although EEG has proven to be a critical tool in many domains, it still suffers from a few limitations that hinder its effective analysis or processing. First, EEG has a low signal-to-noise ratio (SNR) [20,77], as the brain activity measured is often buried under multiple sources of environmental, physiological and activity-specific noise of similar or greater amplitude called "artifacts". Various filtering and noise reduction techniques have to be used therefore to minimize the impact of these noise sources and extract true brain activity from the recorded signals.
EEG is also a non-stationary signal [30,57], that is its statistics vary across time. As a result, a classifier trained on a temporally-limited amount of user data might generalize poorly to data recorded at a different time on the same individual. This is an important challenge for real-life applications of EEG, which often need to work with limited amounts of data.
Finally, high inter-subject variability also limits the usefulness of EEG applications. This phenomenon arises due to physiological differences between individuals, which vary in magnitude but can severely affect the performance of models that are meant to generalize across subjects [29]. Since the ability to generalize from a first set of individuals to a second, unseen set is key to many practical applications of EEG, a lot of effort is being put into developing methods that can handle inter-subject variability.
To solve some of the above-mentioned problems, processing pipelines with domain-specific approaches are often used. A significant amount of research has been put into developing processing pipelines to clean, extract relevant features, and classify EEG data. State-of-the-art techniques, such as Riemannian geometry-based classifiers and adaptive classifiers [105], can handle these problems with varying levels of success.
Additionally, a wide variety of tasks would benefit from a higher level of automated processing. For example, sleep scoring, the process of annotating sleep recordings by categorizing windows of a few seconds into sleep stages, currently requires a lot of time, being done manually by trained technicians. More sophisticated automated EEG processing could make this process much faster and more flexible. Similarly, real-time detection or prediction of the onset of an epileptic seizure would be very beneficial to epileptic individuals, but also requires automated EEG processing.
For each of these applications, most common implementations require domain-specific processing pipelines, which further reduces the flexibility and generalization capability of current EEG-based technologies. Fig. 1 presents an overview of how EEG data (and similar multivariate time series) can be formatted to be fed into a DL model, along with some important terminology (see Section 1.4), as well as an illustration of a generic neural network architecture. Usually, when c channels are available and a window has length l samples, the input of a neural network for EEG processing consists of a multidimensional array X i ∈ R c×l containing the l samples corresponding to a window for all channels. This multidimensional array can be used as an example for training a neural network, as shown in Fig. 1b. Variations of this end-to-end formulation can be imagined where the window X i is first passed through a preprocessing and feature extraction pipeline (e.g., time-frequency transform), yielding an example X ′ i which is then used as input to the neural network instead.
Different types of layers are used as building blocks in neural networks.
Most commonly, those are fully-connected (FC), convolutional or recurrent layers. We refer to models using these types of layers as FC networks, convolutional neural networks (CNNs) [89] and recurrent neural networks (RNNs) [145], respectively. Here, we provide a quick overview of the main architectures and types of models. The interested reader is referred to the relevant literature for more in-depth descriptions of DL methodology [88,53,151].
FC layers are composed of fully-connected neurons, i.e., where each neuron receives as input the activations of every single neuron of the preceding layer. Convolutional layers, on the other hand, impose a particular structure where neurons in a given layer only see a subset of the activations of the preceding one. This structure, akin to convolutions in signal or image processing from which it gets its name, encourages the model to learn invariant representations of the data. This property stems from another fundamental characteristic of convolutional layers, which is that parameters are shared across different neurons -this can be interpreted as if there were filters looking for the same information across patches of the input. In addition, pooling layers can be introduced, such that the representations learned by the model become invariant to slight translations of the input. This is often a desirable property: for instance, in an object recognition task, translating the content of an image should not affect the prediction of the model. Imposing these kinds of priors thus works exceptionally well on data with spatial structure. In contrast to convolutional layers, recurrent layers impose a structure by which, in its most basic form, a layer receives as input both the preceding layer's current activations and its own activations from a previous time step. Models composed of recurrent layers are thus encouraged to make use of the temporal structure of data and have shown high performance in natural language processing (NLP) tasks [230,210].
Additionally, outside of purely supervised tasks, other architectures and learning strategies can be built to train models when no labels are available. For example, autoencoders (AEs) learn a representation of the input data by trying to reproduce their input given some constraints, such as sparsity or the introduction of artificial noise [53]. Generative adversarial networks (GANs) [54] are trained by opposing a generator (G), that tries to generate fake examples from an unknown distribution of interest, to a discriminator (D), that tries to identify whether the input it receives has been artificially generated by G or is an example from the unknown distribution of interest. This dynamic can be compared to the one between a thief (G) making fake money and the police (D) trying to distinguish fake money from real money. Both agents push one another to get better, up to a point where the fake money looks exactly like real money. The training of G and D can thus be interpreted as a two-player zero-sum minimax game. When equilibrium is reached, the probability distribution approximated by G converges to the real data distribution [54].
Definition used in this review Point or sample A measure of the instantaneous electric potential picked up by the EEG sensors, typically in µV .

Example
An instantiation of the data received by a model as input, typically denoted by x i in the machine learning literature. Trial A realization of the task under study, e.g., the presentation of one image in a visual ERP paradigm. Window or segment A group of consecutive EEG samples extracted for further analysis, typically between 0.5 and 30 seconds. Epoch A window extracted around a specific trial. Table 1: Disambiguation of common terms used in this review.
Overall, there are multiple ways in which DL improve and extend existing EEG processing methods. First, the hierarchical nature of DNNs means features could potentially be learned on raw or minimally preprocessed data, reducing the need for domain-specific processing and feature extraction pipelines. Features learned through a DNN might also be more effective or expressive than the ones engineered by humans. Second, as is the case in the multiple domains where DL has surpassed the previous state-of-the-art, it has the potential to produce higher levels of performance on different analysis tasks. Third, DL facilitates the development of tasks that are less often attempted on EEG data such as generative modelling [52] and transfer learning [129]. Indeed, generative models can be leveraged to learn intermediate representations or for data augmentation [52]. In transfer learning, the model parameters can also be transferred from one subject to another or from task A to task B. This might drastically widen or change the applicability of several EEG-based technologies.
On the other hand, there are various reasons why DL might not be optimal for EEG processing and that may justify the skepticism of some of the EEG community. First and foremost, the datasets typically available in EEG research contain far fewer examples than what has led to the current state-of-the-art in DL-heavy domains such as computer vision (CV) and NLP. Data collection being relatively expensive and data accessibility often being hindered by privacy concernsespecially with clinical data -openly available datasets of similar sizes are not common. Some initiatives have tried to tackle this problem though [65]. Second, the peculiarities of EEG, such as its low SNR, make EEG data very different from other types of data (e.g, images, text and speech) for which DL has been most successful. Therefore, the architectures and practices that are currently used in DL might not be readily applicable to EEG processing.

Terminology used in this review
Some terms are sometimes used in the fields of machine learning, deep learning, statistics, EEG and signal processing with different meanings. For example, in machine learning, "sample" usually refers to one example of the input received by a model, whereas in statistics, it can be used to refer to a group of examples taken from a population. It can also refer to the measure of a single time point in signal processing and EEG. Similarly, in deep learning, the term "epoch" refers to one pass through the whole training set during training; in EEG, an epoch is instead a grouping of consecutive EEG time points extracted around a specific marker. To avoid the confusion, we include in Table 1 definitions for a few terms as used in this review. Fig. 1 gives a visual example of what these terms refer to.

Objectives of the review
This systematic review covers the current state-of-the-art in DL-based EEG processing by analyzing a large number of recent publications. It provides an overview of the field for researchers familiar with traditional EEG processing techniques and who are interested in applying DL to their data. At the same time, it aims to introduce the field applying DL to EEG to DL researchers interested in expanding the types of data they benchmark their algorithms with, or who want to contribute to EEG research. For readers in any of these scenarios, this review also provides detailed methodological information on the various components of a DL-EEG pipeline to inform their own implementation 2 . In addition to reporting trends and highlighting interesting approaches, we distill our analysis into a few recommendations in the hope of fostering reproducible and efficient research in the field. Amplitude (e.g.,

μV)
Window or epoch or trial (   • Papers focusing solely on software tools.

Organization of the review
The review is organized as follows: Section 1 briefly introduces key concepts in EEG and DL, and details the aims of the review; Section 2 describes how the systematic review was conducted, and how the studies were selected, assessed and analyzed; Section 3 focuses on the most important characteristics of the studies selected and describes trends and promising approaches; Section 4 discusses critical topics and challenges in DL-EEG, and provides recommendations for future studies; and Section 5 concludes by suggesting future avenues of research in DL-EEG. Finally, supplementary material containing our full data collection table, as well as the code used to produce the graphs, tables and results reported in this review, are made available online.

Methods
English journal and conference papers, as well as electronic preprints, published between January 2010 and July 2018, were chosen as the target of this review. PubMed, Google Scholar and arXiv were queried to collect an initial list of papers to be reviewed. 3 Additional papers were identified by scanning the reference sections of these papers. The databases were queried for the last time on July 2, 2018.
To assess the eligibility of the selected papers, the titles were read first. If the title did not clearly indicate whether the inclusion and exclusion criteria were met, the abstract was read as well. Finally, when reading the full text during the data collection process, papers that were found to be misaligned with the criteria were rejected.
Non-peer reviewed papers, such as arXiv electronic preprints 4 , are a valuable source of state-of-the-art information as their release cycle is typically shorter than that of peer-reviewed publications. Moreover, unconventional research ideas are more likely to be shared in such repositories, which improves the diversity of the reviewed work and reduces the bias possibly introduced by the peer-review process [126]. Therefore, non-peer reviewed preprints were also included in our review. However, whenever a peer-reviewed publication followed a preprint submission, the peer-reviewed version was used instead.
A data extraction table was designed containing different data items relevant to our research questions, based on previous reviews with similar scopes and the authors' prior knowledge of the field. Following a first inspection of the papers with the data extraction sheet, data items were added, removed and refined. Each paper was initially reviewed by a single author, and then reviewed by a second if needed. For each article selected, around 70 data items were extracted covering five categories: origin of the article, rationale, data used, EEG processing methodology, DL methodology and reported results. Table 3 lists and defines the different items included in each of these categories. We make this data extraction table openly available for interested readers to reproduce our results and dive deeper into the data collected. We also invite authors of published work in the field of DL and EEG to contribute to the table by verifying its content or by adding their articles to it. Output of the feature extraction procedure, which aims to better represent the information of interest contained in the preprocessed data. Deep learning methodology

Architecture
Structure of the neural network in terms of types of layers (e.g. fully-connected, convolutional).

Number of layers
Measure of architecture depth. EEG-specific design choices Particular architecture choices made with the aim of processing EEG data specifically. Training procedure Method applied to train the neural network (e.g., standard optimization, unsupervised pre-training followed by supervised fine-tuning, etc.).

Regularization
Constraint on the hypothesis class intended to improve a learning algorithm generalization performance (e.g., weight decay, dropout). Optimization Parameter update rule.

Hyperparameter search
Whether a specific method was employed in order to tune the hyperparameter set. Subject handling Intra-vs inter-subject analysis. Inspection of trained models Method used to inspect a trained DL model.

Results
Type of baseline Whether the study included baseline models that used traditional processing pipelines, DL baseline models, or a combination of the two. Performance metrics Metrics used by the study to report performance (e.g., accuracy, f1-score, etc.). Validation procedure Methodology used to validate the performance of the trained models, including cross-validation and data split.

Statistical testing
Types of statistical tests used to assess the performance of the trained models.

Comparison of results
Reported results of the study, both for the trained DL models and for the baseline models. Reproducibility Dataset Whether the data used for the experiment comes from private recordings or from a publicly available dataset.

Code
Whether the code used for the experiment is available online or not, and if so, where.
The first category covers the origin of the article, that is whether it comes from a journal, a conference publication or a preprint repository, as well as the country of the first author's affiliation. This gives a quick overview of the types of publication included in this review and of the main actors in the field. Second, the rationale category focuses on the domains of application of the selected studies. This is valuable information to understand the extent of the research in the field, and also enables us to identify trends across and within domains in our analysis. Third, the data category includes all relevant information on the data used by the selected papers. This comprises both the origin of the data and the data collection parameters, in addition to the amount of data that was available in each study. Through this section, we aim to clarify the data requirements for using DL on EEG. Fourth, the EEG processing parameters category highlights the typical transformations required to apply DL to EEG, and covers preprocessing steps, artifact handling methodology, as well as feature extraction. Fifth, details of the DL methodology, including architecture design, training procedures and inspection methods, are reported to guide the interested reader through state-of-the-art techniques. Sixth, the reported results category reviews the results of the selected articles, as well as how they were reported, and aims to clarify how DL fares against traditional processing pipelines performance-wise. Finally, the reproducibility of the selected articles is quantified by looking at the availability of the data and code. The results of this section support the critical component of our discussion.

Results
The database queries yielded 553 different results that matched the search terms (see Fig. 2). 49 additional papers were then identified using the reference sections of the initial papers. Based on our inclusion and exclusion criteria, 446 papers were excluded. Therefore, 156 papers were selected for inclusion in the analysis.

Origin of the selected studies
Our search methodology returned 49 journal papers, 58 conference and workshop papers, 48 preprints and 1 journal paper supplement ( [201], included in the "Journal" category in our analysis) that met our criteria. A total of 23 journal and conference papers had initially been made available as preprints on arXiv or bioRxiv. Popular journals included Neurocomputing, Journal of Neural Engineering and Biomedical Signal Processing and Control, each with three publications contained in our selected studies. We also looked at the location of the first author's affiliation to get a sense of the geographical distribution of research on DL-EEG. We found that most contributions came from the USA, China and Australia (see Fig. 3).

Domains
The selected studies applied DL to EEG in various ways (see Fig. 4 and Table 4). Most studies (86%) focused on using DL for the classification of EEG data, most notably for sleep staging, seizure detection and prediction, brain-computer interfaces (BCIs), as well as for cognitive and affective monitoring. Around 9% of the studies focused instead on the improvement of processing tools, such as learning features from EEG, handling artifacts, or visualizing trained models. The remaining papers (5%) explored ways of generating data from EEG, e.g. augmenting data, or generating images conditioned on EEG.  Figure 4: Focus of the studies. The number of papers that fit in a category is showed in brackets for each category. Studies that covered more than one topic were categorized based on their main focus.
Despite the absolute number of DL-EEG publications being relatively small as compared to other DL applications such as computer vision [88], there is clearly a growing interest in the field. Fig. 5 shows the growth of the DL-EEG literature since 2010. The first seven months of 2018 alone count more publications than 2010 to 2016 combined, hence the relevance of this review. It is, however, still too early to conclude on trends concerning the application domains, given the relatively small number of publications to date.

Data
The availability of large datasets containing unprecedented numbers of examples is often mentioned as one of the main enablers of deep learning research in the early 2010s [53]. It is thus crucial to understand what the equivalent is in EEG research, given the relatively high cost of collecting EEG data. Given the high dimensionality of EEG signals [105], one would assume that a considerable amount of data is required. Although our analysis cannot answer that question fully, we seek to cover as many dimensions of the answer as possible to give the reader a complete view of what has been done so far. [165] Reduce effect of confounders [197] Signal cleaning Artifact handling [202,203] [193] [130] Fig. 4 have been grouped together.

Quantity of data
We make use of two different measures to report the amount of data used in the reviewed studies: 1) the number of examples available to the deep learning network and 2) the total duration of the EEG recordings used in the study, in minutes. Both measures include the EEG data used across training, validation and test phases. For an in-depth analysis of the amount of data, please see the data items table which contains more detailed information.
The left column of Fig. 6 shows the amount of EEG data, in minutes, used in the analysis of each study, including training, validation and/or testing. Therefore, the time reported here does not necessarily correspond to the total recording time of the experiment(s). For example, many studies recorded a baseline at the beginning and/or at the end but did not use it in their analysis. Moreover, some studies recorded more classes than they used in their analysis. Also, some studies used sub-windows of recorded epochs (e.g. in a motor imagery BCI, using 3 s of a 7 s epoch). The amount of data in minutes used across the studies ranges from 2 up to 4,800,000 (mean = 62,602; median = 360).
The center column of Fig. 6 [48]). The wide range of windowing approaches (see Section 3.3.4) indicates that a better understanding of its impact is still required. The number of examples used ranged from 62 up to 9,750,000 (mean = 251,532; median = 14,000).
The right column of Fig. 6 shows the ratio between the amount of data in minutes and the number of examples. This ratio was never mentioned specifically in the papers reviewed but we nonetheless wanted to see if there were any trends or standards across domains and we found that in sleep studies for example, this ratio tends to be of two as most people are using 30 s non-overlapping windows. Brain-computer interfacing is seeing the most sparsity perhaps indicating a lack of best practices for sliding windows. It is important to note that the BCI field is also the one in which the exact relevant time measures were hardest to obtain since most of the recorded data isn't used (e.g. baseline, in-between epochs). Therefore, some of the sparsity on the graph could come from us trying our best to understand and calculate the amount of data used (i.e., seen by the model). Obviously, in the following categories: generation of data, improvement of processing tools and others, this ratio has little to no value as the trends would be difficult to interpret.
The amount of data across different domains varies significantly. In domains like sleep and epilepsy, EEG recordings last many hours (e.g., a full night), but in domains like affective and cognitive monitoring, the data usually comes from lab experiments on the scale of a few hours or even a few minutes.

Subjects
Often correlated with the amount of data, the number of subjects also varies significantly across studies (see Fig. 7).
Half of the datasets used in the selected studies contained fewer than 13 subjects. Six studies, in particular, used datasets with a much greater number of subjects: [132,160,188,149] all used datasets with at least 250 subjects, while [22] and [49] used datasets with 10,000 and 16,000 subjects, respectively. As explained in Section 3.7.4, the untapped potential of DL-EEG might reside in combining data coming from many different subjects and/or datasets to train a model that captures common underlying features and generalizes better. In [202], for example, the authors trained their model using an existing public dataset and also recorded their own EEG data to test the generalization on new subjects. In [191], an increase in performance was observed when using more subjects during training before testing on new subjects. The authors tested using from 1 to 30 subjects with a leave-one-subject-out cross-validation scheme, and reported an increase in performance with noticeable diminishing returns above 15 subjects.

Recording parameters
As shown later in Section 3.8, 42% of reported results came from private recordings. We look at the type of EEG device that was used by the selected studies to collect their data, and additionally highlight low-cost, often called "consumer" EEG devices, as compared to traditional "research" or "medical" EEG devices (see Fig. 8a). We loosely defined lowcost EEG devices as devices under the USD 1,000 threshold (excluding software, licenses and accessories). Among these devices, the Emotiv EPOC was used the most, followed by the OpenBCI, Muse and Neurosky devices. As for the research grade EEG devices, the BioSemi ActiveTwo was used the most, followed by BrainVision products.
The EEG data used in the selected studies was recorded with 1 to 256 electrodes, with half of the studies using between 8 and 62 electrodes (see Fig. 8b). The number of electrodes required for a specific task or analysis is usually arbitrarily defined as no fundamental rules have been established. In most cases, adding electrodes will improve possible analyses by increasing spatial resolution. However, adding an electrode close to other electrodes might not provide significantly different information, while increasing the preparation time and the participant's discomfort and requiring a more costly device. Higher density EEG devices are popular in research but hardly ecological. In [153], the authors explored the impact of the number of channels on the specificity and sensitivity for seizure detection. They showed that increasing the number of channels from 4 up to 22 (including two referential channels) resulted in an increase in sensitivity from 31% to 39% and from 40% to 90% in specificity. They concluded, however, that the position of the referential channels is very important as well, making it difficult to compare across datasets coming from different neurologists and recording sites using different locations for the reference(s) channel(s).
Similarly, in [26], the impact of different electrode configurations was assessed on a sleep staging task. The authors found that increasing the number of electrodes from two to six produced the highest increase in performance, while adding additional sensors, up to 22 in total, also improved the performance but not as much. The placement of the electrodes in a 2-channel montage also impacted the performance, with central and frontal montages leading to better performance than posterior ones on the sleep staging task.
Furthermore, EEG sampling rates varied mostly between 100 and 1000 Hz in the selected studies (the sampling rate reported here is the one used to record the EEG data and not after downsampling, as described in Section 3.4). Around 50% of studies used sampling rates of 250 Hz or less and the highest sampling rate used was 5000 Hz ( [67]).

Data augmentation
Data augmentation is a technique by which new data examples are artificially generated from the existing training data. Data augmentation has proven efficient in other fields such as computer vision, where data manipulations including rotations, translations, cropping and flipping can be applied to generate more training examples [134]. Adding more training examples allows the use of more complex models comprising more parameters while reducing overfitting. When done properly, data augmentation increases accuracy and stability, offering a better generalization on new data [215].
Out of the 156 papers reviewed, three papers explicitly explored the impact of data augmentation on DL-EEG ( [192,219,152]). Interestingly, each one looked at it from the perspective of a different domain: sleep, affective monitoring and BCI. Also, all three are from 2018, perhaps showing an emerging interest in data augmentation. First, in [192], Gaussian noise was added to the training data to obtain new examples. This approach was tested on two different public datasets for emotion classification (SEED [227] and MAHNOB-HCI [159]). They improved their accuracy on the SEED dataset using LeNet ( [90]) from 49.6% (without augmentation) to 74.3% (with augmentation), from 34.2% (without) to 75.0% (with) using ResNet ( [70]) and from 40.8% (without) to 45.4% (with) on MAHNOB-HCI dataset using ResNet. Their best accuracy was obtained with a standard deviation of 0.2 and by augmenting the data to 30 times its original size. Despite impressive results, it is important to note that they also compared LeNet and ResNet to an SVM which had an accuracy of 74.2% (without) and 73.4% (with) on the SEED dataset. This might indicate that the initial amount of data was insufficient for LeNet or ResNet but adding data clearly helped bring the performance up to par with the SVM. Second, in [219], a conditional deep convolutional generative adversarial network (cDCGAN) was used to generate artificial EEG signals on one of the BCI Competition motor imagery datasets. Using a CNN, it was shown that data augmentation helped improve accuracy from 83% to around 86% to classify motor imagery.
In [152], the authors explicitly targeted the class imbalance problem of under-represented sleep stages by generating Fourier transform (FT) surrogates of raw EEG data on the CAPSLPDB dataset. They improved their accuracy up to 24% on some classes.
An additional 30 papers explicitly used data augmentation in one form or another but only a handful investigated the impact it hae on performance. In [82,15], noise was added to 2D feature images, although it did not improve results in [15]. In [76], artifacts such as eye blinks and muscle activity, as well as Gaussian white noise, were used to augment the data and improve robustness. In [209] and [208], Gaussian noise was added to the input feature vector. This approach increased the accuracy of the SDAE model from around 76.5% (without augmentation) to 85.5% (with).
Multiple studies also used overlapping windows as a way to augment their data, although many did not explicitly frame this as data augmentation. In [185,123], overlapping windows were explicitly used as a data augmentation technique. In [83], different shift lengths between overlapping windows (from 10 ms to 60 ms out of a 2-s window) were compared, showing that by generating more training samples with smaller shifts, performance improved significantly. In [150], the concept of overlapping windows was pushed even further: 1) redundant computations due to EEG samples being in more than one window were simplified thanks to "cropped training", which ensured these computations were only done once, thereby speeding up training and 2) the fact that overlapping windows share information was used to design an additional term to the cost function, which further regularizes the models by penalizing decisions that are not the same while being close in time.
Other procedures used the inherent spatial and temporal characteristics of EEG to augment their data. In [34], the authors doubled their data by swapping the right and left side electrodes, claiming that as the task was a symmetrical problem, which side of the brain expresses the response would not affect classification. In [17], the authors augmented their multimodal (EEG and EMG) data by duplicating samples and keeping the values from one modality only, while setting the other modality values to 0 and vice-versa. In [42], the authors made use of the data that is usually thrown away when downsampling EEG in the preprocessing stage. It is common to downsample a signal acquired at higher sampling rate to 256 Hz or less. In their case, they reused the data thrown away during that step as new samples: a downsampling by a factor of N would therefore allow an augmentation of N times.
Finally, classification of rare events where the number of available samples are orders of magnitude smaller than their counterpart classes [152] is another motivation for data augmentation. In EEG classification, epileptic seizures or transitional sleep stages (e.g. S1 and S3) often lead to such unbalanced classes. In [190], the class imbalance problem was addressed by randomly balancing all classes while sampling for each training epoch. Similarly, in [26], balanced accuracy was maximized by using a balanced sampling strategy. In [183], EEG segments from the interictal class were split into smaller subgroups of equal size to the preictal class. In [160], cost-sensitive learning and oversampling were used to solve the class imbalance problem for sleep staging but the overall performance using these approaches did not improve. In [144], the authors randomly replicated subjects from the minority class to balance classes. Similarly, in [167,38,39,109], oversampling of the minority class was used to balance classes. Conversely, in [175,154], the majority class was subsampled. In [181], an overlapping window with a subject-specific overlap was used to match classes. Similar work by the same group [180] showed that when training a GAN on individual subjects, augmenting data with an overlapping window increased accuracy from 60.91% to 74.33%. For more on imbalanced learning, we refer the interested reader to [155].

EEG processing
One of the oft-claimed motivation for using deep learning on EEG processing is automatic feature learning [132,76,45,68,114,11,213]. This can be explained by the fact that feature engineering is a time-consuming task [98]. Additionally, preprocessing and cleaning EEG signals from artifacts is a demanding step of the usual EEG processing pipeline. Hence, in this section, we look at aspects related to data preparation, such as preprocessing, artifact handling and feature extraction. This analysis is critical to clarify what level of preprocessing EEG data requires to be successfully used with deep neural networks.

Preprocessing
Preprocessing EEG data usually comprises a few general steps, such as downsampling, band-pass filtering, and windowing. Throughout the process of reviewing papers, we found that a different number of preprocessing steps were employed in the studies. In [71], it is mentioned that "a substantial amount of preprocessing was required" for assessing cognitive workload using DL. More specifically, it was necessary to trim the EEG trials, downsample the data to 512 Hz and 64 electrodes, identify and interpolate bad channels, calculate the average reference, remove line noise, and high-pass filter the data starting at 1 Hz. On the other hand, Stober et al. [164] applied a single preprocessing step by removing the bad channels for each subject. In studies focusing on emotion recognition using the DEAP dataset [81], the same preprocessing methodology proposed by the researchers that collected the dataset was typically used, i.e., re-referencing to the common average, downsampling to 256 Hz, and high-pass filtering at 2 Hz.
We separated the papers into three categories based on whether or not they used preprocessing steps: "Yes", in cases where preprocessing was employed; "No", when the authors explicitly mentioned that no preprocessing was necessary; and not mentioned ("N/M") when no information was provided. The results are shown in Fig. 9.  A considerable proportion of the reviewed articles (72%) employed at least one preprocessing method such as downsampling or re-referencing. This result is not surprising, as applications of DNNs to other domains, such as computer vision, usually require some kind of preprocessing like cropping and normalization as well.

Artifact handling
artifact handling techniques are used to remove specific types of noise, such as ocular and muscular artifacts [186]. As emphasized in [203], removal of artifacts may be crucial for achieving good EEG decoding performance. Adding this to the fact that cleaning EEG signals might be a time-consuming process, some studies attempted to apply only minimal preprocessing such as removing bad channels and leave the burden of learning from a potentially noisy signal on the neural network [164]. With that in mind, we decided to look at artifact handling separately.
artifact removal techniques usually require the intervention of a human expert [120]. Different techniques leverage human knowledge to different extents, and might fully rely on an expert, as in the case of visual inspection, or require prior knowledge to simply tune a hyperparameter, as in the case of wavelet-based independent component analysis (ICA) [108]. Among the studies which handled artifacts, a myriad of techniques were applied. Some studies employed methods which rely on human knowledge such as amplitude thresholding [114], manual identification of high-variance segments [71], and handling EEG blinking-related noise based on highamplitude EOG segments [109]. On the other hand, many other articles favored techniques that rely less on human intervention, such as blind source separation techniques. For instance, in [166,207,208,45,131,133], ICA was used to separate ocular components from EEG data.
In order to investigate the necessity of removing artifacts from EEG when using deep neural networks, we split the selected papers into three categories, in a similar way to the preprocessing analysis (see Fig. 9). Almost half the papers (46%) did not use artifact handling methods, while 24% did. Additionally, 31% of the studies did not mention whether artifact handling was necessary to achieve their results. Given those results, we are encouraged to believe that using DNNs on EEG might be a way to avoid the explicit artifact removal step of the classical EEG processing pipeline without harming task performance.

Features
Feature engineering is one of the most demanding steps of the traditional EEG processing pipeline [98] and the main goal of many papers considered in this review [132,76,45,68,114,11,213] is to get rid of this step by employing deep neural networks for automatic feature learning. This aspect appears to be of interest to researchers in the field since its early stages, as indicated by the work of Wulsin et al. [198], which, in 2011, compared the performance of deep belief networks (DBNs) on classification and anomaly detection tasks using both raw EEG and features as inputs.
More recently, studies such as [165,66] achieved promising results without the need to extract features.
Given that the majority of EEG features are obtained in the frequency-domain, our analysis consisted in separating the reviewed articles into four categories according to the respective input type. Namely, the categories were: "Raw EEG", "Frequency-domain", "Combination" (in case more than one type of feature was used), and "Other" (for papers using neither raw EEG nor frequency-domain features). Studies that did not specify the type of input were assigned to the category "N/M" (not mentioned). Notice that, here, we use "feature" and "input type" interchangeably. Fig. 9 presents the result of our analysis. One can observe that 49% of the papers used only raw EEG data as input, whereas 48% used hand-engineered features, from which 36% corresponded to frequency domain-derived features. Finally, 3% did not specify the type of input of their model. According to these results, we find indications that DNNs can be in fact applied to raw EEG data and achieve state-of-the-art results.

Architecture
A crucial choice in the DL-based EEG processing pipeline is the neural network architecture to be used. In this section, we aim at answering a few questions on this topic, namely: 1) "What are the most frequently used architectures?", 2) "How has this changed across years?", 3) "Is the choice of architecture related to input characteristics?" and 4) "How deep are the networks used in DL-EEG?".
To answer the first three questions, we divided and assigned the architectures used in the 156 papers into the following groups: CNNs, RNNs, AEs, restricted Boltzmann machines (RBMs), DBNs, GANs, FC networks, combinations of CNNs and RNNs (CNN+RNN), and "Others" for any other architecture or combination not included in the aforemen-tioned categories. Fig. 10a    In Fig. 10b, we provide a visualization of the distribution of architecture types across years. Until the end of 2014, DBNs and FC networks comprised the majority of the studies. However, since 2015, CNNs have been the architecture type of choice in most studies. This can be attributed to the their capabilities of end-to-end learning and of exploiting hierarchical structure on the data [177], as well as their success and subsequent popularity on computer vision tasks, such as the ILSVRC 2012 challenge [35]. Interestingly, we also observe that as the number of papers grows, the proportion of studies using CNNs and combinations of recurrent and convolutional layers has been growing steadily. The latter shows that RNNs are increasingly of interest for EEG analysis. On the other hand, the use of architectures such as RBMs, DBNs and AEs has been decreasing with time. Commonly, models employing these architectures utilize a two-step training procedure consisting of 1) unsupervised feature learning and 2) training a classifier on top of the learned features. However, we notice that recent studies leverage the hierarchical feature learning capabilities of CNNs to achieve end-to-end supervised feature learning, i.e., training both a feature extractor and a classifier simultaneously.
To complement the previous result, we cross-checked the architecture and input type information provided in Fig. 9.
Results are presented in Fig. 10c and clearly show that CNNs are indeed used more often with raw EEG data as input. This corroborates the idea that researchers employ this architecture with the aim of leveraging the capabilities of deep neural networks to process EEG data in an end-to-end fashion, avoiding the time-consuming task of extracting features. From this figure, one can also notice that some architectures such as deep belief networks are typically used with frequency-domain features as inputs, while GANs, on the other hand, have been only applied to EEG processing using raw data.
Number of layers Deep neural networks are usually composed of stacks of layers which provide hierarchical processing. Although one might think the use of deep neural networks implies the existence of a large number of layers in the architecture, there is no absolute consensus in the literature regarding this definition. Here we investigate this aspect and show that the number of layers is not necessarily large, i.e., larger than three, in many of the considered studies.
In Fig 10d,  Some studies specifically investigated the effect of increasing the model depth. Zhang et al. [218] evaluated the performance of models with depth ranging from two to 10 on a mental workload classification task. Architectures with seven layers outperformed both shallower (two and four layers) and deeper (10 layers) models in terms of accuracy, precision, F-measure and G-mean. Moreover, O'Shea et al. [123] compared the performance of a CNN with six and 11 layers on neonatal seizure detection. Their results show that, in this case, the deeper network presented better area under the receiver operating curve (ROC AUC) in comparison to the shallower model, as well as a support vector machine (SVM). In [83], the effect of depth on CNN performance was also studied. The authors compared results obtained by a CNN with two and three convolutional layers on the task of classifying SSVEPs under ambulatory conditions. The shallower architecture outperformed the three-layer one in all scenarios considering different amounts of training data. Canonical correlation analysis (CCA) together with a KNN classifier were also evaluated and employed as a baseline method. Interestingly, as the number of training samples increased, the shallower model outperformed the CCA-based baseline.
EEG-specific design choices Particular choices regarding the architecture might enable a model to mimic the process of extracting EEG features. An architecture can also be specifically designed to impose specific properties on the learned representations. This is for instance the case with max-pooling, which is used to produce invariant feature maps to slight translations on the input [53]. In the case of EEG signals, one might be interested in forcing the model to process temporal and spatial information separately in the earlier stages of the network. In [26,83,213,16,150,109], one-dimensional convolutions were used in the input layer with the aim of processing either temporal or spatial information independently at this point of the hierarchy. Other studies [224,167] combined recurrent and convolutional neural networks as an alternative to the previous approach of separating temporal and spatial content. Recurrent models were also applied in cases where it was necessary to capture long-term dependencies from the EEG data [100,220].

Training
Details regarding the training of the models proposed in the literature are of great importance as different approaches and hyperparameter choices can greatly impact the performance of neural networks. The use of pre-trained models, regularization, and hyperparameter search strategies are examples of aspects we took into account during the review process. We report our main findings in this section.
Training Procedure One of the advantages of applying deep neural networks to EEG processing is the possibility of simultaneously training a feature extractor and a model for executing a downstream task such as classification or regression. However, in some of the reviewed studies [86,195,116], these two tasks were executed separately. Usually, the feature learning was done in an unsupervised fashion, with RBMs, DBNs, or AEs. After training those models to provide an appropriate representation of the EEG input signal, the new features were then used as the input for a target task which is, in general, classification. In other cases, pre-trained models were used for a different purpose, such as object recognition, and were fine-tuned on the specific EEG task with the aim of providing a better initialization or regularization effect [97].
In order to investigate the training procedure of the reviewed papers, we classify each one according to the adopted training procedure. Models which have parameters learned without using any kind of pre-training were assigned to the "Standard" group. The remaining studies, which specified the training procedure, were included in the "Pre-training" class, in case the parameters were learned in more than one step. Finally, papers employing different methodologies for training, such as co-learning [34], were included in the "Other" group.
In Fig. 11a) we show how the reviewed papers are distributed according to the training procedure. "N/M" refers to studies which have not reported this aspect. Almost half the papers did not employ any pre-traning strategy, while 25% did. Even though the training strategy is crucial for achieving good performance with deep neural networks, 26% of the selected studies have not explicitly described it in their paper. Regularization In the context of our literature review, we define regularization as any constraint on the set of possible functions parametrized by the neural network intended to improve its performance on unseen data during training [53]. The main goal when regularizing a neural network is to control its complexity in order to obtain better generalization performance [21], which can be verified by a decrease on test error in the case of classification problems. There are several ways of regularizing neural networks, and among the most common are weight decay (L2 and L1 regularization) [53], early stopping [139], dropout [168], and label smoothing [169]. Notice that even though the use of pre-trained models as initialization can also be interpreted as a regularizer [97], in this work we decided to include it in the training procedure analysis instead.
As the use of regularization might be fundamental to guarantee a good performance on unseen data during training, we analyzed how many of the reviewed studies explicitly stated that they have employed it in their models. Papers were separated in two groups, namely: "Yes" in case any kind of regularization was used, and "N/M" otherwise. In Fig. 11 we present the proportion of studies in each group.
From Fig. 11, one can notice that more than half the studies employed at least one regularization method. Furthermore, regularization methods were frequently combined in the reviewed studies. Hefron et al. [71] employed a combination of dropout, L1-and L2-regularization to learn temporal and frequency representations across different participants. The developed modelwas trained for recognizing mental workload states elicited by the MATB task [31]. Similarly, Längkvist and Loutfi [86], combined two types of regularization with the aim of developing a model tailored to an automatic sleep stage classification task. Besides L2-regularization, they added a penalty term to encourage weight sparsity, defined as the KL-divergence between the mean activation of each hidden unit over all training examples in a training batch and a hyperparameter ρ.
Optimization Learning the parameters of a deep neural network is, in practice, an optimization problem. The best way to tackle it is still an open research question in the deep learning literature, as there is often a compromise between finding a good solution in terms of minimizing the cost function and the performance of a local optimum expressed by the generalization gap, i.e. the difference between the training error and the true error estimated on the test set. In this scenario, the choice of a parameter update rule, i.e. the learning algorithm or optimizer, might be key for achieving good results.
The most commonly used optimizers are reported in Fig. 11. One surprising finding is that even though the choice of optimizer is a fundamental aspect of the DL-EEG pipeline, 47% of the considered studies did not report which parameter update rule was applied. Moreover, 30% used Adam [80] and 17% Stochastic Gradient Descent [141] (notice that we also refer to the mini-batch case as SGD). 6% of the papers utilized different optimizers, such as RMSprop [178], Adagrad [40], and Adadelta [214].
Another interesting finding the optimizer analysis provided is the steady increase in the use of Adam. Indeed, from 2017 to 2018, the percentage of studies using Adam increased from 31.9% to 52.6%. Adam was proposed as a gradient-based method with the capability of adaptively tuning the learning rate based on estimates of first and second order moments of the gradient. It became very popular in general deep neural networks applications (accumulating approximately 15,000 citations since 2014 5 ). Interestingly, we notice a proportional decrease from 2017 to 2018 of the number of papers which did not report the optimizer utilized.
Hyperparameter search From a practical point-of-view, tuning the hyperparameters of a learning algorithm often takes up a great part of the time spent during training. GANs, for instance, are known to be sensitive to the choices of optimizer and architecture hyperparameters [58,99]. In order to minimize the amount of time spent finding an appropriate set of hyperparameters, several methods have been proposed in the literature. Examples of commonly applied methods are grid search [18] and Bayesian optimization [158]. Grid search consists in determining a range of values for each parameter to be tuned, choosing values in this range, and evaluating the model, usually in a validation set considering all combinations. One of the advantages of grid search is that it is highly parallelizable, as each set of hyperparameter is independent of the other. Bayesian optimization, in turn, defines a posterior distribution over the hyperparameters space and iteratively updates its values according to the performance obtained by the model with a hyperparameter set corresponding to the expected posterior.
Given the importance of finding a good set of hyperparameters and the difficulty of achieving this in general, we calculate the percentage of papers that employed some search method for tuning their models and optimizers, as well as the amount of articles that have not included any information regarding this aspect. Results indicate that almost 80% of the reviewed papers have not mentioned the use of hyperparameters search strategies. It is important to highlight that among those articles, it is not clear how many have not done any tuning at all and how many have just not considered to include this information in the paper. From the 21% that declared to have searched for an appropriate set of hyperparameters, some have manually done this by trial and error (e.g. [2,38,183,132]), while others employed grid search (e.g. [207,200,39,208,101,11,86]), and a few used other strategies such as Bayesian methods (e.g. [163,164,152]).

Inspection of trained models
In this section, we review if, and how, studies have inspected their proposed models. Out of the selected studies, 27% reported inspecting their models. Two studies focused more specifically on the question of model inspection in the context of DL and EEG [67,45]. See Table 5 for a list of the different techniques that were used by more than one study. For a general review of DL model inspection techniques, see [75].
The most frequent model inspection techniques involved the analysis of the trained model's weights [135,211,86,34,87,200,182,122,170,228,164,109,204]. This often requires focusing on the weights of the first layer only, as their interpretation in regard to the input data is straightforward. Indeed, the absolute value of a weight represents the strength with which the corresponding input dimension is used by the model -a higher value can therefore be interpreted as a rough measure of feature importance. For deeper layers, however, the hierarchical nature of neural networks means it is much harder to understand what a weight is applied to. The analysis of model activations was used in multiple studies [212,194,87,83,208,167,154,109]. This kind of inspection method usually involves visualizing the activations of the trained model over multiple examples, and thus inferring how different parts of the network react to known inputs. The input-perturbation network-prediction correlation map technique, introduced in [149], pushes this idea further by trying to identify causal relationships between the inputs and the decisions of a model. The impact of the perturbation on the activations of the last layer's units then shines light onto which characteristics of the input are important for the classifier to make a correct prediction. To do this, the input is first perturbed, either in the time-or frequency-domain, to alter its amplitude or phase characteristics [67], and then fed into the network. Occlusion sensitivity techniques [92,26,175] use a similar idea, by which the decisions of the network when different parts of the input are occluded are analyzed.  [135,211,86,34,87,200,182,122,170,228,164,109,204,85,25] Analysis of activations [212,194,87,83,208,167,154,109] Input-perturbation network-prediction correlation maps [149,191,67,16,150] Generating input to maximize activation [188,144,160,15] Occlusion of input [92,26,175] Several studies used backpropagation-based techniques to generate input maps that maximize activations of specific units [188,144,160,15]. These maps can then be used to infer the role of specific neurons, or the kind of input they are sensitive to.
Finally, some model inspection techniques were used in a single study. For instance, in [45], the class activation map (CAM) technique was extended to overcome its limitations on EEG data. To use CAMs in a CNN, the channel activations of the last convolutional layer must be averaged spatially before being fed into the model's penultimate layer, which is a FC layer. For a specific input image, a map can then be created to highlight parts of the image that contributed the most to the decision, by computing a weighted average of the last convolutional layer's channel activations. Other techniques include Deeplift [87], saliency maps [190], input-feature unit-output correlation maps [150], retrieval of closest examples [34], analysis of performance with transferred layers [63], analysis of most-activating input windows [67], analysis of generated outputs [66], and ablation of filters [87].

Reporting of results
The performance of DL methods on EEG is of great interest as it is still not clear whether DL can outperform traditional EEG processing pipelines [105]. Thus, a major question we thus aim to answer in this review is: "Does DL lead to better performance than traditional methods on EEG?" However, answering this question is not straightforward, as benchmark datasets, baseline models, performance metrics and reporting methodology all vary considerably between the studies. In contrast, other application domains of DL, such as computer vision and NLP, benefit from standardized datasets and reporting methodology [53].
Therefore, to provide as satisfying an answer as possible, we adopt a two-pronged approach. First, we review how the studies reported their results by focusing on directly quantifiable items: 1) the type of baseline used as a comparison in each study, 2) the performance metrics, 3) the validation procedure, and 4) the use of statistical testing. Second, based on these points and focusing on studies that reported accuracy comparisons with baseline models, we analyze the reported performance of a majority of the reviewed studies.

Type of baseline
When contributing a new model, architecture or methodology to solve an already existing problem, it is necessary to compare the performance of the new model to the performance of state-of-the-art models commonly used for the problem of interest. Indeed, without a baseline comparison, it is not possible to assess whether the proposed method provides any advantage over the current state-of-the-art.
Points of comparison are typically obtained in two different ways: 1) (re)implementing standard models or 2) referring to published models. In the first case, authors will implement their own baseline models, usually using simpler models, and evaluate their performance on the same task and in the same conditions. Such comparisons are informative, but often do not reflect the actual state of the art on a specific task. In the second case, authors will instead cite previous literature that reported results on the same task and/or dataset. This second option is not always possible, especially when working on private datasets or tasks that have not been explored much in the past.
In the case of typical EEG classification tasks, state-of-the-art approaches usually involve traditional processing pipelines that include feature extraction and shallow/classical machine learning models. With that in mind, 67.9% of the studies selected included at least one traditional processing pipeline as a baseline model (see Fig. 15). Some studies instead (or also) compared their performance to DL-based approaches, to highlight incremental improvements obtained by using different architectures or training methodology: 34.0% of the studies therefore included at least one DL-based model as a baseline model. Out of the studies that did not compare their models to a baseline, six did not focus on the classification of EEG. Therefore, in total, 21.1% of the studies did not report baseline comparisons, making it impossible to assess the added value of their proposed methods in terms of performance.

Performance metrics
The types of performance metrics used by studies focusing on EEG classification are shown in Fig. 12a. Unsurprisingly, most studies used metrics derived from confusion matrices, such as accuracy, sensitivity, f1-score, ROC AUC and precision. As highlighted in [26,200], it is often preferable to use metrics that are robust to class imbalance, such as balanced accuracy, f1-score, and the ROC AUC for binary problems. This is often the case in sleep or epilepsy recordings, where clinical events are rare.
Studies that did not focus on the classification of EEG signals also mainly used accuracy as a metric. Indeed, these studies generally used a classification task to evaluate model performance, although their main purpose was different (e.g., correcting artifacts). In other cases, performance metrics specific to the study's purpose, such as generating data, were used, e.g., the inception score ( [148]), the Fréchet inception distance ( [74]), as well as custom metrics.

Validation procedure
When evaluating a machine learning model, it is important to measure its generalization performance, i.e., how well it performs on unseen data. In order to do this, it is common practice to divide the available data into a training and a test sets. When hyperparameters need to be tuned, the performance on the test set cannot be used anymore as an unbiased evaluation of the generalization performance of the model. Therefore, the training set is divided to obtain a third set called a "validation set" which is used to select the best hyperparameter configuration, leaving the test set to evaluate the performance of the best model in an unbiased way. However, when the amount of data available is small, dividing the data into different sets and only using a subset for training can seriously undermine the performance of data-hungry models. A procedure known as "cross-validation" is used in these cases, where the data is broken down into different partitions, which will then successively be used as either training or validation data.
The cross-validation techniques used in the selected studies are shown in Fig. 12b. Some studies mentioned using crossvalidation but did not provide any details. The category 'Train-Valid-Test' includes studies doing random permutations of train/valid, train/test or train/valid/test, as well as studies that mentioned splitting their data into training, validation and test sets but did not provide any details on the validation method. The Leave-One-Out variations correspond to the special case where N = 1 in the Leave-N-Out versions. 60% of the studies did not use any form of cross-validation. Interestingly, in [104], the authors proposed a 'warm restart' within the gradient descent steps to remove the need for a validation set.

Subject handling
Whether a study focuses on intra-or inter-subject classification has an impact on the performance. Intra-subject models, which are trained and used on the data of a single subject, often lead to higher performance since the model has less data variability to account for. However, this means the data the model is trained on is obtained from a single subject, and thus often comprises only a few recordings. In inter-subject studies, models generally see more data, as multiple subjects are included, but must contend with greater data variability, which introduces different challenges.
In the case of inter-subject classification, the choice of the validation procedure can have a big impact on the reported performance of a model. The Leave-N-Subject-Out procedure, which uses different subjects for training and for testing, may lead to lower performance, but is applicable to real-life scenarios where a model must be used on a subject for whom no training data is available. In contrast, using k-fold cross-validation on the combined data from all the subjects often means that the same subjects are seen in both the training and testing sets. In the selected studies, 22 out of the 108 studies using an inter-subject approach used a Leave-N-Subjects-Out or Leave-One-Subjects-Out procedure.
In the selected studies, 25% focused only on intra-subject classification, 61% focused only on inter-subject classification, 8.3% focused on both, and 4% did not mention it. Obviously, 'N/M' studies necessarily fall under one of the three previous categories. The 'N/M' might be due to certain domains using a specific type of experiment (i.e. intra or inter-subject) almost exclusively, thereby obviating the need to mention it explicitly. Fig. 13 shows that there has been a clear trend over the last few years to leverage DL for inter-subject rather than intra-subject analysis. In [34], the authors used a large dataset and tested the performance of their model both on new (unseen) subjects and on known (seen) subjects. They obtained 38% accuracy on unseen subjects and 75% on seen subjects, showing that classifying EEG data from unseen subjects can be significantly more challenging than from seen ones.
In [184], the authors compared their model on both intra-and inter-subject tasks. Despite the former case providing the model with less less training data than the latter, it led to better results. In [62], the authors compared different DL models and showed that cross-subject (37 subjects) models always performed worse than within-subject models.
In [127], a hybrid system trained on multiple subjects and then fine-tuned on subject-specific data led to the best performance. Finally, in [175], the authors compared their DNN to a state-of-the-art traditional approach and showed that deep networks generalize better, although their performance on intra-subject classification is still higher than on inter-subject classification.

Statistical testing
To assess whether a proposed model is actually better than a baseline model, it is useful to use statistical tests. In total, 19.9% of the selected studies used statistical tests to compare the performance of their models to baseline models. The tests most often used were Wilcoxon signed-rank tests, followed by ANOVAs.

Comparison of results
Although, as explained above, many factors make this kind of comparison imprecise, we show in this section how the proposed approaches and traditional baseline models compared, as reported by the selected studies.
We focus on a specific subset of the studies to make the comparison more meaningful. First, we focus on studies that report accuracy as a direct measure of task performance. As shown in Fig. 12a, this includes the vast majority of the studies. Second, we only report studies which compared their models to a traditional baseline, as we are interested in whether DL leads to better results than non-DL approaches. This means studies which only compared their results to other DL approaches are not included in this comparison. Third, some studies evaluated their approach on more than one task or dataset. In this case, we report the results on the task that has the most associated baselines. If that is more than one, we either report all tasks, or aggregate them if they are very similar (e.g., binary classification of multiple mental tasks, where performance is reported for each possible pair of tasks). In the case of multimodal studies, we only report the performance on the EEG-only task, if it is available. Finally, when reporting accuracy differences, we focus on the difference between the best proposed model and the best baseline model, per task. Following these constraints, a total of 102 studies/tasks were left for our analysis. Figure 14 shows the difference in accuracy between each proposed model and corresponding baseline per domain type (as categorized in Fig. 4), as well as the corresponding distribution over all included studies and tasks.
The median gain in accuracy with DL is of 5.4%, with an interquartile range of 9.4%. Only four values were negative values, meaning the proposed DL approach led to a lower performance than the baseline. The best improvement in accuracy was obtained by [161], where their approach led to a gain of 76.7% in accuracy in an rapid serial visual presentation (RSVP) classification task.

Reproducibility
Reproducibility is a cornerstone of science [111]: having reproducible results is fundamental to moving a field forward, especially in a field like machine learning where new ideas spread very quickly. Here, we evaluate ease with which the results of the selected papers can be reproduced by the community using two key criteria: the availability of their data and the availability of their code.
From the 156 studies reviewed, 54% used public data, 42% used private data 6 , and 4% used both public and private data. In particular, studies focusing on BCI, epilepsy, sleep and affective monitoring made use of openly available datasets the most (see Table 6). Interestingly, in cognitive monitoring, no publicly available datasets were used, and papers in that field all relied on internal recordings.
Fittingly, a total of 33 papers (21%) explicitly mentioned that more publicly available data is required to support research on DL-EEG. In clinical settings, the lack of labeled data, rather than the quantity of data, was specifically pointed out as an obstacle.
As for the source code, only 19% of the studies chose to make it available online [82,149,160,225,197,152,87,150,224,222,167,223,221,104,161,15,164,163,85] and as illustrated in Fig 15,   the code is available but some data is not publicly available, Hard: either the code or the data is available but not both, Impossible: neither the data nor the code are available).
sharing platform. Needless to say, having access to the source code behind published results can drastically reduce time and increase incentive to reproduce a paper's results.
Therefore, taking both data and code availability into account, only 11 out of 156 studies (7%) could easily be reproduced using both the same data and code [149,160,152,224,222,167,221,104,161,164,85]. 4 out of 156 studies (3%) shared their code but tested on both private and public data making their studies only partially reproducible [225,87,150,223], see Fig. 15. As follows, a significant number of studies (61) did not have publicly available data or code, making them almost impossible to reproduce.
It is important to note, moreover, that for the results of a study to be perfectly reproduced, the authors would also need to share the weights (i.e. parameters) of the network. Sharing the code and the architecture of the network might not be sufficient since retraining the network could converge to a different minimum. On the other hand, retraining the network could also end up producing better results if a better performing model is obtained. For recommendations on how to best share the results, the code, the data and relevant information to make a study easy to reproduce, please see the discussion section and the checklist provided in Appendix B.

Discussion
In this section, we review the most important findings from our results section, and discuss the significance and impact of various trends highlighted above. We also provide recommendations for DL-EEG studies and present a checklist to ensure reproducibility in the field.

Rationale
It was expected that most papers selected for the review would focus on the classification of EEG data, as DL has historically led to important improvements on supervised classification problems [88]. Interestingly though, several papers also focused on new applications that were made possible or facilitated by DL: for instance, generating images conditioned on EEG, generating EEG, transfer learning between subjects, or feature learning. One of the main motiva-  [86,110,85] tions for using DL cited by the papers reviewed was the ability to use raw EEG with no manual feature extraction steps. We expect these kinds of applications that go beyond using DL as a replacement for traditional processing pipelines to gain in popularity.

Data
A critical question concerning the use of DL with EEG data remains "How much data is enough data?". In Section 3.3, we explored this question by looking at various descriptive dimensions: the number of subjects, the amount of EEG recorded, the number of training/test/validation examples, the sampling rate and data augmentation schemes used.
Although a definitive answer cannot be reached, the results of our meta-analysis show that the amount of data necessary to at least match the performance of traditional approaches is already available. Out of the 156 papers reviewed, only six reported lower performance for DL methods over traditional benchmarks. To achieve these results with limited amounts of data, shallower architectures were often preferred. Data augmentation techniques were also used successfully to improve performance when only limited data was available. However, more work is required to clearly assess their advantages and disadvantages. Indeed, although many studies used overlapping sliding windows, there seems to be no consensus on the best overlapping percentage to use, e.g., the impact of using a sliding window with 1% overlap versus 95% overlap is still not clear. BCI studies had the highest variability for this hyperparameter, while clinical applications such as sleep staging already appeared more standardized with most studies using 30 s non-overlapping windows.
Many authors concluded their paper suggesting that having access to more data would most likely improve the performance of their models. With large datasets becoming public, such as the TUH Dataset [65] and the National Sleep Research Resource [217], deeper architectures similar to the ones used in computer vision might become increasingly usable. However, it is important to note that the availability of data is quite different across domains. In clinical fields such as sleep and epilepsy, data usually comes from hospital databases containing years of recordings from several patients, while other fields usually rely on data coming from lab experiments with a limited number of subjects.
The potential of DL in EEG also lies in its ability (at least in theory) to generalize across subjects and to enable transfer learning across tasks and domains. When only limited data is available, intra-subject models still work best given the inherent subject variability of EEG data. However, transfer learning might be the key to moving past this limitation. Indeed, Page and colleagues [127] showed that with hybrid models, one can train a neural network on a pool of subjects and then fine-tune it on a specific subject, achieving good performances without needing as much data from a specific subject.
While we did report the sampling rate, we did not investigate its effect on performance because no relationship stood out particularly in any of the reviewed papers. The impact of the number of channels though, was specifically studied. For example, in [26], the authors showed that they could achieve comparable results with a lower number of channels. As shown in Fig. 8a, a few studies used low-cost EEG devices, typically limited to a lower number of channels. These more accessible devices might therefore benefit from DL methods, but could also enable faster data collection on a larger-scale, thus facilitating DL in return.
As DL-EEG is highly data-driven, it is important when publishing results to clearly specify the amount of data used and to clarify terminology (see Table 1 for an example). We noticed that many studies reviewed did not clearly describe the EEG data that they used (e.g., the number of subjects, number of sessions, window length to segment the EEG data, etc.) and therefore made it hard or impossible for the reader to evaluate the work and compare it to others. Moreover, reporting learning curves (i.e. performance as a function of the number of examples) would give the reader valuable insights on the bias and variance of the model.

EEG processing
According to our findings, the great majority of the reviewed papers preprocessed the EEG data before feeding it to the deep neural network or extracting features. Despite observing this trend, we also noticed that recent studies outperformed their respective baseline(s) using completely raw EEG data. Almogbel et al. [7] used raw EEG data to classify cognitive workload in vehicle drivers, and their best model achieved a classification accuracy approximately 4% better than their benchmarks which employed preprocessing on the EEG data. Similarly, Aznan et al. [11] outperformed the baselines by a 4% margin on SSVEP decoding using no preprocessing. Thus, the answer to whether it is necessary to preprocess EEG data when using DNNs remains elusive.
As most of the works considered did not use, or explicitly mention using, artifact removal methods, it appears that this EEG processing pipeline step is in general not required. However, one should observe that in specific cases such as tasks that inherently elicit quick eye movements (MATB-II [31]), artifact handling might still be crucial to obtaining desired performance.
One important aspect we focused on is whether it is necessary to use EEG features as inputs to DNNs. After analyzing the type of input used by each paper, we observed that there was no clear preference for using features or raw EEG timeseries as input. We noticed though that most of the papers using CNNs used raw EEG as input. With CNNs becoming increasingly popular, one can conclude that there is a trend towards using raw EEG instead of hand-engineered features. This is not surprising, as we observed that one of the main motivations mentioned for using DNNs on EEG processing is to automatically learn features. Furthermore, frequency-based features, which are widely used as hand-crafted features in EEG [105], are very similar to the temporal filters learned by a CNN. Indeed, these features are often extracted using Fourier filters which apply a convolutive operation. This is also the case for the temporal filters learned by a CNN although in the case of CNNs the filters are learned.
From our analysis, we also aimed to identify which input type should be used when trying to solve a problem from scratch. While the answer depends on many factors such as the domain of application, we observed that in some cases raw EEG as input consistently outperformed baselines based using classically extracted features. For example, for seizure classification, recently proposed models using raw EEG data as input [64,185,156] achieved better performances than classical baseline methods, such as SVMs with frequency-domain features. For this particular task, we believe following the current trend of using raw EEG data is the best way to start exploring a new approach.

Deep learning methodology
Another major topic this review aimed at covering is the DL methodology itself. Our analysis focused on architecture trends and training decisions, as well as on model selection techniques.

Architecture
Given the inherent temporal structure of EEG, we expected RNNs would be more widely employed than models that do not explicitly take time dependencies into account. However, almost half of the selected papers used CNNs. This observation is in line with recent discussions and findings regarding the effectiveness of CNNs for processing time series [12]. We also noticed that the use of energy-based models such as RBMs has been decreasing, whereas on the other hand, popular architectures in the computer vision community such as GANs have started to be applied to EEG data as well.
Moreover, regarding architecture depth, most of the papers used fewer than five layers. When comparing this number with popular object recognition models such as VGG and ResNet for the ImageNet challenge comprising 19 and 34 layers respectively, we conclude that for EEG data, shallower networks are currently necessary. Schirrmeister et al. [177] specifically focused on this aspect, comparing the performance of architectures with different depths and structures, such as fully convolutional layers and residual blocks, on different tasks. Their results showed that in most cases, shallower fully convolutional models outperformed their deeper counterpart and architectures with residual connections.

Training and optimization
Although crucial to achieving good results when using neural networks, only 21% of the papers employed some hyperparameter search strategy. Even fewer studies provided detailed information about the method used. Amongst these, Stober et al. [164] described their hyperparameter selection method and cited its corresponding implementation; in addition, the available budget in number of iterations per searching trial as well as the cross-validation split were mentioned in the paper.

Model inspection
Inspecting trained DL models is important, as DNNs are notoriously seen as black boxes, when compared to more traditional methods. This is problematic in clinical settings for instance, where understanding and explaining the choice made by a classification model might be critical to making informed clinical choices. Neuroscientists might also be interested by what drives a model's decisions and use that information to shape hypotheses about brain function.
About 27% of the reviewed papers looked at interpreting their models. Interesting work on the topic, specifically tailored to EEG, was reviewed in [150,67,45]. Sustained efforts aimed at inspecting models and understanding the patterns they rely on to reach decisions are necessary to broaden the use of DL for EEG processing.

Reported results
Our meta-analysis focused on how studies compared classification accuracy between their models and traditional EEG processing pipelines on the same data. Although a great majority of studies reported improvements over traditional pipelines, this result has to be taken with a grain of salt. First, the difference in accuracy does not tell the whole story, as an improvement of 10%, for example, is typically more difficult to achieve from 80 to 90% than from 40 to 50%. More importantly though, very few articles reported negative improvements, which could be explained by a publication bias towards positive results.
The reported baseline comparisons were highly variable: some used simple models (e.g., combining straightforward spectral features and linear classifiers), others used more sophisticated pipelines (including multiple features and non-linear approaches), while a few reimplemented or cited state-of-the-art models that were published on the same dataset and/or task. Since the observed improvement will likely be higher when comparing to simple baselines than to state-of-the-art results, the values that we report might be biased positively. For instance, only two studies used Riemannian geometry-based processing pipelines as baseline models [11,87], although these methods have set a new state-of-the-art in multiple EEG classification tasks [105].
Moreover, many different tasks and thus datasets were used. These datasets are often private, meaning there is very limited or no previous literature reporting results on them. On top of this, the lack of reproducibility standards can lead to low accountability: since study results are not expected to be replicated and results can be inflated by non-standard practices such as omitting cross-validation.
Different approaches have been taken to solve the problem of heterogeneity of result reporting and benchmarking in the field of machine learning. For instance, OpenML [189] is an online platform that facilitates the sharing and running of experiments, as well as the benchmarking of models. As of November 2018, the platform already contained one EEG dataset and multiple submissions. The MOABB [78], a solution tailored to the field of brain-computer interfacing, is a software framework for ensuring the reproducibility of BCI experiments and providing public benchmarks for many BCI datasets. In [73], a similar approach, but for DL specifically, is proposed.
Additionally, a few EEG/MEG/ECoG classification online competitions have been organized in the last years, for instance on the Kaggle platform (see Table 1 of [32]). These competitions informally act as benchmarks: they provide a standardized dataset with training and test splits, as well as a leaderboard listing the performance achieved by every competitor. These platforms can then be used to evaluate the state-of-the-art as they provide a publicly available comparison point for new proposed architectures. For instance, the IEEE NER 2015 Conference competition on error potential decoding could have been used as a benchmark for the studies reviewed that focused on this topic.
Making use of these tools, or extending them to other EEG-specific tasks, appears to be one of the greatest challenges for the field of DL-EEG at the moment, and might be the key to more efficient and productive development of practical EEG applications. Whenever possible, authors should make sure to provide as much information as possible on the baseline models they have used, and explain how to replicate their results (see Section 4.6).

Reproducibility
The significant use of public EEG datasets across the reviewed studies suggests that open data has greatly contributed to recent developments in DL-EEG. On the other hand, 42% of studies used data not publicly available -notably in domains such as cognitive monitoring. To move the field forward, it is thus important to create new benchmark datasets and share internal recordings. Moreover, the great majority of papers did not make their code available. Many papers reviewed are thus more difficult to reproduce: the data is not available, the code has not been shared, and the baseline models that were used to compare the performances of the models are either non-existent or not available.
Recent initiatives to promote best practices in data and code sharing would benefit the field of DL-EEG. FAIR neuroscience [196] and the Brain Imaging Data Structure (BIDS) [56] both provide guidelines and standards on how to acquire, organize and share data and code. BIDS extensions specific to EEG [136] and MEG [119] were also recently proposed. Moreover, open source software toolboxes are available to perform DL experiments on EEG. For example, the recent toolbox developed by Schirrmeister and colleagues, called BrainDecode [150], enables faster and easier development cycles by providing the basic functionality required for DL-EEG analysis while offering high level and easy to use functions to the user. The use of common software tools could facilitate reproducibility in the community. Beyond reproducibility, we believe simplifying access to data, making domain knowledge accessible and sharing code will enable more people to jump into the field of DL-EEG and contribute, transforming what has traditionally been a domain-specific problem into a more general problem that can be tackled with machine learning and DL methods.

Recommendations
To improve the quality and reproducibility of the work in the field of DL-EEG, we propose six guidelines in Table 7. Moreover, Appendix B presents a checklist of items that are critical to ensuring reproducibility and should be included in future studies.

Supplementary material
Along with the current paper, we make our data items table and related code available online at http://dl-eeg.com. We encourage interested readers to consult it in order to dive deeper into data items that are of specific interest to them -it should be straightforward to reproduce and extend the results and figures presented in this review using the code provided. The data item table is intended to be updated frequently with new articles, therefore results will be brought up to date periodically.
Authors of DL-EEG papers not included in the review are invited to submit a summary of their article following the format of our data items table to our online code repository. We also invite authors whose papers are already included in the review to verify the accuracy of our summary. Eventually, we would like to indicate which studies have been submitted or verified by the original authors.
By updating the data items table regularly and inviting researchers in the community to contribute, we hope to keep the supplementary material of the review relevant and up-to-date as long as possible. Provide a table or figure clearly describing your model (e.g., see [26,51,150]). 2 Clearly describe the data used.
Make sure the number of subjects, the number of examples, the data augmentation scheme, etc. are clearly described. Use unambiguous terminology or define the terms used (for an example, see Table 1). 3 Use existing datasets.
Whenever possible, compare model performance on public datasets. 4 Include state-of-the-art baselines.
If focusing on a research question that has already been studied with traditional machine learning, clarify the improvements brought by using DL. 5 Share internal recordings.
Share code (including hyperparameter choices and model weights) that can easily be run on another computer, and potentially reused on new data.

Limitations
In this section, we quickly highlight some limitations of the present work. First, our decision to include arXiv preprints in the database search requires some justification. It is important to note that arXiv papers are not peer-reviewed. Therefore, some of the studies we selected from arXiv might not be of the same quality and scientific rigor as the ones coming from peer-reviewed journals or conferences. For this reason, whenever a preprint was followed by a publication in a peer-reviewed venue, we focused our analysis on the peer-reviewed version. ArXiv has been largely adopted by the DL community as a means to quickly disseminate results and encourage fast research iteration cycles.
Since the field of DL-EEG is still young and a limited number of publications was available at the time of writing, we decided to include all the papers we could find, knowing that some of the newer trends would be mostly visible in repositories such as arXiv. Our goal with this review was to provide a transparent and objective analysis of the trends in DL-EEG. By including preprints, we feel we provided a better view of the current state-of-the-art, and are also in a better position to give recommendations on how to share results of DL-EEG studies moving forward.
Second, in order to keep this review reasonable in length, we decided to focus our analysis on the points that we judged most interesting and valuable. As a result, various factors that impact the performance of DL-EEG were not covered in the review. For example, we did not cover weight initialization: in [51], the authors compared 10 different initialization methods and showed an impact on the specificity metric, with ranged from 85.1% to 96.9%. Similarly, multiple data items were collected during the review process, but were not included in the analysis. These items, which include data normalization procedures, software toolboxes, hyperparameter values, loss functions, training hardware, training time, etc., remain available online for the interested reader. We are confident other reviews or research articles will be able to focus on more specific elements.
Third, as any literature review in a field that is quickly evolving, the relevance of our analysis decays with time as new articles are being published and new trends are established. Since our last database search, we have already identified other articles that should eventually be added to the analysis. Again, making this work a living review by providing the data and code online will hopefully ensure the review will be of value and remain relevant for years to come.

Conclusion
The usefulness of EEG as a functional neuroimaging tool is unequivocal: clinical diagnosis of sleep disorders and epilepsy, monitoring of cognitive and affective states, as well as brain-computer interfacing all rely heavily on the analysis of EEG. However, various challenges remain to be solved. For instance, time-consuming tasks currently carried out by human experts, such as sleep staging, could be automated to increase the availability and flexibility of EEG-based diagnosis. Additionally, better generalization performance between subjects will be necessary to truly make BCIs useful. DL has been proposed as a potential candidate to tackle these challenges. Consequently, the number of publications applying DL to EEG processing has seen an exponential increase over the last few years, clearly reflecting a growing interest in the community in these kinds of techniques.
In this review, we highlighted current trends in the field of DL-EEG by analyzing 156 studies published between January 2010 and July 2018 applying DL to EEG data. We focused on several key aspects of the studies, including their origin, rationale, the data they used, their EEG processing methodology, DL methodology, reported results and level of reproducibility.
Among the major trends that emerged from our analysis, we found that 1) DL was mainly used for classifying EEG in domains such as brain-computer interfacing, sleep, epilepsy, cognitive and affective monitoring, 2) the quantity of data used varied a lot, with datasets ranging from 1 to over 16,000 subjects (mean = 223; median = 13), producing to 62 up to 9,750,000 examples (mean = 251,532; median = 14,000) and from two to 4,800,000 minutes of EEG recording (mean = 62,602; median = 360), 3) various architectures have been used successfully on EEG data, with CNNs, followed by RNNs and AEs, being most often used, 4) there is a clear growing interest towards using raw EEG as input as opposed to handcrafted features, 5) almost all studies reported a small improvement from using DL when compared to other baselines and benchmarks (median = 5.4%), and 6) while several studies used publicly available data, only a handful shared their code -the great majority of studies reviewed thus cannot easily be reproduced.
Moreover, given the high variability in how results were reported, we made six recommendations to ensure reproducibility and fair comparison of results: 1) clearly describe the architecture, 2) clearly describe the data used, 3) use existing datasets, whenever possible, 4) include state-of-the-art baselines, ideally using the original authors' code, 5) share internal recordings, whenever possible, and 6) share code, as it is the best way to allow others to pick up where your work leaves off. We also provided a checklist (see Appendix B) to help authors of DL-EEG studies make sure all the relevant information is available in their publications to allow straightforward reproduction.
Finally, to help the DL-EEG community maintain an up-to-date list of published work, we made our data items table open and available online. The code to reproduce the statistics and figures of this review as well as the full summaries of the papers are also available at http://dl-eeg.com.
The current general interest in artificial intelligence and DL has greatly benefited various fields of science and technology. Advancements in other field of application will most likely benefit the neuroscience and neuroimaging communities in the near future, and enable more pervasive and powerful applications based on EEG processing. We hope this review will constitute a good entry point for EEG researchers interested in applying DL to their data, as well as a good summary of the current state of the field for DL researchers looking to apply their knowledge to new types of data.