EEGdenoiseNet: A benchmark dataset for end-to-end deep learning solutions of EEG denoising

Deep learning networks are increasingly attracting attention in various fields, including electroencephalography (EEG) signal processing. These models provided comparable performance with that of traditional techniques. At present, however, lacks of well-structured and standardized datasets with specific benchmark limit the development of deep learning solutions for EEG denoising. Here, we present EEGdenoiseNet, a benchmark EEG dataset that is suited for training and testing deep learning-based denoising models, as well as for performance comparisons across models. EEGdenoiseNet contains 4514 clean EEG segments, 3400 ocular artifact segments and 5598 muscular artifact segments, allowing users to synthesize contaminated EEG segments with the ground-truth clean EEG. We used EEGdenoiseNet to evaluate denoising performance of four classical networks (a fully-connected network, a simple and a complex convolution network, and a recurrent neural network). Our analysis suggested that deep learning methods have great potential for EEG denoising even under high noise contamination. Through EEGdenoiseNet, we hope to accelerate the development of the emerging field of deep learning-based EEG denoising.


Introduction
Electroencephalography (EEG) solutions permit recording of changes in electrical potential on the scalp, which are generated by neurons in the gray matter. EEG is one of the most important direct and noninvasive approaches for studying brain activity under task and resting conditions. It has been widely used in psychology, neurology and psychiatry research, as well as for brain-computer interface [1,2,3,4,5,6].
EEG signals contain not only brain activity, but also a variety of noise and artifacts, including ocular [7], myogenic artifacts [8,9], and, in rare cases, cardiac artifacts. Therefore, a basic step in using EEG data to study neural activity is denoising or artifact attenuation [10]. Ocular and myogenic artifacts contaminate EEG signals in different ways. The former is often visible as relatively large pulses in the frontal region [11], while the latter frequently appears in the temporal and occipital regions, and has a wide frequency spectrum [9,12].
Various traditional denoising techniques have been developed to remove artifacts from EEG data, such as regression-based methods, adaptive filter-based methods and blind source separation (BSS)-based methods. Among them, the regression-based method first obtains the noise signal through the noise template, and then subtracts the estimated noise signal from the EEG data to eliminate the artifacts [12,13,14,15]. On the contrary, methods based on adaptive filters rely on dynamically estimating filter coefficients based on the input EEG signal itself, thereby filtering out noise [16,17]. BSS-based methods decompose the EEG signal into multiple components [18,19,20], assign them to neural and artifactual sources, and reconstruct a clean signal by recombining the neural components [9,12,21]. However, BSS-based methods can only be used when a large number of electrodes are available, which are not suitable for single-channel denoising.
Deep learning (DL) have been increasingly attracting attention in the past few years [22,23,24,25]. Due to the increase in computing resources, the boosting data size, and the availability of new network architectures and learning algorithms, the performance of DL neural networks has made great breakthroughs, and deep learning has been successfully applied to solve various technical problems, such as image processing [22,23,26,27] and natural language processing [24,25,28]. DL methods have begun to be introduced into the field of EEG signal analysis [29], such as EEG-based classification [30,31,32], EEG reconstruction [33,34] and EEG signal generation [35,36]. Recently, deep learning has also been applied to EEG denoising, providing performance comparable to the traditional denoising method [37,38,39,40].
Deep neural networks can learn the hidden state of neural oscillations in EEG, thereby eliminating fluctuations that are not from real neural activity but from biological artifacts. The performance of deep neural networks fundamentally depends on the size of the training and test datasets; or in other words, it requires big data [41,42,43]. A big dataset with the gold standard clean EEG is essential for evaluating newly developed supervised deep learning models. Some EEG datasets have been collected while participants are at rest [44,45], during cognitive tasks [46,47,48], or motor-related tasks [49,50,51,52]. However, none of them are specifically developed for training end-to-end deep learning models for EEG artifact removal. To the best of our knowledge, there is no open EEG dataset suitable for training deep learning models for EEG denoising. The lack of ground-truth clean EEG data and benchmarks have largely limited the development of DL methods for EEG denoising.
In this study, we present a publicly available structured dataset, named EEGdenoiseNet, which is particularly suitable for deep network-based EEG artifacts attenuation (Sec 2). Specifically, the dataset contains 4514 clean EEG segments as ground truth, and 3400 pure EOG segments and 5598 pure EMG segments as ocular artifacts and myogenic artifacts respectively. In addition, we also implement four deep neural networks as benchmarks (Sec 3), including a fully-connected neural network (FCNN), a simple convolution neural network (CNN), a complex CNN, and a recurrent neural network (RNN). We train the deep learning models in a supervised end-to-end fashion, and the denosing performance are presented as benchmarks (Sec 4).

Data acquisition and preprocessing
Our main goal is to construct a dataset suitable for EEG denoising research based on deep learning networks. In this regard, we downloaded EEG, EOG and EMG data from several publicly available data repositories which were published in previous studies (see Table  1) [53,54,55,56,57,58,59,60]. These studies have been ethically approved by their respective local ethical committees, and followed the Helsinki Declaration of 1975, revised in 2000.
To generate clean EEG, pure EOG and pure EMG, we firstly preprocessed the data. Then segmented them into 2-second segments. Afterwards, we re-scaleed the segments to the same variance. Finally, each segment was visually checked by an expert to ensure they are clean and usable. We set the length of segments to 2 seconds according to the previous knowledge of EEG signals. On the one hand, a 2s segment is long enough to recover the temporal and spectral characteristics of EEG, as well as EOG and EMG. On the other hand, it is difficult to obtain artifact-free EEG segments longer than 2s due to the random eye blinks or movements. The segments in each category have been uploaded to a publicly available repository (https://github.com/ncclabsustech/EEGdenoiseNet).
Specifically, for the EEG segments ( Figure 1a) [53], the dataset included 52 participants who performed both real and imaginary left and right hand movement task, with 64 channel EEG recorded simultaneously at 512 Hz sampling freqeuncy. For both real and imagined movement task, a participant repeated 2 second baseline and 3 seconds movement with 4.1 to 4.8 second random interval for 20 minutes. The data was band-pass filtered between 1 to 80 Hz, notched at powerline frequency, and then re-sampled to 256 Hz. To obtain the clean EEG as ground truth, the 64-channel EEG signals were processed by ICLabel, a toolbox to remove EEG artifacts with independent component composition (ICA) [9]. Then the pure EEG signals were segmented into one-dimensional segments of 2 seconds. It is worth noting that, in order to ensure the universality of this data set, we did not construct clean EEG signals with a specific number of channels due to the diversity of EEG caps, but constructed a dataset with single-channel EEG signal.
For the ocular artifact segments (Figure 1b), multiple open-access EEG datasets with additional EOG channels are used [54,55,56,57,58,59]. The horizontal and vertical raw electroculagraphy (EOG) signals of the datasets are band-pass filtered between 0.3 and 10 Hz, and then re-sampled to 256 Hz. Finally, the EOG signals are segmented into one-dimensional segments of 2 seconds.
For the myogenic artifact segments (Figure 1c), a facial electromyography (EMG) dataset is used [60]. We choose facial EMG because they are the main sources of myogenic artifacts. The raw EMG signal is band-pass filtered between 1 to 120 Hz and notched at the powerline frequency, and then resampled to 512 Hz. We resample the EMG to 512 Hz instead of 256 Hz, because the EMG signal is concentrated in the high frequency range, so a higher sampling rate is required (according to the Nyquist sampling theorem). In the end, we extract one-dimensional 2-second EMG segments.
For all three categories, the segments are standardized by subtracting their mean and dividing by their standard deviation, and then visually inspected by an expert. We obtain a total of 4514 EEG segments, 3400 ocular artifact segments, and 5598 myogenic artifact segments. The segments of each category are respectively saved as Matlab matrix files and Python numpy matrix files in the public data repository. Figure 2 shows an example of the clean EEG, horizontal EOG, vertical EOG and EMG.

Data Usage
The contaminated signals can be generated by linearly mixing the pure EEG segments with EOG or EMG artifact segments, according to Eq. (1) (see Figure 1c): where y denotes the mixed one-dimensional signal of EEG and artifacts; x denotes the clean EEG signal as the ground truth; n denotes (ocular or myogenic) artifacts; λ is a hyperparameter to control the signal-to-noise ratio (SNR) in the contaminated EEG signal y. Specifically, the SNR of the contaminated segment can be adjusted by changing the parameter λ according to Eq. (2): in which the Root Mean Squared (RMS) value is defined as Eq. 3: where N denotes the number of temporal samples in the segment g, and g i denotes the i th sample of a segment g. Notably, lower λ represents higher SNR, as less EOG or EMG artifacts are added in the EEG signal. In return, lower SNR means higher noise level. According to previous studies, the SNR of EEG contaminated by ocular artifacts is usually ranging from -7dB to 2dB [61], while the SNR of EEG contaminated by myogenic artifacts are between -7dB and 4dB [62,63].
In this way, we obtain a pair of EEG data (x, y). To train the end-to-end deep learning methods for EEG denoising, the clean EEG x can be regarded as the ground truth, and the contaminated EEG y can be used as the inputs.

Benchmarking deep learning algorithms
The second goal of this study is to provide a set of benchmark algorithms. We train four standard deep-learning neural networks, then validate the networks. The evaluation metrics can be used as benchmarks for the EEG denoising algorithms.

Generating semi-synthetic data
The semi-synthetic ocular artifact contaminated signals are from 3400 EEG segments and 3400 ocular artifact segments, with 80% for generating the training set, 10% for generating the validation set, and 10% for generating the test set [64]. Each set were generated by randomly linearly mixing EEG segments and ocular artifact segments according to section 2.2, with SNR raging from ten different SNR levels (-7dB, -6dB, -5dB, -4dB, -3dB, -2dB, -1dB, 0dB, 1dB, 2dB). This procedure expanded the data size of each set to ten times. The EEG segments are treated as ground truth, and the corresponding mixed segments are treated as contaminated EEG.
The myogenic artifacts contaminated signals come from 4514 EEG segments and 5598 myogenic artifact segments. To match the sampling frequency of EEG segments with myogenic artifact segments, we upsample the EEG segments to 512Hz. To match the number of EEG segments with myogenic artifact segments, we randomly reuse some EEG segments. We mix the EEG segments and the myogenic artifact segments as Eq.(1) to generate the training data, test data, and validation data. Likewise, the EEG segments are treated as ground truth, and the corresponding mixed segments are treated as contaminated EEG.

Fully-connected Neural Network
A fully-connected network with four hidden layers using ReLu as activation function is provided as a benchmarking algorithm (Figure 4a). The number of neurons in each layer is equal to the number of temporal samples of the input signal (i.e., 512 for ocular artifact reduction, and 1024 for myogenic artifact reduction). Dropout regularization [65] is used to reduce overfitting. The contaminated EEG is fed in from the first layer of the neural network, and then the denoised EEG is output from the last layer.

Simple Convolution Neural Network
A simple convolution network is implemented (Figure 4b). The simple CNN consists of four 1D-convolution layers with small 1×3 kernels, 1 stride, and 64 feature maps (k3n64s1). Each 1D-convolution layer is followed by a batch-normalization (BN) layer [66] and a ReLu activation function. To reconstruct the signal, the last convolutional layer is followed by a flatten layer and a dense layer with 512 or 1024 neurons as outputs (the same as the input).

Complex Convolution Neural Network
An one-dimensional residual convolutional neural network (1D-ResCNN), adapted from [38], is implemented (Figure 4c). Compared with simple CNN, the 1D-ResCNN has a more complex structure, so it is called complex CNN. The main difference between them is that a ResNet with skip-layer connections is added into the complex CNN to avoid gradient explosion so that we can train a deeper network to obtain better feature extraction capabilities [23]. To extract multi-scale features, we repeatedly stack residual blocks, using 1×3, 1×5, 1×7 multi-scale convolutional kernels twice and arranging three sets of residual blocks branches in parallel [27,67].

Recurrent Neural network
A Long Short-Term Memory (LSTM) network (Figure 4d), adapted from [68], is regard as the benchmark of recurrent neural networks (RNNs). LSTM can learn long-term dependencies, which may help distinguish the long-term features in noise and EEG signals. Each EEG sample is sequentially input to LSTM cells, and the output is obtained from the state of each cell through a fully-connected network. This RNN model is initialised to have 1 hidden state, and the output network is a three-layer fully-connected network with ReLu activation function, dropout regularization, and 512 or 1024 neurons per layer.

Learning process
In order to facilitate the learning procedure, we normalized the input contaminated EEG segment and the ground-truth EEG segment by dividing the standard deviation of contaminated EEG segment according to Eq. (4): where σ y is the standard deviation of the contaminated EEG signal segment y. The standard deviation of each noise segment is saved, so that the magnitude of the denoised EEG segment can be restored by multiplying the network output by the corresponding standard deviation value.
The networks are trained in an end-to-end manner, which means that we input the normalized contaminated EEG segment into the neural networks and then directly output the denoised EEG segment. To this end, the goal of a denoising network is to learn a nonlinear function f that maps the contaminated EEGŷ to the denoised EEGx: whereŷ ∈ R 1×T denotes the contaminated EEG segment,x ∈ R 1×T as the output of neural network (the denoised EEG segment), and the vector θ contains all parameters to be learned.
We use the mean squared error (MSE) as loss function L M SE (f ) (see Eq. (6)). The learning process is implemented with gradient descent to minimize the error between the denoised segment and the ground-truth clean segment.
where N denotes the number of temporal samples of segment;x i denotes i th sample of the output of the neural network;x i denotes the i th sample of the ground truth x.
For ocular artifact removal, we train the FCNN with 60 epochs, RNN with 100 epochs, while the simple CNN and complex CNN models are trained over 40 epochs. For myogenic artifact removal, we train the FCNN and RNN with 60 epochs, while the simple CNN and complex CNN models are trained over 10 epochs. The Adam algorithm is applied to optimize the loss function, and the parameter were set to α = 5e −5 , β 1 = 0.5, β 2 = 0.9. To increase the statistical power, the four networks are trained, validated and tested independently for 10 times with randomly generated datasets via EEGdenoiseNET.
All the four networks are implemented in Python 3.7 with Tensorflow 2.2 library, running on a computer with two NVIDIA Tesla V100 GPUs. The codes for the benchmarking algorithms are publicly available online at Github [69].

Performance Evaluation as Benchmarks
There are several metrics are used to qualitatively evaluate the performance of networks, including the network convergence, the relative root mean squared error, and the correlation coefficient.
The network convergence is the first index to evaluate the performance of networks, which can provide rich information about the learning procedure and generalization ability. The convergence curve of both training and validating processes are obtained by calculating the averaged loss (in Eq. (6)) with respect to the number of epochs.
We then quantitatively examine the performance of the networks by applying three objective measures to the denoised data [62], including Relative Root Mean Squared Error (RRMSE) in the temporal domain (RRM SE temporal , see Eq. (7) ), RRMSE in the spectral domain (RRM SE spectral , see Eq. (8) ) and the correlation coefficient (CC see Eq. (9)).
RRM SE spectral = RM S(P SD(f (y)) − P SD(x)) RM S(P SD(x)) where the function P SD() denotes to the power spectral density of an input segment. The frequency range of P SD is 0-120Hz. The fft-length equal to the length of the input segment.
V ar(f (y))V ar(x) To compare the deep learning mehtods with the traditional methods, we implement two traditional EEG denoising methods: i) empirical mode decomposition (EMD) and ii) filtering. In the EMD method, the artifactual intrinsic mode functions (IMFs) are defined by the distance metric used in clustering [70]. In the filtering method, the ocular and myogenic artifacts are removed using a high-pass filter (12 Hz) and a band-pass filter  , respectively. These two traditional methods are tested 10 times with randomly generated datasets. The corresponding codes are available online at https://github.com/ ncclabsustech/Single-Channel-EEG-Denoise.

Results
To give a qualitative overview of the denoising results, we display some sample fragments in the test in the time domain and frequency domain for ocular artifact removal (see Figure 5) and for myogenic artifact removal (see Figure 6). For each network and artifact type, we show two examples: one of the best results (left column) and one of the worst result (right column). Generally, both in ocular and myogenic artifact removal, the artefacts are greatly attenuated, and the noise-free EEG samples are well-reconstructed. The frequency domain diagram shows that the artifacts in the low frequency bands are well detected and attenuated, but the high frequency is affected by residual noise.
The quantitatively results are examined. We first present the convergence of the four networks. The training and validation loss of the networks can show a quantitative overview of the training and validating process. For all the 4 networks and 2 artifact types, the training loss is consistently lower than the validation loss as expected. For the ocular artifact removal (see Figure 7a), the training and validation loss decrease with the increase of epochs. Specifically, the loss of simple CNN and complex CNN drop notably fast and eventually diminish after 20 epochs. The FCNN loss and the RNN loss, however, starting from a relatively high level, remain at a significant level after 20 epochs. For the myogenic artifact removal (see Figure  7b), the training loss of four networks decreased with respect to the number of epochs, similar to the ocular artifact removal. The loss of simple CNN and complex CNN decrease faster during training, but increased during validation. This phenomenon indicates that the two convolutional networks seem not learn the true characteristics of the EMG signal, which means that CNNs suffer from an overfitting problem when removing myogenic artifacts.
We present the quantitative benchmarks (RRM SE temporal , RRM SE spectral and CC) from the four networks and the two traditional methods at multiple SNR levels (see Figure 8). Generally, for both ocular and myogenic artifact removal, the performance decreases with the decrease of SNR level. The traditional methods showed higher RRM SE and lower CC compared with the four deep learning networks. The difference of performance is larger at the large noise level (low SNR), while the difference reduces at low noise level (large SNR, eg. SN R > 0). Among the deep learning methods, RNN has the lowest RRM SE and the highest CC for ocular artifact removal (see Figure 8a), and the complex CNN has the lowest RRM SE and the highest CC in myogenic artifact removal.
To further comprehensively compare benchmarks, we separately plot the benchmarks at multiple SNR levels in boxplot (see Figure 9), and conduct ANOVA analyses. For the ocular artifact removal (see Figure 9a), DL-based methods have significantly better denoising performance compared to two traditional methods, in terms of RRM SE temporal , RRM SE spectral and CC (p < 0.001 for each of three metrics). Similarly, the DL-based methods outperform traditional methods for myogenic artifact removal (p < 0.001 for each of three metrics) (see Figure 9b). In the time domain, RRM SE temporal of RNN is significantly higher than FCNN, simple CNN, and complex CNN (p = 0.007, p < 0.001, and p < 0.001, respectively); FCNN has significantly higher RRM SE temporal than the complex CNN (p = 0.020). At the frequency domain, RNN has significantly higher RRM SE spectral than the FCNN and the complex CNN (p = 0.020 and p = 0.006, respectively). CC of the complex CNN is significantly higher than  RNN and FCNN (p < 0.001, and p = 0.011, respectively). The same effect is shown on simple CNN and RNN (p = 0.004).
We finally evaluate the performance of the different methods for different frequency bands, by calculating the average power ratio of each frequency band (delta [1][2][3][4], theta [4][5][6][7][8], alpha [8][9][10][11][12][13], beta [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30], and gamma  bands) to whole band (1-80 Hz) for ocular artifact removal and myogenic artifact removal (see Table 2 and 3). For the ocular artifact removal (see Table 2), the mix of ocular artifacts increased the power ratio of delta and theta bands, while reduced the ratio of the other bands. The simple CNN showed the closest delta, theta and beta power ratio compared to those of the ground truth, the same effect observed on the complex CNN for theta band, on the EMD for alpha band, on the RNN for gamma band, and on the FCNN for theta and gamma band. In myogenic artifact removal (see Table 3), the add of myogenic artifact increased gamma power ratio and decreased other power ratios. The FCNN showed the closest ratio in beta bands compared to those of the ground truth, the same effect on the EMD for alpha band, on the Complex CNN for theta and gamma bands, on the RNN for delta and alpha bands.

Discussion
In this study, we have provided an EEG benchmark dataset, EEGdenoiseNet, for training and testing end-to-end deep learning models. To obtain the ground-truth clean EEG data, the raw EEG data is denoised by ICLabel [9] and then manually inspected for a double check. Although there are other publicly available EEG datasets, they are not specifically developed for EEG denoising. Instead, they are mainly focused on the resting state study [44,45], psychological study [46,47,48], or motor imaginary or motor tasks [49,50,51,52]. A previous study has offered a semi-simulated dataset for EOG artifact removal, but EMG signals are not included [71]. Effective use of these datasets for DL-based denoising requires extensive EEG background knowledge, including properties of EEG and artifacts, data format conversion, and signal processing. In contrast, the segments in our dataset have been pre-processed, so users can immediately generate a large set of semi-synthetic noisy EEG segments with ground truth for their DL-based denoising models without being distracted by detailed electrophysiological knowledge. With this advantage, our well-structured dataset would greatly promote the development of DL-based EEG denoising.
Another major challenge to compare the performance of different denoising algorithms is the lack of specific benchmarks. The use of standard benchmarks greatly simplify the comparisons of performance across multiple DL models. To fill this gap, we provided a set of benchmark algorithms along with a standardized EEG dataset. We chose four well-known and relatively basic networks, i.e. a FCNN, a simple CNN, and a RNN for benchmarking. Performance of these DL models in providing artifact-corrected EEG data has been measured using several standard metrics, such as RRMSE, PSD and CC. Furthermore, we define the network convergence, expressed by loss as a function of epoch number, as a qualitative part of the benchmarks. We expect our work to contribute to the DL-based EEG denoising field, in particular for we standardizes evaluation metrics of performance.
Our benchmarks of four deep learning networks and two traditional methods have demonstrated the feasibility of using DL-based methods to attenuate artifacts from EEG signals. Our comparisons of the four networks (i.e., FCNN, simple CNN, complex CNN, RNN) with two traditional methods (i.e., EMD and filter) suggest that DL-based methods outperform the traditional method, and the supervised end-to-end deep learning has great potential to remove artifacts in EEG signals. Specifically, for the ocular artifacts, the range of CC values in our four networks are at equivalent level of the CC values reported in a previous study, which used a regression-based method and an offline ICA-based method [72]. A consistent result has been also reported by a DL-based ocular artifact removal study [37]. However, these studies have not offered benchmarks for comparing with other methods [73,74,75,76]. For the myogenic artifacts, comparable RRM SE temporal values have been reported in previous literature, such as an ICA-based method [63] and a canonical correlation analysis-based method [77].
The performance of the neural networks depends on the data quality and frequency characteristics of artifacts. The neural networks provide better results for high SNR signals than low SNR signals ( Figure 8). Moreover, the high-frequency artifacts, such as EMG artifacts, are more difficult for neural networks to deal with (Figure 8-9). This phenomenon may be explained by the F-Principle of neural network [78]. The F-Principle proves that deep learning networks often learn low-frequency information in the early stages of training, and then learn high-frequency information as training iterations increase.
One advantage of deep learning for EEG artifact removal is its flexibility and generalizability. Although the DL-based denoising methods require a large amount of ground-truth EEG data in the training stage, once the model is trained, it can be easily applied to new data, such as multi-channel EEG data or task-related EEG data, regardless of the corresponding reference channels for artifact removal. Another advantage lies in the handling of complex (e.g., nonlinear and non-stationary) artifact mixtures. Due to the hierarchical structure of deep neural networks, DL models can directly learn the true nature of neural activities from training data in the hidden space, and then generate the cleaned EEG data according to the new contaminated EEG input, whereas traditional methods usually linearly attenuate artifacts. Therefore, methods based on deep learning are expected to provide better performance than traditional methods in noise removal.
Several limitations should also be mentioned. First of all, an important potential problem is the size of the dataset and the type of data. Although we provided thousands of segments of EEG, ocular and myogenic artifacts in EEGdenoiseNet, it is possible that more complex neural networks might require larger amounts of data for training and testing. Another drawback is the diversity of the EEG type and artifact type. EEG data may be collected in resting state or in different task conditions; furthermore, artifacts in EEG recordings are not only limited to ocular and myogenic. For example, removing motion artifacts is important for EEG mobile applications. Criteria for reviewing and approving additional EEG data submissions to EEGdenoiseNet would be helpful. Such an evolving dataset will contribute to improve the generalibility of the DL-based EEG denoising networks to diverse brain states. Third, we only focused on the denosing of 2s-long EEG segments in this study. It is worth noting that some EEG tasks might be longer than 2 seconds, not to mention the case of resting EEG. In the future, EEGdenoiseNet needs to be extended to adapt to the denoising of continuous EEG. The continuous artifact removal problem can be solved by defining pseudo-segments in continuous EEG recordings, and extracting the hidden relationships between consecutive segments, such that the previous segment can be used in the training stage as input to constrain the denoising process of the current segment. Forth, here we only focus on single-channel EEG denoising, and the deep learning model learns the temporal information of EEG signals and EOG/EMG artifacts. To use supervised models to learn spatial features, a benchmark data set with multi-channel EEG data must be provided in future. Finally, we did not explore unsupervised deep learning models in this study. When there is no gold standard for clean EEG signals and artifacts, unsupervised deep learning may be of great importance.

Conclusion
In this study, we provided a dataset containing thousands of clean EEG, ocular artifacts and muscular artifact segments, which is suited for benchmarking DL-based EEG denoising methods. The dataset is well-structured and publicly available online in different formats. In addition, we included a set of benchmark tools to facilitate the evaluation of newly developed DL-based EEG denoising models. Our benchmarking results suggested that DL methods have great potential to remove both ocular and myogenic artifacts from EEG data, even at high noise levels. Our study may accelerate the development of DL-based EEG denoising field. To obtain the clean EEG, pure EOG and pure EMG segments, we firstly preprocess the raw data. The data preprocessing include filtering, ICA-based artifacts removal, resampling, standarization, and visual checked by an expert. (B) 4514 pure EEG segments, 3400 pure EOG segments and 5598 pure EMG segments are obtained. EEGdenoiseNet dataset include two data formats: .mat files and .npy files. (C) The semi-synthetic data is generated by mixing a pure EEG segment and an EOG/EMG segment.