DeepSleepNet-Lite: A Simplified Automatic Sleep Stage Scoring Model with Uncertainty Estimates

Deep learning is widely used in the most recent automatic sleep scoring algorithms. Its popularity stems from its excellent performance and from its ability to directly process raw signals and to learn feature from the data. Most of the existing scoring algorithms exploit very computationally demanding architectures, due to their high number of training parameters, and process lengthy time sequences in input (up to 12 minutes). Only few of these architectures provide an estimate of the model uncertainty. In this study we propose DeepSleepNet-Lite, a simplified and lightweight scoring architecture, processing only 90-seconds EEG input sequences. We exploit, for the first time in sleep scoring, the Monte Carlo dropout technique to enhance the performance of the architecture and to also detect the uncertain instances. The evaluation is performed on a single-channel EEG Fpz-Cz from the open source Sleep-EDF expanded database. DeepSleepNet-Lite achieves slightly lower performance, if not on par, compared to the existing state-of-the-art architectures, in overall accuracy, macro F1-score and Cohen's kappa (on Sleep-EDF v1-2013 +/-30mins: 84.0%, 78.0%, 0.78; on Sleep-EDF v2-2018 +/-30mins: 80.3%, 75.2%, 0.73). Monte Carlo dropout enables the estimate of the uncertain predictions. By rejecting the uncertain instances, the model achieves higher performance on both versions of the database (on Sleep-EDF v1-2013 +/-30mins: 86.1.0%, 79.6%, 0.81; on Sleep-EDF v2-2018 +/-30mins: 82.3%, 76.7%, 0.76). Our lighter sleep scoring approach paves the way to the application of scoring algorithms for sleep analysis in real-time.


I. INTRODUCTION
G OOD sleep plays a crucial role in human well-being, and sleep disorders represent a significant and an increasing public health problem [1].Polysomnography (PSG) is used in sleep medicine as a diagnostic tool, so as to objectively analyze the quality of sleep and the common sleep pathologies -e.g.sleep breathing disorders, narcolepsy, sleep-related movement disorders [2].Electroencephalography (EEG), electrooculography (EOG), electromyography (EMG) and electrocardiography (ECG) signals are essential for the PSG exam.The physicians extract sleep cycle information through the well-known sleep stage scoring procedure, according to the AASM guidelines [3].The whole-night sleep recording is divided into 30-second windows, called epochs, and each epoch is classified into one of the following five sleep stages: wakefulness W, stage N1, stage N2, stage N3, and stage R (REM sleep).This manual sleep stage classification is obviously time-consuming and is affected by human error -several works report high values of inter-and intra-scorer variabililty [5].Since 1960, a wide variety of techniques have been devised in an effort to automate this procedure.Still, up to now, no system has completely replaced the physician.In the last decades, deep learning algorithms have been widely used to solve the sleep scoring task automatically.A thorough review of the application of deep learning architectures to sleep scoring can be found in [6].Autoencoders [7], deep neural networks (DNNs) [8], convolutional neural networks (CNNs) [9]- [16], recurrent neural networks (RNNs) [17] and several combination of them [18]- [26] have been recently proposed.The main advantage of all these networks is the ability to learn features directly from raw data, by taking into account the temporal dependency among the sleep stages.However, the architectures of these models are quite complex, a high number of parameters need to be trained.The most recent ones process lengthy time sequences in input -i.e. up to 12 minutes -using RNNs, thus requiring extra resources to buffer the PSG input and making them unsuitable in homemonitoring and in real-time applications.As a rule, deep architectures with a high number of layers and parameters need to be trained on large databases to prevent overfitting.In different scenarios sleep datasets have a limited number of labeled PSG samples available.Lighter architectures may be better suited if the model needs to be trained from scratch.We found only two architectures [14], [21] performing the automatic sleep scoring and also providing an estimate of the model uncertainty.In [14] they use an additional classification block-2 (i.e.multilayer perceptron in cascade to the deep convolutional scoring architecture) to output the final sleep stage and the associated relative confidence score.In contrast, [21] trains 16 different models and uses the relative model variance to estimate the uncertain predictions.It is important to know the level of confidence of each prediction, as it could be the key to identify the misclassified sleep stages.
In this paper, we propose DeepSleepNet-Lite, a simplified and lightweight automatic sleep scoring architecture.It provides the predicted sleep stages along with an estimate of their uncertainty.The major advantage is that it does not require any additional computation over the baseline architecture to provide the estimate.The two main contributions of this paper are: 1) the optimization of a simple feed-forward sleep scoring architecture, that processes only 90-second single-channel EEG in input; 2) the application of the Monte Carlo dropout sampling technique, using dropout at test time to capture the model uncertainty and to enhance the performance of the scoring system.In Section II we describe the architecture, the training algorithm and the regularization techniques used in our scoring system.In Section III we briefly present the label smoothing technique used to calibrate the scoring architecture.Moreover we propose a new conditional probability distribution computed over the targets (i.e.our prior knowledge), and used to smooth our labels.In Section IV we present the Monte Carlo dropout sampling technique, and its application within our sleep scoring system to estimate the uncertainty of the model.In the last sections, we demonstrate the efficiency of label smoothing and Monte Carlo dropout techniques in both calibrating and enhancing the performance of our model.We also demonstrate the efficiency of the uncertainty estimate procedure, by showing that it is able to identify the most challenging sleep stage predictions.We finally show that DeepSleepNet-Lite achieves performance on par with most up-to-date scoring systems.

II. DEEPSLEEPNET-LITE
The architecture of DeepSleepNet-Lite is strongly inspired by DeepSleepNet from Supratak [18].Unlike the original network, we have employed only the first representation learning part, and trained it with a sequence-to-epoch learning approach.The architecture receives in input a sequence of PSG epochs, and it predicts the corresponding target of the central epoch of the sequence.In [27] we had already shown that the first representation learning part of the architecture, trained with a small temporal context -90-second epochs, does most of the work on a small-sized database.

A. The Architecture
The representation learning architecture consists of two parallel CNNs employing small (CN N θS ) and large (CN N θL ) filters at the first layer.The small filter has been used to extract high-time resolution patterns, while the large filter has been used to extract high-frequency resolution patterns.The idea behind the use of the small and large filter sizes comes from the way the signal processing experts define the trade-off between temporal and frequency precision in the feature extraction procedure [28].Each CNN section consists of four convolutional layers and two max-pooling layers.Each convolutional layer executes three operations: a one-dimensional convolution of the filters with the 90-second epochs, a batch normalization [29] and an element-wise rectified linear unit (ReLU) activation function.The filter size, the number of filters and the stride size of each conv layer are defined in Fig. 1.The pooling layer is used to downsample the input.In each max-pool unit the pooling size and the stride size are specified.
The 90-second EEG signal x i is given in input to the convolutional neural networks CN N θS and CN N θL .The parameters θ of each convolutional neural network are independently trained, so as to return in output two feature vectors h i S and h i L .The outputs are concatenated in f i , then forwarded to the softmax layer.
The softmax function, together with the cross-entropy loss function, is used to train the model to output the logits z i and the probability for the five mutually exclusive classes.
where θ = {W,b} are the parameters of the softmax layer, j is the index of the vector z, pi,k is the output probability of class k associated to x(t), the centred 30-second signal in x i .
All the model specifications are reported in Fig. 1, equally to the first representation learning in [18].

B. Training Algorithm
The architecture is trained end-to-end via backpropagation, using the sequence-to-epoch learning approach.Classification algorithms learn to predict the most represented class in the training set, leading to the so called class imbalance problem.Here the least represented classes are balanced by using two techniques: (i) data augmentation, by flipping vertically the data input (i.e.multiply by −1 the original signal, see Fig. 2) belonging to the least represented classes, then (ii) oversampling randomly the data so that all the sleep stages are equal in number to the most represented class.In our model, the input is a sequence of three 30-second epochs, and the output is the corresponding target of the central epoch at time (t).So, we refer to the target of the central epoch to compute the most or least represented classes.The model is trained using mini-batch Adam gradient-based optimizer [30] with a learning rate lr.The training procedure runs up to a maximum number of iterations, as long as the break early stopping condition is satisfied -further details in the next subsection II-C.

C. Regularization Techniques and Training Parameters
Dropout.Commonly used as regularizer in convolutional neural networks, it prevents overfitting and co-adaptation of the feature maps [31].During the training procedure a certain number of neurons are randomly removed, dropping units with a probability p.We fix the probability of dropping a connection equal to 50%, i.e. p = 0.5.
Early stopping.It provides guidance on how many iterations can be run before the model begins to overfit [32].The training procedure should be stopped as soon as the performance (i.e.F1-score) on the validation set is lower than it was in the previous iteration step.However, in this study, before hastily stopping the learning procedure, the algorithm runs for an additional number of iterations (by fixing the so called patience parameter).The model with the highest performance is the one we finally save.
L2 weight decay.This technique simply adds a term to the loss function that penalizes the weight values; by doing so it avoids the exploding gradient phenomena [33].The lambda defines the degree of penalty and it has been set to 10 -3 .
All the training parameters are fixed as in [18].The Adam optimizer's parameters beta1 and beta2 have been set to 0.9 and 0.999 respectively.The mini-batch size has been set to 100.During the batch normalization procedure, the value of 10 -5 has been added to the mini-batch variance.In order to compute the mean and variance of the training samples, the moving average has been implemented using a fixed decay rate value of 0.999.The learning rates parameter lr has been fixed to 10 -4 .The maximum number of iterations has been set to 100, with the early stopping patience parameter equal to 50.

III. MODEL CALIBRATION
Along with the estimated sleep stage, the model should also provide a calibrated confidence -i.e. the probability associated to the predicted stage should mirror its ground truth correctness likelihood.We adopted label smoothing [34] to calibrate our model.It has been shown to be a suitable technique to improve model calibration [35].In a standard training of a neural network, the cross-entropy loss is minimized using the hard targets y k (i.e.hot encoded targets, '1' for the correct class and '0' for the other).For a network trained with label smoothing, the hard targets are weighted with the uniform distribution 1/K (eq.6), and the cross-entropy loss is minimized using the weighted mixture of the targets (eq.7).
where α is the smoothing parameter, K is the total number of classes, y k LSu the targets smoothed with the uniform distribution, and pk the softmax output probabilities.

A. Conditional Probability Distribution in Label Smoothing
In our study, we introduce a new distribution to smooth the labels, by mainly taking into account the importance in sleep scoring of the transitions from one sleep stage to the other.The idea is to compute the conditional probability distribution over the five sleep stages of all the sequences of epochs: where in M we have the conditional probabilities values for each possible combination of sequences of three sleep stages.In detail, we compute the probability to be in a stage at time t given the previous (t−1) and the next (t+1) sleep stages over the whole database.The matrix M is K×K×K dimensional, where K is the total number of sleep stages.As stated previously, the architecture takes in input a sequence of three epochs, and outputs the corresponding target of the central epoch y k,(t) .So, during the training procedure, given the knowledge of the sleep stage at time (t − 1) and the sleep stage at time (t + 1), the hot encoded y k,(t) will be smoothed with the corresponding conditional probability vector from M.
In Table I we report an example of the conditional probability values computed over the sequences extracted from the Sleep-EDF v1-2013 dataset (see section V), with the label at time (t − 1) fixed in sleep stage awake.We highlight in lightgreen an example of the conditional probability vector to use when we had awake W at time (t − 1) and N1 at time (t + 1), which results in the following smoothed target: The cross-entropy loss is minimized using the weighted mixture of the hard targets with these conditional probability distributions.
The smoothing parameter α for the uniform distribution and the conditional probability distribution weighting has been set to 0.1 and 0.2 respectively.These two values gave us the highest performance.In both, we explored α values up to 0.5.

IV. ESTIMATING UNCERTAINTY
In order to estimate the model uncertainty, we exploit the dropout regularization technique.As explained above, during the training procedure, at each iteration, dropout removes a certain number of units within our network at random.It randomly samples a certain number of sub-networks, so that each time the model's architecture is slightly different.In a standard application, dropout is used only during the training phase.At test time, instead, all the trained neurons and connections are used -i.e.all the weights of the whole network.The output could be interpreted as an averaging ensemble of all the sub-networks.We employ, for the first time in sleep staging, the Monte Carlo (MC) dropout [36], to quantify the model uncertainty, and to finally enhance the performance of the scoring architecture.Monte Carlo refers to a specific class of algorithms that rely on random sampling, to provide estimates and distributions of numerical quantities.MC dropout simply consists in applying the randomized sampling even at test time.The different sub-networks could be interpreted as Monte Carlo samples extracted from the space of all the possible models.As a result, by applying dropout N times at inference time (with the probability of dropping a connection p = 0.5), we would get N different predictions.We compute the mean and the variance of the N predictions for each sleep stage k where pn,i,k is the output probability for the sleep stage k of the n-th prediction for the input x i .The final prediction ŷi of the model will be given by max(µ µ µ i ).
The uncertain predictions will be then estimated by analysing both their computed mean and variance.The selection procedure of the uncertain sleep stages is explained in detail in subsection VI-D.The selected uncertain predictions could be then presented to the physician for a secondary review.[37].The PSG data belong to 78 subjects (37 males and 41 females) aged from 25 to 101 years.Except for the first nights of subjects 36 and 52, and for the second night of subject 13, for all the subjects are available two whole nights, resulting in 153 PSG recordings.Each recording includes two scalp EEG channels (Fpz-Cz and Pz-Cz), one EOG (horizontal) channel, one submental chin EMG channel and one oro-nasal respiration channel.The recordings are manually scored by sleep experts on 30second epochs according to Rechtschaffen and Kales scoring rules [38], resulting in the eight classes Wake, N1, N2, N3, N4, REM, MOVEMENT and UNKNOWN.In order to use the AASM standard [3], we have merged the N3 and N4 stages into a single stage N3, and we have excluded the MOVEMENT and UNKNOWN classes.In many recordings there were long wake periods before the patients went to sleep and after they woke up.We have done experiments with the two common ways these periods are trimmed in literature: 1) only in-bed parts are employed [4], i.e. from light-off time to light-on time; 2) 30 minutes of data before and after in-bed parts are taken into account in the experiments [18].In our study we have considered the EEG Fpz-Cz channel, with a sampling rate of 100 Hz and without any pre-processing.In order to facilitate the comparison with many existing deep learning based scoring algorithms, in this work we use the last expanded version published in 2018, and also the previous upload of the Sleep-EDF database published in 2013.In the older upload there were only 39 PSG recordings from 20 subjects.In Table II we report a summary of the total number and percentage of the epochs per sleep stage.

A. Design of Experiments
The validation procedure is in line with the state-of-theart methods considered in Table VIII in subsection VI-E.In fact, we evaluate our model using the k-fold cross-validation scheme.We set k equal to 20 for v1-2013 and 10 for v2-2018 Sleep-EDF datasets.In Table III we summarize the data split for each dataset.In our study we decided to further standardize the experiments by considering in each fold the same subject IDs used in [26].We believe that in such small datasets, the subjects involved in the training/validation/test set may have an impact on the final results.The following experiments are conducted: • base.The model is trained not considering model calibration, and without label smoothing.• base+LS u .The model is trained taking into account the confidence calibration, using label smoothing with uniform distribution -i.e. the hard targets are weighted with the uniform distribution.• base+LS s .The model is trained taking into account the confidence calibration, using label smoothing with our statistical analysis done on the sequences of sleep stages -i.e. the hard targets are weighted with the conditional probability distribution.These three models, differently trained, have been evaluated with and without using the MC dropout sampling technique.In Table IV subsection VI-C we present the results obtained for the three models, and the impact of MC dropout at inference time.
The models have been implemented in TensorFlow 1.14, and trained on a single workstation running Ubuntu 18.04.2with a Intel Core i7-8700K CPU, an NVIDIA GTX 1080 GPU with 8 GB memory and 32 GB RAM memory.

B. Metrics
Performance.The per-class F1-score, the overall accuracy (Acc.), the macro-averaging F1-score (MF1) and the Cohen's kappa (k) have been computed from the predicted sleep stages from all the folds to evaluate the performance of our model [39], [40].In our experiments the weighted-averaging F1-score has been also reported, taking into account also the label imbalance problem.It computes the average of the metric weighted by the number of true instances for each label.The F1-score computed in this way is not a realistic weighted average of the precision and recall, but it takes into account the high imbalance between the sleep stages.Calibration.We evaluated the calibration of our model using the Expected Calibration Error (ECE) proposed in [41].It approximates the difference in expectation between accuracy acc and confidence conf , where with confidence it refers to the softmax output probabilities.More in detail, we first divide the predictions into M equally spaced bins (size 1/M ), then for each bin we compute the accuracy acc(B m ) and we define the average predicted probability value conf (B m ): conf where y i and ŷi are the true and predicted labels for the sample i, B m is the group of samples whose predicted probability values falls into the interval I m = ( m−1 M , m M ], and pi is the predicted probability value for sample i.Then we finally compute the weighted average of the acc and conf difference of the M bins, where n is the number of samples in each bin.Clearly, perfectly calibrated models have acc(B m ) = conf (B m ) for all m ∈ {1, .., M }, resulting in ECE = 0.

C. Analysis of Experiments
In table IV we report the overall performance and the calibration measure of three different models, with and without Monte Carlo dropout at inference time, to which we refer w/o MC and w/ MC respectively.In the following, we analyse only the results obtained on the Sleep-EDF v1-2013 ±30mins dataset, since the findings are still valid for its expanded v2-2018 ±30mins version.In our tests w/o MC, we show the efficiency of label smoothing in calibrating the model.The conf value refers to the average of all the predicted probability values.In both LS u and LS s models, the conf probability better reflects the ground truth correctness likelihood -i.e.accuracy value.Indeed, it results in a better ECE value 0.023 and 0.071, compared to the In Fig. 3 we report the F1-score against the number of Monte Carlo samples N , evaluated over all our experiments.Interesting how Monte Carlo sampling outperforms the experiments done without applying MC after approximately three samples, on the average of the three models.On average we get a plateau after 30 samples, so we decided to set N equal to 30.From here on, all the results will refer to the best of our models base+LS u , by using MC sampling at test time.
In Tables V and VI we report the confusion matrix and the per-class performance of the best of our models evaluated on Sleep-EDF v1-2013 ±30mins and v2-2018 ±30mins respectively.The i-th row and the j-th column indicates the percentage number of 90-s EEG instances with the true label being i-th class and the predicted label being j-th class.In bold we highlight the percentage number of instances well classified.As expected [42], the lowest performance has been obtained for the N1 sleep stage, i.e.F1-score 44.4% and 46.0%; most of the N1 have been wrongly classified in awake, N2 and REM.The F1-score for all the other sleep stages were in range between 82.4% and 88.2% on v1-2013 ±30mins, and between 76.4% and 91.5% on v2-2018 ±30mins.

D. Uncertainty Estimate
MC dropout enables the estimate of the uncertain predictions.In order to select the uncertain instances, at first, we used the variance (σ 2 of the predicted probability values obtained from the N sampling).The selection procedure (also referred to as query procedure) simply rely on the setting of a threshold value q%, that corresponds to the percentage number of epochs -for each PSG recording -to select and to send potentially to the physician for a secondary review.The epochs with the highest values of variance will be the q% selected.We also tried to use the mean (µ of the predicted probability values obtained from the N sampling) to select the uncertain instances.In this case the epochs with the lowest mean values will be the q% selected.The selected epochs, in both cases, correspond to the predictions where the averaging ensemble of the models outputs the higher uncertainty.In Fig. 4 we report the F1-score computed over the remaining epochs against the percentage number of selected instances.We have fixed the q% threshold value to 5%, because it was considered to be a reasonable number of epochs (54 on average for each PSG recording) to select and to eventually present to the physician for a secondary review.The results show that by using µ in the selection procedure we obtain higher performance.In Fig. 5 we also report, for each q% number of selected instances, the percentage of misclassified and correctly classified epochs among the selected ones.As illustrated, by using µ, the percentage number of misclassified epochs are greater than the correctly classified up to the selection threshold q% equal to 10%.Whilst, by using σ 2 , the percentage number of selected epochs q% radically decreases to 2%.
In Table VII we also report the average of the per-class σ 2 and µ predicted probability values, to have an overall estimate of the model uncertainty, evaluated on both Sleep-EDF v1-2013 ±30mins and v2-2018 ±30mins datasets.As expected, the results show that the model has more difficulty in classifying N1 and REM epochs, while providing greater confidence in classifying W, N2 and N3 sleep stages (lower variance and higher predicted probability values).

E. Comparison with state-of-the-art
In Table VIII we compare our best model with the other state-of-the-art methods evaluated on the two versions of the Sleep-EDF database.We report the results for each experimental scenario: 1) only in-bed recordings; 2) additional 30 minutes recordings before and after in-bed.We have considered only the methods using deep learning based architectures, raw single channel Fpz-Cz, same evaluation procedure (i.e.kfold cross-validation) and using independent training and test sets.We decided to further standardize our experiments by considering in each fold the same subject IDs used in [26].All the results indicated by † are not directly comparable, since they use a different set of subject IDs in their training/ evaluation/ testing procedure.The sleep scoring algorithms are compared across the overall metrics (Acc., MF1, Cohen's Kappa and F1-score) and the per-class F1-score.The proposed DeepSleepNet-Lite achieves slightly lower performance, if not on par, compared to the state-of-the art models on all the Sleep-EDF datasets.The results confirm what we had already partially observed in [27] on the Sleep-EDF v1-2013: the first epoch processing block from DeepSleepNet, trained with a small temporal context in input, still succeed in solving the classification task on the small-sized database.Indeed, on both v1-2013 and v2-2018 in-bed recordings, our model achieves an overall accuracy only below 1.3% compared to the recent state-of-the-art XSleepNet2 [26].We are not surprised to see our lighter architecture to overperform DeepSleepNet: one of the reasons could be that in [18] they have not implemented any early stopping procedure, and they save their model only at the latest iteration step, thus not mitigating the overfitting phenomenon.The number of training parameters of our lighter model are significantly reduced, ∼0.6M compared to the others TinySleepNet [25] ∼1.3M, SleepEEGNet [23] ∼2.6M,FCNN+RNN [26] ∼5.6M, Naive Fusion and XSleepNet2 ∼5.8M [26] and DeepSleepNet [18] ∼24.7M.Nevertheless, SeqSleepNet [24] is still the network with the lowest number of parameters ∼0.2M.We did not report the number of training parameters for IITNet [22] since it was not available in literature.

F. Comparison among our methods
In Table IX we report the results of our best model evaluated on the two versions of the Sleep-EDF database -in both experimental scenarios.The outcomes refer to the performance of the model evaluated before the selection procedure and after the selection procedure, by using σ 2 and µ query values.We report the results obtained after the selection procedure on both the kept and rejected set of epochs.As a consequence of what we have observed in Fig. 5, on both Sleep-EDF v1-2013 and v2-2018, the model shows an increase in performance over the kept epochs, and a significant decrease on the rejected epochs (below 50% by using µ query).These results highlight the efficiency of the query procedure to select a larger number of misclassified epochs among the selected one.The best performance for each dataset are indicated in bold.We obtain an overall accuracy equal to 86.1% on v1-2013 ±30mins (84.5% on in-bed only) and equal to 82.3% on v2-2018 ±30mins 80.9% on in-bed only).

VII. DISCUSSION
Our simplified deep learning approach to sleep scoring achieves performance slightly lower, if not on par, compared to the existing state-of-the-art methodologies evaluated on the Sleep-EDF database.Beside being trained on a small number of parameters, our method does not require any extra resources to buffer the sequences in input, since it processes sequences of only 90-seconds EEG.Therefore, we may assume that an automatic sleep scoring system does not necessarily have to encode such long temporal structures, rather intrinsic patterns of short-term PSG recordings may be sufficient.However, as a result of further experiments carried out on larger and more heterogeneous databases (e.g.Physio2018 [43], [44] and SHHS [45], [46]), we can state that these observations are valid on small-sized dataset (i.e.low heterogeneity between subjects).
The major advantage of the proposed approach is that it also provides an estimate of the model uncertainty by exploiting existing layers of the architecture.Unlike the existing confidence estimation algorithms for sleep scoring [14], [21], the Monte Carlo dropout is easy to implement and it does not require any additional computation over the baseline architecture.Moreover, it produces interpretable outputs, i.e. mean and variance of the predicted probability values.A clear disadvantage for this approach -as for other ensemble learning based algorithms -is that it needs to be executed N times, obviously increasing the evaluation time by N.However, in a real-time application, it may still be a valid solution because the evaluation of a single sequence takes only a few milliseconds.
The results obtained in subsection VI-C, in case our model is trained by smoothing the labels through the conditional probability distribution, are still to be further investigated.The impact of this prior knowledge, inserted during the training of our architecture, is not so obvious.It seems to improve the calibration process of the model while maintaining its overall good performance.Even if with this technique we succeed to better calibrate our network, we do not equally succeed in obtaining higher performance using it in combination with Monte Carlo dropout.Therefore, unlike what we expected, it is not always the case that a better calibrated architecture leads to higher performance, or even, to a better estimate of the model uncertainty.VIII.CONCLUSION AND FUTURE WORKS We propose DeepSleepNet-Lite a simplified and lightweight automatic sleep scoring architecture, providing the predicted sleep stages along with an estimate of their uncertainty.The scoring system is based on raw single channel EEG, and it processes 90-seconds time sequences.Although the proposed simple feed forward architecture has proven to be as efficient as RNNs based architectures, we cannot conclude that by using only this first representation learning block we will reach equally good results on larger databases.The Monte Carlo dropout technique allows us to enhance the performance of the architecture and to identify a relevant number of misclassified epochs among the ones selected during the query procedure.DeepSleepNet-Lite has a low capacity, i.e. low number of training parameters, hence less prone to overfitting on a small dataset.Therefore the need to further investigate its robustness on larger database.It would be interesting to simulate the query procedure on the recent state-of-the-art architectures, e.g.XSleepNet2, to assess its benefit on them.Our lightweight sleep scoring approach paves the way to real-time applications and to home-monitoring scenarios.

Fig. 1 .
Fig. 1.An overview of the representation learning architecture from [18], with our sequence-to-epoch input-output training approach.

Fig. 3 .
Fig. 3. F1-score against the number of Monte Carlo samples N of the three models (base, base+LSu and base+LSs) evaluated on Sleep-EDF v1-2013 dataset.Monte Carlo sampling converges after 30 samples without further significant improvement on the average of the three models.

Fig. 4 .Fig. 5 .
Fig.4.F1-score computed over the remaining epochs after the query procedure against the percentage number of epochs to select.In light green and in light blue the F1-score performance in case the selection procedure has been done using the variance (σ 2 query) and the mean (µ query) respectively.The performance refers to the best of our model evaluated on Sleep-EDF v1-2013 ±30mins dataset.

TABLE I CONDITIONAL
PROBABILITIES VALUES COMPUTED OVER THE SEQUENCES, EXTRACTED FROM THE SLEEP-EDF V1-2013 DATASET, WITH THE LABEL AT TIME (t − 1) FIXED IN AWAKE.i.e.M W ,K×K

TABLE II NUMBER
AND PERCENTAGE OF 30-SECOND EPOCHS PER SLEEP STAGE OF THE SLEEP-EDF DATASETS WITH DIFFERENT TRIMMING.
V. DATASleep-EDF (SC).The Sleep-EDF Sleep Cassette is a subset of the open source Sleep-EDF dataset

TABLE III SUMMARY
OF THE SLEEP-EDF DATASET AND THE DATA SPLIT.

TABLE IV OVERALL
PERFORMANCE AND CALIBRATION MEASURE OF THE MODELS OBTAINED FROM 20-FOLD CROSS-VALIDATION WITH AND WITHOUT MC ON SLEEP-EDF V1-2013 ±30MINS DATASET.BEST SHOWN IN BOLD.

TABLE VIII COMPARISON
BETWEEN OUR METHOD AND THE OTHER DEEP LEARNING-BASED AUTOMATIC SLEEP SCORING SYSTEMS USING RAW SINGLE CHANNEL FPZ-CZ, EVALUATED ON SLEEP-EDF DATASETS WITH OVERALL ACCURACY (ACC.),MACRO F1-SCORE (MF1), COHEN'S KAPPA (K) AND PER-CLASS F1-SCORE.THE BEST PERFORMANCE METRICS FOR EACH DATASET ARE INDICATED IN BOLD.

TABLE IX COMPARISON
AMONG OUR METHODS USING RAW SINGLE CHANNEL FPZ-CZ, EVALUATED ON SLEEP-EDF DATASETS WITH OVERALL ACCURACY (ACC.),MACRO F1-SCORE (MF1), COHEN'S KAPPA (K), WEIGHTED-AVERAGING F1-SCORE (F1) AND PER-CLASS F1-SCORE.THE BEST PERFORMANCE METRICS FOR EACH DATASET ARE INDICATED IN BOLD.