Intra-and Inter-epoch Temporal Context Network (IITNet) Using Sub-epoch Features for Automatic Sleep Scoring on Raw Single-channel EEG

—A deep learning model, named IITNet, is proposed to learn intra-and inter-epoch temporal contexts from raw single-channel EEG for automatic sleep scoring. To classify the sleep stage from half-minute EEG, called an epoch, sleep experts investigate sleep-related events and consider the transition rules between the found events. Similarly, IITNet extracts representative features at a sub-epoch level by a residual neural network and captures intra-and inter-epoch temporal contexts from the sequence of the features via bidirectional LSTM. The performance was investigated for three datasets as the sequence length ( L ) increased from one to ten. IITNet achieved the comparable performance with other state-of-the-art results. The best accuracy, MF1, and Cohen’s kappa ( κ ) were 83 . 9% , 77 . 6% , 0 . 78 for SleepEDF ( L =10), 86 . 5% , 80 . 7% , 0 . 80 for MASS ( L =9), and 86 . 7% , 79 . 8% , 0 . 81 for SHHS ( L =10), respectively. Even though using four epochs, the performance was still comparable. Compared to using a single epoch, on average, accuracy and MF1 increased by 2 . 48% p and 4 . 90% p and F1 of N1, N2, and REM increased by 16 . 1% p , 1 . 50% p , and 6 . 42% p , respectively. Above four epochs, the performance improvement was not significant. The results support that considering the latest two-minute raw single-channel EEG can be a reasonable choice for sleep scoring via deep neural networks with efﬁciency and reliability. Furthermore, the experiments with the baselines showed that introducing intra-epoch temporal context learning with a deep residual network contributes to the improvement in the overall performance and has the positive synergy effect with the inter-epoch temporal context learning.


I. INTRODUCTION
Sleep scoring, also known as "sleep stage classification" and "sleep stage identification," is essential for diagnosis and treatment of sleep disorders [1]. Many individuals suffering from sleep disorders are at risk of underlying health problems [2]. Typical sleep disorders (e.g., sleep apnea, narcolepsy, and sleepwalking) can be diagnosed via polysomnography (PSG) [3], the gold standard of sleep scoring. PSG is based on the biosignals of body functions such as brain activity (electroencephalogram, EEG), eye movement (electrooculogram, EOG), heart rhythm (electrocardiogram, ECG), and muscle activity of the chin, face, or limbs (electromyogram, EMG). These recorded signals are analyzed by trained human experts, who label each 20-or 30-second segment of PSG data (an "epoch") with its corresponding sleep stage. The sleep stages are classified into wakefulness (W), rapid eye movement (REM), and non-REM (NREM) following the Rechtschaffen and Kales (R&K) rules [4], [5] or the American Academy of Sleep Medicine (AASM) rules [6]. For the AASM rules, NREM is further divided into three stages, referred to as S1, S2, and S3 or N1, N2, and N3. To draw a whole-night hypnogram showing the sleep stage as a function of sleep time, experts must visually inspect all epochs and label their sleep stages. This manual sleep scoring is labor-intensive and time-consuming [7]- [9]. Thus, automatic sleep scoring for healthcare and well-being is in high demand.
Many sleep scoring methods for automatic analysis of PSG data have been proposed. In particular, handcrafted feature extraction techniques have been widely used [10]. The features can be extracted from time-domain signals [11], [12], frequency/time-frequency domain data [3], [13]- [16], or non-linear parameters [17]- [19], and have been analyzed using fuzzy classification [3], decision trees [17], random forest algorithms [11], [13], [14], [16], and support vector machines [12], [18], [20]. In some cases, multi-channel or multi-modality data have been used [7], [18]. Such previous studies have shown that the application of machine learning to handcrafted features is effective for automatic sleep scoring. However, the associated approaches may require additional handcrafted tuning to analyze PSG data obtained from different recording environments, as the features are handengineered based on the specific PSG system and available datasets [21].
Deep learning has been adopted to score sleep stages from PSG data automatically. When human sleep experts score the sleep stage of an epoch, they generally find the sleeprelated event (such as a K-complex, sleep spindle, or frequency components: alpha, beta, delta, and theta activities) in the epoch. Then, they analyze the relations between the sleeprelated events and sleep stages in its neighboring epochs. Thus, they inspect the data at both intra-and inter-epoch levels. In [21]- [28], features were extracted by convolutional neural networks (CNNs). Some deep neural networks have learned features from multi-channel or multi-modality data to improve sleep scoring performance [23], [29], [30]. Recently, recurrent neural networks (RNNs) are being adopted to consider transition rules such as those of the AASM manual and to learn temporal information from the sequence of epochs [7], [21]- [23], [25], [31]- [35]. For instance, an epoch can be labeled N2 when K-complexes or sleep spindles exist in the last half of its preceding epoch. Supratak et al. used RNNs to consider the inter-epoch temporal context between epoch-wise features individually extracted from each epoch by CNNs [21]. Their results showed that consideration of the transition rules by analyzing the inter-epoch temporal context is essential for automatic sleep scoring. However, this model required two-step training and just considered the inter-epoch temporal context. Phan et al. [31] introduced RNNs to analyze the temporal context of the representative features extracted from sub-epochs at both the intra-and inter-epoch levels.
Their results showed that considering both the intra-and interepoch temporal context is effective to improve the performance of sleep scoring. Though the model achieved the state-ofthe-art performance, the model used multi-channel signals and required data preprocessing via the short-time Fourier transform to extract the time-frequency domain data.
In this paper, intra-and inter-epoch temporal context network (IITNet) is proposed for automatic sleep scoring on raw single-channel EEG. IITNet encodes each sub-epoch of multiple EEG epochs into the corresponding representative feature and analyzes their temporal context at both intra-and interepoch levels. IITNet is an end-to-end deep learning model based on one-step training without data preprocessing such as short-time Fourier transform or filterbank processing. A modified deep residual neural network (ResNet-50) [36], [37] is used to extract representative features from each epoch (halfminute EEG) at a sub-epoch level. Two layers of bidirectional long short-term memory (BiLSTM) [38] are used to learn the temporal context of the representative features in the forward and backward directions. The performance was investigated for three public datasets: SleepEDF, Montreal Archive of Sleep Studies (MASS), Sleep Heart Health Study (SHHS). Also, the influence of the number of input epochs, the sequence length (L), was investigated by increasing the sequence length from one to ten.

II. IITNET: INTRA-AND INTER-EPOCH TEMPORAL CONTEXT NETWORK A. Model Overview
IITNet is designed to classify the time-series data by extracting representative features at the sub-epoch level and analyzing their temporal context. In this study, the model is applied to sleep stage classification from raw single-channel EEG. When human sleep experts label each half-minute PSG (target epoch) with its corresponding sleep stage, they visually inspect the frequency characteristics and the sleep-related events such as spindles and K-complexes. Besides, they consider the sleeprelated events in its neighboring epochs to check whether the relations of the events correspond to the transition rules [25]. Similarly, IITNet learns the sleep-related events by extracting representative features at the sub-epoch level and scores the sleep stages of the target epoch by capturing the contextual information from the sequence of the features.
IITNet belongs to a convolutional recurrent neural network (CRNN) [39] and consists of two main parts: CNN and RNN layers, as shown in Fig. 1. The CNN layers learn the representative features associated with the sleep-related events from the EEG. To train the deep CNN effectively, a modified ResNet-50 [36] is used since its skip connections ensure that the higher layers can perform as good as the lower layer [37]. For this reason, the ResNet has been widely used in deep networks as a promising feature extractor [40]- [42].
In the RNN layers, two layers of bidirectional LSTM (BiLSTM) are employed to capture the temporal context from the representative features in both the forward and backward directions [43]- [45]. At the top of the model, a softmax classifier is placed to output the most appropriate sleep stage. Specifically, IITNet disassembles each half-minute epoch into l overlapped sub-epochs and encodes each sub-epoch as its corresponding representative feature, as shown in Fig. 2. In the CNN layers, the epoch is converted to feature maps. Each column of the feature maps represents the sub-epoch feature corresponding to its receptive field. These l sub-epoch features are stacked from left to right in chronological order, and then a feature sequence is created for each epoch. The RNN layers analyze the temporal relation between the sub-epoch features.
For IITNet, the input can be either single epoch or a series of successive epochs. Using only the target epoch as the input, the intra-epoch temporal context can be analyzed at the sub-epoch level. To consider both the intra-and inter-epoch temporal context, the sequence of the target epoch and its neighboring epochs should be fed at a time. In this study, the target epoch and its previous epochs are used as the input, which is practical for real-time sleep scoring in a smart bed or hospital bed because future epochs cannot be measured in advance. This is also clinically reasonable since human experts generally investigate the previous epochs to follow the AASM transition rules.

B. Intra-epoch Temporal Context Learning
For intra-epoch temporal context learning, IITNet takes a target epoch x as an input to model the conditional probability p(y|x), where y is the true sleep stage. In the CNN layers as shown in Fig. 1, IITNet extracts the representative features f from the sub-epochs and produces a feature sequence F, which contains the sub-epoch features in chronological order as follows: where f i is the representative feature vector of the i-th sub epoch, and l is the number of the sub epochs. The length of a feature vector is u, which is the number of filters in the last CNN layer. Note that the learnable parameters of the CNN layers are shared. Through backpropagation in training, the parameters are updated with the average of the gradients computed for the sub-epochs.
In the RNN layers, the BiLSTM has hidden states of dimension u for each direction. The two BiLSTM layers process the feature vector with the previous hidden state in both the forward and backward directions, yielding the internal To predict y, the last hidden states are concatenated to form the bidirectional context ← → h of size 2u. This concatenated vector is fed into the fully connected layer to output p(y|x) for the target epoch. Finally, the softmax classifier labels the epoch with the most likely sleep stage.

C. Intra-and Inter-epoch Temporal Context Learning
To consider the inter-epoch dependency at the sub-epoch level, IITNet takes the target epoch and its previous epochs as an input. The model scores the target epoch based on the temporal context in a series of feature sequences after encoding the target epoch and its previous L − 1 epochs at the sub-epoch level. Therefore, L is the number of epochs for the input. Formally, IITNet is trained to model the following conditional probability: where X L = {x 1 , x 2 , · · · , x L } is a sequence of successive epochs, x L is the target epoch, x 1 , x 2 , · · · , x L−1 are the previous epochs, and y L is the true sleep stage of the target epoch. Since it is practical to predict the latest sleep stage in the view of real-time sleep scoring, the the sleep stage the latest epoch is chosen as the true label.
In the CNN layers, IITNet individually extracts the feature sequence for each epoch, as shown in Fig. 1. In other words, the CNN layers take the i-th epoch x i and produce a corresponding feature sequence F i . At the top of the convolution layers, a series of feature sequences S L is created as follows: where f i,1 , · · · , f i,l are the sub-epoch feature vectors corresponding to F i . Accordingly, S L includes the entire representative features from the sub epochs in chronological order. Therefore, the RNN layers take the series of the feature sequence. The softmax classifier process S L in the same way as the intra-epoch temporal context learning.

A. Datasets
To evaluate the sleep scoring performance of IITNet, three public datasets containing PSG records and their corresponding sleep stages labeled by human sleep experts were used: SleepEDF [47], [48], MASS [29], and SHHS [49]. Table I lists the number of epochs in the datasets for the sleep stages and Table II

1) SleepEDF:
The SleepEDFx dataset (2013 version) contains two types of PSG record: SC for 20 healthy subjects without sleep-related disorders and ST for 22 subjects of a study on Temazepam effects on sleep. Each record includes two-channel EEGs from the Fpz-Cz and Pz-Oz channels, a single-channel EOG, and a single-channel EMG. Each halfminute epoch is labeled as one of eight classes (W, REM, N1, N2, N3, N4, MOVEMENT, UNKNOWN) according to R&K rules. In this study, the single-channel EEGs (Fpz-Cz) in the SC (average subject age: 28.7 ± 2.9 years) were used since the Fpz-Cz has shown higher performance than Pz-Oz with deep-learning based approaches [21], [25]. As the class W group was quite large compared to the others, only sixty epochs (thirty minutes) before and after the sleep period were used [21].
2) MASS: The MASS dataset includes the PSG records of 200 subjects in five subsets: SS1, SS2, SS3, SS4, and SS5. These subsets are grouped according to the research and acquisition protocols. The dataset contains twenty-channel EEG, two-channel EOG, three-channel EMG, and singlechannel ECG. Each epoch is labeled as one of five classes (W, REM, N1, N2, N3) according to AASM rules. In this

3) SHHS:
The SHHS dataset is a multi-center cohort study to investigate the effect of sleep-disordered breathing on cardiovascular diseases. The dataset consists of two rounds of PSG records: Visit 1 (SHHS-1) and Visit 2 (SHHS-2). Each record includes two-channel EEGs, two-channel EOGs, singlechannel EMG, single-channel ECG, two-channel respiratory inductance plethysmography, position sensor data, light sensor data, pulse oximeter data, and airflow sensor data. In this study, the single-channel EEGs (C4-A1) in 5,791 records of the SHHS-1 were used. Each epoch is scored as either W, REM, N1, N2, N3, N4 using R&K rule. More details are described in [50]. Note that some epochs in the datasets were labeled with MOVEMENT and UNKNOWN. They were excluded in this study because their prediction is beyond the scope of sleep stage classification. N3 and N4 stages were regarded as N3 stage according to AASM rules.

B. Baselines
Depending on the dataset processing methods, especially regarding whether using only in-bed or light-out parts of the dataset, a direct comparison between sleep scoring methods would not be a reasonably straightforward comparison; for example, including more segments with WAKE label can lead to high overall accuracy because the performance of sleep stage scoring methods on WAKE segments is usually better compared with the performance on other segments. On the other hand, a training method also affects performance. In order to fairly verify the effectiveness of the deep residual network [36] and intra-epoch temporal context learning for sleep scoring, baseline experiments were conducted by modifying DeepSleepNet with an end-to-end (E2E) approach. First, E2E-DeepSleepNet was used to compare the performance in terms of model architecture, excluding the influence of a training method such as pre-training and fine-tuning. Secondly, E2E-IntraDeepSleepNet was used to confirm whether intra-epoch temporal context learning was effective in a shallow network. They were trained and evaluated on the same three datasets (SleepEDF, MASS, and SHHS) with the same pre-processing and training condition when sequence length (L) is 1, 4, 10, as shown in Fig. 3.

1) End-to-End DeepSleepNet (E2E-DeepSleepNet):
We used the DeepSleepNet as a deep learning baseline [21] and implemented it with the same architecture and parameter used in the paper. DeepSleepNet, which showed a fine performance in sleep stage scoring of the target epoch from a sequence of single-channel EEG signals, utilizes two parallel CNNs of small and large filters to extract time-invariant features and uses bidirectional LSTM to consider the sleep-stage transitions. For the comparison under the same condition, the model was trained in an end-to-end manner, similar to IITNet, instead of using a two-step training algorithm that finetunes the model using the sequential whole-night epochs after pre-training the CNN parts. This end-to-end DeepSleepNet is similar to the model experimented in [31].
2) End-to-End Intra-epoch Temporal Context DeepSleepNet (E2E-IntraDeepSleepNet): To evaluate the effectiveness of intra-epoch temporal context learning, we modified the endto-end DeepSleepNet baseline by introducing an intra-epoch temporal context learning, which was described in section II-B and II-C. This end-to-end intra-epoch temporal context DeepSleepNet first extracts the sleep-related features from two CNN branches. The interpolation is performed after the CNN branches with larger filters to make the number of the subepochs of two CNN branches equal. Thereafter, features from two branches are concatenated, and two convolutional layers are applied channel-wise to half the length of the feature vector to form a feature sequence. In this way, sleep-related features can be analyzed in the sub-epoch level in RNN layers. Whereas the IITNet consists of ResNet-50 with residual connection and 49 convolutional layers, IntraDeepSleepNetbaseline uses a relatively shallow network with two parallel CNNs of four convolutional layers.

C. Model Specifications
As described in Table III, to handle one-dimensional time series EEGs, the one-dimensional operations of the modified ResNet-50 [36] were used instead of the two-dimensional oper ations of convolution, max-pool, batch normalization. Furthermore, an additional max-pool was placed between the conv3 x and conv4 x layers to halve the feature sequence length. The global average pool layer was excluded, and a dropout layer (p = 0.5) was added to the end of the CNN layers to prevent overfitting. In the RNN layers, two BiLSTM layers were adopted. The hidden state size of the BiLSTM in each direction was set to u = 128, which corresponds to the number of last convolutional filters.

D. Input for IITNet
The sequence length (L), the number of epochs as an input for IITNet, can be adjustable in IITNet. To investigate the influence of the sequence length, training and evaluation were conducted by increasing the sequence length from one to ten in this study. The range was decided after considering the experimental result that there was no significant difference in the comparison of the sequence length of 10, 20, and 30 [31]. Although some studies used multi channel input, this study only uses single channel EEG to consider versatile applications for time series data. Multi channel input is also possible via an ensemble model based on IITNet, which can show the sensitivity of each channel for each class. Since the channel characteristics is beyond the scope of this study, multi channel input is not dealt with in this study.

E. Model Training
Without any preprocessing, Adam [51] optimizer was used with parameters lr = 0.005, β 1 = 0.9, β 2 = 0.999, and = 10 −8 . To avoid overfitting, L2-weight regularization was applied with wr = 10 −6 . Any methods to balance out on data processing or model training were not used in this study although the state-of-the-art algorithms use balanced-class sampling [21], [25] or class-balanced loss [24] for mitigating class imbalanced-problems. In all the experiments, the batch sizes were 256, 128, 256 for SleepEDF, MASS, and SHHS, respectively. Early stopping was implemented by tracking the validation cost, such that the training was stopped when there was no validation cost improvement for ten successive training steps. For each cross-validation, the model that achieved the best validation accuracy was used for evaluation in the test set. The training process was implemented in Python 3.5.0 and PyTorch 0.4.0 [52]. On an RTX 2080 Ti, the training time of IITNet was, in total, approximately 10 h (SleepEDF) to 15 h (MASS), which was approximately 30 min for each fold. For the SHHS, the total training time was approximately 10 h. A single forward pass of IITNet took 11.7, 35.1, 79.4 ms when sequence length (L) is 1, 4, 10, respectively, which was calculated by averaging 2,000 trials.

F. Model Evaluation
For SleepEDF and MASS, k-fold cross-validation was conducted. When the number of subjects in a dataset was N s , the N s /k records were used for model evaluation, and the other records were split into training and validation data. The test-set subjects were sequentially changed by repeating this process k times so that the evaluation was performed over all subjects. To be specific, 20-fold cross-validation was conducted for SleepEDF. In each fold, the numbers of the subjects for training, validation, and evaluation were 15, 4, and 1, respectively. For MASS, 31-fold cross-validation was performed. In each fold, the records of two randomly selected subjects that did not overlap with the other folds were used as the test set, and the remaining records were divided into training (45 subjects) and validation (15 subjects) sets. For SHHS, the subjects were randomly divided into training, validation, and test sets in the ratio 5:2:3 as performed in [22].
The IITNet performance was assessed according to the following criteria: per-class precision (PR), per-class recall (RE), per-class F1 score (F1), overall accuracy (Acc.), macroaveraging F1 score (MF1), and Cohen's kappa coefficient (κ) [53] [54]. In the case of a classification task, where e ij is the element in the i-th row and j-th column of the confusion matrix and N c is the number of sleep stages (five in this study). Nc PR represents the precision with which the model distinguishes the sleep stage from the other stages. RE represents the accuracy with which the model predicts the sleep stage. Overall accuracy is the ratio of the correct predictions to the total predictions, which is an intuitive performance measure. Since F1 is calculated from the harmonic mean of the PE and RE, it can be more informative than overall accuracy, especially in the case of an imbalanced class distribution, i.e., sleep stages in PSG. The average of F1 corresponds to MF1 and κ indicates the agreement between human expert (truth) and IITNet (prediction) in sleep scoring.

A. Performance Comparison to State-of-the-Art Models
A hypnogram predicted by IITNet for one of the subjects is presented in Fig. 4, in which the model predictions are in good agreement with human expert's scores. Specifically, Table IV lists the performance of IITNet and the state-of-the-art models with the model architectures, approaches, input channel types, subject numbers, input types for the deep learning models, and the number of epochs simultaneously input for scoring the sleep stage of the target epoch. The baseline of IITNet is the case that the sequence length (L) is one. For all the datasets, IITNet achieved the performance comparable to the state-of-the-art models using the single-channel EEG although preprocessing was not used, the sequence length was shorter, and the target epoch and its previous epochs were only considered. The best overall accuracy, MF1, and κ are 83.9%, 77.6%, 0.78 for SleepEDF (L=10), 86.5%, 80.7%, 0.80 for MASS (L=9), 86.7%, 79.8%, 0.81 for SHHS (L=10), respectively. The differences of overall accuracy, MF1, and κ between the best of IITNet and other state-of-the-art models are +1.89%p, +0.73%p, +0.018 for SleepEDF, +0.28%p, −1.05%p, −0.004 for MASS, −0.06%p, +1.32%p, −0.003 for SHHS, respectively. According to [31], SeqSleepNet-30 used the spectrogram of three channels (EEG, EOG, EMG) as the input and its overall accuracy, MF1, and κ were 87.1%, 83.3%, 0.815 for MASS. Although IITNet used the raw EEG signals instead of the spectrogram images, the performance was observed to be similar. The modified ResNet-50 could learn effectively the representative features related to the sleep events at the sub-epoch level. The features could be analyzed by the RNN via the two-layered BiLSTM at both the intra-and inter-epoch levels, which contributed to learning the transition rules human experts considered.
Compared to the other state-of-the-art models, the main advantages of IITNet are efficiency and adjustability attributed to using single-channel raw EEG and controlling input length. The proposed model achieved comparable performance with state of the art by using raw single-channel EEG, although the other studies used multi-channel EEGs or spectrogram instead of the raw signal. Furthermore, the input length of the proposed algorithm can be adjustable according to diverse application purposes. This provides high feasibility compared to the other algorithm. Table IV supports these strong points of the proposed algorithm. It is also feasible to score the sleep stages automatically in real-time via IITNet with a single-channel EEG sensor, which can be one of the essential technologies for healthcare 4.0, especially for the next generation treatment and diagnosis of sleep disorders. Furthermore, IITNet can be directly applied to classify various kinds of time-series data since the model is the end-to-end architecture without pretraining or preprocessing. On the other hand, it should be noted that the sampling frequency of the input cannot be changeable after the model is trained. In order to score sleep stages from the signal of the different sampling frequency, preprocessing such as down or upsampling is required. The compared models of other studies have the same limitation.
To solve this, further study is necessary with the datasets of various sampling frequencies.

B. Influence of the Sequence Length (L)
The influence of sequence length (L) on the performance is shown in 5. For all the datasets, similar variation patterns can be seen. In Figs. 5a-5c, overall accuracy, MF1, and κ consistently increase until L is four. After that, they oscillate according to L. The smaller the fluctuation, the larger the dataset size (the number of subjects). The performance dropped in SleepEDF when the sequence length was 5 and 6. We think that the relatively small size of SleepEDF affected the drop since the models trained on a small dataset are more likely to result in high variance [55]. Moreover, the features of previous sleep stages before 2 minutes (more than 4 epochs) may be more difficult to be characterized. With insufficient training data, it is also hard for the model to learn the long ago features that affect the current sleep stage. The results show that this difficulty can be overcome with longer sequence length or larger datasets. Using four epochs (two minutes) as an input, the performance was still comparable to the state-of-the-art results for three datasets (SleepEDF, MASS, and SHHS) with overall accuracy (Acc.: 83.6%, 86.2%, 86.3%), macro F1score (MF1: 76.5%, 80.0%, 78.8%), and Cohen's kappa (κ : 0.77, 0.79, 0.81). The differences of overall accuracy, MF1, and κ between IITNet (L) and the state-of-the-art models are +1.58%p, −0.41%p, +0.013 for SleepEDF, +0.01%p, −1.73%p, −0.007 for MASS, −0.46%p, +0.35%p, −0.009 for SHHS, respectively. When L increases from one to four, overall accuracy and MF1 improve by 2.99%p, 4.36%p for SleepEDF, 1.73%p, 3.32%p for MASS, and 2.74%p, 7.03%p for SHHS, respectively. However, when L increases from four   to ten, these two metrics increase by 0.32%p, 1.14%p for SleepEDF, 0.13%p, 0.51%p for MASS, and 0.39%p, 0.97%p for SHHS, respectively. This results support that considering last two-minute epochs can be a reasonable choice to predict the sleep stage with efficiency and reliability. According to SeqSleepNet using the spectrogram of multi-channel PSG [31], the performance showed no significant difference when the sequence length was set to 10, 20, and 30. Other state-ofthe-art models can be more compact, and their prediction can become faster by reducing the input length as recommended in this study.

C. Per-class Performance Improvement
The F1 of N1, N2, and REM increases in a similar fashion to overall performance, whereas the enhancement in the F1 of W and N3 is not apparent. In Figs. 5d-5h, F1 of N1 and REM steadily increases until L is four. F1 of N2 also increases until L is four except for MASS L = 4. After that, they oscillate according to L. The confusion matrices for L of one, four, and ten are illustrated in Fig. 6. On average for three datasets, in four epochs compared to single epoch, accuracy and MF1 increased by 2.47%p and 4.93%p, respectively. Notably, MF1 of N1, N2, and REM increased by 16.1%p, 1.50%p, and 6.42%p, respectively. The results mean that the overall performance improvement was attributed to the enhanced prediction of N1, N2, and REM. The AASM recommends that sleep experts should consider the target and previous epochs, especially when labeling the sleep stage as N1 or REM. The results support that the transition rules were well trained in IITNet by the intra-and inter-epoch context learning at the sub-epoch level. On the other hand, no significant improvement in W and N3 indicates that they have less inter-epoch dependency than the other stages. According to the AASM, for the target epoch to be identified as W, N2, or N3, the sleep-related events or specific EEG signal activities of the target epoch should mainly be considered.
For further improvement in N1, adding a frequency-aware module on IITNet and increasing the sequence length can be included for the future work, as the mixed frequency in the range of 4-7.99 Hz is a characteristic of N1 [6]. Modifying an IITNet into a sequence-to-sequence classification model with an multi-channel ensemble, as was accomplished in [31], [32], is also possible for a more elaborate classification. However, introducing an intra-epoch temporal context learning with the deep residual network for sleep scoring is our main focus in this article, and the suggested modification will be included in the future work.
On the other hand, the results show that IITNet overcame the imbalanced nature of PSG datasets without any sampling method or learning technique. Typically, the PSG datasets have an imbalanced class distribution, e.g., the number of N1 is less than 10% as shown in Table I    [7], [21], [25], [33], [34]. Although these kinds of methods improved the performance of certain classes; however, the overall performance became worse [22]. IITNet shows that using the sequence of less than ten epochs as the input enhanced both the overall and class-wise performance.

D. Performance Comparison to the baselines
The performance comparison between IITNet, E2E-DeepSleepNet and E2E-IntraDeepSleepNet when the sequence length L is 1, 4, and 10 for SleepEDF, MASS, and SHHS dataset is shown in Table V. It is worth noting that all experiments with the baselines were conducted using the same dataset and under the same condition. In SleepEDF and MASS, introducing our proposed intra-epoch temporal context learning on E2E-DeepSleepNet tend to improve the sleep scoring performance considerably in both cases when the input is a single epoch and multiple-epoch. E2E-IntraDeepSleepNet performed better than E2E-DeepSleepNet when the input is a single epoch (L=1) with the margin of overall accuracy, MF1, and κ between two models being +0.7%p, +0.7%p, +0.01 for SleepEDF, and +0.6%p, +0.2%p, +0.01 for MASS. Even for the multi-epoch input, the overall performance of E2E-IntraDeepSleepNet is better compared with that of E2E-DeepSleepNet, with the differences in overall accuracy being +0.6%p (L=4) and +0.9%p (L=10) for SleepEDF and −0.1%p (L=4) and +0.7%p (L=10) for MASS, respectively. In SHHS, the overall accuracy of E2E-IntraDeepSleepNet was similar, i.e., +0.0%p (L=1) and −0.2%p (L=4) or better with +0.7%p (L=10), in comparison to E2E-DeepSleepNet. This verifies that considering intra-epoch temporal context by learning with sub-epoch level features can lead to a performance gain in sleep scoring and introduces a synergistic effect with the inter-epoch temporal context learning.
In SHHS, exploiting a ResNet-50 for the representation learning was a key factor in the improvement of the sleep scoring performance. The overall accuracy of IITNet was significantly higher for all sequence lengths with a high margin +2.8%p, +2.8%p (L=1), +2.4%p, +2.6%p (L=4), +1.7%p, +1.0%p (L=10) comparing with those for E2E-DeepSleepNet and E2E-IntraDeepSleepNet. This shows that a deeper neural network (49 convolutional layers) with a residual connection can yield a better sleep scoring performance than the shallow network of DeepSleepNet (4 con-volutional layers) on a large scale dataset. However, for the SleepEDF and MASS, which are relatively small datasets compared with SHHS, employing a ResNet-50 did not always guarantee a performance improvement. Although IITNet in SleepEDF showed similar or higher overall accuracy compared with E2E-IntraDeepSleepNet with the margin of −1.0%p (L=1), +1.0%p (L=4), and +0.3%p (L=10), the overall accuracy of IITNet in MASS was lower than that of E2E-IntraDeepSleepNet's with −0.3%p (L=1), −0.5%p (L=4), and −0.9%p (L=10). This suggests that using ResNet-50 as a feature extractor can enhance the sleep scoring performance when a sufficient number of epochs per subject are given (2,115 in SleepEDF), while a shallow network can be sufficient to learn sleep-related features when the number of epochs per subject is small (926 in MASS). Thus, the number of epochs and subjects should be considered when designing the depth of CNN in the sleep scoring network.

V. CONCLUSIONS
Human sleep experts search the sleep-related events and consider the transition rules to score the sleep stage of an epoch. Motivated by this approach, a novel deep learning model named IITNet is proposed to score the sleep stage more accurately by considering the inter-and intra-epoch temporal contexts using raw single-channel EEG. The deep CNN based on a modified ResNet-50 extracts the sleep-related features and the RNN via two-layered BiLSTM learns the transition rules. Using ten epochs or less as an input, IITNet achieved the performance comparable to other state-of-the-art results for three public datasets: SleepEDF, MASS, and SHHS. The results show that the proposed temporal context learning at both the intra-and inter-epoch levels is effective to classify the time-series inputs. The overall performance was enhanced when the sequence length increased from one to four, which was attributed to the enhanced prediction of N1, N2, and REM. However, the improvement was not significant above four epochs. Using the target epoch and its three previous epochs, the overall performance was still comparable to state-of-theart results, which supports that considering last two-minute raw single-channel EEG can be a reasonable option to predict the sleep stage with efficiency and reliability. Other state-ofthe-art models can be more compact, and their training can become faster by reducing the input length. Moreover, IITNet can be directly applied to predict or classify various kinds of time-series data for healthcare and well-being applications since the model is based on the end-to-end architecture without pre-training or preprocessing tailored to sleep scoring.