A Temporal-Spectral Fused and Attention-Based Deep Model for Automatic Sleep Staging

Sleep staging is a vital process for evaluating sleep quality and diagnosing sleep-related diseases. Most of the existing automatic sleep staging methods focus on time-domain information and often ignore the transformation relationship between sleep stages. To deal with the above problems, we propose a Temporal-Spectral fused and Attention-based deep neural Network model (TSA-Net) for automatic sleep staging, using a single-channel electroencephalogram (EEG) signal. The TSA-Net is composed of a two-stream feature extractor, feature context learning, and conditional random field (CRF). Specifically, the two-stream feature extractor module can automatically extract and fuse EEG features from time and frequency domains, considering that both temporal and spectral features can provide abundant distinguishing information for sleep staging. Subsequently, the feature context learning module learns the dependencies between features using the multi-head self-attention mechanism and outputs a preliminary sleep stage. Finally, the CRF module further applies transition rules to improve classification performance. We evaluate our model on two public datasets, Sleep-EDF-20 and Sleep-EDF-78. In terms of accuracy, the TSA-Net achieves 86.64% and 82.21% on the Fpz-Cz channel, respectively. The experimental results illustrate that our TSA-Net can optimize the performance of sleep staging and achieve better staging performance than state-of-the-art methods.


I. INTRODUCTION
G OOD sleep is a crucial human physiological activity intimately related to health. Studies have shown that sleep is inextricably linked to human metabolism [1], immune The system [2], and memory [3]. Narcolepsy, sleep apnea syndrome, and insomnia are a few examples of sleep-related problems [4]. With the rapid development of modern civilization, people are having trouble in sleeping and are increasingly concerned about the quantity and quality of their sleep. Sleep staging is essential to evaluating the quality of sleep. Typically, sleep specialists use acquisition equipment for sleep staging based on the entire night polysomnogram (PSG) [5] of the patient, which usually includes electroencephalogram (EEG), electromyogram (EMG), electrooculogram (EOG), and electrocardiogram (ECG). The manual sleep staging methods generally divide the PSG into 30-second segments (called epochs), and then the segments are manually classified by specialists based on relevant criteria. The existing criteria for evaluating the sleep stage include the American Academy of Sleep Medicine standard (AASM) [6], and the Rechtschaffe and Kales (R&K) rule [7]. The R&K rule divides the sleep process into six stages, including wake stage (W), rapid sleep eye movement stage (REM), S1, S2, S3, and S4 of nonrapid eye movement stage. The AASM rule combines S3 and S4 into N3, including W, REM, N1, N2, and N3 of non-rapid eye movement stage. According to the above rules, sleep specialists categorize the stages of sleep based on the distinctive waveforms and frequency characteristics that each stage of sleep exhibits. For instance, K-complex wave and sleep spindles wave are evident in the N2 stage, and the δ wave is more prevalent in the N3 stage of deep sleep. Similarly, the α wave is prominent during the W stage.
Generally, manual labeling and diagnosis in sleep staging need skilled and experienced specialists, which is timeconsuming and laborious. Even if some specialists have received training, there are still certain biases in the manual sleep staging procedure. Besides, various specialists may have varying views on how the same PSG data should be evaluated. Considering these factors, it is important to construct objective and automatic sleep staging methods.
In recent years, some automatic sleep staging techniques have gained popularity with the growth of machine learning. Traditional machine learning of sleep staging requires handcrafted feature extraction and classifier selection. Firstly, the data are preprocessed and filtered to obtain clean and nonimpurity signals, then feature extraction is carried out on the preprocessed signals, and valuable features are selected and This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ input into the classifier to execute sleep staging. These handcrafted features include time-domain features [8], frequencydomain features [9], and time-frequency domain features [10]. In addition, various classifiers are used for sleep stages, such as support vector machines (SVM) [11], random forests (RF) [12], and adaptive boosting (AdaBoost) [13]. For instance, Guo et al. [11] proposed a sleep staging method using an SVM classifier based on the Hilbert-Huang transform and sample entropy features. In [14], the authors obtained the maximized feature based on multi-scale principal component analysis and discrete wavelet transform to realize the classification of sleep stages by integrating the RotSVM classifier. Based on the time domain and frequency domain, Timplalexis et al. [15] extracted the mixed features and tested multiple classifiers to generate the final classifier for sleep staging with the voting idea. Hassan et al. [16] used the tunable-Q wavelet transform to extract the features of sleep EEG signals and implemented a decision support system based on bootstrap aggregating to complete the classification of 2-state to 6-state sleep stages. In [17], the authors staged sleep by using several tiny classifiers and employed overlapping sampling to enhance the quality of the data samples.
Feature selection requires a certain amount of expertise and has limited performance. Deep learning can automatically extract meaningful features of polysomnography and further perform end-to-end sleep stage classification, gaining more and more interest and attempts. Existing methods can be categorized by model features into temporal domain featurebased, spectral domain feature-based, and other spatial featurebased or multimodal methods. Most of the methods are based on temporal domain features, and the model input is the raw temporal EEG sequence. For instance, based on the raw EEG signals, the convolutional neural network (CNN) was widely used in discriminative feature extraction to realize sleep stage classification [18], [19], [20]. To further learn the temporal correlation information, Supratak et al. [21] used CNN to gain shallow features and added two-layer bi-directional long shortterm memory (Bi-LSTM) to mine the relationship of temporal sequence. In [22], Seo et al. proposed IITNet, which utilized Bi-LSTM to explore features inside and between the sleep stages. Drawing lessons from the transformer's idea of selfattention, the transform-based models [23], [24] can learn the features of sequence signals more effectively and have higher accuracy.
In addition to the original time domain EEG signals, some methods consider extracting features from the frequency domain. For instance, considering the frequency-domain information, Kuo et al. [25] employed the spectrogram obtained by wavelet transform as the model input and built the SNet model by CNN. Instead, Gupta et al. [26] applied a modified Fourier decomposition method to construct a new time-frequency image representation. Some methods used multi-channel EEG data to construct functional connectivity [27] representing the features of brain spatial domain or combined EOG and ECG data for sleep staging research [28], [29], [30]. Though the above deep learning models achieved acceptable sleep staging results, most of the existing methods only take into account the raw time-domain signal or the 2-D image of time-frequency representation. In this way, the feature representation power is limited, and applying image representation for training requires more resource consumption. Besides, these methods fail to consider the inner and continuous relationship between stages in sleep staging. Actually, for uncertain signals, the specialist would observe the adjacent epochs to determine the sleep stage of the current epoch.
To alleviate the above problems, we propose a Temporal-Spectral fused and Attention-based deep neural Network model (TSA-Net) for automatic sleep staging, using a singlechannel EEG signal. The TSA-Net is a two-stream model that combines temporal and spectral features from two perspectives. In addition, we only use single-channel EEG data for our study, considering multimodal data can be informative, but single-channel EEG has lower requirements for equipment and is more portable. The TSA-Net starts with a two-stream feature extractor (TSFE) module, in which the raw timedomain signals and the converted frequency-domain signals are the inputs of temporal and spectral streams, respectively. In each stream of the TSFE, different feature extraction modules are designed. The feature context learning (FCL) module receives the extracted features and uses the multi-head self-attention [30] block to learn the correlation information between the features and the residual block to learn multi-level features. After that, we will get a preliminary result of sleep staging, similar to a sleep specialist's initial determination of the current epoch. Furthermore, we employ the conditional random field module (CRF) [31] to learn the transition rules between epochs and to obtain the final results.
Our contributions can be summarized as follows: (1) We propose a fused dual-input model considering both temporal and spectral features from two perspectives to obtain salient features of epochs.
(2) We use the multi-head self-attention mechanism in the FCL module to learn the dependencies between features.
(3) We imitate the decision-making procedure of specialist in manual sleep staging and employ CRF module to build transition rules, which makes the model more reasonable.
(4) Extensive experiments are conducted on two public datasets, and the results indicate that the proposed TSA-Net model outperforms state-of-the-art models in terms of staging accuracy and overall performance.
The rest of the paper is arranged as follows. Section II describes the proposed TSA-Net model. Section III introduces the experimental design, including the dataset used in the experiment, the evaluation criteria, and specific model settings. Section IV presents the experimental results and performances of our model, discussion and limitations. Finally, Section V concludes the whole paper.

II. METHOD
In this section, we provide a thorough introduction to the proposed TSA-Net model, covering its overall structure and each module in depth.

A. Overview of Our Model
The time-domain information and frequency-domain information of different sleep stages have different characteristics, Fig. 1. The overall framework of the proposed model TSA-Net. Two branches of the raw EEG signals are input into the two-stream feature extractor; one branch into the temporal feature extractor, and the other branch into the spectral feature extractor after time-frequency transformation. The temporal and spectral fusion features obtained by the two branches were then entered into the feature context learning module to obtain the preliminary results of sleep staging. The conditional random field module will further optimize and gain the final sleep stage.
both of which can offer extensive details for sleep staging. Based on that, we propose the TSA-Net model. The pipeline of our dual-input single-channel based sleep staging method is shown in Fig. 1. The dual-input represents the raw temporal EEG sequence and the spectral sequence after the time-frequency transformation. Two-stream feature extractor includes temporal feature extractor and spectral feature extractor, and they vary according to their characteristics. The temporal feature extractor is referred to a previous work [23], where two branches of different convolution kernels are designed. The spectral feature extractor is simpler because the frequency domain is more conspicuous. These two feature extractors will deal with temporal and spectral sequences, respectively. After that, features from two streams will be concatenated, go through a convolutional layer mapping and enter the FCL module. The FCL consists of a multi-head self-attention block and a residual block. The multi-head selfattention block is used to learn the correlation and dependence information between the features learned from two streams. The residual structure can combine features from different levels. Sequentially, we get a preliminary sleep stage result after FCL, but this result does not consider the associations between epochs and the existing inappropriate transition. For uncertain epochs, we imitate the specialists who look at the adjacent epochs to determine the current epoch. Therefore, we propose the use of CRF to further modify the results and complete the sleep staging. The main modules are described below.

B. Dual-Input
Since the characteristics of EEG include the time domain and frequency domain, we mine relevant information from two perspectives, which is the dual input. For the original 30-second EEG with a sample rate of r Hz, the raw EEG time series is the temporal sequence, with the shape of 1×30r . We use Fast Fourier Transform for time-frequency transformation to obtain a spectral EEG sequence. Considering the lowfrequency characteristics of EEG signals during sleep [32], we only select the frequency part around 0-25 Hz. These two parts of the input will be subjected to the corresponding feature learning part of the two-stream feature extractor.

C. Two-Stream Feature Extractor (TSFE)
The two-stream feature extractor includes two sub-modules, the temporal feature extractor, and the spectral feature extractor, which correspond to the temporal domain and spectral domain, respectively. The temporal domain sequence will be fed into the temporal feature extractor, and the spectral domain sequence will enter the spectral feature extractor. The time-frequency domain features obtained by the two feature extractors are integrated and flowed into the next module. These two sub-modules are described in detail next.
1) Temporal Feature Extractor: Since time-domain information is of the raw temporal sequence, inspired by works [21], [23], we implement two branches to capture valuable features of different bands. The convolutional layers of the two branches have different convolution kernel sizes to explore the feature information of different scales. The size setting of the convolution kernel is related to the sampling rate of the EEG signal. In the experiment of this study, with the sampling rate of 100 Hz, we set the convolution kernel size to 50 and 400, corresponding to time windows of 0.5 seconds and 4 seconds, respectively. Taking the 4-second time window as an example, it can capture sinusoidal signals as low as 0.25 Hz. The feature waveforms that can be captured for different size time windows are also different, so designing branches with different convolution kernel sizes can obtain signal features on two scales with different sizes. Each branch consists of four convolution layers, two maximum pooling layers, and a dropout layer. Each convolution layer is followed by batch normalization and a Gaussian Error Linear Unit (GELU) activation function. Batch normalization normalizes the data so that the model can have a better generalization effect. To avoid gradient disappearance problems, we choose GELU as the activation function. The shape of the 1 × 3000 EEG temporal sequence is fed into the temporal feature extractor and outputs a shape of 128 × 80 feature vector.
2) Spectral Feature Extractor: The spectral feature extractor consists of a stack of one-dimensional convolution operations, including two convolutional modules for feature learning and a max-pooling module to reduce the dimension of features. Each convolutional layer in the convolutional module is followed by a batch normalization layer and GELU activation function, like the temporal feature extractor. In the max-pooling module, the max-pooling layer is followed by a dropout layer that drops with a certain probability to prevent the overfitting of the model. Since the feature of frequency domain sequence is more obvious, and a single branch can reduce unnecessary resource consumption, we just use one branch in the spectral feature extractor. In our model, the input of spectral feature extractors is 1 × 1000, and the output shape is 128 × 40.

D. Feature Context Learning Module (FCL)
Inspired by Transformer [33], the FCL module utilizes multi-head self-attention to encode and learn the extracted time-frequency domain features. The multi-head self-attention can process features in parallel, improving the parallel efficiency of the model over other models, like the recurrent neural network (RNN). The FCL module consists of a multihead self-attention block and a residual block stacked twice. Multi-head self-attention block can learn dependence over a long period. Compared with the traditional self-attention method, multi-head self-attention divides the input features into subspaces composed of multiple heads, and each subspace will learn the attention weights in the space. The heads of different subspaces will interact with each other to convey attention information between different subspaces. After that, we design a residual structure to lessen information loss at different information levels, which makes up the second block. Therefore, the FCL module can improve the overall ability of the model and pay attention to different locations. The output of the TSFE module (denoted as X ∈ R l×d ) will be input into this module, where l is the length of the feature, d is the dimension of the feature. After the two blocks, the final output of FCL is a preliminary sleep stage. Suppose we have H heads, the input features are evenly divided into H subspaces, each of which are represented as x n ∈ R l× d H , where 1≤ n ≤ H . For each subspace n, we calculate its corresponding Q n , K n , V n according to the learnable weight matrix W n : The self-attention A n of each subspace n can be obtained by Q n , K n , V n for dot product operation, and the specific operation formula is: Multi-headed self-attention will concatenate the self-attention of each subspace A n : The result of the multi-head self-attention calculation will be added with the input feature and then enter the residual block. In the residual block, the input M will first handle layer normalization and then enter two fully connected layers. Each fully connected layer is followed by a dropout layer with a probability of 0.1. The residual operation will be performed on the output of the multi-head self-attention block. The output of the residual block will be sent into the fully connected layer to output the preliminary predicted sleep staging results.

E. Conditional Random Fields (CRF)
The preliminary results obtained in the FCL module only consider the characteristics of the current epoch. This is analogous to the fact that the specialist will have a pre-judgment of the sleep stage in each 30-second time window. However, when the EEG information within the time window cannot fully determine the sleep stage, the specialist will consider the epochs before and after the current window to determine the current sleep stage. Based on this ideology, we propose to use a CRF module to correct the transition rule of the sleep stage.
The CRF [31] is a discriminant probability model based on an undirected graph that considers the dynamic relationship between adjacent variables. Our method is based on a linear CRF, which defines two stochastic sequences. One is a state sequence I = {i 1 , i 2 , . . . , i T } and the other is an observations sequence O = {o 1 , o 2 , . . . , o T }, i n , o n ∈ {W, N 1, N 2, N 3, R E M}, (1 ≤ n ≤ T ) Here, state sequence I is the ultimate label we want, and observed sequence O is the preliminary prediction in the FCL module. i n represents the true sleep stage at n-time and o n is the observed preliminary sleep staging result at n-time. Since our goal is to solve the problem of sleep staging, either the observation sequence or the state sequence can only be one of W, N1, N2, N3, and REM, a total of five possible states. We calculate the prediction result of the final sleep stage based on probability from the undirected graph, and its conditional probability distribution P (i | o) is defined as: f k (i n , i n−1 , o n = t k (i n , i n−1 , o n ) s l (i n , i n−1 , o n ) where f k (i n , i n−1 , o n ) is feature function, which is specifically divided into transition functions t k (i n , i n−1 , o n ) and state functions s l (i n , i n−1 , o n ). In Eq. (9), ω k is the weight of the feature functions. λ k is for transition functions, µ l is for state functions correspondingly. K is the total number of feature functions.
For the conditional probability distribution constructed, we adopt a maximum likelihood estimate to calculate the conditional log-likelihood L (θ) = N j=1 logp(i j |o j ), where N is the length of the predicted sequence, i j and o j represent the state value and observation value of the j-th sample, respectively. After obtaining the trained model, we use the Viterbi algorithm [33] to solve the predicted sleep stage. That is to say, the sequence of CRF optimizes the preliminary sleep staging result, and the final sleep staging result is obtained.

F. Loss Function
Due to the imbalance between categories in sleep stages, we use a weighted cross-entropy loss function [23]: where ω c is a weight parameter that can be adjusted according to each category, M is the total number of samples, and C is the number of categories. y c m is the true label of the m-th sample,ŷ c m is the predicted label of the m-th sample, which together constitute the training loss of the model.

A. Datasets
Our experiments are evaluated on two public datasets from PhysioBank [34], called Sleep-EDF-20 and Sleep-EDF-78. Sleep-EDF-20 was released in 2013 and contained 20 healthy Caucasians, while Sleep-EDF-78 included 78 subjects without any sleep problems and was a later development version. All subjects in the experiment were aged between 25 and 101. They experimented for two consecutive day-night at home. For some reason, subjects 13, 36, and 52 were lost on one of the two nights of PSG records. Each PSG record contains Fpz-Cz and Pz-Oz two EEG channels, one horizontal EOG channel, one submental chin EMG channel, respiration, and rectal body temperature. Experienced specialists scored the sleep stage labeling of each PSG record according to R&K rules, which include W, S1, S2, S3, S4, REM, MOVEMENT, and UNKNOWN.

B. Data Preprocessing
In our research, following some previous studies [21], [23], we only use single-channel EEG Fpz-Cz, with a sampling rate of 100Hz. Both datasets go through the following preprocessing procedure: Firstly, we divide the EEG signal into epochs of 30 seconds, since the specialist marked the label by the epochs. Secondly, we follow previous studies and classify five stages of sleep according to the AASM standard, merging N3 and N4 into one, and removing MOVEMENT and UNKNOWN stages. Thirdly, since the non-sleep stages are useless for our research and we are only interested in sleep processes, we adopt 30 minutes before sleep and 30 minutes after sleep. Table I shows the sleep stage distribution for the two datasets after preprocessing.

C. Evaluation Metrics
We adopt five kinds of indicators to comprehensively evaluate the performance of the proposed model. They are accuracy (ACC), recall (RE), precision (PR), macro-averaged F1-score (MF1), and Cohen Kappa (K). MF1 and Kappa can provide a good evaluation of the unbalanced dataset. Suppose TP is True Positive, FP is False Positive, FN is False Negative, TN is True Negative, C is the kind of sleep stage, P e is the hypothetical probability of agreement by chance. Their corresponding formulas are defined as follows.
P R = T P T P + F P (13)

D. Experimental Setup
We adopt k-fold cross-validation for sleep staging in this experiment, where k is set differently for two datasets. For Sleep-EDF-20, the k is set to 20, which is equivalent to leave-one-subject-out (LOSO) cross-validation. In each fold, 19 subjects are used for training, and the remaining one subject is used for testing. This operation is repeated 20 times until all the subjects are used. Due to the Sleep-EDF-78 dataset being large and k-fold cross-validation being time-consuming, we only use 10-fold cross-validation for this dataset like in other work [35].
We apply the cross-entropy loss to train our model and assign different weights to different categories of losses to avoid the problem of unbalanced data. Considering the different severity caused by the uneven distribution of the number of the five categories, as used in [23], we set the weights of W, N1, N2, N3, and REM as 0.3, 0.4, 0.3, 0.2, and 0.3, respectively. During the model training, we use the Adam

IV. RESULTS AND ANALYSIS A. Classification Performance
We implement our temporal-spectral fused, attention-based and CRF-optimized method TSA-Net for validation on two public datasets Sleep-EDF-20 and Sleep-EDF-78. For the two datasets, we use the Fpz-Cz channel with 20-fold crossvalidation and 10-fold cross-validation for the test, respectively. Table II and Table III show the confusion matrix of the two corresponding datasets. The confusion matrix is computed by the sum of the validation sets for each fold. Each row represents the actual sleep stages, and each column is the sleep stage predicted by our TSA-Net model. Take Table II, for example, the number 184 in row one, column two means 184 W stage samples are misclassified to the N1 stage. The right of the table indicates the five sleep stage performance calculated from the confusion matrix.
As shown in Table II and Table III, we find that the diagonal elements of the confusion matrix occupy the majority, proving that our method is effective. We can observe that the F1 score of W can reach more than 90% in two datasets, reaching 90.54% in Sleep-EDF-20 and 91.38% in Sleep-EDF-78, respectively. Compared with other stages, N1 obtain the lowest F1 score, and unlike other stages, its precision-score is higher than the recall score. Looking at the confusion matrix, we also find that N1 is often misclassified as N2, REM, and W. In the case of misclassification, most stages are often misclassified as N2. This may be because N1 has a smaller sample size, while N2 has a sample size of about 40%. In contrast, the performance is better overall on the Sleep-EDF-20 dataset. Except for the W stage, all three indicators on the Sleep-EDF-78 dataset were lower than those in the Sleep-EDF-20 dataset.

B. Comparison With State-of-the-Art Models
We compare our TSANet with different classifiers on the Sleep-EDF-20 dataset using 20-fold cross-validation. Referring to the common features of EEG in previous works, we extract approximate entropy, sample entropy, fuzzy entropy [36], and power spectral density features [37] of five frequency bands. We adopt eight baseline models, including Artificial Neural Network (ANN), SVM, k-nearest neighbor (KNN), extreme gradient boosting (XGBoost), LSTM, BiLSTM, CNN, and CNN + BiLSTM. The first four methods are traditional classifiers with hand-crafted features, and the last four are deep models based on the raw sequence, among which CNN + BiLSTM was the DeepSleepNet model proposed by Supratak et al [21]. In the experiment, we also conduct paired samples t-test for statistical analysis on the classification accuracy of the 20-fold cross-validation.
As shown in Fig. 2, we observe that TSANet significantly performs better than the baseline classifiers. In terms of accuracy, the TSA-Net improves by 21.19%, 38.94%, 35.08%, 21.53%, 36.59%, 34.56%, 4.85%, 4.99% compared to the ANN, SVM, KNN, XGBoost, LSTM, BiLSTM, CNN, CNN +BiLSTM, respectively. Among the baseline models, CNN and CNN +BiLSTM perform the best, both achieving an accuracy of more than 81%, which proves the superiority of CNN and the limitations of traditional classifiers. At the same time, among the traditional classifiers, ANN and XGBoost achieve the best accuracy. In contrast, LSTM-based methods  IV  COMPARISON AMONG TSA-NET AND STATE-OF-THE-ART MODELS. THE NUMBERS IN BOLD INDICATE THE BEST PERFORMANCE, AND THE  NUMBERS UNDERLINED ARE THE SECOND BEST do not perform well, probably because they are not suitable for 30-second-long sequences of sleep EEG and could not get effective information. Compared with hand-crafted features, CNN can effectively and automatically extract features, and abstract the features representation. The above results indicates the superiority of CNN-based deep learning models. Moreover, we compare our model with the following recently published CNN-based deep learning models to better illustrate the overall classification performance of the TSA-Net: • CNN-LSTM-CRF [38] used the splicing of the CNN module and LSTM module as pre-training, and then CRF was used for improvement.
• LightSleepNet [39] designed a residual lightweight model based on one-dimensional convolution.
• CNN-HMM [40] applied multi-core CNN for epoch classification, and HMM was used for optimization to obtain the final results.
• CCRRSleepNet [41] employed hybrid mixed relational inductive biases to learn different contributions between features from three levels of the frame, epoch, and sequence.
• DeepSleepNet-Lite [35] first used the Monte Carlo dropout technique to enhance sleep scoring performance and detect uncertain instances.
• AttnSleep [23] proposed a temporal context encoder and deployed causal convolution to learn temporal correlation features. Table IV shows the comparison of F1 scores in five stages and the three overall evaluations of Accuracy, Kappa, and MF1. For the sake of fairness, the datasets and experimental settings of the models we choose are as identical as possible. The sleep staging performances of the above methods are derived from their corresponding papers. We find that our method is the best at W, N2, and all three overall performances, achieving an accuracy rate of 86.64% on the Sleep-EDF-20 dataset. For the accuracy of Sleep-EDF-78 dataset, the result of the 10-fold cross-validation is 81.66% and the 20-fold cross-validation is 82.21%. As shown in Table IV, the TSA-Net outperforms the state-of-the-art models with improvements of 1.25% to 2.84% in terms of accuracy. In addition, the accuracies of TSA-Net improved by 0.91% (for 20-fold cross-validation) and 1.33% (for 10-fold crossvalidation) on the Sleep-EDF-78. Compared with the CNN-HMM method, our model uses CRF for learning transition rules, and the structure of CRF undirected graphs is more effective. The extraction of dual features in the time domain and frequency domain can learn more comprehensive features in comparison to the CNN-LSTM-CRF model. Additionally, multi-head self-attention enhances parallelism capabilities, resulting in overall performance superior to LSTM. For the Sleep-EDF-78 dataset, the overall performance of accuracy and kappa in TSA-Net is also better than other models. And the kappa results prove that our model is not without class-to-category bias but is more scientific. Overall, our model performs better than the state-of-the-art models on both datasets under the same experimental settings. At the same time, we find that the improvement of TSA-Net on the two datasets is different. Firstly, we speculate that the amount and the data distribution differences is to cause. As shown in Table I, N1 accounts for only 6.6% in Sleep-EDF-20, while N2 accounts for 42.1%, which is much higher than in other stages. While the smallest proportion of Sleep-EDF-78 is N3, only 6.75%, N2 and W account for roughly the same proportion. The difference in the distribution of the two datasets should be responsible for the different performances. Secondly, leaveone-subject-out cross-validation is used on the Sleep-EDF-20, while leave-one-subject-out cross-validation used on the Sleep-EDF-78 is not realistic considering resource constraints. The different experimental settings are also the influencing factors for the different effects. Finally, the differences between subjects are also an important factor. As the number of subjects increases, the model will be greatly affected by the differences between subjects. In this way, further optimization could be applied by reducing the differences between subjects using methods such as transfer learning.

C. Model Analysis
To further analyze the feature representation ability of each module in the TSA-Net, we use t-SNE [43] for visual analysis. The t-SNE is a nonlinear dimensionality reduction method that can map high-dimensional data to low-dimensional space to observe the characteristics of the data. Fig. 3 visualizes the distribution of data processed by each module of the Sleep-EDF-20 dataset. Fig. 3(a) denotes the raw data without any preprocessing, and we find that the five sleep categories are all mixed up and difficult to distinguish from each other. Fig. 3 (b) and (c) show the characteristics of time-domain information processed by small convolution kernel and wide convolution kernel, respectively. There is a certain aggregation of categories in the same stage, but they cannot be clearly distinguished, especially between the W and N1 stages. Fig. 3 (d) represents the feature extraction in the frequency domain, and its inter-class aggregation is not clear.
The red W stage and orange N1 stage are scattered into the other three categories. In other words, this shows that both time domain and frequency domain features have certain feature representation abilities and can understand some of the traits of the various sleep stages, but it is not enough and needs further processing. Fig. 3 (e) exhibits the result of the integration of time and frequency domain feature extraction. From the comparison, we see that the integrated features have better expression ability than the separate time domain and frequency domain features, which further demonstrates the progress of temporal and spectral feature fusion. The data distribution after the multi-head self-attention block is shown in Fig. 3 (f), and we can see that the W, N2 and N3 stages have obvious boundaries, but the N1 stage is still confused with the REM stage. As Fig. 3 (g) displays, with the residual block combined with different levels of information, the classification effect is further improved. Fig. 3 (h) denotes the data distribution of the output of the last fully connected layer. The N1 stage has a closer aggregation, and the five stages are clearly separated. However, the overall performance of the N1 stage is not as good as other stages. We argue this may be due to the small sample size and feature learning is not as significant as in other stages.
Taking Sleep-EDF-20 as an example, we experiment with a leave-one-subject-out cross-validation. Fig. 4 shows the overall performance of each subject in the 20-fold validation. We can see that the accuracy of most subjects is above 80%, of which the overall performance of subject 7 is the best and the three indicators of subject 12 are not ideal. The possible reason is that in the cross-validation of fold 12, the sample size of test subjects is smaller than that of other subjects, and the data in the N1 stage is even less, only 31 samples, which may be the reason for the poor performance of fold 12.

D. CRF Refinement
In our model, we use CRF to learn transition rules for further correcting the preliminary predictions. To observe the improvement effect of CRF, we exhibit the whole-night sleep stage of the preliminary results of TSA-Net, the final results of TSA-Net with CRF, and corresponding labels scored by sleep specialists for visualization.
As shown in Fig. 5, there are some abnormal transitions in the preliminary results, which manifest as frequent jumps in the stages. The results of CRF correction are presented in the center hypnogram of Fig. 5. We can observe that the jump during the stages is significantly reduced, which is highlighted in the red box in the figure. The first red box indicated that N2 can be corrected when it is misclassified as REM and REM frequently misclassified as N1 in the second red box and can also be corrected. Similarly, W is wrongly classified as REM. The optimized stage conversion has been significantly improved and is closer to the sleep stage labeled by specialists.
We further explore the transition of the CRF module and plot the transition probability based on the preliminary results of the Sleep-EDF-20 dataset, as shown in Fig. 6. Additionally, Fig. 7 shows the sample size statistics of the preliminary results and the final results in each stage when the correction occurred. We observe that the most frequently occurring sleep transition rule correction is optimizing N1 to REM, followed by N1 to W. We conjecture that the most likely explanation is that N1, the stage between REM and N2, has the smallest  sample size and only makes up 5%-10% of the entire sleep cycle. Besides, the preliminary results are easy to be misclassified as its adjacent stages, while the CRF could modify this phenomenon according to the unreasonable jitter of its adjacent stages. In addition, we could see that the mutual transition probability between N3 and N1, as well as between N3 and REM are both 0. This may be because N3, which is a stage of deep sleep, exhibits a significant slow wave (0.5-2Hz), whereas REM and N1, which are stages of preparation for sleep and light sleep, commonly exhibit the alpha wave (8-13Hz) and the theta wave (4-8Hz). The above two-paired stages can be well separated in the preliminary staging results based on the combination of temporal features and spectral features so that no further correction is required for the CRF module, which leads to its low transition probability. At the same time, the differences in correction between different stages also verify the rationality and necessity of the CRF module.

E. Ablation Study
To evaluate the effect of each module of the model, we perform ablation analysis on the Sleep-EDF-20 dataset based on the modules of the flowchart in Fig.1, and the specific variant model designs are as follows:  Fig. 8 shows that the lack of these critical parts weakens the model from the three perspectives of ACC, Kappa, and MF1. Comparing W/O CRF and all, we can see that the CRF can learn the transition rules during sleep and correct some unreasonable stage changes, showing its remarkable improvement effect. Since the time domain contains abundant original information, we can find from W/O TD and ALL that its lack also has a significant impact. In addition, the feature context learning module can further improve the performance of TSA-Net, and the multi-head self-attention can learn the correlation between features and improve the parallelism ability. The added frequency domain information module can also enhance the overall performance of the model. In general, the comparison between the model without different modules and the complete model verifies that they all make varying degrees of contributions to the TSA-Net.

F. Limitations and Future Work
Nevertheless, there are still some deficiencies in this study that need to be further improved. For example, the classification accuracy in the N1 stage is not enough, and the model cannot deal with the low accuracy caused by insufficient samples. Additionally, there is a lack of consideration of cross-subject differences and the effects of various devices. In the future, we will explore how to solve the data imbalance problem and how to mine the characteristics of N1 to improve its accuracy. In addition, we will further consider the issue of transfer learning [44], [45], as it is crucial to discover common characteristics among different subjects and maintain effective performances across various devices and datasets.

V. CONCLUSION
In this study, we present a novel dual-input deep temporalspectral representation and attention-based model called TSA-Net for automatic sleep staging. The TSA-Net includes three major modules: 1) a two-stream feature extractor for extracting salient features from the time domain and frequency domain, 2) a feature context learning module for learning timedependent dependencies and achieving preliminary staging results, and 3) a conditional random field for optimizing the preliminary results, rationalizing the transition of sleep stages, and obtaining the final classification results. We verify the proposed TSA-Net model on two public sleep staging datasets (i.e., Sleep-EDF-20 and Sleep-EDF-78) and compare them with the existing advanced sleep staging methods. Experimental results demonstrate that the TSA-Net model can not only acquire the transition rules of sleep stages but also achieve better sleep staging performance than state-of-the-art methods.