A comparison of probabilistic classifiers for sleep stage classification

. Objective: To compare conditional random (cid:12)elds (CRF), hidden Markov models (HMMs) and Bayesian linear discriminants (LDs) for cardiorespiratory sleep stage classi(cid:12)cation on a (cid:12)ve-class sleep staging task (Wake/N1/N2/N3/REM), to explore the bene(cid:12)t of incorporating time information in the classi(cid:12)cation and to evaluate the feasibility of sleep staging on obstructive sleep apnea (OSA) patients. Approach: The classi(cid:12)ers with and without time information were evaluated with 10-fold cross-validation on (cid:12)ve-, four-(Wake/N1+N2/N3/REM) and three-class (Wake/NREM/REM) classi(cid:12)cation tasks using a data set comprising 443 night-time polysomnography (PSG) recordings of 231 participants (180 healthy participants, 100 of which with a ‘regular’ sleep architecture, and 51 participants previously diagnosed with OSA). Main results: CRF with time information (CRFt) outperforms all other classi(cid:12)ers on all tasks, achieving a median accuracy and Cohen’s (cid:20) for all participants of 62.8% and 0.44 for (cid:12)ve classes, 68.8% and 0.47 for four classes, and 77.6% and 0.


Introduction
Despite substantial advances in the last decades in simpler and less obtrusive techniques for sleep monitoring, overnight polysomnography (PSG) remains the gold standard for sleep assessment and diagnosis of sleep disorders.However valuable, it is a costly, highly specialized procedure, with substantial burden to the participants under investigation and often limited to one or two consecutive nights.Because PSG is not practical for long-term sleep monitoring, significant research effort has been put in the development of sensors and techniques which allow sleep to be measured with less expensive sensors that are easier to set up and which can be used at home.(Redmond et al. 2007, Hedner et al. 2011, Van De Water et al. 2011, Kortelainen et al. 2010, Willemen et al. 2014, Fonseca et al. 2015).Given the limitations of home sleep monitoring, however, many sensors commonly used in PSG such as electroencephalography (EEG), electrooculography (EOG) or electromiography (EMG), are essentially excluded.The only physiological characteristics that can conveniently and reliably monitored are body movements, for example by means of actigraphy (Ancoli-Israel et al. 2003), cardiac activity, for instance with electrocardiography (ECG), photoplethysmography (PPG) (Yu et al. 2006, Beattie et al. 2017), or ballistocardiography (BCG) (Mack et al. 2009), and surrogates of respiratory effort based on respiratory movements measured with respiratory inductance plethysmography (RIP) belts around the thorax, BCG, or even Doppler radars (de Chazal et al. 2011).Although these modalities can be comfortably used in a home setting, and even set up by the subjects themselves, the relation between these physiological characteristics and sleep stages cannot be trivially annotated using visual rules as is done with standard PSG.Exploiting the varying autonomic characteristics during different stages of sleep, algorithms have been described which automate this process, typically using cardiac characteristics computed from heart rate variability and respiratory features from respiratory effort to classify epochs in states of Wake, 'light sleep' (typically combining N1 and N2), 'deep sleep' or N3, and rapid eye movement (REM) (Willemen et al. 2014, Fonseca et al. 2015, Beattie et al. 2017).
In earlier work (Fonseca et al. 2017) we explained how transitional features and non-separable features are better handled by conditional random fields (CRF) than by probabilistic classifiers traditionally used in sleep stage classification literature, such as hidden Markov models (HMMs) or Bayesian linear discriminants (LDs).However, that work only explored binary classification problems and did not address the more complex problem of multiple-class classification.Here we investigate the use of CRF for the simultaneous classification of multiple sleep stages using the same cardiorespiratory features as in our previous work, and compare it with HMM and LD for the same task.In addition, we will evaluate the extension of the multiple-class sleep stage classification problem to the five classes defined by the American Academy of Sleep Medicine (AASM) (Iber et al. 2007) to separately classify N1, N2, N3, REM sleep and Wake.This is in contrast with most work done in the area of cardiorespiratory sleep staging where classification is usually limited to either three classes (Wake, REM and non-REM A c c e p t e d M a n u s c r i p t (NREM) sleep) (Redmond et al. 2007, Domingues et al. 2014), or four classes (N1 and N2 combined, N3, REM sleep and Wake) (Fonseca et al. 2015, Willemen et al. 2014).Finally, we will evaluate the feasibility of sleep stage classification of obstructive sleep apnea (OSA) patients using CRF and the same set of cardiorespiratory features.Earlier work reported in literature has shown that the presence of this condition negatively impacts the performance of cardiorespiratory classifiers, achieving modest performance on three-class classification of Wake, REM and NREM with accuracy and Cohen's κ of 67% and 0.32 (Redmond & Heneghan 2006) and 70% and 0.41 (Willemen et al. 2015).

Data sets
The data sets used in this project were the same as used in our previous work (Fonseca et al. 2017), further extended with participants previously diagnosed with OSA.They comprise a total of 443 recordings of 231 participants from three databases recorded in different locations during different time periods.The first data set consists of 428 recordings of 216 participants, and was collected as part of the EU SIESTA project (Klosch et al. 2001) involving clinical partners and research groups from Austria, Germany, United Kingdom, Finland, Spain, Denmark andThe Netherlands, between 1997 and2000.The study was approved by the local ethical committee of each research group and all participants signed informed consent.None of the participants had a history of drug or alcohol use, were working at night, or had been diagnosed with a medical disorder interfering with the goal of the study.All participants had a Mini-Mental State Examination score (Folstein et al. 1975) larger than 25 and a Pittsburgh Sleep Quality Index (PSQI) score (Buysse et al. 1989) of at most 5.The apnea patients in this data set had been diagnosed according to the International Classification of Diseases, Tenth Edition (ICD-10) (G47.3) and at the date of investigation were free of psychopharmacotherapeutic drugs for at least five times the half-life of the medication, and in a steady state for on-going non-psychopharmacotherapeutic medication at least two months.
The second data set comprises six healthy participants, recorded at the Philips Experience Lab of the High Tech Campus, Eindhoven, The Netherlands, during 2010 and the third data set comprises nine healthy participants, recorded at the Sleep Health Center, Boston, USA, during 2009.Both studies included participants with no primary history of neurological, cardiovascular, psychiatric, pulmonary, or sleep disorders.In addition, none of the participants were using sleep, antidepressant or cardiovascular medication, recreational drugs, and had a Pittsburgh Sleep Quality Index (Buysse et al. 1989) of at most 5.The protocols were reviewed by the Internal Committee of Biomedical Experiments of Philips Research and all participants signed an informed consent.

A c c e p t e d M a n u s c r i p t
All sleep technicians were blind to the condition of the participants and scored sleep stages on the second and third data sets in five classes (Wake, REM, N1, N2, N3) according to the AASM guidelines (Iber et al. 2007), and on the first data set, in six classes (Wake, REM, S1, S2, S3, S4) according to the R&K guidelines (Rechtschaffen & Kales 1968).To use a common list of ground-truth classes, S1 was mapped to N1, S2 to N2, and S3 and S4 were merged in a single N3 class.These sleep stage annotations were used as ground-truth for training and validation in our study.
Besides all modalities needed for offline visual scoring of sleep stages according to the corresponding guidelines, all recordings comprise ECG recorded with a Modified Lead II derivation, and respiratory effort recorded with thoracic RIP.Although all participants in the data set were healthy, we expect an effect of monitoring in the actual characteristics of sleep.The well-known 'first night effect' has an impact on sleep architecture, namely with a reduction in sleep efficiency, a delay in the onset of N3 and REM, and increased number of stage transitions (Agnew et al. 1966).In order to analyze the impact of these changes, and similar to our previous study, a subset of the complete data set was analyzed separately, comprising 134 recordings (100 participants) with a minimum percentage of each sleep stage considered representative of sleep statistics of healthy adults: at least 5% of N3, 15% of REM sleep, and a minimum of 7 hours in bed (Ohayon et al. 2004) (set 'regular').Regarding sleep efficiency, we used a minimum threshold of 75% which, albeit low to describe healthy sleep, accounted for the increased sleep onset latencies recorded in many participants, but which otherwise had N3 and REM durations characteristic of healthy adult sleep.Although all participants in the data set were healthy, we expect an effect of monitoring in the actual characteristics of sleep.The well-known 'first night effect' has an impact on sleep architecture, namely with a reduction in sleep efficiency, a delay in the onset of N3 and REM, and increased number of stage transitions (Agnew et al. 1966).In order to analyze the impact of these changes, and similar to our previous study, a subset of the complete data set was analyzed separately, comprising 134 recordings (100 participants) with a minimum percentage of each sleep stage considered representative of sleep statistics of healthy adults: at least 5% of N3, 15% of REM sleep, and a minimum of 7 hours in bed (Ohayon et al. 2004) (set 'regular').Regarding sleep efficiency, we used a minimum threshold of 75% which, albeit low to describe healthy sleep, accounted for the increased sleep onset latencies recorded in many participants, but which otherwise had N3 and REM durations characteristic of healthy adult sleep.The second set, also analyzed separately, comprised all 102 recordings of the 51 OSA patients (set 'OSA').The remaining 80 participants -not analyzed separately -did not have a sleep disorder and were as such considered healthy, but had a sleep architecture which, likely due to the first night effect, did not fit the normative values chosen for the 'regular' set described above.The third set comprised all 443 recordings from all three data sets (set 'all').Table 1 summarizes the participant demographics and sleep statistics of the three sets.

Feature extraction
The classifiers used in this study use the same set of cardiac and respiratory features described in our previous work (Fonseca et al. 2017) and will only be summarized here.The baseline wander of the ECG signal was first removed with a high-pass filter using a Kaiser window of 1.016 sec, a cut-off frequency of 0.8 Hz and a side-lobe attenuation of 30 dB.QRS complexes were detected with a Hamilton-Tompkins QRS detector (Hamilton 2002, Hamilton & Tompkins 1986) and further localized with a postprocessing algorithm (Fonseca et al. 2014).RIP signals were first resampled to a common sampling rate of 10 Hz, and then low-pass filtered with a 10th order Butterworth filter with a cut-off frequency of 0.6 Hz.Baseline was removed by subtracting the median value calculated in a sliding window of 120 s.After the signals were processed, a total of 81 features from ECG, 33 from RIP and 3 from the simultaneous analysis of both modalities, were extracted using windows centered on non-overlapping epochs of 30 s, aligned with the ground-truth annotations from PSG.The length of the extraction windows varied from feature to feature as described in earlier work (Fonseca et al. 2017).An automatic artifact rejection procedure was employed to determine whether segments of the ECG and RIP signals were too affected by noise or motion artifacts.For each feature extracted from ECG, a given epoch was marked as invalid if the sum of the length of the R-R intervals in the window used to compute that feature amounted to less than 50% of the length of that window.For all features extracted from RIP, A c c e p t e d M a n u s c r i p t respiratory peaks and troughs on each epoch were first detected, and a criteria was used to determine whether they had a plausible length (Long et al. 2014).Epochs with one or more implausible peaks or troughs were marked as invalid.After feature extraction and artifact rejection, the values of features for epochs marked as invalid were estimated with linear interpolation between the values of neighboring (valid) epochs.Finally, to reduce between-subject physiological and equipment-related variations (Long et al. 2015), all features were normalized to have zero mean and unit standard deviation.

Multiple-sleep stage classification
A detailed explanation of Bayesian LDs, Bayesian networks (BNs), HMMs and CRFs, and how they behave in the presence of transitional and non-separable features was given in our previous work (Fonseca et al. 2017).This section will give only a brief summary of the three classifiers.
The Bayesian LD classifier is a single-epoch classifier which uses the Bayes' rule for minimizing the probability of error by selecting, based on the observations (features), the most likely class (sleep stage).In contrast with LDs, BNs can make use of directed acyclic graphs to better exploit the structured nature of sleep, where a given sleep stage depends also on the previous stage.HMMs are a special type of generative BNs which model the probability of an observation given an unobserved (hidden) state, and the probability of a state given the previous state.CRF classifiers are a type of discriminative models which, in contrast with generative models, avoid the need to learn the emission probabilities from the training data and model directly the probability of a class given an observation and given the class of the previous state.This has the benefit of requiring substantial less training data, or less simplifications about the feature space, and can make better use of correlated features.
In the following, the three classifiers are described in terms of their defining equations.
Applied in previous work on multiple-sleep stage classification (Fonseca et al. 2015), the LD classifier uses Bayes' rule for minimizing the probability of error by selecting, separately for each epoch, the most likely class, i.e., the class that maximizes the discriminant function where µ i is the average feature vector for class i, Σ is the pooled covariance matrix for all classes, and P (ω i ) is usually a fixed term with the prior probability of each class.It has been observed that the probability of different sleep stages changes with sleep time (Redmond et al. 2007).In order to exploit this characteristic, and in addition to the standard LD of (1), a variation of this classifier with time-varying prior probabilities (LDtvp) was considered (Redmond et al. 2007, Fonseca et al. 2015).In this case, the last term of the discriminant equation corresponds to P (ω i , t), representing the prior probability of class i at time (since lights off) t, with t = 1, .., T where T corresponds A c c e p t e d M a n u s c r i p t to the duration of the recording.All parameters µ i , Σ, and P (ω i , t) were determined during training.Unlike LD classifiers, where the most likely class is considered separately for each epoch, HMM classifiers make use of BNs where the dependency between consecutive states can be modeled by means of directed edges.HMM are a special case of BNs where the class of each state, w t , remains hidden, and the only observable variables are the feature vectors, x t , which depend on them (Rabiner 1989).Allowing the current state to depend on the previous state and on the current observations, the joint probability of the variables in this network is given by (2) Using the Viterbi algorithm (Viterbi 1967), the most likely sequence of states is progressively determined as the sequence that maximizes the joint probability up to each state, (3) Emission probabilities expressing the likelihood of an observation given a class, P (X t |w t ), and transition probabilities expressing the likelihood of transitioning to a state w t given the previous state w t−1 , i.e., P (w t |w t−1 ) are learned during training.In our study we used a discrete HMM classifier (Rabiner 1989) whereby the estimation of emission probabilities is obtained by mapping the feature space to a discrete set of symbols, or clusters.The number of clusters k was automatically determined during training as the value of k that maximized the normalized mutual information between the labels (class) of each feature vector and the corresponding closest clusters obtained with k-means clustering (Strehl & Ghosh 2002).During classification, the cluster index of each feature vector was determined by finding the closest cluster centroid.The index of this cluster was used as input to the Viterbi algorithm to obtain the posterior probability of each class and the most likely class for each state.HMM, like other generative models, require the distribution of the observations to be modeled, or somehow learned from the training data.When the features are not independent this requires large amounts of training data or strong assumptions about the distribution and the relation between features.In the case of our study, this was achieved by mapping the feature space to a limited, discrete, number of clusters.Discriminative models such as CRF (Lafferty et al. 2001) do not have this requirement, because they compute the probability of a class based on an observation, P (w|X), directly from data.Because discriminative models do not require any assumptions about the independence of the observed variables, they make better use of interdependent, correlated features, very common in the case of cardiorespiratory sleep stage classification.Parameter A c c e p t e d M a n u s c r i p t learning with CRF is usually done with factor graphs (Kschischang et al. 2001) and the conditional probability P (W |X) can be obtained from with where λ k are the model parameters and f k the feature functions, in this study specified as the set of state-transition functions f t (w t , w t−1 , x t ) defined for all possible statetransition pairs (s, r), and state-observation functions f e (w t , w t−1 , x t ) defined for all state-observation pairs (s, x), and where the notation 1 {condition} has a value of 1 if the condition is true, and 0 otherwise.A total of K = S 2 + S parameters θ = {λ s,r , λ s,x }, with S as the number of classes, are estimated during training using N input sequences (corresponding to state and observation sequences from N different recordings), by maximizing the conditional loglikelihood (Sutton & McCallum 2007), where the superscript i indicates the i-th state-observation sequence pair.The set of parameters θ that maximizes L(θ) can be found using numerical optimization techniques such as the Broyden-Fletcher-Goldfarb-Shanno algorithm (Dennis, Jr & More 1997).
Decoding of a linear-chain CRF is done in a similar way as HMM, using a modified version of the Viterbi algorithm (Sutton & McCallum 2007) where the most likely sequence of states is progressively determined as the sequence of classes that maximizes the recursion In a manner comparable to the use of time-varying prior probabilities for LD, a trivial way of encoding time information in Bayesian chains is to simply add a time feature to the feature set.By defining the lights-off and lights-on instants as time t = 0 and t = 1 respectively, and by linearly interpolating the values for all epochs, a comparable (and normalized) time feature can be used for each recording.This technique was used for both the HMM and CRF classifiers, and will be further specified as HMMt and CRFt to explicitly indicate the use of time information.
A c c e p t e d M a n u s c r i p t

Evaluation
To compare the performance of the three classifiers (and their corresponding timevariations) in the multiple-stage classification tasks, a 10-fold cross-validation procedure was used.Furthermore, to eliminate the possibility of subject-dependent bias between training and validation, when a participant had more than one recording they were both made part of the same fold for either training or validation.The same folds were used for all classifiers to allow for paired comparisons.
The performance of each of the six classifiers (LD, LDtvp, HMM, HMMt, CRF and CRFt) was evaluated on epoch-per-epoch agreement between the predicted classes on a five-class classification task (Wake/N1/N2/N3/REM) in terms of accuracy (percentage of epochs where the predicted and ground-truth sleep stage agreed) and Cohen's κ coefficient of agreement, an agreement measure that compensates for the chance of random agreement (Cohen 1960).κ is usually interpreted with the following terms: below 0.20 indicates 'slight agreement', 0.21 to 0.4 'fair agreement', 0.41 to 0.60 'moderate agreement', 0.61 to 0.8 'substantial agreement', and above 0.81 'almost perfect agreement'.In addition, the performance of all classifiers was also evaluated on four-class (Wake/N1+N2/N3/REM) and three-class (Wake/NREM/REM) tasks.These results were obtained by merging both the predicted and ground-truth classes in the following way: for the four-class task, N1 and N2 were merged into a single N1+N2 class.For three-class classification, N1, N2 and N3 were merged into a single NREM class.A two-tailed Wilcoxon signed-rank test (Wilcoxon 1945) was used to test for significant differences between the best and the remaining classifiers for each task.This evaluation procedure was performed separately for the 'regular', the 'OSA' and the 'all' data sets.
In order to evaluate whether the class merging process is an adequate procedure when a higher performance is desired at the cost of simplifying the classification problem, the performance of each classifier was compared with that obtained with dedicated classifiers, trained specifically for the four-and three-class problem.This was done by merging the ground-truth annotation classes before training.Two-tailed Wilcoxon signed-rank tests were used to test for significant differences between the class-merging and the dedicated-training approaches.
Finally, and given the difference in complexity between all classifiers and to heuristically check for the presence of over-fitting or generalization problems, the training and validation performance of all classifiers was computed using cross-validation for a varying number of participants and so-called learning curves were plotted.
achieving an average κ of 0.59 ± 0.15 (with a median of 0.62) and accuracy of 80.67 ± 6.93% (median 81.29%) in the 'regular' data set, a κ of 0.50 ± 0.13 (median 0.51) and accuracy 75.40 ± 6.83%, (median 75.77%) for 'OSA', and a κ of 0.54 ± 0.14 (median 0.55) and accuracy of 76.70 ± 7.43% (median 77.62%) for 'all'.For three-and four-class classification tasks, the results were obtained by merging the classes resulting from five-class classification.The subset comparisons of Table 4 were obtained with cross-validation on the 'all' data set but analyzed separately for the recordings that were part of the 'regular', 'OSA' and 'irregular' (healthy participants not part of the 'regular' set) data sets.
An unpaired Wilcoxon rank sum test (Wilcoxon 1945) was used to compare, for the CRFt classifier, the performance obtained between the subsets when performing five-class classification.The performance was not significantly different between the 'regular' and the 'irregular' sets, neither for accuracy (p = .64)nor for κ (p = .46).It was, however, significantly higher for 'regular' than for 'OSA', (p < .001and p < .001),and for 'irregular' than for 'OSA' (p < .01 and p < .001).
Table 5 indicates the results for LDtvp and CRFt obtained with a dedicated classifier trained specifically with three and four classes and compares them with the results obtained by merging the classes detected with the five-class classifier for each set.
Tables 6, 7, and 8 indicate the confusion matrix for five-class classification with the CRFt classifier for the three data sets.In addition, the table indicates the sensitivity and the positive predictive value (PPV) for each class and, for comparison, the performance obtained with LDtvp.For Tables 7 and 8 the performance obtained after evaluating the results of each subset after cross-validation with CRFt on 'all' data set are also indicated.
Finally, Figures 1a, 1b and 2 illustrate the so-called 'learning curves' for the LDtvp, HMM, and CRFt classifiers for the three data sets.For each classifier, the curves illustrate the training and validation performance, averaged after cross-validation, for a varying number of participants.The confusion matrices compare the predicted and the reference sleep stages for all epochs of all participants in the data set after cross-validation with CRFt (above) and LDtvp (below).Bold indicates the classifier (between CRFt and LDtvp) which obtained the best sensitivity and PPV for each class.are significant for all performance metrics, except for accuracy in the three-class task.The same can be observed for the 'all' data set in Table 4, where the differences are significant for both κ and accuracy, for all classification tasks.It is clear that the addition of normalized time information as a feature improves the classification performance for healthy participants in comparison with the standard CRF.Regarding the 'OSA' data set, as indicated in Table 3, CRFt achieves a significantly higher κ and for the fiveclass task, accuracy, than all other classifiers except CRF.Interestingly, CRF achieves a better accuracy for the three-and four-class task, albeit not significantly in comparison with CRFt and LDtvp.In contrast with the classification for healthy participants, it is clear that for the 'OSA' data set the addition of time information leads to smaller improvements over standard CRF.OSA is known to cause increased sleep fragmentation due to the arousals which often follow apnea and hypopnea events (Roehrs et al. 1989), and also a decrease of N3 (Ratnavadivel et al. 2009, Bianchi et al. 2010) and REM (Bianchi et al. 2010).Likewise, it is reasonable to expect that, at least for patients with a higher OSA severity, the occurrence of apnea events possibly dictates more changes in sleep architecture than the natural progression of sleep over time.While for healthy participants, for instance, there is a strong association between the occurrence of N3 and REM with sleeping time, the presence and progression of these sleep stages for an apnea patient may depend more on the disordered breathing events which can occur during the entire night.

Performance comparison between data sets
Comparing Tables 2, 3 and 4 it is obvious that the overall performance for 'all' participants is lower than for 'regular' participants but higher than for 'OSA' A c c e p t e d M a n u s c r i p t participants.Accordingly, healthy sleep architectures seem easier to model and classify with the current set of cardiorespiratory features, in contrast with the more complex characteristics of healthy participants who do not have regular sleep architectures and even more so with OSA patients for which the architecture may be disrupted by their condition.Interestingly, as illustrated by the results for the subsets 'regular' and 'OSA' of the 'all' data set (also indicated in Table 4), the performance for those participants is overall lower than what was obtained when the classifier was trained separately with examples of only 'regular' and only 'OSA' participants (Tables 2 and 3).This suggests that by attempting to model the common characteristics of all participants, including those with regular, irregular and disrupted sleep architectures, the classifier compromises on the ability to learn characteristics that were specific of individual groups of participants to be able to better capture irregular sleep architectures.A prior stratification of the sleep architecture (at least between regular and irregular participants) and screening for a sleep disorder such as apnea would allow specific classifiers to be used instead of a globally trained one and improve overall classification performance.
Regarding the comparison between the performance of CRFt and LDtvp for the subsets of the 'all' data set, it can be observed that the differences depends on the characteristics of the subset.Although there are no significant differences in performance for the 'regular' subset, the differences are significant for all tasks for the 'irregular' subset and for the five-class task for the 'OSA' subset.This suggests that a CRFt classifier trained with examples of all participants is better able to learn complex sleep architectures than LDtvp, in particular for the most complex case of five sleep stages.

Learning curves
Regarding Figures 1a and 1b it is noticeable that the training and validation performances of the CRFt classifier converge at approximately the same number of recordings (around 70-80) as the LDtvp classifier and that both achieve a better training and classification performance than HMM, suggesting that the latter is simply not able to learn the characteristics of the sleep stages for either data set.The same can be seen in Fig. 2 although, because the complexity of the problem increases with the addition of participants with irregular sleep architectures, both classifiers require a larger number of recordings for the performances to converge (around 200).What is clear from the three figures is that for the 'regular', 'OSA' and 'all' sets, the number of recordings seems to be sufficient for an adequate training and validation.There is no evidence of over-fitting nor problems generalizing to new data, in which cases the training performance would seem substantially higher than the validation performance when using the complete data sets.
A c c e p t e d M a n u s c r i p t

Comparison between training strategies
Table 5 shows there is, generally speaking, no clear advantage to training classifiers dedicated specifically to the three-and four-class tasks as compared with the results obtained by appropriately merging the classes resulting from five-class classification.For CRFt, the only significant improvement is in terms of accuracy for the threeclass task in the 'regular' data set and also in terms of accuracy for both tasks in the 'all' set.κ actually decreases (albeit not significantly) when compared with the results obtained by merging the five-class classification results.For LDtvp there are no significant improvements for any task, and the results are significantly worse in terms of κ for the four-class task for 'regular' and 'all', and for the three-class task for the 'regular' and 'all' tasks.This is likely due to substantial differences between the classes: when merged together, the resulting class is likely very heterogeneous and a classifier trained with a fewer-class model still needs to learn -with a more complex model -the characteristics of the original classes.Consider for instance the CRF classifier, which uses the training data to model the probability P (w|X) of a sleep stage given a set of observations or features.When this is done for a class (for example NREM) obtained by merging multiple stages (in this case, N1, N2 and N3) the merged feature space is highly heterogeneous since it comprises examples of substantially different stages and the model that needs to be learned is highly complex.Instead, if the classifier is trained to learn P (w|X) separately for N1, N2 and N3, each individual model is simpler and even if classification errors are done when trying to distinguish between these three stages, the confusion between the predictions for the three classes disappears when the results are combined in a single NREM class.

Confusion matrices
Table 6 indicates the confusion matrices obtained after cross-validation with CRFt and LDtvp for 'all' participants.It is clear that the CRFt classifier achieves a superior performance both in terms of sensitivity and PPV for almost all classes, with the exception of a lower sensitivity to N3 and a lower PPV to Wake.Furthermore, it is interesting to note that the LDtvp classifier confuses more often classes which are, in principle, more disparate, than CRFt does.For instance, notice the much larger number of predictions of N1 in epochs of REM (1170 versus 450 with CRFt), or when predicting REM or Wake in epochs of N3 (721 and 1292 against 242 and 753 with CRFt respectively).This suggests that not only does the CRFt do an overall better job of sleep stage classification, but also that the output is overall more plausible than that with LDtvp.Regarding the confusion matrix for the 'regular' data set, Table 7 indicates once more that CRFt, either trained with 'regular' participants, or with 'all' participants, yields better sensitivity and PPV than LDtvp for the same classes as in Table 6, with all participants.Interestingly, the same is not observed for 'OSA' participants, as indicated in Table 8.In this case, LDtvp achieves the best PPV for Wake, but not the best sensitivity for N3.Where LDtvp outperforms CRFt seems to A c c e p t e d M a n u s c r i p t be in the detection of N1, both in terms of sensitivity as well as PPV.A note should be made about the apparent superior performance of CRFt trained with 'all' participants versus CRFt trained with 'OSA' for some classes, for example achieving a much higher sensitivity to N3.Although Tables 3 and 4 show that a dedicated classifier trained with OSA participants outperforms a classifier trained with all participants and then applied to this subset, the results in the confusion matrix appear to contradict that observation.This apparent contradiction stems from the fact that the confusion matrices are based on all epochs aggregated over all participants in that data set.As we had seen in Table 1, these OSA patients have, in average, less REM and especially less N3 than the average for 'regular' participants and in the complete set.The confusion matrix thus hides the much lower performance obtained by CRFt trained with 'all' participants, for example on participants with nearly no N3 nor REM sleep.

Examples of predicted hypnograms
To illustrate some examples of the classification output with the CRFt classifier, Figures 3 and 4 illustrate a comparison between the ground-truth reference hypnogram and the classification output for the participant with the best performance and the participant with the performance closest to the median (both in terms of κ), in the 'regular' data set.As it can be clearly seen in Fig. 3, there is a substantial match between the reference and the prediction for the best participant, with most mistakes occurring in short transitions between classes, such as between epoch 50 and 100 where a sequence of transitions between N3 and N2 lead to most mistakes in the first half of that recording.Regarding the participant with the performance closest to the median, in Fig. 4, it is clear that many classification errors occur once more during periods with faster transitions between stages, such as between epochs 550 and 600 where there are successive transitions between REM and N2, or between epochs 700 and 800 and at the end of the recording where there are successive transitions between Wake, REM and N1.Similar observations are made for 'OSA' participants, where the best and the participant with performance closest to the median are illustrated in Figures 5 and 6.Although the cause for these successive transitions is more likely related to the occurrence of apnea events, the classifier is once more not being able to fully cope with fragmented sleep architectures.
Regarding the classification of participants of the 'all' data set, Figs.7 and 8 illustrate a comparison between the ground-truth reference hypnogram and the classification output for the participant with the best performance and the performance closest to the median in the 'all' data set.The first remarkable aspect in Fig. 7 is the extremely low sleep efficiency of this participant, who spent a large part of the night awake.The classifier is correctly able to detect this period, and most mistakes occur once more during short periods of N3 and N1 which alternate with N2.Regarding Fig. 8 we again see that many mistakes occur during fast transitions to and from short periods of N3 although in this particular participant we see a rather low sensitivity to N3, likely related to a weaker expression of some of the hallmark cardiac and respiratory characteristics of N3 seen in most participants in the training set.

Comparison with earlier work and state of the art
In earlier work we used a linear discriminant similar to the LDtvp classifier used in this manuscript and, with a κ of 0.49 achieved a moderate agreement with PSG, on par with the best classifiers reported in literature at the time (Fonseca et al. 2015).In that work we used a smaller data set, comprised exclusively of healthy participants with regular sleep architecture.In comparison with the present work, we are now able to achieve a moderate agreement with nearly the same performance (κ of 0.48) but for a five-class classification task using the CRFt classifier.In more recent work on four-class sleep stage classification of healthy adults, Beattie et al. (2017) used an accelerometer and a PPG sensor to extract movement, breathing and heart rate variability features, achieving a κ of 0.52 and accuracy of 69% on healthy adults.Albeit on a different, larger data set, and when trained exclusively with 'regular' participants with cardiac and respiratory (but not movement) features, our CRFt classifier obtained comparable results with an average and median κ of 0.50 and 0.53 and accuracy of 69.6% and 70.8% respectively.Additionally, CRFt seems better able to handle participants with irregular sleep architectures and with sleep disorders, achieving consistently better results than LDtvp both with a generally trained classifier evaluated on 'irregular' and 'OSA' subsets, as well as with a dedicated classifier trained only on 'OSA' patients.In this regard, it is interesting to compare the results reported in literature for sleep stage classification of OSA patients.Redmond and Heneghan reported an accuracy and κ on a three-class task of 67% and 0.32 (Redmond & Heneghan 2006), achieving a fair agreement for 37 recordings, and Willemen et al. reported, also for a three-class task, accuracy and κ of 70% and 0.41 (Willemen et al. 2015), thus achieving a moderate agreement for 25 recordings.In comparison, the CRFt classifier in our work achieved an accuracy and κ of 75.4% and 0.50 respectively, for the same three-class problem, representing a substantial improvement, albeit on a different data set.Overall, these results suggest that the CRF classifier is better at modeling the transitional characteristics between sleep stages, while offering an improved separation between sleep stages with overlapping cardiorespiratory characteristics.These advantages allow it to not only better learn and predict more complex and irregular sleep architectures, but also to make the first steps towards five-class classification, the standard for sleep scoring used in clinical PSG and, importantly, to achieve first promising results in the sleep stage classification of patients with sleep A c c e p t e d M a n u s c r i p t disorders.

Limitations
Finally, we should highlight the limitations of this technology.As discussed, the classification performance on OSA patients is higher when using a dedicated classifier, than when using a classifier trained with healthy and disordered participants.In a realworld application, this would imply knowing, a priori, whether a patient suffers from a sleep disorder (OSA in this case) or not, before the appropriate classifier can be used.This may of course limit the immediate applicability of this technology in a clinical context, where the analysis of sleep architecture might be relevant for the complete understanding of a given condition in the first place.Alternative methods for automatic detection of, in this case, respiratory events, would be necessary in order to obtain a first segmentation of the participants between those with OSA and those without.Furthermore, OSA is only one of the many sleep disorders, and the applicability of these methods to patients suffering from other pathologies remains to be demonstrated.Regarding the HMM classifier, it clearly underperforms when compared with the LD and the CRF classifiers.A possible explanation for this is related to the limitations imposed by the nature of the HMM classifier used, where the discretization into a limited number of symbols -needed to compute the emission and transition probabilities -possibly limits the learning capabilities in this complex task.Although this problem was not so evident in our earlier results, where HMM was superior to LD for certain binary sleep classification tasks, this simplification is clearly not appropriate in the more complex problem of multi-class classification.The applicability of continuous HMM models remains to be evaluated.

Conclusions
The performance of a CRF classifier was evaluated in a home sleep monitoring context with cardiorespiratory features, and compared to HMM and LD classifiers.In earlier work we have shown how CRF can exploit transitional features and better handle nonseparable features, but the evaluation was limited to binary classification in 'one class versus the rest' tasks.In this work we show that CRF is superior to the other classifiers also in the more useful multiple-stage classification problem.Furthermore, we explored the advantage of incorporating time as a feature for classification.
The results also demonstrated that five-class classification is feasible, and that dedicated three-and four-class classifiers do not outperform the results obtained by merging classes with a five-class classifier.Finally, it was shown that CRF classifiers also achieve moderate performance in the sleep stage classification of OSA patients, outperforming earlier work reported in literature for this population.
More research is needed to improve the sensitivity to N1 stage, and to evaluate this technology in subjects suffering from other sleep disorders.Furthermore, and A c c e p t e d M a n u s c r i p t given the improvements obtained with CRF over the single-epoch LD, higher-order multivariate Markov chains could be explored to make use not only of the previous state, but of a longer sequence of past states.Although this will introduce additional training complexity, it should lead to further performance improvements.
Although the current performance of cardiorespiratory sleep stage classifiers is, overall, still modest, the technology has the potential to complement PSG for long-term monitoring at home, potentially replacing the role that actigraphy had over the last decades in the assessment of sleep-wake disorders, and extend it with more detailed information about sleep architecture.

Figure 1 .Figure 2 .
Figure 1.Average validation and training (suffix TR) performance κ for a varying number of recordings in the (a) 'regular' and (b) 'OSA' data sets, for the LDtvp, HMM and CRFt classifiers.

Figure 3 .Figure 4 .
Figure 3. Classification example for the participant with the highest κ of 'regular' participants (κ of 0.74 and an accuracy of 84.2%).From top to bottom: reference hypnogram, result of classification with CRFt classifier, epochs for which there is a mismatch between reference and classification.

Figure 5 .Figure 6 .
Figure 5. Classification example for the participant with the highest κ of 'OSA' participants (κ of 0.71 and an accuracy of 79.9%).See caption to Fig. 3.

Figure 7 .Figure 8 .
Figure 7. Classification example for the participant with the highest κ of 'all' participants (κ of 0.74 and an accuracy of 83.8%).See caption to Fig. 3.
A c c e p t e d M a n u s c r i p t A c c e p t e d M a n u s c r i p t cardiac and respiratory based classification of sleep and apneic events in subjects with sleep apnea Physiological Measurement 36 2103-18 Yu C, Liu Z, McKenna T, Reisner A T and Reifman J 2006 A method for automatic identification of reliable heart rates calculated from ECG and PPG waveforms Journal of the American Medical Informatics Association 13 309-20

Table 1 .
Demographics and sleep statistics of participants in the two subsets used in the study.

Table 5 .
Classification performance with classifiers trained with three and four classes.
*  (9.1)The table indicates for each classifier and classification task the mean (standard deviation) κ and accuracy.Indicated in bold are the results for which the dedicated classifiers obtained a higher performance than that achieved by merging the results of the fiveclass classifier.Significant differences are indicated ( * * * p < .001,* * p < .01,* p < .05).

Table 6 .
Pooled confusion matrix for CRFt and LDtvp classification of 'all' participants.

Table 2
, the CRF classifier with normalized time (CRFt) outperforms all other classifiers for the 'regular' data set for all classification tasks.The differences

Table 7 .
Pooled confusion matrix for CRFt classification of 'regular' participants.See notes to Table6.For comparison, the sensitivity and PPV obtained with LDtvp for the same data set, and with CRFt evaluated for the 'regular' set after cross-validation on the 'all' data set are also given.