Validation of Somno-Art Software, a novel approach of sleep staging, compared with polysomnography in disturbed sleep profiles

Abstract Study Objectives Integrated analysis of heart rate (electrocardiogram [ECG]) and body movements (actimetry) during sleep in healthy subjects have previously been shown to generate similar evaluation of sleep architecture and continuity with Somno-Art Software compared to polysomnography (PSG), the gold standard. However, the performance of this new approach of sleep staging has not yet been evaluated on patients with disturbed sleep. Methods Sleep staging from 458 sleep recordings from multiple studies comprising healthy and patient population (obstructive sleep apnea [OSA], insomnia, major depressive disorder [MDD]) was obtained from PSG visual scoring using the American Academy of Sleep Medicine rules and from Somno-Art Software analysis on synchronized ECG and actimetry. Results Inter-rater reliability (IRR), evaluated with 95% absolute agreement intra-class correlation coefficient, was rated as “excellent” (ICCAAAvg95% ≥ 0.75) or “good” (ICCAAAvg95% ≥ 0.60) for all sleep parameters assessed, except non-REM (NREM) and N3 sleep in healthy participants (ICCAAAvg95% = 0.43, ICCAAAvg95% = 0.56) and N3 sleep in OSA patients (ICCAAAvg95% = 0.59) rated as “fair” IRR. Overall sensitivity, specificity, accuracy, and Cohen’s kappa coefficient of agreement (κ) on the entire sample were respectively of 93.3%, 69.5%, 87.8%, and 0.65 for wake/sleep classification and accuracy and κ were of 68.5% and 0.55 for W/N1+N2/N3/rapid eye movement (REM) classification. These performances were similar in healthy and patient population. Conclusions The present results suggest that Somno-Art can be a valid sleep-staging tool in both healthy subjects and patients with OSA, insomnia, or MDD. It could complement existing non-attended techniques measuring sleep-related breathing patterns or be a useful alternative to laboratory-based PSG when this latter is not available.


Introduction
Polysomnography (PSG) is the gold standard for objective sleep monitoring and the diagnosis of many sleep disorders. PSG, composed mainly of an electroencephalogram (EEG), an electro-oculogram (EOG), and an electro-myogram (EMG), is cumbersome and time-consuming to set up and is therefore costly and with limited access (long waiting lists and some large geographical areas are poorly equipped). For these reasons, PSG is generally limited to a maximum of one or two recording nights in the sleep laboratory. In parallel, the evaluation of sleep architecture and continuity consists in the visual scoring of 30-sec epochs PSG recordings based on the standard adopted by the health care institutions, the American Academy of Sleep Medicine (AASM) manual [1]. Visual scoring is a tedious task and several studies reported an inter-rater reliability (IRR) under 85% [2][3][4][5].
Therefore, the development of new technologies to respond to these limitations of PSG could facilitate and improve clinical evaluation of sleep disturbances. Indeed, insomnia is often diagnosed based only on nonobjective tools such as the clinical interview, questionnaires, or sleep diary which are much easier to obtain than PSG. Even if these nonobjective tools are useful and necessary to guide the diagnosis, objective sleep monitoring is mandatory to detect potential associated sleep disorders and may deliver information not inherent in the subjective patient report such as detecting sleep state misperception. Furthermore, nonobjective tools often overestimate the symptoms compared to objective measures [6]. The diagnosis of insomnia, therefore, would benefit from several successive recording nights to be reliably evaluated. In addition, at-home sleep recording would avoid confounding factors specific to the sleep laboratory settings such as the first night effect, and reflect more accurately the normal environment in which the patient is living [7][8][9]. Diagnosis of sleep apnea syndrome would also benefit from an ambulatory sleep staging system to supplement ambulatory respiratory polygraphy that does not discriminate between wake and sleep states, leading to misestimation of total sleep time (TST) and therefore of the apnea-hypopnea index (AHI). A new wave of research focuses on the detection of sleep independently of brain electrical activity (EEG), in adopting a multisensory approach based on the knowledge that autonomic variables such as heart rate and its variability are sleep stage-dependent [10][11][12]. However, most of the wearable devices on the market today lack of robust validation studies and cannot be considered as good and reliable alternatives to PSG [13]. In 2016, Muzet et al. validated in healthy volunteers the Somno-Art Software against PSG [14]. Somno-Art Software evaluates sleep architecture and continuity from an integrated analysis of heart rate and body movements. This study found an excellent intraclass correlation (according to Cicchetti [15] cutoffs) between Somno-Art Software and PSG for the combination of 12 sleep architecture and continuity descriptors (i.e. sleep efficiency [SE], sleep latency [SL], REM sleep) useful for the clinician in the diagnosis and quantification of treatments (both pharmacological and interventional).
Sleep disorders such as insomnia or obstructive sleep apnea (OSA) affect sleep architecture and continuity, complicating the visual scoring [2,3,16] and leading to lower IRR compared to healthy adults [2,3,17]. Therefore, most of the wearable devices based on cardiac and body movement or EEG signals are so far exclusively or mostly validated in healthy populations [18][19][20].
The aim of the data presented here is to evaluate the performance of the new approach of sleep staging of Somno-Art Software based on heart rate and body movement, on disturbed sleep architecture and continuity. To do so, sleep recordings coming from healthy subjects and patients suffering from OSA, insomnia, or major depressive disorder (MDD) were analyzed. It is hypothesized that Somno-Art Software performances on healthy and pathological populations will be similar.

Source studies.
The dataset used for this research is based on data collected from six studies.
Recording nights from healthy subjects were acquired from two studies. Study 1 primary objective was to investigate relationship between daytime activity and night sleep structure and the impact of noise on sleep patterns. Study 2 primary objective was to investigate the effect of light on sleep, wake, EEG, and cognitive performances as a function of homeostatic sleep drive. All recorded nights from these two studies were included in the dataset.
Recording nights from patients were acquired from four studies.-The OSA study included patients diagnosed with OSA syndrome-The insomniac study and the two depression studies' primary objectives were to evaluate the efficacy, safety, and tolerability of investigational drugs. Only pretreatment nights were included in the dataset. For all studies included in the present analysis and before undergoing sleep recordings, a standard screening of patients and healthy subjects' health status was done. More information on the protocol descriptions are detailed in Supplementary 1.
All study protocols were approved by institutional review boards in accordance with the Declaration of Helsinki and the guidelines on Good Clinical Practice. Written consent was obtained from all participants according to local requirements.

Participants.
All subjects were free of any drug or medication that could affect sleep. Patients were diagnosed with OSA based on the AHI (≥5 [21]). Insomnia was diagnosed with the Insomnia Severity Index (> 15). MDD patients fulfilled diagnostic and statistical manual-4 or -5 criteria (using MINI 6.0 or 7.0) and had a score ≥ 30 on the Inventory of Depressive Symptomatology (IDS-C30) or on the Montgomery-Åsberg Depression Rating Scale (MADRS) and a score ≥ 4 (markedly ill or worse) on the Clinical Global Impressions Severity Scale (CGI-S).
From a pool of 509 nights from 267 subjects, 458 recording nights from 246 subjects were included in the dataset after removing recordings that could not be analyzed due to Somno-Art Software limitations: recording nights with a time in bed under 5 h, recording nights with periodic movements, or recording nights with long R-R signal loss. In total 79 nights from 26 healthy participants (up to five nights/subject), 33 nights from 30 patients with OSA (up to two nights/subject), 135 nights from 66 patients with insomnia (up to three nights/subject), and 211 nights from 124 patients with MDD (up to two nights/subject) were included in the analysis. Other demographic and baseline information of each study group are presented in Table 1.

Study design
All the recordings combined standard PSG with ECG and actimetry recordings.
Sleep staging was performed according to the AASM rules and the resulting reference classes were obtained by combining N1 and N2 into a single "N1 + N2" class while the remaining classes (wake, N3, and REM) were unchanged. The nights from the healthy and the OSA subjects were scored by experienced scorers, 1 per study. The insomnia and the depression studies were scored by an independent expert scorer of the Siesta Group (Vienna, Austria) using the computer-assisted Somnolyzer software [23].

Cardiac activity from ECG.
Cardiac beats position was extracted from the PSG ECG lead with Medilog Darwin v2.8. To avoid misdetection, periods without signal were excluded, no other beat correction were applied (artifacts and ectopic beats were left as is). Successive inter-beats intervals (R-R intervals) were then computed from this continuous series of beats. Heart rate data were calculated from R-R intervals as HR = 60/RR (in seconds) and then interpolated at 1 Hz.

Wrist movement from actimetry.
Nondominant wrist movement activity was recorded using ActiGraph (Actigraph LLC, Pensacola, FL) activity monitor. Raw data were filtered and accumulated every second. The wrist actimetry was measured through the vector magnitude of accelerations obtained every second in the three dimensions of the space and its value is given in counts per second.

Somno-Art Software.
To perform Somno-Art Software 2.6.0 [3.1.0] analysis, a precise synchronization of the actimetry and the PSG ECG signal was achieved. A visual inspection to confirm that some occurring events such as cardiac arousals (sudden increase in heart rate followed by a return to initial values) were associated with wrist movements was performed.
Using heart rate at a beat-to-beat resolution and actimetry data at a 1 Hz resolution, sleep stage classification (wake, N1+N2, N3, REM) was performed at a 1-s epoch resolution. The latter 1-s epoch classification was merged into 30-s epochs to be compared to visual scoring. To do so, the more prevalent stage, or the first occurring stage when equally represented, was selected.
The sleep classification algorithm is based on the detection and quantification of physiological events such as movements or cardiac arousals in association with Support Vector Machine (SVM) detectors. SVM detectors were trained on a pool of recording nights (3 to 5 recordings, depending on the detector), optimized on 123 recording nights, and tested on a pool of 118 recordings. In a final step, the sleep stage classification is fine-tuned by more than 40 expert rules to better discriminate transition phases. More information on the data processing methodology is described in Muzet et al. [14].

Statistical analysis
Based on the guidelines edited in SLEEP after the 2018 international biomarkers workshop on wearables in sleep and circadian science, recommended statistical tools described in Table 3: Guidelines for performing and interpreting results from device validation of sleep and circadian metrics (descriptive statistics, Bland-Altman plot, epoch-by-epoch (EBE) analysis [sensitivity, specificity, confusion matrix]), were used to evaluate the agreement of the Somno-Art Software to PSG [13].

Sleep parameter analysis.
Derived from the sleep stage classification, the following AASM sleep-wake statistics were computed: TST, SE, wake after sleep onset (WASO), and SL. In addition, latency to persistent sleep (LPS), defined as the elapsed time between lights-off and the first occurrence of continuous 10 min in any sleep stage, and REM sleep latency (REML), defined as the elapsed time between sleep onset and the first occurrence of REM sleep were computed.
To take into consideration the multiple nights from the same subject, the mean sleep parameters of each subject were calculated and only one data point per subject was used for the analysis. The IRR between Somno-Art Software and the visual scorer was assessed for all sleep parameters (TST, SE, WASO, SL, LPS, REML, wake, N1 + N2, N3, NREM, and REM sleep) in calculating absolute agreement intra-class correlation coefficient (ICC AAAvg : the degree of absolute agreement for measurements) using twoway mixed model with "subject" as a random effect and "rater" as a fixed effect [24]. 95% ICC AAAvg were estimated after 5% outlier data trimming (based on PSG visual scorer versus Somno-Art Software differences) procedure using 2.5% two-sided approach. An ICC estimate of 1 indicates perfect agreement and 0 indicates only random agreement (values increase by one method and decrease by another method, nondirectional). Cicchetti [15], provides commonly cited cutoffs for qualitative ratings of agreement based on ICC values 0-0.39: "poor" agreement; 0.40-0.59: "fair" agreement; 0.60-0.74: "good" agreement; 0.75-1: "excellent" agreement.
Bland-Altman plots were constructed to qualitatively assess the concordance between Somno-Art Software and the visual scorer and evaluate overall device performance. To quantify the bias, ±95% CI and the lower and upper agreement limits of the Bland-Altman, Design 3 of the NCSS software, which addresses multiple variables within-subject assessments, was used (https://www.ncss.com/wp-content/themes/ncss/pdf/ Procedures/NCSS/Bland-Altman_Plot_and_Analysis.pdf).
In short, the mean difference corresponds to the mean of the means and limits of agreement (LoA) calculation to the SD of a difference that considers pooled estimates of the within-subject and between-subject random errors, and the harmonic mean of the replicate counts. Finally, confidence interval estimation for LoA is based on the MOVER method, which provides adjusted confidence intervals and is accurate for small to moderate sample sizes. The Bland-Altman plots allow the visualization of discrepancies and the interpretation of biases: a positive bias indicates that the Somno-Art Software underestimated the observed outcome while a negative bias indicates that the Somno-Art Software overestimated the observed outcome.

EBE analysis.
Sensitivity, specificity, accuracy, and Cohen's kappa coefficient of agreement κ [25] were used to evaluate EBE agreement. Sensitivity is defined as the ability to correctly classify PSG sleep epochs, while specificity is defined as the ability to correctly classify wake epochs. Accuracy indicates the percentage of epochs correctly labeled relative to PSG. κ indicates the agreement between the two hypnograms corrected for agreement due to chance. These metrics were computed on each night before evaluating the distribution on the whole dataset. The κ score scale was applied for evaluating agreement between recorders: <0: poor; 0-0.20: slight; 0.21-0.40: fair; 0.41-0.60: moderate; 0.61-0.80: substantial; 0.81-1: almost perfect agreement [25].
Confusion matrices represent EBE analysis by crosstabulating the agreement and disagreement between Somno-Art Software and PSG visual scoring. Table 2 presents the mean ± SD of each sleep architecture and continuity descriptors obtained with Somno-Art Software and visual scoring of PSG on the mean value of each subject (n = 246).

Results
For the entire sample, the IRR, based on ICC values was "good" for N3 sleep and "excellent" for all remaining sleep parameters presented in Table 2. The healthy sub-group presents "excellent" ICC for TST, SE, WASO, SL, LPS, REML, and wake, "good" for N1 + N2 and REM sleep and "fair" for N3 and NREM sleep. For the overall pathology dataset, "excellent" ICC was observed for TST, SE, WASO, SL, LPS, REML, N1 + N2, and REM sleep and "good" ICC for N3 sleep. For OSA patients ICC of TST, WASO, REML, wake, N1+N2, NREM, and REM sleep was "excellent", while SE, SL, and LPS had "good" ICC and N3 "fair" ICC. All the sleep parameters of the insomniac and MDD patients had "excellent" ICC, except REML and N3 with "good" ICC in both study groups.
Bland-Altman plots ( Figure 1) and the specific bias, ±95% CI of the biases and the lower and upper LoA (  Table 4 illustrates EBE agreement measured with accuracy, κ coefficient, sensitivity, and specificity. Wake/sleep classification ranges from accuracy of 82.8% for the OSA sub-group to an accuracy of 93% for the healthy sub-group. κ coefficient for wake/ sleep was moderate for OSA sub-group (κ: 0.54) and substantial for the other groups (κ: 0.63-0.70). Sensitivity ranges between 88.9% (OSA patients) to 95.1% for the healthy sub-group, while specificity was lowest for OSA patients with 64.5% and highest for insomniac patients with 74.5%. For the four stages classification (W/N1 + N2/N3/REM), accuracy was lowest for OSA patients with 63.9% and highest for healthy patients 71.2%. κ coefficient was moderate for all studied groups.
The confusion matrices (Table 5) illustrate the percentage agreement between Somno-Art Software and PSG visual scoring for each sleep stage. For all study groups, confusions between Somno-Art Software and PSG are mostly due to N1 + N2 sleep misclassification and principally confusions with N3 sleep. Misclassification mean between wake and REM sleep was < 10%. Sleep stage accuracy across the various studied groups was >85% for wake, N3, and REM sleep, and between 68% and 73% for N1 + N2 sleep.

Discussion
The results of the present study bring additional evidence for using algorithms that combine heart rate and body movement for scoring normal sleep in accordance to standard visual rules. The present research further extends these results to the 2 most common sleep disorders (chronic insomnia and OSA) as well as to sleep of patients with MDD.
When considering all the investigated sleep parameters (TST, SE, WASO, SL, LPS, REML, Wake, N1 + N2, N3, NREM, and REM sleep), with the exception for N3 sleep in healthy and OSA patients and NREM sleep in healthy subjects where the ICC were "fair", the agreement between Somno-Art Software and PSG showed "excellent" or "good" ICC in healthy, OSA, insomniac, and MDD patients. Interestingly, healthy and OSA patients present higher bias for N3 sleep than the other groups. A closer look at the Bland-Altman plot shows that in the case of the healthy group, the bias increases with longer N3 duration: Somno-Art Software tends to underestimate N3 sleep for long N3 sleep durations (> 150 min). In OSA patients, who present with the lowest mean N3 sleep duration (56.8 min), Somno-Art Software tends to overestimate N3 sleep duration.
As expected from the sleep parameter analysis, EBE analysis achieves promising results. Sensitivity, specificity, accuracy, and κ coefficient of the overall dataset were 93.3%, 69.5%, 87.8%, and 0.65, respectively. Movement-based wearable devices, such as actimetry, often suffer from poor specificity, with difficulties detecting calm wake periods. A systematic review of the literature indicated that the specificity of actimetry ranges between 28% and 67% in healthy population [26]. Somno-Art Software showed a higher specificity compared to actimetry on healthy subjects (mean specificity: 73.3%) and even on recording nights from patient population (mean specificity: 69.2%). Compared to other algorithm based on heart rate and wrist movements, Somno-Art Software shows higher specificity in insomniac patients compared to the algorithm evaluated by Kahawage et al. (74.5% for Somno-Art Software vs. 45% by Kahawage et al. [27]), while Fonseca et al. [28] showed similar performances to the present results on a sample of patients with sleep disorders (69.2% for Somno-Art Software versus 72.9% for Fonseca et al. [28]).
On four stages classification (W/N1 + N2/N3/REM), Somno-Art Software presented an accuracy and κ coefficient of 68.5% and 0.55 respectively on the overall dataset, a performance comparable to the heart rate-based algorithm evaluated by Radha et al. [29] on a similar population: accuracy: 77%, κ: 0.61.
Sleep stage accuracy was > 85% for wake, N3, and REM sleep. These results are comparable or slightly above the IRR of visual scorers for wake and REM sleep, but clearly exceed it for N3 sleep [4]. Visual scorers present the highest inter-rater variability for the sleep stage N3, generally due to the complexity associated with the measurement of slow waves (SW) duration and amplitude. In contrast, Somno-Art Software, as an automatic algorithm, is consistent in its definition of SW and may therefore yield more accurate results. Of note, the interpretation of the confusion matrices is improved by taking the duration of the sleep stage into consideration. In the case of wake, which represent 23.2% of the scored recording for the overall dataset, 69.5% were correctly scored with the software, while 30.5% of waking episodes were misclassified as sleep. But in parallel, only 6.5% of sleep episodes, that represent 76.8% of the scoring, were misclassified as wake. Moreover, the overall accuracy of wake was 87.8%. To further illustrate this point, insomniac patients that have more wake epochs (29.6%) overall as compared to the other participant groups, present a higher wake sensitivity (74.5%). ICC 95%AAAvg , 95% absolute agreement ICC; LowB, lower bound of ICC 95%AAAvg . ICC cutoffs: 0-0.39: "poor" agreement; 0.40-0.59: "fair" agreement (yellow); 0.60-0.74: "good" agreement (light green); 0.75-1: "excellent" agreement (green) [15].
Similarly, N3 sleep represents only 15.5% of the scored recording for the overall data set, and in consequence presents the lowest score with only 64.1% epochs correctly scored. However, the overall accuracy of N3 sleep is at 88.0%. Somno-Art Software presents an average accuracy of 71.5% for the discrimination of N1 + N2 sleep on the overall dataset. Most misclassifications were observed between N1 + N2 and N3 sleep. This finding is not surprising as N1 + N2 sleep represent the predominant sleep stage. Moreover, N1 sleep stage has low inter-rater agreement even between human scorers [2,4,17,30], and as previously mentioned, confusion between N2 and N3 sleep stages is well known for visual scoring [4], as the characterization of N3 sleep depends on the amount of SW also present in N2 sleep [1].  switches between N1 + N2 sleep and N3 sleep are less frequent compared to PSG, illustrating the lower performances of Somno-Art Software in the estimation of N1 + N2 sleep.
Interestingly, these results indicate comparable scoring performances of the Somno-Art Software in normal and pathological sleep. It should be emphasized that, in contrast to the results obtained with the Somno-Art Software, scoring pathological sleep has consistently been found less reliable than scoring normal sleep [2,16,17]. Indeed, sleep of patients often presents a fragmented hypnogram and less obvious sleep stages characteristics (K-complex, spindles, SW) than healthy subjects, leading to higher variance in visual sleep scoring. In addition, Rechtschaffen and Kales and subsequently the AASM sleep scoring rules have been developed for healthy individuals and may not adequately describe disturbed sleep [2]. Cardiac-based sleep scoring algorithms are usually evaluated only in healthy subjects [14,19,20,31,32]. Fortunately, recent studies are starting to evaluate algorithms on patients suffering from sleep disorders [ [27][28][29] which is necessary to ascertain whether they fit with data coming from patients with disrupted sympatho-vagal balance, such as patients with OSA, insomnia, or MDD [33][34][35].

Limitations
Of note, the difference in the sample size of each study group may lead to different statistical power and thus different levels of precision. However, this concern is less relevant due to the stated objective of this study which is to evaluate the performance of the software on the various study groups and was not intended for a between groups comparison. The analyzed data were already integrated in the learning process of the algorithm and could present a bias in the performances of the outcome of analysis. However, software performances have been trained on large datasets (more than 600 nights, including data used for above results) and the technologies used for this sleep classification algorithm are resistant to overfitting.
One current limitation of the ongoing version of Somno-Art Software is the duration of the recording. Indeed, the Software has been validated on recordings longer than 5 h and is therefore currently inadequate for use with shorter recordings (e.g. nap).

Conclusion
The present study indicates that Somno-Art is a reliable tool for the characterization of sleep architecture and continuity in both  healthy subjects and patients with OSA, insomnia, or MDD. It opens new insights to measure sleep at home, in a less invasive and costly, and more time-saving way than the gold standard, PSG. Somno-Art could for instance complement existing nonattended techniques measuring sleep-related breathing pattern or be a useful alternative to laboratory-based PSG when this latter is not available.

Supplementary Material
Supplementary material is available at SLEEP Advances online.