Sleep staging using nocturnal sound analysis

Sleep staging is essential for evaluating sleep and its disorders. Most sleep studies today incorporate contact sensors that may interfere with natural sleep and may bias results. Moreover, the availability of sleep studies is limited, and many people with sleep disorders remain undiagnosed. Here, we present a pioneering approach for rapid eye movement (REM), non-REM, and wake staging (macro-sleep stages, MSS) estimation based on sleep sounds analysis. Our working hypothesis is that the properties of sleep sounds, such as breathing and movement, within each MSS are different. We recorded audio signals, using non-contact microphones, of 250 patients referred to a polysomnography (PSG) study in a sleep laboratory. We trained an ensemble of one-layer, feedforward neural network classifiers fed by time-series of sleep sounds to produce real-time and offline analyses. The audio-based system was validated and produced an epoch-by-epoch (standard 30-sec segments) agreement with PSG of 87% with Cohen’s kappa of 0.7. This study shows the potential of audio signal analysis as a simple, convenient, and reliable MSS estimation without contact sensors.


Macro sleep stages estimation
Offline system performances for each subject can be seen in Fig. S1 rightmost columns. Additionally, performance of given epoch estimation (wake, REM, and NREM) was measured as a function of the subjectinduced sounds such as respiratory and body movement sounds, relative to the background noise level (signal to noise ratio, SNR) of the testing room in the sleep laboratory using the validation dataset. In our setting, the average SNR overnight among subjects ranged from -18.3 dB to 2.7 dB (95% CI). We found that the estimation accuracy of a given epoch improved by 2.2% for every 10 dB increase in SNR.
System accuracy of offline MSS estimation was analyzed across subjects' anthropometric characteristics (Supplementary Fig. S3A-C), AHI, and sleep efficiency (Supplementary Fig. S3D,E) using the validation dataset. Univariate analysis revealed that accuracy inversely correlates with age, BMI, and AHI; and sleep efficiency positively correlates with system accuracy. Multivariate analysis revealed that only sleep efficiently correlates with system accuracy (adjusting for gender, age, BMI, AHI, and sleep efficiency) (Supplementary Fig. S3F). For every 10% increase in sleep efficiency, system MSS accuracy increases by 1.3%. . F) A multivariate regression analysis between predicted SSA based on subject characteristics (gender, age, BMI, AHI, and SE) and system accuracy. Each dot represents one individual from the validation dataset (n = 100); ris the regression coefficient and its p-value.

Sleep quality parameters
Using the detected MSS from the offline analysis, sleep quality parameters were calculated. Comparison between SSA estimation and PSG is presented using Bland-Altman plot. Future studies are needed to validate our approach on narcolepsy or insomnia patients.  Sleeping sounds include several types of sounds from several sources including vocal sounds, body frictions (body movements), and other sounds such as clock ticking, barking dogs. In this study, we grouped those into three main sources: 1) breathing sounds; 2) body movement sounds, and 3) other sounds. Fig. S5 shows the distribution of sleeping sound sources and their sound intensities (volume) during the night. Manual annotations and segmentations were conducted by several raters using ad hoc graphical user interface (GUI), which involved hearing and visualization of several PSG channels (including effort belts, and EMG) along with spectrogram of the audio (for each epoch). Annotations were made at 50 ms resolution.

Supplementary Fig. S5 | Sleeping sounds distribution. A)
Sleeping sound content and sound intensity in dB SPL (sum of all sources across all dBs gives 100%); B) Distribution of each sound source according to the sound intensity (sum of each sound source across all dBs gives 100%). Data were collected from 67 patients that underwent full manual annotation of sleeping sounds (into three classes: breaths, body movements, and other sounds). Pdfprobability density function.

Detectors robustness estimation
Sleeping sounds include several sources including vocal sounds, body movements (frictions), and other. Fig. S6 shows the performances of the detectors as a function of the signal quality (SNR). To measure the performance of specific SNR value, sound events from all subjects were sorted and divided into 4 dB subbands with 50% overlaps.
Detection agreement of a sound event (segmentation and detection) was measured by comparing frame-byframe (50 ms resolution) manual annotation with the detector's predictions. Although this comparison is extremely strict, e.g., detection of 1.0-second event (20 frames) at 50 ms delay (1 frame) will results in 19/21 (90%) frames agreement, in this study, we chose this comparison method because it measures the quality of both segmentation and detection, and it is easy to implement as classifier cost function. Detectors were designed based on 25 subjects and were validated on a 42 subjects dataset.

Supplementary Fig. S6 | Breathing and body movement detectors -Performance vs. SNR. A)
Breathing detector (inhalation/ exhalation/ non-breathing) accuracy and Cohen's kappa coefficient; B) Body movement detector (body-movement/ non-body-movement) accuracy and Cohen's kappa coefficient. In each plot, the right Y-axis represents the accuracy scale (blue curve), and the left Y-axis represents the Cohen's kappa coefficient value (red curve); please note the different Y-axis scales. Data were collected from 42 patients that underwent full night manual annotation of sleeping sounds.

Assessing feature importance
To assess the importance of each feature in the classifier, we measured the impact of the MSS classification (Cohen's kappa coefficient) when corrupting only the tested feature.   Table presents the name of the feature, its symbol, number of features used (count), and its importance. The code name (symbol) for each feature is composed of prefix (symbol group family) and suffix (individual symbol abbreviation). For example, WB_SIE01, i.e., meaning within breathing feature (WB) indicating sound intensity expiration top 1%. µ -mean; σstandard deviation. The importance value represents the decrement in MSS classification (kappa coefficient) from the complete model (reference kappa = 0.694), e.g., without the within-breathing-features set (WB) system performance is degraded to 0.424 (-0.270).

Feature calculations
The following section describes how to calculate some of the features presented in Table S2.

Training the MSS classifiers
In this work, we configured time-series classifiers that are designed to learn short-to long-term relations between epochs, i.e., from adjacent epochs' relations and up to the relations between two REM cycles of roughly 90-100 minutes (180-200 epochs). We chose to work with one hidden-layer ANN as it was powerful enough to learn the discriminative information on the design dataset features, yet simple enough to overcome the "overfitting curse" on a small database. In our case, 100 subjects (out of 150 subjects) on the design dataset had a simultaneous recording from two audio recorders, therefore presenting 200 observations for each time-step. We treated the additional records as new subjects, i.e., a total of 250 records in the design dataset; although it is not good as 250 different subjects, it is better than 150 in the sense of model convergence and robustness to different microphones specifications, distance and angle to subject's head. Another challenging task in the design phase was the unbalanced MSS class sizes, e.g., NREM is more abundant than REM and wake during sleep. To overcome this, we formulated a penalty weight for each epoch proportional to the PSG a priori probability of each MSS, as follows in Eq. 1 where N is the total number of epochs for the design dataset, and Bool is the Boolean operator resulting in "1" if the statement is true. In recent studies 1-3 it has been shown that recurrent neural network (RNN) and long-short-term memory (LSTM) neural network have superior potential to learn the relations found in timeseries data.
In our ongoing attempts to a bi-directional LSTM (BiLSTM) model as a classifier, we achieved almost similar, yet slightly inferior, results for three-class estimations, mainly due to REM misdetection (see Fig.  S10). We hypothesized that similar to BiLSTM, our original proposed classifier holds enough information presented in both the short-and long-term memory for past and future information (using up to 200 adjacent epochs each side). Consequently, training each sub-classifier (separately) proved to be a much easier task (to converge) with less sensitivity to hyperparameters values compared to the BiLSTM model. Further studies are needed to support these findings. Presented here is our attempt to use the BiLSTM model.

Overfitting assessment
Several aspects are affecting system performances including microphone specifications, distance to subject's head, bedroom background noise level, and even subject's loudness. In our study, most aspects were fixed except the sounds (intensities) generated by the subjects. To overcome some of these aspects, we trained our model using a large database and two types of microphones (see main body for more information) located at a 90 degrees angle and about 0.5 -1.0 m distance to the subject's head.
Robustness of the model can be indicated by achieving for each subject high agreement with the gold standard (PSG), and with a minor deviation between subjects. Additionally, performances of the training dataset may imply the upper-limit performances while large differences between training and validation datasets may imply an "overfitting" situation. p-value is calculated for unpaired t-test two-tailed (250 and 100 samples for design and validation, respectively).
One can see minor differences between design and validation datasets (~3% and κ=.06).