Prediction of Sleep Stages Via Deep Learning Using Smartphone Audio Recordings in Home Environments: Model Development and Validation

Background The growing public interest and awareness regarding the significance of sleep is driving the demand for sleep monitoring at home. In addition to various commercially available wearable and nearable devices, sound-based sleep staging via deep learning is emerging as a decent alternative for their convenience and potential accuracy. However, sound-based sleep staging has only been studied using in-laboratory sound data. In real-world sleep environments (homes), there is abundant background noise, in contrast to quiet, controlled environments such as laboratories. The use of sound-based sleep staging at homes has not been investigated while it is essential for practical use on a daily basis. Challenges are the lack of and the expected huge expense of acquiring a sufficient size of home data annotated with sleep stages to train a large-scale neural network. Objective This study aims to develop and validate a deep learning method to perform sound-based sleep staging using audio recordings achieved from various uncontrolled home environments. Methods To overcome the limitation of lacking home data with known sleep stages, we adopted advanced training techniques and combined home data with hospital data. The training of the model consisted of 3 components: (1) the original supervised learning using 812 pairs of hospital polysomnography (PSG) and audio recordings, and the 2 newly adopted components; (2) transfer learning from hospital to home sounds by adding 829 smartphone audio recordings at home; and (3) consistency training using augmented hospital sound data. Augmented data were created by adding 8255 home noise data to hospital audio recordings. Besides, an independent test set was built by collecting 45 pairs of overnight PSG and smartphone audio recording at homes to examine the performance of the trained model. Results The accuracy of the model was 76.2% (63.4% for wake, 64.9% for rapid-eye movement [REM], and 83.6% for non-REM) for our test set. The macro F1-score and mean per-class sensitivity were 0.714 and 0.706, respectively. The performance was robust across demographic groups such as age, gender, BMI, or sleep apnea severity (accuracy 73.4%-79.4%). In the ablation study, we evaluated the contribution of each component. While the supervised learning alone achieved accuracy of 69.2% on home sound data, adding consistency training to the supervised learning helped increase the accuracy to a larger degree (+4.3%) than adding transfer learning (+0.1%). The best performance was shown when both transfer learning and consistency training were adopted (+7.0%). Conclusions This study shows that sound-based sleep staging is feasible for home use. By adopting 2 advanced techniques (transfer learning and consistency training) the deep learning model robustly predicts sleep stages using sounds recorded at various uncontrolled home environments, without using any special equipment but smartphones only.

For the other two home datasets, home smartphone dataset and the home PSG dataset, we prospectively enrolled participants and collected data between June and November 2022. First, the home smartphone dataset, volunteers were recruited and screened through an internet survey. Those who passed the screening were asked to download the mobile application specifically designed for audio recording.
Instructions were provided, such as placing the phones 0.5-1.0 meter from the subject's head, connecting the phones to chargers, and activating the recording button before sleep. Various models of smart phone were used, ranging from Android (OS version later than 8.0) to iOS devices (OS version later than 15).
For the home PSG dataset, volunteers were recruited by the sleep center of the SNUBH and written informed consents were obtained from each participant. An Embletta MPR/ST+ proxy (Natus Medical Inc., Middleton, WI, USA) was used for home PSG with standard electrodes and sensors. An iPhone 11 was provided for audio recording. Participants were asked to place the smartphone on a side table or on the mattress, with a 0.5-1.0 meter distance from their head.
The inclusion criterion for all three datasets was age 20 years or older. The two home datasets required an additional criterion that subjects needed to sleep in the bedroom, i.e., without a partner or pet, during the recording. For the large-sampled home smartphone dataset, stratification was done according to age in a ratio of 1:1:2 to the 20s, the 30s, the 40s and older, respectively.
Exclusion criteria for all three datasets were as following: (1) patients with major physical illness or psychiatric disorders; (2) patients with a history of head trauma, neurological disorders, cerebrovascular diseases, or a brain tumor; (3) incomplete audio data. The cases of incomplete audio data were: (1) a recording error which was suggested by the portion of zero values exceeding 15% of one-night audio; (2) insufficient information for the temporal synchronization of audio and PSG. For the two datasets including PSG, subjects who failed PSG or whose total sleep time from PSG were less than 240 minutes were excluded.

Forming the noise dataset
The noise clips were downloaded from Freesound using their provided Python API code given at https://github.com/MTG/freesound-python. We chose keywords and sound tags (home, room noise, fan, etc.) that are highly likely to be recorded from residential environments. The noise clips must have a user rating above 4.0 to be selected for training. In the end, we were able to form a noise dataset with 8,255 noise clips to be used for consistency training.

HomeSleepNet Aggregated Training Algorithm
Combining all three components described in the above sections gives a complete training algorithm for HomeSleepNet (Algorithm 1).

Algorithm 1. HomeSleepNet training algorithm
The following two steps are sequentially repeated until convergence.
Step 1: Train the classifier. Sample M hospital data with M corresponding sleep stages , M smartphone data , and 2xM noise samples to update and : Step 2: Train the domain discriminator. Sample M hospital data and M smartphone data to update : We use the notation ℒ ( ; ) to mean that the loss ℒ updates the parameters by using the data from . represents the feature extractor parameters, means the domain discriminator parameters, and denotes parameters of the feature classifier.
represents the hospital audio data and denotes the corresponding sleep stage labels. Similarly, represents the home smartphone dataset, and is the home noise dataset. Please note that we used SoundSleepNet as the pretrained network to train HomeSleepNet.

The Seeming Underestimation of HomeSleepNet in Sleep Onset Latency
The mean sleep onset latency (SOL) of Portable PSG is 26.4, while the value for HomeSleepNet predictions is 12.6, which indicates an underestimation of HomeSleepNet. The reason for the seeming underestimation of HomeSleepNet in SOL is because of several special data from the Home PSG dataset. For example, there exists this one data that has in total almost 8 hours of sleep, but the sleep stage in the first 5 hours was Wake, which results in a SOL of 5 hours. The HomeSleepNet model, though performing well overall, predicted several Light sleep stages at around 3 hours, which results in a SOL of around 3 hours, 2 hours less than the truth value from the PSG data ( Figure S1). We tried removing the special data that has long Wake in the beginning HomeSleepNet failed to predict (6 special subjects in total) and recalculated the statistics. The new results without the special data became reasonable, as mean SOL for Portable PSG is 9.2 and mean SOL for HomeSleepNet predictions is 11.0, a difference of only 1.8. Figure S1. Illustration of special data with 5 hours of Wake from the beginning of the sleep test.