A novel method to increase specificity of sleep-wake classifiers based on wrist-worn actigraphy

ABSTRACT The knowledge of the distribution of sleep and wake over a 24-h day is essential for a comprehensive image of sleep-wake rhythms. Current sleep-wake scoring algorithms for wrist-worn actigraphy suffer from low specificities, which leads to an underestimation of the time staying awake. The goal of this study (ClinicalTrials.gov Identifier: NCT03356938) was to develop a sleep-wake classifier with increased specificity. By artificially balancing the training dataset to contain as much wake as sleep epochs from day- and nighttime measurements from 12 subjects, we optimized the classification parameters to an optimal trade-off between sensitivity and specificity. The resulting sleep-wake classifier achieved high specificity of 80.4% and sensitivity of 88.6% on the balanced dataset containing 3079.9 h of actimeter data. In the validation on night sleep of separate adaptation recordings from 19 healthy subjects, the sleep-wake classifier achieved 89.4% sensitivity and 64.6% specificity and estimated accurately total sleep time and sleep efficiency with a mean difference of 12.16 min and 2.83%, respectively. This new, device-independent method allows to rid sleep-wake classifiers from their bias towards sleep detection and lay a foundation for more accurate assessments in everyday life, which could be applied to monitor patients with fragmented sleep-wake rhythms.


Introduction
Good and restful sleep is important to our health. It enables the body to repair and be fit and ready for another day (Thomas et al. 2017;Zhang et al. 2017). However, sleep and health problems are quite commonly resulting in fragmented and non-restorative night sleep and might impair vigilance. The distribution of sleep and wake episodes over a 24-h day may alter with reduced sleep efficiency at night and/or additional sleep during the day. Assessing the quantity and timing of sleep during the night and during the day is therefore important.
Polysomnography (PSG) is the gold standard to quantify sleep time, to differentiate sleep stages, and to assess sleep fragmentation (McKenna et al. 2007). This technique includes a variety of simultaneous recordings such as electroencephalography (EEG), electromyography (EMG), electrooculography (EOG), electrocardiography (ECG). In clinical PSGs, additional sensors, e.g., for respiration detection, help to rule out the pathogenesis of sleep-wake disorders. But PSG is resource-intensive and thus costly and is conducted in an artificial sleep laboratory neglecting the influence of the natural sleeping environment. Although portable ambulatory PSG systems exist, the use of PSG is limited by the battery runtime, and its cumbersome design and thus not feasible for 24-h measurements over several days (Lau-Zhu et al. 2019).
Wearable movement sensors (actimeters) emerged as an accepted alternative to PSG for settings outside the sleep laboratory and present a valuable approach to observe sleep-wake rhythms over several days in natural environment (Smith et al. 2018). The wrist-worn sensors continuously measure the change in acceleration and allow to dissect the sleep-wake rhythm by correlating epochs of low and high physical activity to rest and wake periods, respectively. Actimeters are small, costefficient and unobtrusive, and are hardly noticeable while sleeping, making them ideal candidates for measurements outside the laboratory. Since the publication of the first automatic scoring algorithm for actigraphy by Webster et al. (1982), various algorithms have been developed to automatically score sleep and wake from the recorded movement data in children and adults (Cole et al. 1992;Sadeh 2011;Sadeh et al. 1994;Sazonov et al. 2004;Schoch et al. 2021;Sundararajan et al. 2021). However, over the last years, concerns regarding the validity and reliability of such methods were raised, in particular the systematical underestimation of wake phases after sleep onset (de Zambotti et al. 2018;Meltzer et al. 2015;Palotti et al. 2019;Souza et al. 2003) and the unreliable detection of short naps during the day (Cook et al. 2019;Depner et al. 2020;Galland et al. 2016;Samson et al. 2016;Yoon et al. 2003).
As a reason for this, the typically unbalanced datasets used for developing the scoring algorithms were identified (Souza et al. 2003): The data used to train the sleepwake classifiers is usually derived from typical sleep experiments where subjects spend most of the time asleep, resulting in a large overproportion of sleep epochs in the data set. When used as learning base for training a scoring algorithm, a bias towards detecting sleep is induced, which leads to the typical underestimation of wake.
Various attempts, although mostly in children's data, have been made in order to increase the specificity (i.e. the rate of correctly detected wake epochs) of sleepwake classifiers. Tilmanne, who compared ankle actimeters in infants to polysomnography during night sleep adapted the cost function in the training process of the classifier. Additionally, she increased the proportion of wake epochs in the learning database (Tilmanne et al. 2009). Galland optimized a linear scoring algorithm on data of a nap protocol from infants, to increase the detection accuracy of daytime sleep (Galland et al. 2012). All approaches have each led to higher specificities, but whether these can be translated to wrist actigraphy and the adult population and how such algorithms apply on standard night-sleep, remains to be shown.
In this work, we propose and train a novel, unbiased sleep-wake classifier for wrist actigraphy on adult data. We increased the proportion of wake in the training data to 50% and included overnight sleep as well as daytime nap recordings in the training dataset. We trained and tested different cost functions, by combining sensitivity and specificity. We hypothesized that such a balanced dataset prevents the classifier from being biased towards sleep and therefore achieves higher specificities compared to a state-of-the-art scoring algorithm . Additionally, we validated our novel sleep-wake classifier on data from more and partially different subjects collected during the study adaptation nights, where sleep is often more fragmented, by comparing the results from the classifier epoch-by-epoch to polysomnography. Therewith we aim to measure wake epochs more accurately during fragmented night-time sleep. In the long term, such an improved sleep-wake classifier may facilitate sleep assessments over 24 hours and over several days and support the diagnostic workout of patients with sleepwake disorders.

Subjects
For this work, we relied on data from young healthy individuals who participated in an in-lab study that includes night sleep and shorter daytime sleep episodes, as part of a larger project aiming to investigate the role of the circadian system in patients with neurologic sleep-wake disorders (ClinicalTrials.gov Identifier: NCT03356938). While initially collected for a different objective, this dataset provides the ideal sleep-wake data for the purpose of the present work. In short, nineteen candidates in the age range from 19 to 33 years underwent an adaptation polysomnography to exclude any sleep disturbances and absorb first night effects. Twelve out of the nineteen subjects (6 female, 6 male) showed a sufficiently high sleep efficiency as defined in the inclusion criteria and agreed to subsequently partake in the main study, which consisted of a poly-nap protocol where overnight sleep was distributed over 40 hours into repetitive sleep-wake cycles so that sleep episodes occurred at different circadian phases. The study protocol was approved by the responsible ethics committee and all participants gave written informed consent.

Study procedure
After giving the informed consent, candidates were invited for a first adaptation night of eight hours. Participants arrived in the lab facilities at the University Hospital Zurich in the evening hours. All polysomnography sensors (9 EEGs (F3, F4, C3, C4, O1, O2, M1, M2), 2 EOGs, 3 EMGs, ECG, respiratory sensors, air flow, thorax and abdomen belts, SaO2) and an actimeter was attached to the non-dominant wrist. We performed PSG recordings with digital videography (Embla N7000, RemLogic v3.2) according to AASM standard criteria (Berry et al. 2012).
The 12 participants, who were included in the main study, continued with a 70 h sleep protocol on a separate date. Again, electrophysiological sensors and the actimeter on the non-dominant wrist were attached. The following electrophysiological signals were recorded: 21 EEGs according to standardized 10-20 system, 2 EOGs, 1 EMG on the chin and ECG. The first night consisted of seven hours of consolidated sleep (see Figure 1). Afterwards, participants underwent ten alternating cycles of forced wake (160 min, "wake period") and sleep opportunities (80 min, "nap period"). The protocol ended with another seven hours night sleep for recovery. During the experiment, subjects stayed in the bed in a sitting position during the wake periods and were not allowed to get up, except for regularly scheduled toilet visits. The participants were instructed to perform only sedentary low physical activities (e.g. reading, watching TV, using mobile electronic devices with dimmed screens and appropriate blue-light filter goggles). During the forced wake phases, they also performed various vigilance tests and filled out questionnaires for the main study purpose. Room light was dimmed always to 10 lux. If needed, study examiners interacted with the participants to prevent them from falling asleep. During the nap period, subjects lied down, the lights were turned off (0 lux) and the subjects were asked to relax and fall asleep. Sleep efficiency within naps varied between 0% and 98.7% depending on the strength of the circadian wake drive. During night sleep episodes, all nap periods and during a major part of the wake periods, electrophysiological signals were continuously recorded. The recording was only stopped to check and replace electrodes and during toilet visits. Sleep was scored by an experienced somnologist according to standard criteria of the American Academy of Sleep Medicine (AASM) over epochs of 30 seconds (Berry et al. 2012). Whenever the recording was stopped, subjects were closely supervised, and the corresponding episodes therefore scored as wake.

Actimeter
An actimeter was used to record the movement of the subjects, which was later used to train the sleep-wake classification algorithm proposed in this work. The custom-made device (ZurichMOVE, Zurich, Switzerland) consists of an inertial measurement unit (IMU) comprising a triple-axes accelerometer with a range of±8 g. The device continuously monitors the activity and logs the raw acceleration data sampled at 50 Hz on an integrated SD card. The ZurichMOVE sensor was attached with elastic straps to the nondominant wrist, which is considered the optimal position for actigraphy-based discrimination of sleep and wake. (Sadeh 2011) The raw acceleration data was filtered with a first order high pass Butterworth filter with a cut-off frequency of 0.25 Hz. For each time-point, the Euclidean norm of the acceleration vector was calculated and defined as magnitude M t ð Þ (see Equation (1)). Values below 0.05 g were considered sensor noise and set to zero. The acceleration magnitude was integrated with trapezoidal method over an epoch (see Equation (2)) and normalized by the number of samples (i.e., mean value). Here, an epoch length of 30 s was chosen to be consistent with the clinical scores from the EEG scoring (t epoch ¼ 30s, while f s is the sampling frequency). The resulting mean activity was defined as activity count A i . The epochs were synchronized to the PSG recording by aligning the internal clocks of each data acquisition system, so that the epochs of the actimeter were identical to the epochs of the sleep scoring.

Datasets
For the training of the new classifier to differentiate between sleep and wake phases, we compiled three independent datasets based on the main study data collection: the training, testing, and validation sets. To provide a neutral training dataset without any preference or bias, we compiled a dataset which consists of the same number of wake epochs as sleep epochs. Sleep originated from the consolidated night sleep (from the baseline in the beginning and the recovery in the end of the protocol) and from the multiple naps (see Figure 1). As in our protocol, the total time awake exceeds the total time asleep, all epochs scored as sleep were used in the analysis, complemented by the same number of wake epochs per subject. The wake epochs were chronologically included into the dataset until the number of sleep epochs is reached. From each subject, a total of 1378 to 3057 of 30s-epochs were used for the dataset from sleep and wake each. Thus, forming a balanced data set with a total of 15 399.5 h of sleep and 15 399.5 h of wake over all subjects. Then, the data were partitioned into two subsets with a split of nineto-one per subject. The larger data set ("training set") of 27 719.1 h was used to train the sleep-wake classifier. The smaller set of 3079.9 h was reserved for testing ("testing set"). The balance of sleep and wake epochs was maintained in both subsets.
Data from the adaptation nights were used to validate the new classifier. This dataset ("validation set") comprises 191.7 h of data (117.2 h of sleep and 25.6 h of wake). The data was analyzed from lights off till lights on.

Sleep-wake classifier
The sleep-wake classifier determines for each epoch whether it represents sleep or wakefulness. Therefore, it calculates a score S i for each 30s-epoch, based on the weighted sum of the activity counts A i from four previous and two subsequent epochs (see Equation (3)).
The resulting score is linearly scaled by a factor P and compared to a threshold of 1. If the resulting number is above 1, the epoch is scored as wake, otherwise it is scored as sleep.
We aimed to increase specificity without changing the structure of the standard approach, which is still the most established and most widely used today. This allows others to use the new classifier immediately without having to rework existing evaluation frameworks and facilitates comparability. We therefore kept the number and distribution of epochs considered for classification, and optimized the classification parameters ðW i ; PÞ on the balanced training dataset containing 50% sleep and 50% wake epochs. Therefore, a genetic optimization (implementation from MATLAB2018b, Mathworks Inc.) was conducted over the training data (see Figure 2). The scaling parameter P and weights W i were bound to P 2 1; 1000 ½ � and W i 2 0; 1 ½ �, respectively. An equality constraint ( P 2 À 4 W i ¼ 1) ensured that the leverage of the different weights stayed comparable. The optimization process started with a distinct set of parameters, applied the scoring algorithm to the training data, compared the labels epoch-by-epoch to the PSG scoring, and updated the classification parameters W i ; P ð Þ. This routine was repeated until the optimization objective (see description below) was not improved anymore. The genetic optimization does not guarantee to find the global extrema. Therefore, the whole optimization routine was performed five times (per optimization objective), and the scoring algorithm with the highest performance was chosen for the test and validation step.
The scoring algorithms depend on the activity counts of adjacent epochs. Therefore, the temporal information of the data needs to be conserved. This, of course, must also be considered when the data is split into the training and testing dataset, or when the data is mixed with data from other subjects and randomized during the training step. This was ensured by creating a feature vector per datapoint with the activity counts of the epoch of interest and its adjacent epochs, for both the training and test dataset, so that the temporal relation could be preserved even if neighboring epochs were assigned different sets.
Optimization objective: During the optimization process, an objective, or optimization "cost," is minimized. It was shown that a cost function combining sensitivity and specificity leads to a better trade-off in the sensitivity-specificity-dilemma than other cost functions, such as the sum of squared errors (SSE) (Tilmanne et al. 2009). In order to investigate the influence of different cost functions, we introduced a parametrized cost function based on the sum of sensitivity and specificity (see Equation (4)).
The parameters a and b define the proportion of sensitivity and specificity in the cost function. Therefore, we refer to them as sensitivity parameter (a) and specificity parameter (b). We tested several optimization objectives, by systematically changing the sensitivity parameter a from 0 to 1 and decreasing b from 1 to 0, respectively, in increments of 0.25. This optimization process was repeated for each cost function.
After the optimization, the sleep-wake classifier is applied to the testing set (6160 epochs with 50% sleep and 50% wake). The resulting accuracy, sensitivity and specificity were calculated based on an epoch-by-epoch comparison to the EEG-labels.

Traditional sleep-wake algorithm for comparison
We want to compare our classifier against the stateof-the-art for accelerometer-based sleep-wake detection. To date, there are two sleep-wake detection algorithms that have been most widely used: the Cole-Kripe algorithm (Cole et al. 1992) and the Sadeh algorithm (Sadeh et al. 1994). The latter is designed for 60-s epochs. Therefore, we focused on the Cole-Kripke algorithm, which has since been validated for various devices and cohorts (Jean-Louis et al. 2000;Kripke et al. 1997;Mason and Kripke 1995). It has been implemented for commercial actimeter devices, such as Motionlogger, Actillume, ActiGraph, and, although with a different distribution of epochs and weights, Philipps Actiwatch, and is applied for sleep-wake detection and used as benchmark for validation purposes (Chinoy et al. 2021;Rosa et al. 2021;Mahadevon, 2021;Wang et al. 2021). The original Cole-Kripke algorithm was later adapted for singleor multi-axis accelerometers with linearly interpolated activity counts (Mason and Kripke 1995), and optimized for young healthy subjects ). This version corresponds most to our setup, therefore it was chosen for comparison and applied on our testing dataset. For the purpose of this paper, this algorithm will be referred to as "traditional algorithm." The traditional algorithm, also based on Equation (3) Mason and Kripke 1995). The factor P serves as a scaling factor to the overall discrimination score.
To be able to compare different scoring algorithms to each other, the activity counts should be normalized to the maximum acceleration amplitude that can be measured with a specific device. Additionally, we suggest normalizing the weights W i . Then, scoring algorithms can be computed and applied to different actimeter devices.
Here, we converted the classification parameters of the traditional algorithm to be compatible with our actimeter device, by changing the scaling parameter P based on a visual comparison of the maximum activity Figure 2. Schematic of the optimization process. A cost function is defined as the weighted sum of sensitivity and specificity. A genetic optimization is performed over the training data set until the optimization criteria is reached. The resulting sleep-wake classifier is applied to the testing data set with a 50%-50% distribution of sleep and wake epochs, and to the validation set with 89% sleep and 11% wake epochs. The colors correspond to the colors used in the results section for better comprehension.
counts shown by Jean-Louis ) and in our data. We applied the traditional algorithm with adapted scaling parameter P to the testing set and compared the performance to the validity metrics (accuracy, sensitivity, and specificity) of our sleep-wake classifier with a paired t-test with significance level of α ¼ 0:05.

Validation
Our sleep-wake classifier trained on a balanced dataset learned activity patterns performed during sleep and wake episodes, and therefore is hypothesized to better detect wake epochs. To test if the optimized classifier can detect awakenings also during night-time sleep, and to ensure that the classifier did not develop a "bias for balanced data sets," it was further validated on the 19 adaptation nights in the validation dataset. The actimeter data were compared epoch-by-epoch to the labels from the sleep scorings. The sensitivity, specificity and accuracy were calculated. Additionally, a common set of night sleep parameters were derived to describe the sleeping pattern: Total sleep time (TST), defined as the sum of epochs scored as sleep; sleep efficiency (SE), the percentage of time staying asleep; wake after sleep onset (WASO), defined as the number of wake epochs after falling asleep the first time; and sleep latency (SL), defined as the duration between lights off and the first epoch scored as sleep. The sleep parameters were compared statistically with a two-sided paired t-test (significance level α ¼ 0:05.). As SL did not show a normal distribution, a Wilcoxon rank sum test was used instead. Additionally, the difference of the sleep parameters to their equivalents derived from PSG scorings were shown graphically by means of Bland-Altman plots (Bland and Altman 1999). These plots are a common method to graphically analyze the difference of two measurement approaches on the same parameter.

Results
To analyze the influence of the cost function, the optimization objective was gradually changed. The results are shown in Table 1. Decreasing the sensitivity weighting parameter increased the specificity performance, with a maximum of 86.5% for a ¼ 0. For cost functions considering sensitivity and specificity, i.e., with a 2 0:25; 0:75 ½ �, overall classification accuracies were consistently achieved between 83% an 84.6%. For a ¼ 1, sensitivity increased up to 94.5% at the expense of specificity (56.7%) and overall performance in terms of accuracy (75.6%).
To derive an unbiased sleep-wake classifier, the optimization was run to minimize a cost function with a ¼ 0:5 and b ¼ 0:5. The resulting sleep-wake classifier was the following: Figure 3, the performance results of this sleep-wake classifier are shown. Over the test data, the sleep-wake classifier achieved significantly higher accuracies 84:6 � 1:4% compared to the traditional algorithm 77:4 � 1:7% (t 11 ð Þ ¼ À 5; p < :001; r ¼ 0:83Þ:The specificity was significantly increased from 57 � 3:8% with the traditional algorithm to 79:2 � 2:6% with the sleep-wake classifier (t 11 ð Þ ¼ À 8:38; p < :001; r ¼ 0:9Þ. On the other hand, the traditional algorithm performed with higher sensitivities of 97:6 � 0:7% compared to our sleep-wake classifier, which achieved sensitivities of 87:9 � 2:4% (t 11 ð Þ ¼ 4:59; p < :001; r ¼ 0:8Þ. To validate the sleep-wake classifier on nighttimesleep, it was applied to 19 adaptation nights. Results are presented in Table 2. On average, an accuracy of 84.12% was achieved, with a sensitivity of 89.38% and specificity of 64.62%. Highest variability was observed in specificity with values ranging from 29.73% till 100%. The interpersonal differences show that in certain individuals, the classifier is still prone to sleep detection (significantly higher sensitivity than specificity). While for others, all wake phases are correctly detected. Further looking into the misclassifications using the PSG information about the different sleep stages, we can observe that the misclassifications of true sleep as wake happened mainly during N1 stage (3.85%) and N2 stage (4.14%), and rarely for N3 stage (0.89%) or REM (1.74%). Despite the relatively large variability between individuals, an average wake detection of 64% is significantly higher than with previous actigraphy methods.
The sleep parameters derived from the validation set are presented in Table 3. For sleep latency, the median instead     Figure 4, the corresponding Bland-Altman plots are shown. It can be observed that for TST and SE, only one value lies outside of the 1.96*SD limits of agreement, accordingly to what would be expected by a normal distribution. For wake after sleep onset, all values laid within the ±1.96*SD range. For the sleep latency, a tendency to overestimation is observed especially for short latencies.

Discussion
The quantification of wake during sleep episodes is of great interest as not every sleeper has a good night sleep, and many people suffer from sleep disturbances. Current state-of-the-art scoring algorithms understate the wake episodes for the most part. By training a classifier on a balanced dataset, which contains as much data from wake than sleep, we were able to address this issue and achieve good overall performance with an optimized trade-off between wake and sleep detection. The difficulty for sleep-wake classifiers lies in the discrimination between sleep and passive wake. As the participants were restricted to sedentary activities, we collected a specialized learning database addressing specifically this challenge.

Performance on night sleep
The sleep-wake classifier presented in this work performed with significantly higher specificity outcomes. On the x-axis, the mean of a value calculated with both methods is drawn on the y-axis the difference between both methods. The dashed line in the middle represents the mean of the differences. The two lines above and below represent ± 1.96* standard deviations.
On PSG data from 19 healthy subjects, the classifier detects 89.38% of all sleep episodes (sensitivity) and 64.62% of all wake episodes (specificity). This results in a significantly higher specificity than what is usually reported in literature: Jean-Louis reported a specificity of 40.6% (by an overall accuracy of 91.3%) for the traditional algorithm . De Souza compared the scoring algorithms from Cole and Sadeh, reporting specificities of 34% and 44% (Souza et al. 2003). And generally, the low specificity values (37%-55%) and the resulting low validity of sleep-wake classifiers was seen as one barrier to the use of actimeters in patients with fragmented sleep (Paquet et al. 2007). Also recent approaches, including machine learning techniques, reported no remarkable increase of specificity (30%-55%) (Li et al. 2020;Sundararajan et al. 2021;Tilmanne et al. 2009), when no specific optimization or training criteria was set. For hierarchical trees, as example, an average precision of 58.9% for wake and 89.6% for sleep, respectively, was reported (Sundararajan et al. 2021). Two recent works showed, however, that with the right training criteria, an increase of specificity is possible with machine learning as well (Banfi et al. 2021;Lüdtke et al. 2021). They discussed the same trade-off reported here, being that the improved specificity comes at the cost of a reduced sensitivity. While they showed a modelling approach for large and unspecific data sets, the advantage of our optimization approach lies in the opportunity to tailor a scoring algorithm to specific dataset characteristics, i.e., very high or very low sleep-wake proportions. Furthermore, we calculated the most common parameters characterizing night sleep. No significant difference between our classifier and the PSG gold standard were found, except for sleep latency. And the increased specificity led to more accurate sleep parameter estimations than reported with other scoring algorithms in literature. While in literature, mean differences for TST of 15 min to 50 min are reported (Banfi et al. 2021;Danzig et al. 2020;Regalia et al. 2021;van Hees et al. 2018;Yavuz-Kodat et al. 2019), our sleep-wake classifier only showed a mean difference of 12.16 min. This is also reflected in an accurate estimation of sleep efficiency, with 82.12% compared to PSG-based SE of 84.95%. The mean difference of 2.83% is lower compared to the reports in literature (3.58% till 7.7%), except for Banfi (Banfi et al. 2021), who reported a difference of −2.5% with a different machine learning approach. While accelerometer-based algorithms typically overestimated TST and SE, our sleep-wake classifier shows a tendency for underestimation. This is due to the increased specificity, which leads to a slight overestimation of WASO (−20.61 min), while in literature, WASO is typically underestimated (50.7 min (Danzig et al. 2020), 7.57 min (Yavuz-Kodat et al. 2019, 37 min (Regalia et al. 2021)). This shows that the increased specificity due to a balanced training set successfully counteracts the bias towards sleep detection during overnight sleep.
We presented a sleep-wake classifier with a significant increase in specificity. However, this increase goes at the cost of sensitivity. While the literature consistently reported high sleep detection rates of above 90%, our sleep-wake classifier only finds 89% of all sleep episodes. While for the assessment of sleep in healthy individuals (where sleep efficiency normally exceeds 90%), the existing scoring algorithms have been deemed satisfactory, our classifier rather lends itself as reliable actigraphy method for subjects with sleep disorders, where higher specificity values have been required (Depner et al. 2020). However, further validation is needed to investigate if sensitivity is potentially affected in disturbed sleep, where the classes with highest misclassification rates (N1 and N2 sleep) are typically present more often. The shown trade-off between sensitivity and specificity suggests that there most likely will not be an optimal one-fits-all solution. The choice of a scoring algorithm should, in the future, rather depend on the patient population and the protocol applied. If in a dataset, consolidated night sleep with only a few awakenings is expected, the scoring algorithms trained with a high sleep-wake-ratio may be suitable. While for datasets including patients with sleep-wake disorders, frequent awakening or full 24-h sleep-wake-profiles, the presented, more balanced classifier might be superior.

Sleep-wake classifier
In order to train an unbiased sleep-wake classifier, we artificially balanced the training and testing datasets. The performance metrics on the test datasets (central tendencies of 88.6% and 80.4% and standard deviations of 8.34% and 9.08% for sensitivity and specificity, respectively) suggest that this approach indeed trains an unbiased classifier. The traditional algorithm, tested on the same test data, showed a clear bias towards sleep, with very high sensitivity values that are distributed with narrow variability (standard deviation of 2.48%) over all subjects.
Although genetic optimizations not necessarily converge to a global optimum, repeated optimizations yielded comparable results in terms of performance outcome, despite the different relative distribution of the weighting parameters W i (mean � standard deviation of distribution per parameter weight: W À 4 ¼ 0:57 � 0:25; W À 3 ¼ 0:04 � 0:04; W À 2 ¼ 0:07 � 0:09; W À 1 ¼ 0:04 � 0:04; W 0 ¼ 0:18 � 0:2; W 1 ¼ 0:05 � 0:04; W 2 ¼ 0:06 � 0:07). The scaling parameter P seemed to have the greatest influence, which reproduces previous observations (Cole et al. 1992). This can be explained as P proportionally scales the score compared to 1, or proportionally shifts the threshold below or above 1 and therefore changes the sensitivity of the scoring algorithm to detect sleep. In future, if a scoring algorithm is needed for a specific patient population, it might even be enough to optimize P to adjust the bias to sleep for different applications. It is interesting to note that the coefficients strongly differ in magnitude, with the first A iÀ 4 and current A i epoch having the most weight. This might indicate that the epochs in between do not contribute any valuable information. At best, they correlate with each other, which in turn could mean that larger epochs of more than 30 s would be equally effective at best. However, these observations would need to be investigated in more detail in future work.
A direct comparison of scoring algorithms across publications and devices would be valuable (Neishabouri et al. 2022). To ensure comparability, we reported the weights W i in a way that their L1-norm equals one and normalized the scaling parameter P with respect to the maximum activity counts possible. Reporting sleep-wake classifiers in such a standardized way allows to decouple them from the technical specifications of the devices and make the classifier directly applicable for different actimeters.
For the optimization process, an optimization objective had to be defined. We optimized for the sum of sensitivity and specificity but allowed to change the ratio between them. Thereby, we were able to observe the influence of the objective function.
The results show the strong influence of the objective on sensitivity and specificity. The accuracy, however, was robustly estimated between 83.1% and 84.5% for a 2 0; 0:75 ½ �. Only for an optimization objective that clearly prioritizes sleep detection (a ¼ 1), the accuracy decreased to 75.7%. Interestingly, even the most unilateral objectives did not lead to simple all-sleep or all-wake scoring algorithms. Running an optimization for specificity only ( a; b ð Þ ¼ 0; 1 ð Þ) only minorly decreased the sensitivity and increased the specificity; or vice versa for optimizing for sensitivity only ( a; b ð Þ ¼ 1; 0 ð Þ). The reason for this result could be that the genetic optimization stopped in a local optimum rather than converging to the global maximum, which in this case is expected to be found at the border of the eight-dimensional parameter space for very low or very high values of P. In general, a stronger response to the changes in optimization objective is observed for sensitivity than for specificity. This might be explained by the specific characteristics of our dataset, where the subjects stayed in bed also during the wake periods performing sedentary activities, and thus show a presumably higher overlay of low activity from wake with sleep than during unconstrained wake periods.
The different optimization objectives can be interpreted as performing optimizations on unbalanced datasets. With a cost function with, for example, a; b ð Þ ¼ 0:9; 0:1 ð Þ, the optimization routine values the correct detection of sleep epochs as nine times more important than a correct detection of wake epochs. This can also be interpreted as an optimization (for accuracy) on a training set containing 90% sleep epochs and 10% wake epochs. Therefore, a classifier trained with a cost function with a; b ð Þ ¼ 0:9; 0:1 ð Þ would perform similarly to a typical scoring algorithm trained on unbalanced data (Depner et al. 2020).

Limitations
As discussed above, the performance of the classifier depends on the data it is trained and validated with. Hence, the data used in this study also illustrate the main limitations: we only included healthy subjects without sleep disorders to participate in the experiment and collected data under well-controlled conditions in a laboratory setting. External artifacts such as displaced actimeters were excluded by design. As the validation data set was drawn from the adaptation procedures, there was an overlay of subjects included in the validation and optimization step. However, as the data was taken from two separate measurements, the variability in data due to sensor placement and handling are expected to be higher than due to individual activity levels, therefore independence of data was assumed. The test data set was created by a one-time allocation with a fixed split of 10% of all data points, while maintaining the balance between sleep and wake data points. However, it could not be ruled out that an iterative allocation procedure might contribute to increase the variance in the test data set. The sleepwake classifier was validated on a sample of 19 single-PSG nights, and therefore day-to-day variability still needs to be investigated. Future work should extend this validation, for example including publicly available datasets. Future studies will thus examine the validity of the proposed sleep-wake classifier with patient data, and additionally apply the classifier to unsupervised activity profiles collected over 24 h outside the laboratory.

Conclusion
Our goal was to improve the specificity in sleep-wake classifiers based on actigraphy. We therefore analyzed data collected under sleep laboratory condition which contained sleep and calm wake episodes to the same proportion. We trained the sleep-wake classifier on a balanced dataset with a genetic optimization, and thereby tuned the classification parameters towards an optimized trade-off between sleep and wake detection. The resulting sleep-wake classifier showed a significantly increased specificity when applied to nocturnal sleep recordings which normally are more fragmented than normal sleep of healthy subjects compared to a standard sleep-wake classifier. Thereby, it was possible to remarkably decrease the overestimation of the total sleep time and sleep efficiency. At the same time, significant interindividual variation was observed, which appears to represent the cost of the less biased algorithm.
The classifier presented in this work is based on normalized activity counts and can, in theory, be transferred to other accelerometer devices. In summary, we aim to contribute to finding a sleep-wake classifier that is optimized for patients with fragmented sleep and that allows to reliably detect sleep and wake over 24 h.

Disclosure statement
No potential conflict of interest was reported by the author(s).