Heart Rate Variability for Classification of Alert Versus Sleep Deprived Drivers in Real Road Driving Conditions

Driver sleepiness is a contributing factor in many road fatalities. A long-standing goal in driver state research has therefore been to develop a robust sleepiness detection system. It has been suggested that various heart rate variability (HRV) metrics can be used for driver sleepiness classification. However, since heart rate is modulated not only by sleepiness but also by several other time-varying intra-individual factors such as posture, distress, boredom and relaxation, it is relevant to highlight not only the possibilities but also the difficulties involved in HRV-based driver sleepiness classification. This paper investigates the reliability of HRV as a standalone feature for driver sleepiness detection in a realistic setting. Data from three real-road driving studies were used, including 86 drivers in both alert and sleep-deprived conditions. Subjective ratings based on the Karolinska sleepiness scale (KSS) were used as ground truth when training four binary classifiers ( ${k}$ -nearest neighbours, support vector machine, AdaBoost, and random forest). The best performance was achieved with the random forest classifier with an accuracy of 85%. However, the accuracy dropped to 64% for three-class classification and to 44% for subject-independent, leave-one-participant-out classification. The worst results were obtained in the severely sleepy class. The results show that in realistic driving conditions, subject-independent sleepiness classification based on HRV is poor. The conclusion is that more work is needed to control for the many confounding factors that also influence HRV before it can be used as input to a driver sleepiness detection system.

numbers by convincing sleepy drivers to pull over for a rest or nap. Such systems are typically based on (i) vehicle-based information such as lane keeping performance, (ii) behavioural information such as yawning and eye movements, and/or (iii) driver physiological data such as electrocardiography (ECG), electroencephalography (EEG), or electrooculography (EOG) [3]- [5].
Today's and tomorrow's advanced driver assistance and automated driving systems are changing the playing field for sleepiness detection. When longitudinal and lateral positioning is secured by the vehicle, vehicle-based data can no longer be used for sleepiness detection. Similarly, camera-based systems may not see the driver's face due to obstructed camera views or because the driver is out of position, both of which are more likely during assisted or automated driving. At the same time, physiological data are becoming more available thanks to wearable [6], [7] or remote sensors that can be embedded in the steering wheel, seat belt, and seat [8], [9]. This has revived heart rate and heart rate variability (HRV)-based concepts, and it is logical to use the relationship between cardiac function and sleepiness when designing sleepiness detection systems.
The heart rate is regulated by the sympathetic (activation) and parasympathetic (rest) branches of the autonomic (involuntary) nervous system. Sympathetic activation leads to increased heart rate, more forceful heart contractions, dilated airways, higher respiration rate and increased muscle strength. Parasympathetic activation causes reduced heart rate, decreased blood pressure, and stimulates digestion and waste disposal. Heart rate, and thus HRV, reflects the balance between sympathetic and parasympathetic activity. In general, a lowered heart rate gives more room for variability between successive heartbeats allowing higher HRV. This is typically seen during sleepiness when the body is winding down to prepare for sleep. It is important to note that while a change in alertness level will likely cause a change in HRV ( Fig. 1.), the opposite is not necessarily true, as a change in HRV could equally be due to a change in any of the other factors that also affect HRV. These other factors include both inter-individual factors, such as age, gender, and medical conditions, and intra-individual factors, some of which vary over time [10], [11].
It has been hypothesized that parasympathetic activity should increase with increasing levels of driver sleepiness. High-frequency (HF) power in the HRV signal, which is a sign of parasympathetic activity, has indeed been found This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Fig. 1. Model of how different time-varying intra-individual factors affect the cardiovascular control system. The autonomic nervous system controls heart rate via a complex feedback system, in which homeostasis is maintained by information received by the baroreceptors acting on changes in arterial blood pressure. Sleepiness is but one of the many time-varying intra-individual factors that modulate heart rate and HRV.
to increase with increasing levels of sleepiness [12], [13]. However, it has also been found to decrease [14] or to display unreliable changes [15]- [17]. It has also been hypothesized that a sleepy driver should have either decreased or increased sympathetic activity, depending on whether or not the driver is struggling to remain awake. Low-frequency (LF) power in the HRV signal, which is often used as an indicator of sympathetic activity, has consequently been found to decrease [12], [14], increase [13], [16], or display unreliable changes [15], [17], [18] with increasing levels of driver sleepiness. These differing results would make sense if increased LF activity was found in experiments with manual driving (due to the struggle to remain awake) while decreased LF activity was found in studies with conditional or full automation (since the driver is then allowed to relax and possibly even sleep). However, all reviewed results come from manual driving experiments. The ambiguous hypotheses are nevertheless very convenient as they provide an explanation regardless of the direction of the results.
For the reasons outlined above, it is relevant to question how appropriate it is to use HRV for driver sleepiness detection outside a controlled experimental environment. The primary aim of this paper is therefore to investigate whether HRV metrics alone, measured during real-road driving, can be used for driver sleepiness detection. Preliminary results of this study were presented by Persson, et al. [19].

II. MATERIALS AND METHODS
Two methodological approaches have been used for the HRV analyses in this paper: a group-level statistical approach and a machine learning approach. Common to both approaches was a pre-processing stage in which the ECG signals were filtered and divided into 5-min epochs. The HRV metrics (or features) under investigation were then calculated for each epoch. The first methodological approach involved an analysis of variance (ANOVA). This was done to see whether the dataset provided results similar to those previously reported in the HRV literature. The second methodological approach used a standard machine learning pipeline to investigate how appropriate it is to design a sleepiness classifier based on HRV metrics alone.

A. Sleepiness Database
The database used in this paper consists of data from three separate driver sleepiness experiments (see [20]- [22] for a detailed account). To the best of our knowledge, this is one of the largest labelled real-road driver sleepiness datasets with data from both alert and sleep-deprived conditions. Experiment 1 included 18 drivers (8 women, mean age 41 ± 9 years) who drove for about 90 min on a motorway in real traffic. Each driver drove two times, once in a supposedly alert state during daytime and once in a sleep-deprived state during night-time. Experiment 2 included 24 drivers (12 women, mean age 35 ± 10 years) who drove for about 135 min three times on a motorway (supposedly alert during daytime, mostly alert in the evening, and sleep deprived during nighttime). Experiment 3 included 44 drivers (21 women, mean age 45 ± 8 years) who drove for about 90 min three times on a rural road (supposedly alert during daytime, mostly alert in the evening, and sleep deprived during night-time). The participants were recruited by random selection from the Swedish register of vehicle owners. All participants were prepared in the same way in all experiments. Before arrival, the participants were requested to avoid alcohol for 72 h and to abstain from nicotine and caffeine for 3 h before driving. All participants reported that they were healthy with good to excellent sleep quality. The Swedish government approved the testing of sleepy drivers on real roads (N2007/5326/TR), and each of the three experiments was approved by the Regional Ethics Committee in Linköping, Sweden. The experimental car was equipped with dual command, allowing the test leader to take control of the vehicle if necessary.
The ECG (lead II) was measured using disposable Ag/AgCl electrodes connected to a portable digital recording system (Vitaport 2 and 3, Temec Instruments BV, the Netherlands) using a sampling rate of 256 Hz. The Karolinska Sleepiness Scale (KSS) was used to acquire self-reported sleepiness every fifth minute. KSS has nine levels [23]: 1-extremely alert, 3-alert, 5-neither alert nor sleepy, 7-sleepy, no effort to stay awake, and 9-very sleepy, great effort to keep awake, fighting sleep. The reported value corresponds to the average feeling over the past 5 min.
The KSS values are used as target values when training the classifiers. Alternative approaches used as ground truth in driver sleepiness studies include EEG [24], [25], reaction time tests [26], and expert ratings based on observations [12], [15]. However, reaction time tests are difficult to administer in real-road driving and video-based expert ratings have been found to be unreliable [27]. EEG-based measures suffer from noise in naturalistic settings, large inter-individual variability, and the fact that some individuals do not respond despite being clearly sleepy [3], [28], [29]. The main drawbacks of KSS are that the subjective feeling does not always reflect the actual sleepiness level [30], repeated reporting can have an alerting effect [31], and participants may interpret the levels of KSS differently. Advantages are that KSS correlates with lane departures, is easily applied, and has been found to be the measure of driver sleepiness least affected by inter-individual variations [23]. All in all, subjective ratings seem to be the best option.

B. Pre-Processing
R-peaks were extracted from the ECG using the Pan Tompkins algorithm [32] and the RR time series were extracted as the time difference between heart beats. The corresponding normal to normal (NN) time series were obtained by removing outliers using the standard deviation method [33], [34]. Here the threshold was set to 4 standard deviations. Epochs with more than 5% outliers were discarded. In total, five complete recordings were discarded due to poor signal quality, and the outlier removal step removed 13 additional 5-min epochs.

C. HRV Feature Extraction
All NN signals were divided into 5-min epochs, partly because the KSS values were reported every fifth minute, but also because this is the recommended minimum duration for short-term HRV analysis [35]. This resulted in 3954 epochs.
Twenty-four commonly used HRV features were extracted according to Table I. These 24 features make up the main feature set, which will be referred to as the HRV feature set. Due to the large individual differences encountered in HRV analyses [10], [11], [17], [36], a second baseline-corrected feature set was constructed by subtracting the mean feature values per participant corresponding to the KSS ≤ 5 cases. This will be called the baseline-corrected feature set. Finally, the time difference between consecutive 5-min epochs of the 24 original features was computed. This feature set will be called the time-difference feature set. All 72 features were standardized by removing the mean and dividing by the standard deviation.
The frequency domain features were derived from an autoregressive power spectral density estimate using the modified covariance method with a model order of 32. The VLF band was defined as 0.003-0.04 Hz, the LF band as 0.04-0.15 Hz, and the HF band as 0.15-0.4 Hz. Sample entropy was derived using embedding dimension = 2 and threshold = 0.2 × std(NN) according to. Richman and Moorman [37]. The potentials of unbalanced complex kinetics (PUCK) metrics were implemented according to.Igasaki, et al. [18].

D. Statistical Analysis
The HRV metrics were analysed with mixed-model ANOVAs using two fixed factors: Daytime (day or evening versus night-time driving) and Sleepiness (KSS ≤ 5, KSS = 6, KSS = 7, KSS = 8, and KSS = 9). Participant (1-86) was included as a random factor. The key effect of interest was how different HRV metrics relate to subjective sleepiness. The factors Daytime and Sleepiness are dependent, but since high KSS levels are also common during the day, both factors were included in the analysis. The other two factors should be considered confounding factors. The significance level was set to 0.01, which corresponds to p < 0.0004, with Bonferroni correction to compensate for the 24 comparisons.

E. Sleepiness Classification
An overview of the machine learning pipeline is illustrated in Fig. 2. The feature set was split into three parts: 30% for feature selection and parameter tuning, 50% for training, and 20% for testing. To make sure that the results are not just coincidental, stemming from a certain data partitioning, this process was repeated ten times. The final test results are mean values across these ten repetitions. This 10-fold cross validation will be referred to as classification-repetitions.
Three classes of subjective sleepiness were defined: alert (KSS ≤ 5), somewhat sleepy (6 ≤ KSS ≤ 7), and severely sleepy (KSS ≥ 8). In the case of binary classification, the somewhat sleepy class was left out to obtain a clearer distinction between the alert and sleepy classes [38]. The class definitions are justified by the observation that hardly any line crossings occur at KSS ≤ 5, whereas a markedly increased frequency of unintentional line crossings occurs at KSS ≥ 8 [22].
Feature selection was carried out using sequential forward floating selection (SFFS) [39]. SFFS was wrapped with a binary decision tree classifier, 5-fold cross-validation, 20 cross-validation runs, and a trade-off between sensitivity and specificity as optimization score. Since SFFS often results in low-dimensional non-redundant but noise sensitive feature sets, the SFFS procedure was run repeatedly (20 times) on different partitions of the feature selection set. The features selected in ≥20% of the repetitions were used as the final feature set.
Four different binary classifiers were evaluated: k-nearest neighbours (kNN), support vector machine (SVM), AdaBoost, and random forest. These classifiers were chosen because they are well established and because there is a clear difference in complexity and computational cost between them. The kNN used 25 neighbours and a Euclidean distance function with no distance weighting. The SVM used a Gaussian kernel, a heuristic procedure to set an appropriate kernel scale factor, and a box constraint level = 10. The AdaBoost classifier used decision trees as weak learners (50 voting trees) with the maximum number of decision splits set to 20 and a learning rate of 0.1. The AdaBoost.M1 algorithm was used as the ensemble aggregation method. The random forest also used decision trees as weak learners (50 voting trees) and applied bootstrap aggregation as the ensemble-aggregation method. A cost function was used to avoid misclassifications of the severely sleepy class, since the dataset was unbalanced (54% alert, 32% somewhat sleepy, and 14% severely sleepy). The number of neighbours, box constraint level, and number of trees, respectively, were selected by evaluating classification performance over a range of parameter values. Except for these parameters, the default settings in the MATLAB Statistics and Machine Learning Toolbox version 11.1 (The Mathworks Inc., Natick, MA, USA) were used. The best performing binary classifier was also evaluated in a three-class classification setting. This was done using both 10-fold cross-validation as outlined above and leave-oneparticipant-out (LOO) cross-validation. The LOO evaluation was added to investigate subject-independent performance.
All classification results are derived based on three different feature sets: (i) the HRV feature set, (ii) the HRV feature set combined with the baseline-corrected feature set, and (iii) all three feature sets combined. This was done to investigate the importance of a personalised feature set and to investigate the added value of taking time history into account [17].

III. RESULTS
The ANOVA results are summarized in Table II and Fig. 3. Almost all HRV metrics varied with the level of driver sleepiness, except for the absolute power in VLF, the LF peak, the absolute power in HF, sample entropy, and the slope from the PUCK analysis. By and large, the results support the hypothesis that sleepiness is associated with lowered heart rate, more HRV, increased LF power (i.e. increased sympathetic activity -fighting to stay awake), and increased LF power (i.e. increased vagal activity). Note that the relative power in HF is changing in the "wrong" direction, which is a consequence of the comparatively stronger influence of LF when normalizing with LF+HF. Also note that the baseline-normalized relative power in HF does not show this reversed behaviour. The random factor participant was significant for all HRV metrics, suggesting large inter-individual differences. Many of these features represent similar information (Fig. 3) and it is likely that many features are redundant. It is therefore warranted to use feature selection to reduce the complexity of the developed classifier. The SFFS iterations show that both the number and selection of features varied substantially from iteration to iteration (Fig. 4). Five features were selected in all iterations: RMSSD, NN50, pNN50, and mean NN and SSD1 from the PUCK analysis. When adding the baseline-corrected feature set, many of these features were replaced by their baseline-corrected counterparts (e.g. RMSSD, mean NN, absolute power in LF, and SSD1 and SSD2 from the PUCK analysis). Interestingly, adding the time-difference feature set changed the result only marginally.
The best overall binary classification performance was achieved with the random forest classifier, but both kNN and AdaBoost had higher sensitivity (Table III). The mean  TABLE III   TEST ACCURACY, SENSITIVITY, SPECIFICITY, AND F1 SCORES FOR BINARY CLASSIFICATION USING KNN, SVM, ADABOOST, AND RANDOM FOREST   TABLE IV TEST AND LEAVE-ONE-PARTICIPANT OUT (LOO) ACCURACY, SENSITIVITY, SPECIFICITY, AND F1 SCORES FOR THE RANDOM FOREST CLASSIFIER accuracy of the random forest classifier was 81% when the HRV feature set was used as input. This increased to 85% when including the baseline-corrected features. Adding time history did not improve the results.
Performance dropped by about 20 percentage points for multi-class classification (Table IV). Despite the cost function, only a small share of the severely sleepy drivers was identified as sleepy. To some extent (an additional 5 percentage points), this was remedied by adding the baseline-corrected features. Compared with the multi-class cross-validation results, the LOO results dropped by another 15-20 percentage points (Table IV). This suggests that personalized algorithms, not just personalized features, are needed in a driver sleepiness monitoring system based on HRV.

IV. DISCUSSION
The suitability of using HRV metrics for driver sleepiness classification has been evaluated. Significant effects of sleepiness were found on most HRV metrics and the binary classification results reached an accuracy of 85.4%. However, the accuracy dropped by about 20 percentage points for three-class classification, and by almost another 20 percentage points for subject-independent LOO classification. Sensitivity was low in general, with the worst results in the severely Driver fatigue is defined as a suboptimal psychophysiological condition caused by exertion [40]. Sleep-related fatigue, i.e. sleepiness or drowsiness, is defined as a "a physiological drive to fall asleep" [41] caused by sleep loss, circadian rhythm and time since awakening. Certain characteristics of driving, like task demand and driving environment, can produce task related fatigue in the absence of any sleep-related cause. Task-related fatigue can then also be subdivided as fatigue due to overload and underload, respectively [42]. Long-term fatigue that isn't relived by sleep, such as chronic fatigue syndrome, and muscle fatigue, such as after sports, are out of scope of this paper. A summary of previous research on driver sleepiness/fatigue and HRV is presented in Table V. Note that differences in experimental design, state inducement, and validation approaches make it difficult to compare results across studies. The studies in Table V should thus only be compared on a higher level. Here it can be seen that Patel, et al. [43] reached an accuracy of 90% (sensitivity not reported) with a neural network classifier. Li and Chung [44] obtained an accuracy of 95% (sensitivity = 95%) with wavelet analysis and an SVM classifier. Fujiwara, et al. [45] used multivariate statistical process control and managed to predict 12 out of 13 sleep onsets. Abe, et al. [36] managed to predict 7 out of 8 sleepiness-related accidents. Finally, Vicente, et al. [17] achieved a positive predictive value of 96% (sensitivity = 59%) with a linear discriminant analysis classifier. Our achieved accuracy is lower than those reported in these other studies. However, our lower sensitivity scores are in line with Vicente, et al. [17]. One reason for the discrepancies could be the reduced level of experimental control in our real-road data as compared with the driving simulator data used in most other studies [12], [18], [36], [43]- [45]. If sleepiness is the only factor manipulated in the experiment while other influencing factors are kept as constant as possible, then a discovered change in HRV is likely due to the altered alertness level. In a real-road setting, the experimental control is weakened, and the association between HRV and sleepiness is weakened. Another reason for the discrepancies could be that the number of participants in the other studies was sometimes as low as 4-12 drivers [15], [18], [43], [44]. A small dataset may be justified in a within-subject design experiment, but it does not allow for proper validation of the sophisticated machine learning methods that are often used. A third reason for the discrepancies could be that different ground truths for sleepiness/fatigue have been used in different studies.
The conclusions drawn in the studies outlined in Table V are that HRV-based sleepiness detection "may add significant improvements to existing car safety systems" [17], "can be used as a fatigue countermeasure" [43], and could "contribute to prevent[ing] accidents caused by drowsy driving" [36], [45]. We draw a different conclusion, namely that it will be difficult to use HRV for sleepiness detection in a production vehicle. As mentioned in the introduction (see Fig. 1), there are many time-varying intra-individual factors that can cause a change in HRV. Stating that a particular change in HRV is due to sleepiness alone and not to any of these other factors is difficult, especially since other mental states such as boredom and relaxation give rise to very similar changes in the HRV metrics.
There is considerable individual variability in HRV and changes in HRV are different in every person [46]. To circumvent this issue, the use of personalised algorithms [36], [45] or individually normalized features [17] has been suggested. In this study, the latter approach was used when creating the baseline-corrected feature set, something that led to a 5-percentage-point increase in classification accuracy. While personalised algorithms or features may account for inter-individual variability due to age, gender, and other factors, this solution will not be able to account for time-varying intra-individual variability that arises due to distress, anger, boredom, relaxation, etc. [10], [11]. Thus, although some studies have shown promising results for drowsiness and sleepiness detection based on personalised HRV-based algorithms, it is unlikely to be possible to develop a sleepiness detection system based solely on HRV. Multimodal systems that use HRV as one of several inputs have therefore been suggested [47], [48]. However, when doing so, it has been found that HRV does not contribute much when used in combination with EOG, EEG, or vehicle measures [47], [49].
There are some inherent issues with HRV that need to be accounted for even when personalised algorithms are used. One challenge is that diseases such as diabetes mellitus markedly reduce HRV to a stable but very low magnitude [50], which in turn makes it very difficult or even impossible to detect changes over time. The same is true for high age [51], when abnormal HRV patterns have been observed that could mask changes due to sleepiness observed with advancing age [52]. HRV is also severely affected by arrhythmia, and extrasystolic beats must be detected and accounted for since they increase the magnitude of the beat-to-beat variation in heart rate. Spurious extrasystolic beats can be detected and the time window when they occur can be excluded from the HRV analysis [17]. However, sleepiness cannot be evaluated in subjects with very frequent extrasystolic beats based on HRV, since the extrasystolic beats mask the underlying autonomic modulation of heart rate. This means that a proportion of the population will never be able to use an HRV-based sleepiness detector. A limitation of the dataset used in this study is that we have limited knowledge of the health status of the participants. Via the background questionnaire, the participants reported that they were of good health, did not have chest pain/tachycardia/respiratory problems, and had good to excellent sleep quality. However, we do not know whether they have treated heart conditions or diabetes mellitus that do not affect their general health.

V. CONCLUSION
Partial or conditional automation will transform the role of the driver from active driving to passive monitoring. This will lead to increased levels of fatigue due to boredom and under-stimulation [53], and consequently an increased prevalence of driver sleepiness and fatigue. Finding robust driver monitoring solutions is therefore increasingly important.
Generalising an HRV-based sleepiness detector is a major challenge, as HRV data vary both between individuals and over time within individuals, depending on both internal and external factors. A successful classifier should thus make use of driver profiles that evolve over time [37]. Future research on HRV-based driver sleepiness characterization should focus on complementing personalised algorithms with functionalities able to account and control for the many confounding factors known to affect HRV.