Utilizing a Wristband to Detect the Quality of a Performed CPR

—Cardiopulmonary Resuscitation (CPR) is often trained using special manikins that provide feedback. When CPR is performed in out-of-hospital scenarios, the feedback can only come from sensors that are already on the rescuer, such as a smartwatch. This paper proposes and evaluates a method for detecting CPR quality using the sensors available in most smartwatch devices: accelerometers and gyroscope. We collect data of 18 nursing students performing CPR on a CPR manikin while wearing a wristband, and we used the manikin response to create a labeled dataset. Feature engineering includes extraction of vertical acceleration, Fourier analysis of acceleration data, and numerical integration to estimate push amplitude. This paper compares multiple machine learning models on top of the extracted features, with L1-regularized logistic regression producing the best results. The model achieved 90% of cross-validation accuracy and 80% of test set accuracy. Discussion of noise removal in the data provides the path for potential accuracy increase. The results of this work can contribute to the development of CPR feedback applications on smartwatches. This will provide a cheap and accessible solution to guide untrained people when CPR is needed. area. His research interests are around the application of machine learning methods to ubiquitous systems.


I. INTRODUCTION
Cardiopulmonary Resuscitation (CPR) is a lifesaving procedure performed to a person suffering from cardiac arrest [1]. The procedure consists of chest compressions often combined with ventilation in an effort to manually restore the flow of oxygenated blood to the brain. It is recommended for untrained people to provide hands only CPR (chest compressions) until a trained staff arrives and provides further help [2]. Compressing the chest is usually around 5.5 cm deep with a rate of 100 to 120 compressions per minute [3].
Performing CPR correctly needs training. CPR training is mandatory for many workers in the healthcare domain. It is even introduced recently as a mandatory procedure to learn for students before they can graduate from high school in many states in the United States [4]. There are courses and guidelines for CPR practice and training from different associations worldwide, such as the Manuscript received April 20, 2019; revised September 26, 2019. American Heart Association (AHA) [5]. The trainers can practice CPR on an automated voice advisory manikin that gives feedback in real-time to help in obtaining the necessary skills [6]. The feedback indicates whether the compressions are performed correctly or if there is an error in the pushes. Errors are detected if pushes are too fast, too slow, too deep, or not deep enough.
The problem we are addressing in this paper is how to provide feedback of the CPR performance without relying on a manikin. This can aid in the training of CPR if training manikins are not available, as well as providing real-time feedback in real-life scenarios, where correct CPR performance is crucial. We investigate the possibility of using sensors data, mainly accelerometers and gyroscope, from a wristband (worn by the rescuer) to detect if any of the aforementioned errors happens when pushing (fast, slow, deep, not deep). Accelerometers and gyroscope are the most common inertial sensors that are available in a lot of wrist-worn devices [7]. In addition, accelerometers are well-known sensors for capturing motion [8]. They have small size, low power consumption, and little weight that make them very suitable for wearable applications [9].
The results of the presented work can be employed in different applications. For instance, this service can provide a cheap solution for training nursing students at universities. The service can also be added to smartwatches, so people can have easy access to such application if they wear a smartwatch.
The remainder of this article is organized as follows. Section II discusses related work in the field of CPR monitoring and training, as well as the use of a wristband sensor for analyzing motions and detecting activities of the users. Section III describes the method of the proposed experiment and illustrates the used equipment and the data collection process. Section IV presents the evaluation of the selected features and the models for analyzing the data. Section V discusses the results. Finally, Section VI concludes the paper and presents future work.

II. RELATED WORK
Many researchers used sensors to investigate the quality of a performed CPR. For instance, Aase and Myklebust [10] attempted to estimate the depth of CPR chest compression using the accelerometer. Depth was calculated using double integration of the acceleration signal. In order to overcome the instability of integration, the authors proposed to place a switch on a patient chest, and triggering of the switch during each compression acted as an integrator reset. The authors managed to achieve with 95% confidence ±1.6mm precision in estimating compression depth for a patient lying on the floor. However, for the cases when a patient is in a vehicle and in a boat, the precision is lower.
Zhang et al. [11] used a pressure sensor, an accelerometer and an ECG monitoring module, to develop a wireless chest compression monitoring system. The system sends feedback of the performance to a main program on an iPhone. Another study presented by Chen et al. [12] introduced a sensor system called "Rhythm of Life Aid" (ROLA) that gives audio and visual feedback about chest compressions to support medical staff during CPR of newborn infants. The pressure is detected through a Force-Sensitive Resistor (FSR) on a soft transparent sheet placed at the chest of the infants. The problem with the aforementioned systems is that they rely on different devices and sensors that make them not practical in an emergency, where the rescuer has no access to the system. Amemiya and Maeda [13] presented a system that uses accelerometer sensors in a smartphone to estimate the chest compression depth and rhythm while performing CPR. A CPR manikin was used for the data collection. One of the authors collected the data by performing compressions for 40 to 80 seconds, putting the smartphone on the manikin's chest or having it secured in a sports armband. The results indicated that accelerationbased estimation is stable. There were, however, no detailed results from this study about the accuracy and reliability of using accelerometers to give feedback in a CPR situation. A similar study was conducted by Yamamoto and Ohmura [14] to give on-site instructions about CPR relying on wearable sensors. The aim of the study was to compare visual instructions with voice instructions when instructing inexpert people while performing CPR. The authors used a smartphone as a wrist-worn device that has accelerometer and gyro. Four subjects participated in the study wearing the device while performing CPR. The results showed that "voice teaching" is the best for teaching CPR to the rescuer in an emergency when an accident happens. Even though the results showed a recognition rate of the strength with an accuracy of 93%, the authors mentioned that these results need deeper investigation with larger number of subjects.
Gruenerbl, Pirkl, Monger, Gobbi, and Lukowicz [15] conducted an experiment to monitor the frequency and compression depth of CPR and provide feedback through an application on a smartwatch. The magnitude of the acceleration of the watch is used to estimate both parameters (frequency and depth). 41 subjects participated in the study, where each of them performed CPR sessions with 30 compressions on a manikin. The results were approximately improved from 20% of the subjects performing CPR correctly (frequency and depth) for 50% of the time (without feedback), to 50% of the subjects staying in the recommended range for 50% of the time (with feedback from the smartwatch). Another study [16] suggested the use of a smartwatch feedback device during infant cardiac arrest. 30 volunteers participated in the experiment wearing a smartwatch at their wrist and performing CPR on a CPR training infant mannequin. The three-axis acceleration signals from the smartwatch were used to detect the numbers of compressions with certain depths and rates. The results were similar to the previous study, as there was an improvement in the CPR performance on infants when using a smartwatch feedback device.
In contrast to the related work articles, we focus in our work on the use of a single wristband. This allows better handling of out-of-hospital scenarios, where specialized on-patient sensors are not available. We consider the general correctness of CPR pushes rather than focusing on some particular aspect (like only compression depth). We also extend the previous work by investigating different machine-learning techniques to improve the accuracy of the detection. We aim to add knowledge in this field that facilitates the development of CPR feedback applications on smartwatches. This will provide a cheap and an accessible solution to guide untrained people when CPR is needed.

III. EXPERIMENT SETUP
The experiment is a step towards potentially lifesaving smartwatch applications that can guide a person to perform CPR correctly in real life out-of-hospital scenarios: estimate in real time whether the person is performing CPR correctly or not, and provide corrective advices (e.g. instruct to do chest compression faster or slower, deeper or not so deep, etc.). The same application can also be used for CPR training when using any CPR manikin. However, a smartwatch has only certain available sensors, and for the goal to be achievable, the application needs to evaluate CPR quality using only the sensors available on a smartwatch. The main goal of the experiment was to evaluate how accurately we can determine the CPR quality using only the sensors that are available on most smartwatches. We did the following to achieve that:  Estimate the performance of different models for CPR quality detection. High accuracy is required for practical purposes. We therefore compared multiple machine learning algorithms performances.  Determine the most important features that should be used in practical scenarios. The algorithms need to run on resource-constrained devices (like a smartwatch). Extracting only essential features can speed up the computation and also improve the accuracy by reducing the overfit of the models.

A. Equipment 1) Wristband
The experiment was conducted using a Shimmer ExG sensor [17] (see Fig. 1). In this experiment we used data from accelerometers and gyroscope. In addition, we used quaternions that help in calculating vertical acceleration. The rest of the available sensors (battery charge, atmosphere pressure, EMG kit, etc.) were not used.
 Accelerometer readings. The accelerometers provided 3-dimensional readings with axis aligned relative to the sensor.  Gyroscope readings. The sensor tracks rotation or twist of the wrist.  Quaternions. Orientation quaternions are precalculated by the Shimmer sensors and provided as if they were sensor readings.

2) CPR training manikin
The CPR training manikin in this experiment was Resusci Anne [18] (see Fig. 2). The manikin is used at the department of Health and Nursing Sciences at the University of Agder, Norway, to train nursing students on performing CPR. Resusci Anne has contributed in educating more than 500 million people in vital lifesaving skills. According to the provider's official website [18], Resusci Anne can report: compression rate and depth, correct release for each compression, correct hand position, and amount of compressions. Resusci Anne is connected to a laptop that logs the data as well as it gives real-time feedback while performing the CPR and tells immediately if the compressions are done correctly or not. For instance, the manikin could tell that the compressions are done too fast, so the student can slow down with the pushes. B. Data Collection 18 nursing students participated in the data collection. The students were training for a CPR course as part of the nursing program at the University of Agder, Grimstad, Norway. The study was approved by the department of Health and Nursing Sciences at the University of Agder, and no ethical application was needed since the data was collected anonymously. The researchers, however, explained the experiment to the students and gave them the choice of wearing the wristband during training or skipping it. Each participant wore the wristband at the non-dominant hand while performing the CPR on the Resusci Anne manikin. The training was done in the form of sessions, where each session lasted for 2 minutes on average. Each participant performed three sessions.
A script that runs on a separate laptop was developed by the researchers to label the data while performing the CPR. The wristband was synchronized with that laptop, and the researcher started the script with the start of each session and stopped it by the end of that session. The researcher was manually pressing Y (yes) and N (No) during the session based on the continuous feedback from the manikin. As the manikin gives an immediate feedback (every 3 seconds on average), every 2-minute session ended up with around 40 labels, where Yes means correct compression and No means bad compression. Even though the manikin indicates the error type (such as not deep enough push), the focus in this study was only on correct and incorrect pushes rather than the precise error type.

IV. EVALUATION
As a result of the experiment, we collected the wristband sensors readings for correctly and incorrectly performed CPR. The CPR manikin reported the problems related to the depth and frequency of compressions. So, essentially the CPR manikin is checking that frequency and amplitude of pushes are within bounds.

A. Feature Engineering 1) Vertical acceleration
During CPR the patient is on the back, and the pushes are effectively vertical relative to the ground. Vertical acceleration relative to the ground (we refer to it as vertical acceleration from now on) can be extracted by combining accelerometer readings and orientation quaternions as described, for example, by Shoemake [19]. Extracting vertical movement can also help with accounting for different accelerometer angles due to different positions of the wristband. A smartwatch, for example, can slide and tilt on the wrist while the rescuer is performing the CPR. This should affect relative acceleration readings, but not vertical acceleration relative to the ground. A sample of the collected vertical acceleration data is provided in Fig. 3. We highlighted 5 areas in the plot to illustrate the vertical acceleration before, during and after a CPR session.
Data in Area (1) is recorded when the participant had the wristband before starting the session, while Area (5) is recorded when the participant had the wristband after ending the session. The data after Area (1) and before Area (5) is therefore representing a CPR session. The presented session shows the time when CPR was performed correctly (according to the manikin), except for Area (3) where the CPR manikin reported incorrect performance. There is a slight visible difference in the pattern between Area (3) and the rest of the session, and there is a chance that machine learning can detect it. Area (2) and Area (4) in Fig. 3 represent possible noise in the data where there is a change in the pattern, yet the manikin reported that CPR was performed correctly. Area (2) shows where the participant stopped performing CPR for a moment, but the stop was not long enough to trigger a response from the manikin. Area (2) shows therefore one kind of noise in the data -imprecise CPR manikin response. Area (4) shows where the participant stopped performing the CPR, but the researcher stopped the script (finishing the session) a few seconds later. This is another source of noise in the data -slow researcher's response.   4 provides a more detailed illustration. The figure shows randomly picked 5s samples from correct CPR performance (left column) and incorrect CPR performance (right column). The illustration also indicates that acceleration signals for correct CPR performance are periodical with a frequency of slightly below 2Hz (8-9 iterations over 5 seconds). Extracting frequencies of the signal is therefore a natural next step in the feature engineering process.
2) Frequencies One of the possible mistakes during CPR is pushing too fast or too slow. It is expected from Fig. 4 that pushes should be performed at the rate of 2 per second. In order to confirm it, we applied discrete Fourier transform [20] to the acceleration signal and analyzed the relationship between absolute Fourier coefficients and the CPR outcome (Fig. 5 illustrates the results). The figure shows correlation between coefficients corresponding to various frequencies and the outcome (1 being correct CPR and 0 being incorrect CPR). As expected, there is a strong positive correlation for coefficients around 2Hz (slightly less, just as shown in Fig. 4). The multiples of those frequencies also show strong positive correlation with the outcome: if the final signal has periodicity of about 2Hz, it can be a sum of sinusoids with frequencies 4Hz, 6Hz, 8Hz, etc. In Fig. 5 coefficients for the frequencies of 1Hz, 3Hz, 5Hz, 7Hz and 9Hz show negative correlation with the outcome, most likely because they correspond to common mistakes, such as doing pushes too fast or too slow. Those frequency components cannot be a part of a ≈2Hz signal. High frequency components can be a part of correct or incorrect CPR signals, e.g. 12Hz frequency component has a period of 1/12s and this component can be a part of any signal with a period of 1/12s, 2/12s, 3/12s, etc. So, 12Hz can be part of correct 2Hz pushes (period of 6/12s), or too fast 3Hz pushes (period of 4/12s), or too slow 1.5Hz pushes (period 8/12s). The presence of high frequency component cannot therefore help in telling whether the final CPR signal is correct or not, and Fourier coefficients for high-frequency components have nearzero correlation with the outcome. The pattern in Fig. 5 is expected, but there is significant noise.

3) Extracted features
As described before, the most relevant features are likely to be amplitudes corresponding to various frequencies of the acceleration signal. Fig. 4 also showed the importance of the acceleration value itself. We did therefore the following feature engineering procedure:  Split the data into segment windows. We considered window sizes from 0.5s (about 1 push) to 10s.  Extract absolute vertical acceleration relative to the ground.  Use discrete Fourier transform to extract frequency components of the signal, then use absolute amplitudes corresponding to those frequencies as features.  Add minimum acceleration, maximum acceleration, mean acceleration and standard deviation of the acceleration to the feature list.  Estimate the amplitude of the push and add it to the list of features. For estimation we used double integration of acceleration using Simpson's rule [21] combined with trend removal.

B. Data Models
Despite the mentioned challenges, Fig. 3 and Fig. 4 show that there is a visible difference between correct and incorrect CPR performance, and machine learning models might be able to distinguish between correct and incorrect CPR performance. We compared the performance of the following models:  K nearest neighbors [22]. We alternated K and used Euclidean distance in a feature space as a metrics. Although the model is used mostly as a baseline, it can be used in practice for small datasets.  Logistic regression [23]. Logistic regression attempts to find a linear separating hyperplane in the feature space in order to distinguish between the classes (like correct or incorrect CPR). Here we used L1 and L2 regularization and alternated regularization power hyperparameter.  Random forest [24]. Random forest uses bootstrap aggregation (bagging) and random feature selection to build a multitude of small decision trees. When new example arrives, the random forest model uses voting among the decision trees to label the new example (e.g. correct or incorrect CPR).  Gradient boosted trees [25] is another ensemble method that builds a multitude of small decision trees, but unlike random forest it is based on boosting rather than bagging. After building each decision tree the algorithm puts more weights on the examples that it classified wrongly, and then builds the next tree accordingly. All the models were evaluated in a following manner:  There is a separate dataset for each sliding window size. Models were evaluated on those datasets separately using the same principles as described below.
 Depending on the time window size, between 42% and 46% of the dataset examples were positive. Therefore, classes are not skewed, and accuracy can be a reasonable estimate of model performance.  The dataset was separated into an 80% training and cross-validation set and a 20% test set.  All models were trained and evaluated using 5fold cross-validation over the training and crossvalidation set. Average of cross-validation accuracy was used for model selection.  The best performing model was retrained on the entire training and cross-validation set (80% of the data) and its performance on the 20% test set was recorded. Figure 6. Cross-validation set accuracies (average for 5-fold crossvalidation). Maximal accuracy is taken for each method and for each sliding window duration over all considered method parameters.
Cross-validation accuracies of various models are presented in Fig. 6. For each of the considered models and for each sliding window size the highest crossvalidation accuracy was taken. The highest average accuracy on 5-fold cross-validation -90.81% -was achieved on a 7s window using the logistic regression model (L1 regularization with a strength of 100). The model, however, achieved only 80.85% test set accuracy.
For the 7s window size the number of features exceeded the number of samples (179 and 148 respectively), and hence there was a high overfit risk. Slightly lower test set performance showed that the risk was only somewhat mitigated by a high regularization strength.

V. DISCUSSION
With the proposed feature engineering methods, most of the models achieved ≈90% of cross-validation accuracy, and the best performing model achieved ≈80% of test set accuracy. We investigate in this section why the accuracy was not higher and how to improve the accuracy of the CPR correctness recognition. The section also discusses the assumptions that can affect the accuracy and usefulness of the approach.
We found out in the experiment that there are several possible sources of noise in the dataset:  Human reaction time. It takes some short, but nonnegligible time to react to the CPR manikin feedback and record whether the interval was successful or not. During that time one or even few more pushes can be done. So, few pushes at the end of "correct CPR" or "incorrect CPR" segments can be incorrectly labeled (manikin already reported the change, but the researcher did not react on it yet). This problem can be mitigated by removing the last few pushes (1-2s) from each segment because labels for them are not reliable. This problem can also be mitigated by using more precise data collection methods (e.g. if the manikin itself records correct and incorrect pushes at the exact time without researcher involvement).  Amplitude detection noise. There is a very specific noise in the dataset related to double integration of acceleration data in order to get the amplitude of the pushes. Despite being very important, it turned out that the feature of push amplitude had low correlation with the actual outcome. Some researchers attempted to overcome that problem [26], and improving and calibrating amplitude detection might improve overall CPR correctness detection accuracy.  Movements of the hand are not the same as movements of the sensor. The sensors in a wristband can measure acceleration and orientation of the device itself. However, unless the wristband is very tight at the wrist, the movement of the hand is not the same as the movement of the sensors. The noise is even higher if the wristband keeps moving constantly (e.g. a slightly loose smartwatch during the CPR), and it can result in significantly different movement patterns. This effect can be viewed as a very special source of measurement noise (when the movement of the hand is measured). We attempted to find a robust enough algorithm to recognize the movement patterns of correct or incorrect CPR even in presence of that noise. Additional data collection (with a lot of examples of this situation) can also help in adjusting the algorithm.  Data size. The number of features and the number of samples in a dataset varies depending on the sliding window sizes. Starting from the 7s timeframe size the number of features exceeds the number of samples, and this brings a significant overfit risk. Grid search for proper hyperparameters did mitigate the overfit problem, but it did not eliminate it completely. As mentioned in section IV, the models show the signs of overfit. Another possible way to combat overfit is to collect more data. Apart from the noise in the dataset, another possible problem is the time that it takes to recognize whether the CPR is performed correctly or not. Best results were achieved with the 7s window, but in practice feedback time of 7s might be too long. It is possible to overcome this problem with an overlapping sliding window (e.g. repeat calculation every second for the last 7s of CPR), but this can introduce another subtle problem. Consider that the rescuer performs CPR incorrectly over the last 7s before correcting the performance. The new 7s window will consist of 1s of the new correct performance, and 6s of the previous incorrect performance, and there is a good chance that the final outcome of the model will be "incorrect". As a result, new feedback might be misleading. Reducing the overlap can mitigate it to some extent, but the smaller the overlap, the longer is the feedback time. Using smaller time windows can also help, but the models have shown to have lower accuracy. The extent of the problem and possible mitigation strategies should be investigated as a part of future work.

VI. CONCLUSIONS AND FUTURE WORK
This paper presented an investigation into the accuracy of using accelerometers and gyroscope from a wristband (worn by the rescuer) to detect the quality of a performed CPR. Results have shown that the most relevant features from the sensors' data are amplitudes corresponding to various frequencies of the acceleration signal and the acceleration value itself. Logistic regression (L1 regularization with a strength of 100) has shown the best performance among the chosen machine-learning methods. The model had over 90% of cross-validation accuracy and 80% of test set accuracy.
The results of the study are promising, but accuracy needs to be improved before the proposed methods can be used in practice. Improvement directions include collecting more data, reducing the noise in the dataset by enhancing data collection procedures, and improving feature engineering, mostly in the area of push amplitude estimations. Future work should therefore focus on collecting more data and reducing the noise to improve accuracy and further develop a practical application that can be integrated in smartwatches, so it can be used in real-life situations when CPR is needed. Another direction for future work is performing more detailed error analysis, so the application can tell the exact error type when pushing rather than reporting only whether the CPR is performed correctly or not.

ACKNOWLEDGMENT
We would like to thank the Centre for eHealth and the department of Health and Nursing Sciences at the University of Agder for approving and supporting the study. We'd also like to thank all students who agreed to participate in the experiment. In addition, special thanks go to Kari Hansen Berg and Kjersti Marie Frivoll Johnsen from the University of Agder, and to Aakash Arora, Mederic Hourier, and Anshuman Bhadauria Singh from the Interdisciplinary Centre for Security, Reliability and Trust of University of Luxembourg for their assistance and valuable inputs.