A Method for Intelligent Allocation of Diagnostic Testing by Leveraging Data from Commercial Wearable Devices: A Case Study on COVID-19

Mass surveillance testing can help control outbreaks of infectious diseases such as COVID-19. However, diagnostic test shortages are prevalent globally and continue to occur in the US with the onset of new COVID-19 variants, demonstrating an unprecedented need for improving our current methods for mass surveillance testing. By targeting surveillance testing towards individuals who are most likely to be infected and, thus, increasing testing positivity rate (i.e., percent positive in the surveillance group), fewer tests are needed to capture the same number of positive cases. Here, we developed an Intelligent Testing Allocation (ITA) method by leveraging data from the CovIdentify study (6,765 participants) and the MyPHD study (8,580 participants), including smartwatch data from 1,265 individuals of whom 126 tested positive for COVID-19. Our rigorous model and parameter search uncovered the optimal time periods and aggregate metrics for monitoring continuous digital biomarkers to increase the positivity rate of COVID-19 diagnostic testing. We found that resting heart rate features distinguished between COVID-19 positive and negative cases earlier in the course of the infection than steps features, as early as ten and five days prior to the diagnostic test, respectively. We also found that including steps features increased the area under the receiver operating characteristic curve (AUC-ROC) by 7–11% when compared with RHR features alone, while including RHR features improved the AUC of the ITA model’s precision-recall curve (AUC-PR) by 38–50% when compared with steps features alone. The best AUC-ROC (0.73 ± 0.14 and 0.77 on the cross-validated training set and independent test set, respectively) and AUC-PR (0.55 ± 0.21 and 0.24) were achieved by using data from a single device type (Fitbit) with high-resolution (minute-level) data. Finally, we show that ITA generates up to a 6.5-fold increase in the positivity rate in the cross-validated training set and up to a 3-fold increase in the positivity rate in the independent test set, including both symptomatic and asymptomatic (up to 27%) individuals. Our findings suggest that, if deployed on a large scale and without needing self-reported symptoms, the ITA method could improve allocation of diagnostic testing resources and reduce the burden of test shortages.


Introduction
The COVID-19 pandemic has severely impacted our world, with more than 478 million COVID-19 cases and 6.13 million deaths estimated worldwide [1].In the US alone, there have been more than 81 million cases and nearly 1 million deaths at the time of writing [2].Mass surveillance testing has been identi ed as the most effective tool to monitor the spread of infectious diseases including COVID-19 [3].However, a combination of cost, availability, and impracticality of frequent and widespread testing impedes the mass epidemiologic surveillance needed to curb new disease outbreaks.To date, COVID-19 diagnostic test shortages are still prevalent globally, and shortages continue to occur in the US with the onset of new variants (e.g, Delta, Omicron) [4]- [6].For example, when the Delta variant emerged in July 2021, daily demand for tests across the US surged from 250k to 1.5 million in the span of 2 months [7].A similar circumstance occurred with the Omicron variant, where testing capacity failed to meet the sudden demand [8]- [10].Furthermore, rural-urban disparities in testing access have exacerbated existing inequities resulting in further harm to underserved communities [11], [12].In June 2020, it was estimated that 64% of counties in the United States, predominantly rural, did not have access to COVID-19 testing [13].Such circumstances lead to underreporting of COVID-19 incidence and may lead to a premature sense of security and unwarranted changes in public health measures [12].Thus, there is an unprecedented need to improve our current and future methods for mass COVID-19 surveillance testing, especially as stronger testing capacity has been associated with reduced mortality and greater pandemic control [14].
By targeting surveillance testing toward individuals who are more likely to be infected with the disease, more positive cases can be captured with the same number of tests, increasing the positivity rate of the tested population (Fig. 1A) [4].The positivity rate (i.e., percent positive rate or percent positive) is the percentage of all diagnostic tests performed that are positive.The likelihood of disease presence prior to a diagnostic test, or the pretest probability, is dependent on disease prevalence in the population under surveillance.By ltering the broader surveillance population to a subpopulation with a higher likelihood of infection, the allocation and utility of tests can be improved (Fig. 1A).In other words, more positive cases can be captured with the same number of tests and, thus, the testing positivity rate is increased.The development of tools to increase testing positivity rate are not only crucial in the early phase of a pandemic outbreak when the available clinical diagnostic testing tools are inadequate to meet the existing demand, but also throughout a pandemic in remote locations, underserved communities, and low-and middle-income countries where testing is known to be particularly scarce [15].
The rapid adoption of commercial wearable devices such as smartwatches and activity trackers brings forth opportunities to develop novel tools to support an intelligent disease detection infrastructure.Multiple studies suggest the utility of digital biomarkers, objective and quanti able digitally-collected physiological and behavioral data (e.g., resting heart rate, step count, sleep duration, and respiratory rate), collected by consumer devices along with patient-reported symptoms to monitor the progression of respiratory and in uenza-like illnesses [16]- [22].
To determine who to test in settings where there are a limited number of diagnostic tests available (i.e., limited testing capacity), we explored whether information from wearables could help rank individuals by their likelihood of a current COVID-19 infection.To achieve this, we developed an Intelligent Testing Allocation (ITA) model which leverages differences in digital biomarkers to distinguish individuals who are likely positive or negative for COVID-19 in order to improve current methods of diagnostic test allocation and increase testing positivity rates.

Results
We developed the CovIdentify platform in April 2020 to integrate commercial wearable device data and electronic symptom surveys to calculate an individual's real-time risk of being infected with COVID-19.A total of 7,348 participants e-consented to the CovIdentify study between April 2, 2020 and May 25, 2021 through the secure research electronic data capture (REDCap) system (Fig 1B) [23].Of those consented, 6,765 participants enrolled in the study (Supplementary Table 1) by completing an enrollment survey consisting of 37-61 questions that followed branching logic (Supplementary Document 1) [23].Of those enrolled, 2,887 participants connected their smartwatches to the CovIdentify platform, including 1,689 Garmin, 1,091 Fitbit, and 107 Apple smartwatches.Throughout the course of the study, 362,108 daily surveys were completed by 5,859 unique participants, with a mean of 62 and a median of 37 daily surveys completed per individual.Of all CovIdentify participants, 1,289 participants reported at least one diagnostic test result for COVID-19 (132 positive and 1,157 negative) (Fig 1B).All survey and device data collected through CovIdentify was transferred securely to a protected cloud environment for further analysis.Out of the 1,289 participants with self-reported diagnostic test results, 136 participants (16 positive and 120 negative) had smartwatch data available during the time periods needed for analysis.The relatively small number of participants with available smartwatch data out of the larger population is consistent with similar bring-your-own-device (BYOD) studies aimed toward COVID-19 infection prediction from personal devices [18], [19], [24].
Development of the Intelligent Testing Allocation (ITA) model.A diagnostic testing decision support model was designed to leverage real-world data to intelligently allocate diagnostic tests in a surveillance population where there are insu cient tests available to test all people in the surveillance group (Fig 1A,top).To increase the study population size, we augmented our dataset with data from the MyPHD study.Similar to CovIdentify, MyPHD collected simultaneous smartwatch, symptom, and diagnostic testing data during the COVID-19 pandemic [24], [25].The wearables and diagnostic testing data were publicly available [24], [25] while symptom data were added for this work.From the MyPHD study, smartwatch, symptom, and diagnostic testing data from an additional 1,129 participants (110 positive and 1,019 negative) were included in this analysis.
Differences in resting heart rate (RHR) and steps measured by smartwatches well before and immediately prior to a COVID-19 diagnostic test.
To compare digital biomarkers between healthy and infected states, data were segmented into two time periods: a baseline period (22 to 60 days prior to the diagnostic test date) and a detection period (21 days prior to the diagnostic test date).We chose this window for the detection period to encompass the COVID-19 incubation period (2-14 days) reported by the CDC as well as the common delay between symptom onset and diagnostic testing.Consistent with prior literature [16], [26], daily RHR increased signi cantly during the detection period from baseline for those who were COVID-19 positive, with an average difference (± SD) of 1.65 ± 4.63 bpm (n=117, p-value < 0.001, paired t-test) over the entire time periods.
On average, daily RHR values more than two standard deviations from the baseline mean were present as early as 13 days prior to the positive test, with an increasing trend that peaked at one day prior to the test date (Fig 1C,bottom).Conversely, the step count during the detection period decreased signi cantly from baseline, with a difference of -854 ± 2386 steps/day (n=125, p-value < 0.0001).On average, step counts less than two standard deviations from the baseline mean were present as early as 10 days prior to the positive test and reached the minimum value 2 days after the test date (Fig 1C,top).For the subset of participants in our dataset with available symptom onset dates, daily RHR and step count that differed beyond two standard deviations from the baseline mean occurred as early as ve days before the symptom onset date (Supplementary Fig 1).Timelines for this and other real-world infection studies should be considered as rough estimates because exact dates of exposure and symptom onset are unknown, unlike in controlled infection studies [22], [27].Our ndings, however, are consistent with the 2-14 day COVID-19 incubation period reported by the CDC [28].
There was also a signi cant difference in digital biomarkers between the baseline and detection periods of participants who tested negative, but it was less pronounced than for those who tested positive.Speci cally, the daily RHR difference was 0.58 ± 4.78 bpm (n=1,094, p-value < 0.05) and the step count difference was -281 ± 2,013 steps/day (n=1,136, p-value < 0.0001).We hypothesized that the digital biomarker differences in the COVID-19 negative group were because a subset of the negative group may have experienced a health anomaly other than COVID-19 (e.g., in uenza) that resulted in physiological differences between the baseline and detection periods.Another recent study also observed RHR elevation and activity reduction in individuals who were COVID-19 negative but u positive, and the magnitudes of these differences were lower than in individuals who were COVID-19 positive [18].To explore the possibility that our COVID-19 negative group contains false negatives due to test inaccuracies or physiological differences due to a health anomaly besides COVID-19, we performed hierarchical clustering on the symptom data from individuals who reported negative tests and found a trend toward multiple subgroups (Extended Data Fig 1).This nding supports the existence of COVID-19 negative subgroups.It should also be noted that the highly signi cant p-value for the digital biomarker differences in the COVID-19 negative group is likely attributable to the higher number of participants (9- 1).Of the 520 total participants with su cient wearable data, 469 participants had high frequency minute-level wearable data (280 from Fitbits) from which we calculated daily RHR and step counts.Device-reported daily values were available for the remaining 51 participants.To explore whether high frequency wearable data, or high frequency wearable data from a single device type, could improve the performance of digital biomarkers for ITA, we developed and validated our ITA model using three cohorts, which we refer to as 1) the All-Frequency (AF) cohort: participants with both high frequency and device reported daily values, 2) the All-High-Frequency (AHF) cohort: participants with high frequency data only, and 3) the Fitbit-High-Frequency (FHF) cohort: participants with high frequency Fitbit data only (Extended Data Fig 2).We analyzed these three cohorts separately in the subsequent analysis and compared the resulting ITA model performance.We divided each cohort into an 80% train and 20% test split, with FHF as a subgroup of AHF, which itself is a subset of AF to ensure that no observations in the training set of one cohort existed in the test set of another (Extended Data Fig 2).
Feature engineering: To explore differences in digital biomarkers (median or mean) between the detection and baseline periods that may be useful for the development of ITA model features, we designed four deviation metrics including: (1) Δ (detection -baseline), (2) normalized Δ, (3) standardized Δ, and (4) zscore ((detection -baseline mean) / baseline standard deviation) (Table 2).Each of the four deviation metrics were calculated on the training data by digital biomarkers (RHR and step count), day in the detection period, and cohort (examples in Supplementary Fig 2 and 3), resulting in four calculated metrics per cohort per biomarker.These training data deviation metrics were used as inputs into the subsequent statistical analysis for feature extraction and the ITA model training.We extracted the same resultant features from the independent test set for subsequent ITA model evaluation.
On average, step count decreased (ΔSteps) signi cantly from baseline to the detection period in COVID-  bottom plots).The 95% con dence intervals of the mean ΔSteps and the mean ΔRHR overlap considerably between positive and negative participants for the initial phase of the detection period (approximately twenty-one to ve days prior to the test date).However, closer to the diagnostic test date (approximately four to one days prior to the test date) the 95% con dence intervals of mean ΔSteps largely do not overlap, and the 95% con dence intervals of mean ΔRHR do not overlap at all (Fig 2A).The fact that the 95% con dence intervals of mean ΔSteps and mean ΔRHR do not overlap later in the detection period is consistent with prior literature [29] and suggests that it is possible to aggregate data into summary statistics to develop a decision boundary that effectively separates COVID-19 positive and negative cases.However, the overlap in estimated mean values prior to day 5 suggests that separation between positive and negative cases may be more challenging prior to that point in time.Although the 95% con dence intervals closer to the test date were non-overlapping, there was overlap in the variance of the digital biomarkers between the two groups during that time period (Extended Data Figure 5), which may hinder model performance as separation of the 95% con dence intervals does not necessarily imply signi cant differences between the groups [30].Similar estimates of variability have not been reported prior, so we were unable to compare our mean statistics variability to prior literature.
Optimizing the detection period for the ITA model: To maximize the separability of the COVID-19 positive and negative groups in the training set, we performed statistical analysis to explore how different lengths and start times of the detection window, parametrized respectively by two variables (the detection end date (DED), de ned by days prior to the diagnostic test date, and the detection window length (DWL) de ned by number of days), would affect the separation between these two groups.We performed a combinatorial analysis across these two parameters (DED and DWL) to calculate ve summary statistics (mean, median, maximum, minimum, and range) of the four deviation metrics (Table 2) to be used as features for model building.This resulted in 40 total summary statistics (20 each from steps and RHR), which we refer to as steps and RHR features, respectively.Statistical comparison of the steps and RHR features between the COVID-19 positive and COVID-19 negative groups was performed on the training data for the AF, AHF, and FHF cohorts separately to uncover the statistically signi cant features (unpaired t-tests; Benjamini-Hochberg corrected p-value < 0.05).
A systematic grid search to optimize the detection end date and detection window length (DED and DWL, respectively) demonstrated that the closer the detection period is to the diagnostic test date, the larger the number of features that are signi cantly different between the COVID-19 positive and negative groups (  and 4B).Additionally, RHR features became signi cant earlier in the detection period than steps features (DED as early as -10 days vs. -5 days, respectively), which indicates that changes in RHR occur earlier than steps during the course of infection.Comparison across the three cohorts revealed AF generated the highest number of signi cant features compared with the AHF and FHF cohorts, which may be attributable to the larger population size of AF.This demonstrates the tradeoff in wearables studies between high frequency data, which is less common but contains more information, and larger population data, which contains data at a variety of sampling frequencies but overall more data to train the models.Across the DWL values, 3 and 5 days generated the largest number of signi cant features for all cohorts (Figure 2C, Extended Data Fig 3C and 3C), while 5 days also corresponded to the date of the maximum divergence between ΔSteps and ΔRHR (Fig 2A).Ultimately, this systematic analysis pointed to an optimal DED of 1 day prior to the diagnostic test date and an optimal DWL of 5 days for the detection window duration, both of which were used to generate features for the ITA model.
ITA feature selection: When implementing the DED timepoint and DWL duration that best separated the COVID-19 positive and negative groups, there were 28-31 signi cant features (p-value < 0.05; unpaired ttests with Benjamini-Hochberg multiple hypothesis correction) that overlapped across the three cohorts, indicating their robustness to differences in data resolution and device types (Supplementary Table 2).
The top 7-9 features, ranked in order of signi cance, originated from the RHR digital biomarker.To gain a more mechanistic understanding of the RHR and step digital biomarkers, we explored the top two most signi cantly different (lowest p-value) features for each digital biomarker between those who were COVID-19 positive or negative in the AF cohort (Fig 2D).The decrease in steps during the detection period as compared to baseline was greater in those with COVID-19, with a 2054 vs 99 median decrease in steps (median ΔSteps) and a 1775 vs 64 mean decrease in steps for those who were COVID-19 positive vs those who were COVID-19 negative, respectively (p-values < 0.0001).Conversely, the increase in maximum deviation in RHR in the detection period as compared to baseline (maximum ΔRHR) and the increase in mean of Z-scores in the detection period as compared to baseline (mean of Z-score RHR) were both signi cantly higher for COVID-19 positive participants compared to COVID-19 negative participants (8.4 vs. 4.3 bpm for maximum ΔRHR and 0.9 vs. 0.2 for the mean of Z-score-RHR; p-values < 0.0001).Consistent across all three cohorts, the median and mean ΔSteps were the most signi cant (lowest pvalue) steps features (Extended Data Fig 3D and 4D).However, the top two RHR features differed, which were median and mean Z-score-RHR, and maximum ΔRHR and maximum of normalized ΔRHR for the AHF and FHF cohorts, respectively (Extended Data Fig 3D and 4D and Supplementary Table 2).The observation of the same top two steps features given the differences in the top two RHR features across the three cohorts may originate from the resolution and device-reported digital biomarkers.For example, the de nition of a step and the calculation of the daily step count may be more similar across different device types, while the RHR de nition and available HR data resolution may vary more substantially across device types.Although these top features are signi cantly different between those who are COVID- Development of the ITA model: To achieve our broader goal of determining who should receive a diagnostic test under circumstances where there are limited tests available, we aimed to design a model that outputs the probability of a person being infected.However, because our ground truth information is binary (positive or negative for COVID-19), we designed this model as a binary classi er which enabled straightforward evaluation of its performance.We used the features that were signi cantly different in the training data between those who were COVID-19 positive and negative (29 features for AF, 28 for AHF, and 31 for FHF) as inputs into ve machine learning classi cation models: logistic regression, k-nearest neighbors (KNN), support vector machine (SVM), Random Forest (RF), and extreme gradient boosting (XGBoost) (Extended Data Table 1).We chose these ve well-established classi cation models to explore how increasing model complexity and the addition of non-linearity impacts the model performance.We trained these classi cation models on the training data using nested cross-validation with an inner loop for hyperparameter tuning and an outer loop for model selection.We chose recall as our preferred scoring metric for model selection and evaluation to emphasize the relative impact/cost of false negatives compared to false positives, as an individual who is truly positive for COVID-19 and is wrongly classi ed as negative (or healthy) would further spread disease.
Following training, we evaluated the performance of the trained model on the independent test set and used two well-established reporting metrics, including the most commonly reported metric for studies of this kind (the area under the curve for the receiver operating characteristic curve (AUC-ROC)) [26], [31]- [35], and the metric that is most appropriate for this classi cation task (AUC for the precision-recall curve (AUC-PR)) [36] Fig 6).AUC-PR is more appropriate with class-imbalanced data [36], [37], which is the case here (12-15% COVID-19 positive and 85-88% negative in each of the three cohorts).The class imbalance in our dataset was not correctable through Of the ve models tested, logistic regression outperformed all other models based on the training AUC-PR for all three cohorts and was also the best performing model based on the training AUC-ROC for the AF and FHF cohorts.The superior performance of the logistic regression among other (more complex and nonlinear) models may be attributed to the tendency of more complex and non-linear models to over t on the training data [38], which comes to light with our cross-validation methods.The superior performance of the logistic regression also points to the potential to develop explainable machine learning predictive models for the ITA model which enables rapid translation from bench to bedside.Overall, the classi er performed best on the FHF cohort (Extended Data Table 1, Fig 4 A, B, C, and D), followed by the AHF cohort, (Extended Data Fig 6 A, B, C, and D) and nally the AF cohort (Fig 3 A, B, C, and D).These performance differences indicate that device-related and data resolution differences may confound disease-related physiological differences captured by digital biomarkers.Therefore, building models using a single device type and with higher resolution data improves performance.For the FHF cohort, the logistic regression model resulted in an AUC-ROC of 0.73±0.12and AUC-PR of 0.55±0.21 on the crossvalidated training set, and AUC-ROC of 0.77 and AUC-PR of 0.24 on the test set (Fig 4).The AUC-ROC from the models were similar to those reported in recent similar studies [26], [32], [35].
However, the performance of the models based only on AUC-ROC in the context of imbalanced data can be misleading, as a large change in the number of false positives may have a small effect on the false positive rate [37].The precision metric, which integrates both true positives and false positives, can mitigate the effect of an imbalanced dataset (e.g., the higher proportion of negatives seen in this type of data) on a model's performance.Our precision-recall analysis (Fig 3B , 4B, and Extended Data Fig 6B ) demonstrates that we can improve the recall (minimizing false negatives) at the expense of precision.In an extreme example, we were able to achieve 100% recall with a precision of 0.4 on the cross-validated training set of the FHF cohort, whereas, a dummy classi er with random chance (i.e., Random Testing Allocation (RTA)) can achieve a precision of 0.15 on this dataset.It is also important to note that we are not considering resource-limited settings in the ROC and PR analysis; instead, it is assumed that there are a su cient number of diagnostic tests available for the entire surveillance group.In a resource-limited setting, 100% recall may not be achievable due to the shortage of diagnostic testing.
To understand the relative contribution of the steps and RHR digital biomarkers to the ITA model performance, we developed two separate sets of models using features based only on either steps or RHR using the training set data with logistic regression.Consistent with previous literature [26], [32] the models using steps-based features alone had a higher AUC-ROC than models using RHR-based features alone (cross-validated AUC-ROC of 0.67 vs. 0.64, 0.69 vs. 0.63, and 0.72 vs. 0.68 for steps vs RHR features for the AF, AHF, and FHF training sets, respectively) (Extended Data Fig 7).Interestingly, when using the AUC-PR as the performance metric, models using features based on RHR digital biomarkers outperformed models using features based on steps digital biomarkers, a nding which has not been previously reported (cross-validated AUC-PR of 0.30 vs. 0.38, 0.28 vs. 0.37, and 0.40 vs. 0.49 for steps and RHR features for the AF, AHF, and FHF training datasets, respectively) (Extended Data Fig 7).Overall, the addition of steps features increased the AUC-ROC of the ITA model by 7-11% compared with RHR features alone, while RHR features improved the AUC-PR of the ITA model by 38-50% compared with steps features alone.These results suggest that, while steps features provide more salient information on the trade-off between the true positive rate and false positive rate, RHR features provide more salient information on the trade-off between the true positive rate and the precision (positive predictive value).In other words, while steps features improved the speci city of the predictive model, RHR features improved the precision.We also compared the relative feature importances in the logistic regression and found that two, one, and four of the top ve features originated from RHR in the AF, AHF, and FHF cohorts, respectively, with the remaining features originating from steps (Extended Data Fig 8).In all three cohorts, median ΔSteps and mean ΔSteps were the two most important steps features, which was consistent with our earlier statistical analysis.Maximum ΔRHR was the most important RHR feature for the AF and AHF cohorts and the second most important RHR feature for the FHF cohort, and was also one of the top two most signi cant features in our earlier statistical analysis for the AF and FHF cohorts.
Improvement in positivity rate for COVID-19 diagnostic testing using the ITA method.We next evaluated how the ITA model can improve the current standard of practice for COVID-19 infection surveillance.
Under current surveillance testing methods in the US, while some tests are taken due to symptoms or possible exposure, many are taken as precautionary measures for traveling or for surveillance in schools and workplaces [28].While such forms of widespread RTA surveillance are bene cial, the positivity rate of widespread diagnostic testing is typically low and, thus, requires su cient testing capacity in order to prevent testing shortages (e.g., sold out at-home testing kits).Applying an equivalent RTA surveillance approach to our study population results in a 12% positivity rate in both our AF-training (50 COVID-19 positive participants out of 365 participants in total) and AF-test (13 COVID-19 positive participants out of 92 participants in total) datasets.It is important to note that the 12% positivity rate is consistent for all levels of diagnostic testing capacity (0-100% of population).When employing ITA and adding the constraint of limited diagnostic testing capacity (10-30% of population), the testing positivity rate of the cross-validated model increased 2 to 3-fold ( Comparison of the three cohorts demonstrated that the best performing ITA model stemmed from the FHF cohort and was followed by the AHF cohort (Fig. 4C and 4F, and Extended Data Fig. 6C and 6F).By utilizing ITA and assuming a diagnostic testing capacity at 10-30% of the population, the positivity rate of the FHF and AHF cross-validated training datasets increased by 4-fold (64% positivity rate) and 3-fold (35% positivity rate) when compared to the RTA positivity rates of 15% and 12% for FHF and AHF cohorts, respectively.For the FHF cohort, the positivity rate further increased up to 6.5-fold (100% positivity rate) in the cross-validated training dataset when the diagnostic testing capacity was reduced to 2.5-5% of the population (5-11 diagnostic tests to be allocated to individuals in the training dataset) (Fig 4F).Using the independent test data set, the positivity rate of the FHF and AHF cohorts increased by 1.5 to 3-fold (17-31% positivity rate) and 2 to 3-fold (21-32% positivity rate), respectively, compared to the RTA positivity rate of 11%, when the diagnostic testing capacity was 10-30% of the population.These results indicate the potential of the ITA model to target diagnostic testing resources towards individuals who have a higher likelihood of testing positive (i.e., increasing the positivity rate of diagnostic testing) and enables more e cient allocation of testing capacity.
We further explored how the ITA model performs in symptomatic versus asymptomatic COVID-19 positive individuals in each cohort.We considered participants to be symptomatic who reported any symptoms in the detection period or on the diagnostic test date.Assuming a diagnostic testing capacity of 30%, ITA indicates testing for 19 of 29 symptomatic and 7 of 21 asymptomatic COVID-19 positive individuals in the cross-validated model, and 5 of 8 symptomatic and 1 of 5 asymptomatic COVID-19 positive individuals in the independent test set of the AF cohort.In other words, 7 of 26 (27%) and 1 of 6 (17%) COVID-19 positive individuals were asymptomatic in the ITA determined subpopulation for the crossvalidated training set and independent test set of the AF cohort, respectively.Results were similar for the AHF and FHF cohorts (Extended Data Table 2).These ndings indicate that the ITA model can not only target diagnostic testing resources towards individuals with symptoms, but also to those without any reported symptoms, further increasing the utility of this method.

Discussion
The COVID-19 pandemic revealed the fragility of our existing healthcare infrastructure to detect the virus and prevent its spread.One key tool for reducing disease spread is bringing diagnostic testing to the right people at the right time and ensuring appropriate interpretation of the diagnostic testing results based on the prevalence of the disease in the population [4].In light of this need, in April 2020 we developed CovIdentify to integrate commercial wearable device data and electronic symptom surveys to assess the real-time risk of being infected with COVID-19.We envisioned two possible scenarios where CovIdentify would be useful for informing intelligent testing decisions, including 1) ranking individuals in a group by likelihood of current infection with COVID-19 to determine who to test, and 2) tracking a single individual over time for evidence of new infection onset to determine when to test.In our initial development of the Intelligent Testing Allocation (ITA) model, we focused on the rst question, and ultimately improved the positivity rate of COVID-19 diagnostic testing up to 6.5-fold when compared against Random Testing Allocation (RTA).Based on these results, if deployed on a large scale, the ITA model could be used to better allocate diagnostic testing resources.This method is likely applicable to other diagnostic areas as well, where digital biomarkers can be used to indicate the likelihood of disease.
In this work, we demonstrated that wearable device data can be used to strategically target the allocation of diagnostic tests to where they are most useful.This approach not only increases testing e ciency and allocation but also reduces the costs and supply chain burden of surveillance testing which is an ongoing challenge.Our results further demonstrate that the ITA method is able to lter a surveillance population to generate a subpopulation with a higher density of true positives, regardless of the prevalence and pre-test probability of COVID-19 infection in the population under surveillance for the disease, and, thus, increases testing positivity rates.We also demonstrate the utility of the ITA to lter individuals for allocating diagnostic tests not only in cases of symptomatic individuals but also for asymptomatic individuals who may not be tested and diagnosed otherwise.While the sensitivity and speci city of diagnostic tests are not affected by ITA, this more e cient testing allocation approach identi es more cases in less time and with fewer resources [39]- [42].
Consistent with prior literature, we demonstrate here that digital biomarkers derived from heart rate and physical activity are indicative of infection onset.A unique contribution of our work is the demonstration of differences in digital biomarker signi cance with respect to time prior to the diagnostic test date; speci cally, we show that differences in RHR features were signi cant between COVID-19 positive and negative groups as early as ten days prior to the diagnostic test date whereas differences in most step features were not signi cant until ve days prior to the diagnostic test date.One steps feature, minimum ΔSteps, was signi cant up to nine days prior to the diagnostic test date, potentially demonstrating a link between activity levels (and perhaps noncompliance with lockdown measures) and COVID-19 exposure.Furthermore, RHR begins to deviate from baseline earlier than steps (as early as thirteen days vs. ten days prior to the diagnostic test date, respectively), and the peak effect (maximum deviation from baseline) of infection also occurs earlier in RHR than steps (one day prior vs. two days after the diagnostic test date, respectively) for those who were COVID-19 positive.These results indicate that changes in physiology (RHR) occur earlier in the infection period, while symptoms and reduced physical activity (steps) transpire later in the infection period, when people may limit their movement due either to illness or mandatory quarantine.A recent COVID-19 study assessing prolonged physiological and behavioral changes using wearables also observed that COVID-19 positive individuals took more time to return to their RHR baseline values compared to their step and sleep baseline values following the acute COVID-19 infection period [29]; however, this work explored the post-infection period of the data whereas here we explore the preinfection period as well as the acute infection period using a systematic grid search approach.Another recent study [32] that developed machine learning models to passively detect COVID-19 using wearable data noted relative changes in feature importance when including data post-diagnosis.However, to our knowledge, we are the rst to demonstrate and establish the dynamics of feature importance over time prior to the diagnostic test date, indicating which features should be weighted more heavily in prediction models and when.
Another important contribution of our work is demonstrating the utility of RHR and steps features in the tradeoff between the true positive rate and false positive rate (ROC analysis) and the tradeoff between the true positive rate and the positive predictive value (PR analysis).Speci cally, we show that while steps features provide more salient information on the trade-off between the true-positive rate and false positive rate, RHR features provide more salient information on the trade-off between the true positive rate and the precision (positive predictive value).To our knowledge, this is the rst demonstration of this tradeoff in predictive model development for COVID-19 infection detection.The ITA model, in addition to using features of RHR and steps, can likely be further extended and improved with features from other digital biomarkers such as skin temperature, respiratory rate, blood oxygen saturation, and sleep duration [21], [22], [33], [34].It is anticipated that each of these distinct digital biomarkers would capture a physiological response to infection at different times during the detection period, thus improving the robustness and overall performance of the ITA approach.
One of the important observations from our work was the clear separation of the 95% con dence intervals of the means of digital biomarkers between COVID-19 positive and negative populations as early as ve days prior to the test date (Fig. 2A and Extended Data Fig. 3A and 4A), while the variances of the groups have overlapping distributions in the same time window (Extended Data Fig. 5).Notably, a lack of overlap in 95% con dence intervals does not necessarily imply signi cant differences between the groups [30] as standard deviation is a valuable descriptive measure of the data that should be considered as well.There are many possible sources of variance in studies involving wearable data, including the inclusion of different device types and technologies, contexts of measurement (e.g., time of day, activity type, etc.), differences in physiological response to infection, etc.We mitigated this issue by segmenting by device type and data resolution, as well as by utilizing measurements during resting periods only for the RHR calculation.In the future, larger datasets can enable segmentation by demographics (e.g., age, sex, weight, etc.) that would likely further reduce the variance.Sharing datasets between studies, as demonstrated here, can also augment the study population and further reduce the variance.An open question is whether the resolution of current photoplethysmography-based wearable heart rate technologies is high enough to adequately detect signals above the population variance.
Here, we did not deploy the ITA method in real-time and, thus, its performance in practice still remains to be tested.Both the CovIdentify and MyPHD studies were primarily Bring Your Own Device (BYOD) study designs, in which people who already own smart devices are recruited to participate.The BYOD design presents two major challenges: 1) participants must own a smart device, which limits eligibility to those who can afford devices, and 2) many different types of devices are used, introducing an additional source of noise in the analysis.We mitigated the rst challenge by developing and implementing the Demographic Improvement Guideline [45], and the second challenge by dividing our overall dataset into cohorts with homogeneous sampling frequencies and/or device types.Although we recognize that certain factors decrease the likelihood of wearable device ownership, such as lower income or living in a rural area [46]- [48], the precipitously decreasing cost of wearable technology is rapidly increasing the equitable distribution of these technologies [49].
Another limitation of the study is the data missingness and its impact on the deviation of the digital biomarkers, as the source of missingness may confound the disease-related physiological variation.For example, we observed that some participants in our study did not wear their devices when they were feeling sick, as observed in other studies [25], which resulted in a reduction in recorded physical activity.For that reason, it can be a challenge to isolate the effects of physiological and behavioral changes on the digital biomarkers.Furthermore, some devices require more frequent charging (e.g., Apple Watch), which results in more missing data that may also impact model performance.We mitigated this challenge by further developing our model on a single device and homogeneous sampling frequency (FHF) cohort.
The recent body of work on COVID-19 detection using smartwatches uses AUC-ROC to evaluate model performance [26], [32]- [35], which is only an appropriate metric for class-balanced data, and is otherwise misleading [36], [37].In these large-scale studies conducted on a convenience sample of the population for a disease with low prevalence, there exists an inherent challenge of class imbalance because most of the study population does not contract the disease.This was a challenge that we faced in our study, and, further complicating matters, many of the COVID-19 positive participants did not wear their wearable devices at the start of their infection, exacerbating the class imbalance.While less frequently reported than AUC-ROC, the AUC-PR is the correct evaluation metric for evaluating a classi er on imbalanced data [36], which is what we report here.We show that even with a strong AUC-ROC, the AUC-PR demonstrates the limitations of performance.Methods to resolve class imbalance, especially when working with wearable device data, can be further investigated for future studies.
While our study focused on improving testing allocation for COVID-19, the methods developed herein are extensible to other types of infections and could be used to fortify our future pandemic preparedness.Using ITA to improve disease surveillance could be especially important in underserved communities which may bene t from the fact that the ITA method is useful even with only steps digital biomarkers which may be obtained from smartphones which are owned by 85% of the population in the US [50] and up to 76% globally [51].By targeting diagnostic testing toward individuals who are more likely to truly be infected with a disease, we can improve the allocation and utility of diagnostic tests, ultimately reducing mortality and increasing our ability to control the current and future pandemics.

Participant Recruitment and Data Collection
The CovIdentify study launched on April 2, 2020 (Duke University Institutional Review Board #2020 − 0412).Eligibility criteria included age over 18 years and internet access.Social networks and social media advertising were used to recruit participants.By May 25, 2021, a total of 7,348 participants were recruited and e-consented through the Research Electronic Data Capture (REDCap) system [23].During enrollment, participants were given the option to donate 12 months of retrospective wearable data and 12 months of prospective wearable data.Wearable data was collected via the CovIdentify iOS app for devices connected to the Apple Health kit (e.g., Apple Watch) or via Application Programming Interfaces (APIs) for other devices (e.g., Garmin and Fitbit devices).The participants were also asked to complete an onboarding (enrollment) survey and daily surveys.The surveys were in English or Spanish and included questions on symptoms, social distancing, diagnostic testing results, and related information (Supplementary Document 1).Surveys were collected using the CovIdentify iOS app, text messaging, and/or emails.All wearable data and survey results were stored in a secured Microsoft Azure data platform and later analyzed in the Microsoft Azure Machine Learning environment.Soon after CovIdentify was launched, exploratory data analysis revealed major differences between CovIdentify demographics and the demographics of COVID-19 positive cases and deaths in the U.S., as well as overall U.S. demographics based on the 2020 U.S. Census [52], [53].We sought to mitigate the imbalance throughout the duration of the study by providing wearable devices to underrepresented populations [45].COVID-19 vaccine reporting was added to the daily surveys in February 2021, where we asked questions regarding the vaccination date, vaccine brand, vaccine related symptoms, and dose number.
Wearable data processing and analysis: Participants were asked to ll out an enrollment survey following the informed e-consent.Daily symptom survey and wearable data from the participants were analyzed both separately and together.For the overall analysis, we only included participants with self-reported diagnostic test results for COVID-19.These participants were further divided into two categories based on the self-reported diagnostic test results: COVID-19 positive and COVID-19 negative.
In addition to the data collected via CovIdentify, we augmented our analysis by including data from the MyPHD study, as reported on in the two recent publications by Mishra et al. [25] and Alavi et al. [24] For wearable data analysis, we only included days of wearable data when both heart rate and step count were available.Out of the 1,239 participants (113 from CovIdentify and 1,126 from MyPhD study) who had both heart rate and step count data available, we had device-reported daily values of RHR and step count for 67 participants, and high frequency (second-or minute-level, depending on device types) wearable data for 1,172 participants.For participants with high frequency heart rate data, we calculated daily RHR from the heart rate data points recorded between midnight and 7 AM, when there were no steps recorded.For those participants with available high frequency wearable data, we chose a data-driven threshold (i.e., a minimum number of heart rate data points between midnight and 7 AM with zero recorded steps) to include our calculated RHR data from that day in the subsequent analysis.As the sampling rate varies by device types (Fitbit, Garmin, and Apple Watch), we generated separate data distributions of the datasets for these three device types and selected the rst quartile of heart rate data points per device as the data-driven threshold, which resulted in a threshold of 2,630, 19, and 1,389 heart rate data points for Fitbit, Apple Watch, and Garmin devices, respectively.In other words, on a given day, a participant with Fitbit wearable data needed to have at least 2,630 heart rate data points between midnight and 7 AM with zero recorded steps for us to include our calculated RHR value in the subsequent analysis.Following this intraday data point threshold, we used an interday data threshold: a minimum number of days with available wearable data to be included in the analysis (50% in the baseline period and 50% between nine days and one day prior to the diagnostic test date in the detection period).We explored different minimum number of days of available wearable data in the baseline and detection periods and selected these two thresholds to maximize the number of participants while keeping the performance of the ITA model on the training dataset consistent, de ned as less than 10% variation of the performance metrics (AUC-ROC and AUC-PR)).
Cohort De nition: The wearable data availability thresholds (both intraday and interday) resulted in an AF cohort of 520 participants (83 from CovIdentify and 437 from MyPHD) with su cient wearable data.We then created two more subsets from this cohort (Extended Data Fig. 2): (1) AHF cohort: participants with high frequency wearable data (469 participants, 54 COVID-19 positive and 415 COVID-19 negative), and (2) FHF cohort: participants with high frequency wearable data from a single source (Fitbit) (280 participants, 40 COVID-19 positive and 240 COVID-19 negative) to explore the impact of utilizing wearable data from different sources and resolutions on the ITA model development.We employed these three cohorts separately for the ITA model development and compared the resulting models' performance in the corresponding training and test datasets of these cohorts.We divided each cohort into an 80% train and 20% test split, with FHF as a subgroup of AHF (which itself is a subset of AF) to ensure that no observations in the training dataset of one cohort existed in the test dataset of another (Extended Data Fig. 2).

Digital biomarker de nition:
Given the use of datasets with different device types, a consistent RHR de nition was used in order to harmonize the cohorts with high frequency wearable data.We calculated the daily RHR digital biomarker by aggregating the high frequency heart rate data points available between midnight and 7 AM, when there were no steps recorded.Step count was calculated by summing all recorded step values during a 24 hour period in order to produce a daily step count digital biomarker.

Feature Engineering and Extraction
Following the creation of three cohorts (AF, AHF, and FHF) and their corresponding training and test sets, we performed exploratory data analysis (EDA) and extracted features from the time-series digital biomarkers (RHR and step count).For the EDA on the time-series digital biomarkers, we explored the difference in trajectories of digital biomarkers between COVID-19 positive and COVID-19 negative participants (Fig. 2A and Extended Data Fig. 3A and 4A).Following the EDA, we extracted the features mentioned in Table 2 from the raw digital biomarkers.We rst calculated four deviation metrics, which capture the deviation in digital biomarkers from participants' baseline during the detection phase.Following the deviation metrics calculation, we calculated summary statistics of these four deviation metrics which we refer as to features for this manuscript.We extracted the same features from the training and test datasets.Following the feature extraction, we performed statistical analysis on the features from the training datasets of the three cohorts to see which features are statistically different between the two groups and how their signi cance levels vary with different detection period combinations (detection end date (DED) and duration of the detection window (DWL)) using a systematic grid search to optimize DED and DWL (Fig. 2B and Extended Data Fig. 3B and 4B).We utilized multiple hypothesis testing with Benjamini-Hochberg adjusted p-values for this statistical analysis.Following the statistical analysis and systematic grid search to obtain the optimal detection period to extract the features, we only utilized the intersection of the statistically signi cant features (p-value < 0.05) extracted from digital biomarkers recorded between ve days and one day and three days and one day prior to the diagnostic test date for the development of the ITA model.

ITA Model
Following feature extraction, we developed predictive models to classify COVID-19 positive and negative participants in the training dataset of each cohort (AF, AHF, and FHF) using nested cross-validation (CV) and later validated the models on corresponding independent test datasets.We chose ve state-of-the-art machine learning models (logistic regression, K-nearest neighbor (KNN), support vector machine (SVM), Random Forest (RF), and extreme gradient boosting (XGBoost) [54], [55]) for the development of the ITA models to explore how increasing model complexity and adding non-linearity would impact the model performance.We trained these classi cation models on the training dataset using nested CV with an inner CV loop for hyperparameter tuning and an outer CV loop for model selection.For model training, we selected recall as our preferred scoring metric for model selection to emphasize the relative impact/cost of false negatives compared to false positives, as an individual who is truly positive for COVID-19 and is wrongly classi ed as negative (or healthy) would further spread disease.For model performance evaluation, we used two well-established reporting metrics, including the most commonly reported metric for studies of this kind (the area under the curve for the receiver operating characteristic curve (AUC-ROC)) [26], [31]- [35], and the metric that is most appropriate for this classi cation task (AUC for the precision-recall curve (AUC-PR)) [36] (Extended Data Table 1; Fig. 3, Fig. 4, and Extended Data Fig. 6).AUC-PR is more appropriate with class-imbalanced data [36], [37], which is the case here (12-15% COVID-19 positive and 85-88% negative for each of the three cohorts).The results reported for the training dataset (Extended Data Table 1, Fig. 3A, B, and C, Fig. 4A, B, and C, and Extended Data Fig. 6A, B, and C) were generated from the validation on the held-out dataset (fold) from each iteration of the outer CV loop which was not used in the model training.Based on the CV results of the ve machine learning models on the training dataset, we chose the logistic regression model to further evaluate performance on the independent testing dataset (Fig. 3D, E, and F, Fig. 4D, E, and F, and Extended Data Fig. 6D, E, and F).For validation on the independent test dataset, we trained the logistic regression model on the entire training dataset using a grid search with ve strati ed folds for hyperparameter tuning and selected the best model (with tuned hyperparameters) to validate on the test dataset.

Nested-Cross Validation
For model development with the training dataset, we utilized nested CV over as traditional CV, which is a common approach in similar studies [26], [32], [34], [35], because it uses the same data for hyperparameter tuning and model performance evaluation [56].In nested CV (also called double CV), the hyperparameter tuning procedure is nested (inner loop) under the model selection procedure (outer loop) and the inner loop is used for optimizing the hyperparameters of the model with inner CV, and the outer loop is used to compute the error of the optimized model with outer CV. [57].For the nested CV, we divided the training set into ten strati ed folds (keeping the ratio of COVID-19 positive and negative participants the same across each fold) for the outer loop.For each iteration of the outer loop, the model was trained on data from nine folds by optimizing the hyperparameters of the model with inner CV, and validating on the left-out fold, a process which was repeated nine more times.In each iteration of the outer loop, the outer training data (from nine folds) were further divided into ve strati ed folds (inner loop) to tune hyperparameters using a grid search.Out of the ve iterations with the grid search in the inner loop, the best model (including hyperparameters) was selected, and this model was used in the model performance evaluation in the outer loop.This way of model development using two CV steps separates hyperparameter tuning and model selection in order to reduce bias in model performance.

Feature Importance Ranking
To calculate the feature importance ranking, we trained the logistic regression model using a grid search with ve strati ed folds for hyperparameter tuning and selected the best model (with optimized hyperparameters) to train on the entire training set of each cohort and extracted the coe cients for each feature used in the optimized model.We reported the absolute value of each coe cient as the relative importance of the features (Extended Data Fig. 8).
Fig 2B, Extended Data Fig 3B and Extended Data Fig 4B).Across all DED values, DED of one day prior to the diagnostic test date (DED = -1) generated the largest number of signi cant features for all cohorts.Also, across all cohorts, there were more signi cant RHR features than steps features (Fig 2B, Extended Data Fig 3B 19 positive and negative, their distributions do overlap, even though the tailedness varies in direction and extent (Fig 2D, Extended Data Fig 3D and 4D, and Supplementary Fig 4), which points to broader challenges surrounding predictive modeling efforts using standard consumer wearable device data for COVID-19 infection detection.
resampling methods -we have observed that distributions of features overlap between the COVID-19 positive and negative participants, as demonstrated in the individual feature comparison (Fig 2D, Extended Data Fig 3D and 4D), as well as in the low dimensional representation (using principal component analysis and t-stochastic neighbor embedding) of all the features in the training set of the AF cohort (Supplementary Fig 5).
This work was supported by in part by Duke OIT, Duke Bass Connections Fellowship, Duke Margolis Center for Health Policy, Duke MEDx, Microsoft AI in Health, Duke CTSI (UL1TR002553), and NC Biotech (2020-FLG-3884).This article was prepared while Geoffrey S Ginsburg was employed at Duke University.Dana K Pasquale was funded by NIH/NICHD (R25HD079352) and CDC (BAA 75D301-20-R-68024).The opinions expressed in this article are the author's own and do not re ect the view of the National Institutes of Health, the Department of Health and Human Services, or the United States government.We would like to also thank Leatrice Martin, Veronica Palacios-Grandez, John Owens, Julie Ekstrand, Cecilia Plez, Hugh Thomas, Philabian Lindo, Richard Outten, Shellene Walker, Tracey Futhey, Jimmy Dorff, Rob Carter, Sean Dilda, Vanessa Simmons, Andy Ingham, Charley Kneifel, Andrew Olson, Whitney Welsh, Jonathan McCall, Margaret Pendzich, Marialuisa Solis-Guzman, Erich Huang, and Victoria Christian for their contributions to the infrastructure and their dedication to this project.

Figures Figure 1
Figures

Figure 2 Overview
Figure 2 (Extended Data Table 1; Fig 3, Fig 4, and Extended Data . The data from Mishra et al. included heart rate, step count, and sleep data for 27 COVID-19 positive cases.It also included metadata of symptom onset and test dates.The data from Alavi et al. included heart rate and step count data for 83 COVID-19 positive cases and 1,019 COVID-19 negative cases as well as metadata including symptom onset and test dates.

Table 1 .
of the Cohorts.Total refers to training + test data

Table 2 :
Extracted from the Digital Biomarkers (DBs) for the Development of ITA Algorithm