Dear Watch, Should I get a COVID Test? Designing deployable machine learning for wearables

Commercial wearable devices are surfacing as an appealing mechanism to detect COVID-19 and potentially other public health threats, due to their widespread use. To assess the validity of wearable devices as population health screening tools, it is essential to evaluate predictive methodologies based on wearable devices by mimicking their real-world deployment. Several points must be addressed to transition from statistically signiﬁcant diﬀerences between infected and uninfected cohorts to COVID-19 inferences on individuals. We demonstrate the strengths and shortcomings of existing approaches on a cohort of 32 , 198 individuals who experience inﬂuenza like illness (ILI), 204 of which report testing positive for COVID-19. We show that, despite commonly made design mistakes resulting in overestimation of performance, when properly designed wearables can be eﬀectively used as a part of the detection pipeline. For example, knowing the week of year, combined with naive randomised test set generation leads to substantial overestimation of COVID-19 classiﬁcation performance at 0.73 AUROC. However, an average AUROC of only 0 . 55 ± 0 . 02 would be attainable in a simulation of real-world deployment, due to the shifting prevalence of COVID-19 and non-COVID-19 ILI to trigger further testing. In this work we show how to train a machine learning model to diﬀerentiate ILI days from healthy days, followed by a survey to diﬀerentiate COVID-19 from inﬂuenza and unspeciﬁed ILI based on symptoms. In a forthcoming week, models can expect a sensitivity of 0.50 (0-0.74, 95% CI), while utilising the wearable device to reduce the burden of surveys by 35%. The corresponding false positive rate is 0.22 (0.02-0.47, 95% CI). In the future, serious consideration must be given to the design, evaluation, and reporting of wearable device interventions if they are to be relied upon as part of frequent COVID-19 or other public health threat testing infrastructures.


INTRODUCTION
The use of wearable devices has skyrocketed in recent years -about 3 in 10 people in the USA wear a fitness or health tracking sensor [62].The abundance in physiological data has lead to the development of algorithms to deliver healthcare at the edge, with regulatory approval in applications such as atrial fibrillation [45].In the wake of the global COVID-19 pandemic there has been unprecedented need for early detection of COVID-19 in a way which differentiates it from seasonal influenza or otherwise unspecified influenza-like-illnesses (ILI), due to the substantial overlap in signs and symptoms.Jurisdictions have circumvented transmission through extensive testing and proper infection control through physical distancing, isolation of positive cases, and contact tracing [26; 48].In contrast, a reduced capacity for testing has been identified as a rate-limiting step in the control of the pandemic [25].Wearable devices can contribute to testing infrastructure, and in turn, reduce transmission.Studies conducted prior to and throughout the pandemic demonstrate that the physical manifestations of ILI, and in particular, COVID-19 can be captured using wearable devices [35; 38; 43; 51].To understand the usefulness of wearable devices for the public health infrastructure, proposed models must be evaluated in a scenario that resembles conditions of practical deployment.It is essential to exhaustively evaluate the potential impacts of models based on wearable data to avoid unattainable expectations or under-utilisation.
Current testing infrastructure is typically limited to COVID-19 RT-PCR tests 1 , which are invasive and only show a single time-point that determines whether the viral load of SARS-COV2 is above a detectable threshold.The properties of PCR are such that they are highly specific (i.e.COVID-19 negative patients are hardly detected as COVID-19 positive 98%) and reasonably sensitive (71-98% of tests on COVID-19 positive individuals are positive [57; 59; 61]).The highest viral load is detected on the days surrounding the onset of symptoms [60].However, many individuals are only encouraged to get a COVID-19 test if they have had known contact with a COVID-19 case, or have already shown symptoms.It is estimated that 44% of viral transmission occurs prior to the onset of symptoms in cases where symptoms eventually develop [17].This is problematic as false negative rates from the test are elevated prior to symptom onset [27].Emerging evidence for COVID-19 estimates that 17.9% of cases are asymptomatic [39].As with influenza, asymptomatic COVID-19 cases can still transmit the virus [30], though their transmission may be at lower odds than symptomatic individuals [4].
Accessible and frequent COVID-19 tests, even if at lower sensitivity, have the potential to undercut viral transmission if SARS-COV2 harbouring individuals are notified during or prior to peak infectiousness.Simulations have shown that by using frequent, low-sensitivity COVID-19 tests it is possible to rapidly detect and contain outbreaks of the virus on university campuses [44].Simulated daily tests with sensitivities as low as 70% (and 98% specificity) result in 162 cases out of a cohort of 4990 over an 80 day semester [44]; this is in stark contrast with symptom based screening estimates 4970 of 4990 would be infected, although this assumes only 30% of infected individuals show symptoms and the reproductive number, R t , is 2.5.It is presumed that many low viral-RNA-count, positive-PCR tests are beyond peak infectiousness (possibly even no longer infectious) due to the long RNA-positive tail [36; 37].This emphasises the need for early testing.Using frequent tests simulates better control of the pandemic than improving the sensitivity of tests [28].While it is possible to acquire and administer rapid tests, we should utilise the infrastructure already available, including the sensors on our wrists which can passively monitor for ILI events.
Unfortunately, there is no sensor for measuring SARS-CoV-2 viral load in consumer wearable devices that are widely used.Instead we must rely upon the physical manifestations of the infection to trigger a classification.Fever is a hallmark symptom of infection that is characterized by marked elevation in basal body temperature.While some consumer wearables are capable of measuring temperature [55], many are primarily intended as fitness related devices equipped with optical heart rate sensors and accelerometers.Despite the lack of direct temperature monitoring, body temperature is mediated by and affects cardiac rhythm and function [24].In influenza, evidence indicates that there is an increase in resting heart rate (RHR) correlated with onset of fever [14].Abnormally high heart rate measurements from wrist based devices have been associated with inflammatory illnesses [33].Increases in RHR, impaired sleep due to illness, and reduced steps due to inactivity measured using Fitbit user data have been correlated with influenza rates at a population-level [51].Other available wearable measurements, such as heart rate variability (HRV) and respiratory rate, also correlate to infection [35; 43].
Using wearable devices as diagnostic tools is enticing, but to truly understand their potential for this purpose, models that use wearable data must be assessed in frameworks similar to the deployment conditions and subject to the same scrutiny that any non-wearable-based diagnostic test would face.There are numerous examples in the literature outlining the prerequisites needed to build diagnostic models for wearable data [18] and yet the recently published papers [35; 38; 43] lack in one or another aspect as elaborated below.
In this work we address the design flaws in previous works' evaluation methods which have so far lead to the overestimation of COVID-19 detection; this is done by intentionally designing the study so that we can estimate post-deployment performance.First, model performance is compared in the format of current literature.We highlight the strengths and weaknesses of these reporting paradigms.After, we demonstrate the actual performance one can expect in a forthcoming week for COVID-19 detection using wearable devices only, survey data only, and a combination of wearable data and survey data.
While we are able to obtain similar performance to those published under their evaluation procedures, in a realistic framework the wearable contribution is likely to be much weaker.Despite lower performance, implementation of the wearable device prediction models results in a reduction in the number of symptom-based surveys required to detect COVID-19 cases, and elevates the likelihood that a tested individual has COVID-19 compared to a randomly tested sample of the population.Maximizing the pre-test probability of COVID19 ensures that we do not over-test, an important consideration for public health.

RESULTS
We collected a dataset, referred to as FLUCOVID, which contains survey and wearable device data collected between November 1, 2019, and June 1, 2020.There are 32, 198 participants with both surveys and wearable data in this cohort, 2, 554 had medically diagnosed influenza, 204 had COVID-19, and the remainder had unspecified ILI.A complete description of the data collection process can be found in [53].Our survey data proportionally matched peaks and trends in CDC data for influenza [8] and COVID-19 incidences [16] (See Figure 1).Using self reported symptom onset and recovery dates, combined with reports of medically diagnosed COVID-19 we define the task as a daily classification of whether a date is within the self reported ILI symptom dates (label of 1) or outside of that region during presumably healthy days (label of 0).Under this paradigm, day 0 is the first day in which a participant experiences symptoms, despite acquiring the virus and still being capable of transmitting the virus prior to symptom onset (negatively labelled days in the range of -14 to -1).
We investigated several prepossessing steps for the input data and the outcome.We investigated normalizing individuals' covariates by their own history as was previously done in the literature [51].However, we found that this step did not outperform simple normalisation of the individual's covariates by the training set population means.We thus normalise our data by the training set population means.This step allows to make more predictions immediately after a user enrols rather than waiting to accumulate sufficient data.In the spirit of using the most data available, we did not eliminate users with vast amounts of missing data due to lack of wear-time or missing channels.This deviates from the convention accepted in previous works [35; 38; 43; 50] but allows us to make predictions possible even in tough edge cases.To address the increased amount of missing data due to the broader inclusion criteria, we mean imputed missing data at the beginning of the study to left align the participants and forward filled data after an individuals' steps, heart rate, or sleep time has already been observed.Note the XGBoost [12] and GRU-D [11] models are capable of handling missing data.See further details on input data pre-processing in the Online Methods and the Appendix.
For the survey response covariates, survey responses are given once per week recounting symptoms experienced on each day.These responses are disambiguated into binary daily symptom flags for each of the reported symptoms or behaviours.Unfortunately, some participants respond to surveys weeks after their ILI event to provide onset and recovery dates.These individuals are retained during all training and testing procedures.Wherever no information is reported it is assumed that participants are not experiencing any symptoms and are following all COVID-19 precautions.The day-level ILI labels are constructed by labelling all days between self-reported symptom onset and self-reported recovery as positive.This differs from previous work where a fixed number of days around symptom onset are labelled as positive days [35; 43; 50].We did this in order to be able to make predictions continuously on any date prior to, or after symptom onset, to better simulate deployment.It is important to remember that our detector is not a wearable COVID-19 specific detector, it is a detector of ILI events that are a superset of any medically undiagnosed ILI, self-reported medically diagnosed flu, or self-reported medically diagnosed COVID-19.Finally, approximately a third of users had multiple reported events.When these events are separated by fewer than 7 days, they are merged into a single event, with the COVID-19 label superseding the unspecified ILI label.
All presented wearable models are trained using 48 variables representing sleep, heart rate, and steps measurements taken from wearable devices aggregated on a daily basis.
Throughout this work we use the following definitions to distinguish training and evaluation schemes which we compare in this paper: 1. Random splits of data refers to a random shuffling of historic data to create train, validation, and test event splits.We also refer to it as retrospective evaluation, since the training events and testing events are all occurring in the past.In such a scenario, the prevalence of COVID-19 and ILI are likely to be the same across train and test, a conventional but not a realistic deployment scenario.2. Prospective evaluation is the setting where test data is chronologically after any of the training and validation datasets and data splits.We know that over a single season of data, influenza cases surge and fall, and with emergent COVID-19 infections cases are anything but stationary.The prevalence in the week after a COVID-19 wearable model deployment is unlikely to match the prevalence experienced in the training set when real life is concerned.Inevitably, a deployment-ready model would have to face this modelling challenge so it makes sense to test in this manner to avoid providing the public with poorly calibrated COVID-19 or ILI predictions.Aggregate metrics from randomly sampled cohorts artificially inflate performance In the instance of seasonal influenza, models are able to use trends over time estimate current trends [51].With emergent diseases like COVID-19, there is a lack of past season data to rely on to create such estimates.We trained XGBoost [12] model on 5 random splits of data (5 repetitions of 35% training, 7.5% validation, 7.5% held out retrospective test set, and 50% held out prospective test set by participant) to detect ILI events.When analyzing the performance of randomised test splits, we found that most of the AUROC metric was driven by between week comparisons, giving an inflated estimate of performance.For example, knowledge that first week of February had a high prevalence of influenza would lead the model to make more calls in the first week of February in the test set.Dissecting the test set predictions to their constituent weeks lowered the AUROC for nearly all weeks (See Figure 2).Removing the week-of-year feature helped mitigate the overestimate for the XGBoost model, but there was still performance drop from the overall AUROC, to the weekly AUROC on the randomly drawn test set (Figure 2).These results indicate that the model had learned to predict the prevalence of ILI in each week, rather than connecting the underlying physiological data to the outcome.Note that in each instance, the machine learning model outperforms a non-machine learning baseline that was designed for population level influenza surveillance [51].

Random retrospective test cohorts overestimate future performance
In emergent diseases, one would imagine training a model on all available data, selecting parameters, then deploying the model.The new cases that this model predicts on will be temporally disjoint from the training data.Then in the following week, the process would be repeated by training a freshly initialised model with an additional week of data.Naturally this creates a situation where there is a continuously growing training set, a sliding validation set window, and a sliding test set window (See Figure S1).This deployment-style evaluation is in contrast to the conventional ML random split dataset scenario used in the previous section and related work [43], since none of the future observations are seen during training.Additionally, our prospective evaluation only contains participants who are distinct from the participants in the training data for that week's model for FLUCOVID data.Each consecutive week features a new, 50% of that week's participants (sampled to obtain uncertainty estimates).
With each new model trained on the additional week of data, we also reserve a 7.5% random split of the data that is concurrent with the training set, since by the time of deployment, the participants in this set have already been encountered.This allows us to evaluate our models on a standard random split setting as one would in a typical ML workflow that is negligent to temporal patterns in data.We find that the perceived performance from the random split overestimates the prospective evaluation in nearly every week for both models (See Figure 3).Random test set evaluation overestimates our expected performance for ILI detection using wearable devices.These findings exemplify that the danger of overoptimistic result estimation is real when test setting is not representative of deployment scenario.
Given the incidences of ILI and COVID-19 are non-stationary throughout the year, we attribute the drop in performance to an over-reliance on memorising the COVID-19 prevalence in a given week as before.This attribution is supported by the observation that by removing the week-of-year feature there is a corresponding drop in performance in retrospective test sets for XGBoost, which is not observed in prospective test sets (See Figure 3).To elicit the cause of the dataset shift, we apply PCA and t-SNE visualisations to the input data.When the week-of-year feature is absent, data do not show coherent clustering when falsely coloured by time.This temporal feature is only covariate with a large contribution to the temporal variance (See Figure S2).We observe that the means of features change throughout the study for their respective classes (See Figure S3).However, covariate shift is not the main source of dataset drift in the wearable features.Instead we attribute it to the fact that the likelihood of acquiring COVID-19 shifts as prevalence of COVID-19 changes, i.e. there is a prior probability shift [40].We address this in model development by selecting thresholds on a prospective validation set, instead of using a randomly selected validation set.We found this to be essential to attain reasonable performance on future data during the ebbing prevalence of COVID-19 with a single season of data (See Figure S8).

COVID-19 detection using only wearable data
For any prospective week, a model is trained to differentiate symptomatic COVID-19 days from symptomatic ILI days, or healthy days prior to illness.When models are evaluated on prospective data, we find that the margins of separation for COVID-19 vs. non-COVID-19 ILI & healthy classes is small.These weekly results are reported in Figure 4a for XGBoost.By nature of the FLUCOVID study design, only individuals who eventually experience an ILI are included.For any given week using wearable devices to directly detect COVID-19 achieves a sensitivity of 0.52 (0-1, 95% CI), with a false positive rate of 0.49 (0.04-0.99, 95% CI).The test set sensitivity will not change as more negative controls are added to the model, though all other metrics (specificity, positive predictive value, negative predictive value, AUROC, and AUPR) are affected by the negative class proportion.The reported scores are conservative, as we have a disjoint set of participants and non-intersecting date ranges between the training, validation, and test sets.In reality only a small proportion of participants would be expected to enter or leave the survey each week, meaning the deployment performance would only be harmed by lack of generalisation through time, rather than lack of generalisation through time and across participants.

COVID-19 detection using only survey data
Ultimately, it could be easier to differentiate between COVID-19, ILI, and healthy days using daily surveys [50].However, long-term engagement in digital health studies is a burden to the model's beneficiaries [47].Survey models currently upper bound mobile COVID-19 diagnostics.Though, surveys will still fail to detect pre-symptomatic, asymptomatic, and possibly mild symptomatic COVID-19 detection.Recall that for COVID-19, 17% of cases are estimated to be asymptomatic [5; 39], which is comparable to the rate of asymptomatic influenza [32].Both asymptomatic COVID-19 [30] and asymptomatic influenza [21] shed viral RNA.For those who eventually develop symptoms, the relatively long incubation time of SARS-COV-2 is problematic as 44% of transmission occurs during pre-symptomatic phase [17], though active viral load peaks around symptom onset [9].These sources of transmission are unable to be prevented with symptom-based screening alone.
Our data is collected such that a participant would not be asked to complete a survey if they did not have symptoms.The task for this survey model is to detect which symptoms are a result of COVID-19, and which can be presumed to be non-COVID-19 ILI.Models trained on the daily symptom history and demographic covariates to distinguish COVID-19 positive individuals from other ILI.The top performing model, a GRU, achieves this task with an AUROC of 0.78 ± 0.19 and an AUPR of 0.28 ± 0.11 for a prospective week, on a disjoint set of participants.The sensitivity and specificity of this method is shown in Figure 4b.In comparison, the linear model on a curated set of features as reported in [34; 50] yielded AUROC of 0.62 and AUPR 0.03 on our entire FLUCOVID survey data.Since the nature of the task is to determine if symptoms are caused by COVID-19 or not, and since all ILI cases are included in the study by design, the reported metrics would remain valid even if additional healthy individuals were included.

Triggering survey model using wearable predictions for COVID-19 detection
For any sensitivity chosen on the wearable model, we always improve the likelihood that the individual has COVID-19 (See Figure S10).Hence for a permissive sensitivity, we can dismiss many healthy individuals, which minimises reliance on continued survey engagement.The performance of the XGBoost and GRU-D wearable models, paired with a GRU survey model are shown in Figures 5a & 5b.For an average week, the performance on a disjoint set of participants can be expected to be 0.50 (0-0.74,95% CI) sensitivity and 0.79 (0.53-0.98, 95% CI) specificity for the XGBoost architecture on the wearable model combined with the GRU survey model.
However, given a positive prediction at any date near or during the onset of infection, participants could have been directed to get further confirmatory testing.Any positive prediction leading up to symptom onset could elicit behavioural change such as staying home from work or school, or getting a PCR test.Under this assumption, false negative results that follow positive predictions do not have such a profound impact.We consider a cumulative score automatically classifies a positive score any of the past 7 days has been classified as positive.We hypothesise that this would increase the sensitivity (by decreasing the number of FN) while decreasing the specificity.This nuance in reporting leads to an expected weekly sensitivity of 0.65 (0.19-0.87, 95% CI) and specificity of 0.69 (0.41 -0.97, 95% CI) on prospective test sets on FLUCOVID data.Roughly 63.5% of COVID-19 cases are detected at symptom onset (47.7% for non-COVID-19 ILI), and 68.9% of cases are captured by day 3 of symptoms (55.8% for non-COVID-19 ILI), before many individuals would be compelled to take a PCR test and receive results (See Figure S11).
The wearable component for this prediction scheme prompts surveys 65% of the time (a 35% reduction in survey burden as compared to daily symptom reporting).The sensitivity of the wearable-only model (any ILI detection)  The pandemic has a disparate impact on minorities.One meta-analysis found Black and Asian groups were more likely to contract COVID-19 than their White counterparts [58].These results are in agreement with several other studies for Black [23; 42; 49], Hispanic, Asian, and North American Indigenous minorities [22].We look at the performance of the combined wearable and survey model subset by race and gender, where subgroups have more than 1000 members.These results are shown in Table I. Results show that the differences between male and female subpopulations are significant for both specificity and sensitivity, but there is a tradeoff: for the same decision threshold, one group experiences a higher sensitivity and lower specificity, whereas the other group experiences the opposite effect.This pattern is also observed in the black participants where the model sensitivity on black participants is 12% less than the overall sensitivity, but the specificity is 10% higher with the same threshold.In models where the week-of-year is omitted, the sensitivity difference is abridged.Our results also show that the model's sensitivity is significantly higher on the Asian subpopulation compared to the non-Asian population.The sheer number of non-COVID-19 days renders all of the examined subpopulation specificities significant when compared to the rest of the population.The Asian, Black, and Hispanic minority subpopulations have specificities exceeding the population mean, whereas the white subpopulation specificity is below that of other groups.Overall, the model performance does not exacerbate existing inequalities of the COVID-19 pandemic noted in previous studies [23; 42; 49; 58].However, in specific use contexts it may be essential to leverage post-processing techniques to achieve demographic parity or equalized odds.

DISCUSSION
Importance of including imperfect data All of the 32, 198 participants in the FLUCOVID study had at least one ILI event and consented their wearable survey data.Participants have varying degrees of device wear time.Some may only take it off to charge it (nearing 100% wear time when acquiring signals for an entire day) whereas others have sporadic usage.Previous studies require a minimum device wear time for their cohort selection [35; 43], and all participants who do not meet this requirement are dropped.We instead choose models that can implicitly handle missingness.When we gradually eliminate patient-days based on their percent of missingness, we find that separation between ILI and healthy classes diminishes (See Figure S12).Intuitively, this suggests that including missingness makes the task more Results are reported as score (95% confidence interval).Z-score calculated as difference in binomial proportions (Wald test) to other groups (e.g.Asian vs non-Asian).This includes week-of-year and null survey answers are assumed to be healthy predictions.
difficult.By the nature of the task, abstaining from making a prediction is synonymous with classifying a participant as healthy.In order to compare wearable device model performance to other diagnostic measures, such as PCR tests, missing data cases cannot be neglected in test data.We replace these missing predictions with a classification of healthy when evaluating models on test data.
It is important to note that all labels in the dataset have been created using self-reported symptoms, onset dates and recovery dates.When the time between the self-reported date of disease recovery and the survey response is more than one week, model area under the precision recall curve (AUPR) drops considerably corresponding to the reduction in positive cases, however the sensitivity and specificity are not overly perturbed by delays in reporting (See Figure S7).The deteriorating AUPR is driven by a decreasing positivity rate.
Reporting performance across the entire study Our reporting differs from existing literature in that we report across the entire timeline whereas other approaches [35; 38; 43; 50] report classification results for the windows around the onset.A machine learning model trained to detect ILI would be expected to have a detectable change in outputs corresponding to an increase in detection around the time of ILI symptom onset.It is tempting to validate the machine learning model by aligning participants in the test set by the date of onset to confirm this.Previous research has demonstrated that model performance will be artificially inflated unless an outcome-independent reference point is used [54].Prior wearable work calculates AUROC and the ROC is plotted after having discarded a week of data before the reported date of onset, and excludes dates more than 7 days after onset and earlier than 21 days prior to onset [43; 50].However, even a small false positive rate can completely neutralise the perceived utility of the model for an imbalanced task such as determining ILI days vs healthy days.For posterity, we also aggregate our results in this fashion in Figure 6b.A full comparison of our model performance to published models, under their own evaluation strategies can be found in Table S7, though we do not encourage communicating model performance according to these metrics as they do not reflect model utilisation.
A similar design was followed where nighttime respiratory rate, RHR, and heart rate variability (HRV) were used to train a model to classify the onset of COVID-19 [35].To accommodate the class imbalance between sick and healthy days in the training set, sick days were synthetically oversampled by adding noise.The model captured 80% of the COVID-19 positive individuals, but the same model detected 34% of individuals who exhibited symptoms but tested negative for COVID-19.This is understandable as the model was not specifically trained to discern between COVID-19 positive and negative cases.However, this evaluation should be observed cautiously.The number of sick days, and the number of healthy days in the objective are arbitrarily chosen; only days within -30 to -14 range were used in the validation sets as negative labels (For COVID-19, 99% of symptomatic cases will show symptoms within 14 days after exposure [2; 29]).For those that eventually got COVID-19 the FPR was 4.7% and those who eventually had ILI had a FPR of 5%.The FPR of 4.7% corresponds to a COVID-19 prediction once every 21 days, or 17 false positive calls per year for a healthy individual [35].
In another study, an online outlier detection method was shown to correlate with a post-hoc/offline outlier detection method on heart rate, heart rate to steps ratio, and sleep features to detect the onset of symptoms for 32 SARS-COV-2 positive individuals [38].Specifically, their method correctly identified an outlier region within 14 days of SARS-COV-2 symptom onset for the majority of users.When we applied this outlier detection method to a subset of our FLUCOVID cohort, we were also able to find a detection event within 14 days of symptom onset for most individuals.However, the sensitivity of this approach was strongly influenced by the amount of data that was recorded before and after the SARS-COV-2 symptom onset date.For example, when we (artificially) decreased the amount of data by 14 day increments in our study, the symmetric sensitivity (a detection event within 14 days before/after onset) rose from 63% to 100%, and the one-sided sensitivity (a detection event within 14 days before onset) rose from 40% to 61%.Reducing the amount of recorded data around onset increased the positive predictive rate from 27% to 30%, as well as lowered the p-value calculated relative to a null detection proportion from 3% to <1%.Constraining the detection window size over which any ML method is evaluated will artificially increase sensitivity, specificity, and statistical significance.Outlier detection results are unlikely to generalize for small time frames.For COVID-19 the median time from exposure to onset is 4-6 days [2; 29; 65], and 97.5% of symptomatic cases occur within 11.5 days after exposure [29].If there is exposure earlier than 14 days to prior to symptom onset, it is unlikely to be a true positive.Influenza has an even shorter median incubation time, with only 1.4 or 0.6 days until influenza A and influenza B symptom onset, respectively [31].
Grouping the predictions by proximity to ILI cases can hide real trends in the data.Calculating the AUROC around onset is highly influenced by the class balance [46] which in this setting is contrived.Prior commentary is available on the attribution of important baselines, such as prevalence [13].The likelihood of contracting COVID-19 is non-stationary because the transmissability of COVID-19 changes in response to public health measures, the number of infected cases and the circulating strains [1].Evaluation procedures must be agnostic to these effects.This type of dataset shift is known as prior probability shift [40], which warrants careful calibration of machine learning models or additional training techniques [52].In this study we have focused on evaluating our models and communicating model performance that is reflective of deployment under these circumstances.
Account for ILI cases when building a COVID-19 detector It is essential that studies account for ILI cases when building COVID-19 wearable detectors [53].When applying machine learning models to aid in the detection of emergent diseases we must use training data which are representative of data in a deployment scenario.An individual with influenza may share several characteristics of COVID-19, despite never being exposed to the SARS-COV-2 virus.There is an overlap between COVID-19 and influenza in both symptoms and autonomic responses which cannot be ignored when reporting the specificity of COVID-19 diagnostic tests.Fortunately, these overlapping symptoms present with different frequencies.For example, COVID-19 patients are significantly more likely to have ageusia (loss of taste), fever, anosmia (loss of smell), myalgia/arthralgia (aching body), diarrhea, and nausea [7; 34; 63].The symptoms which are more prevalent in COVID-19 negative ILI patients are cough and sore throat, though only sore-throat is statistically more likely to be associated with influenza [63].The associations of these symptoms with COVID-19 have been corroborated in other studies [10; 43].Symptom based COVID-19 detection is somewhat able to distinguish COVID-19 positive patients from other respiratory illnesses using a simple logistic regression model [6].However, when determining the chance of a positive PCR COVID-19 test given that a patient presents symptoms, the major drivers of the prediction were travel history and case proximity in an early stage of an outbreak (not fever or cough) [10].These results will vary based on the richness of the feature set, the accuracy of symptom recording, the size of the cohort, and the progression of the pandemic in the region.Using wearable device data in place of symptom based screening exchanges improved frequency of monitoring with further abstraction from the non specific symptom features.
COVID-19, mild symptomatic COVID-19, and ILI differentiation With COVID-19, mild symptomatic COVID-19, and ILI all concurrently present, it is essential that models can differentiate between them.Differentiating between COVID-19 days and healthy days tends to be the easiest task amongst the subclasses of COVID-19, influenza, and unspecified ILI, alongside the healthy class (Figure S9).It is considerably more difficult to differentiate between unspecified ILI and healthy cases.
Unspecified ILI cases in the FLUCOVID surveys subside at a proportionately slower rate than the CDC's reported values [8] (See Figure 1).We suspect that these ILI cases may be undiagnosed, or misdiagnosed "mild symptomatic COVID-19".Amongst the 8743 participants who reported ILI symptoms and did not test positive for SARS-COV-2 between April 1, 2020-June 1, 2020, 86% received a negative COVID-19 test within -2 and 21 days after symptom onset (14 days post symptoms +7 days reporting delay).Only 6% of patients who didn't have COVID-19 tested positive for influenza in the FLUCOVID dataset.Several of these included participants who had not been tested for COVID-19.Less than 1% of participants in this time period experienced symptoms, but did not get tested for influenza or COVID-19, have had not previously tested positive for COVID-19, yet reported having close contact with a COVID-19 case within the 14 days prior to symptom onset.This small proportion of untested, COVID-19 contact, unspecified ILI cases equates to roughly 10% of the confirmed COVID-19 cases during the same period.If any of these untested cases are actually COVID-19 it could decrease the specificity in our reported results, as the TPs would be disguised as FPs.This  both retrospectively and prospectively and grouped relative to the symptom start date (day 0).Note that elevated percent predicted after day 0 indicates better performance under our labelling scheme.
coincides evidence of individuals in the USA who had been infected with SARS-COV-2 based off of the presence of antibodies, yet had not been reported [3] (though these results must be interpreted cautiously [56]).These individuals would still be capable of transmitting virus, however if we followed the convention of previous works, unspecified ILI patients would have been excluded from the dataset.The signal that would be learned under that convention would be the joint probability of getting a PCR test and that test being positive for SARS-COV-2 [15].We want to discern the latter unconditioned on getting a PCR test.
Limitations There are several limitations of this study which would affect deployment-level performance.Most importantly, our training dataset does not contain participants who are healthy throughout the entire duration of the study.Upon deployment, the PPV, NPV, specificity, AUROC, and AUPR will all be affected since these metrics involve the calls made on COVID-19 negative participants.In addition, the participants consenting in these studies may not represent the population which this would be deployed for.Both cohorts are comprised of primarily female, and primarily white demographics.This could be attributed to the selection bias in those who wish to participate, or the selection bias of individuals who own a wearable device.We find that symptom-based screening substantially outperforms wearable device screening in differentiating between COVID-19 cases and non-COVID-19 ILI cases.The limitation of symptom screening is that it negates the ability of the wearable devices to identify positive COVID-19 cases prior to symptom onset.Both models could be improved from enriched features.For example changes in heart rate variability metrics have been shown to be associated with COVID-19 [19; 35; 43].Our features are summarised on the day level.More granular time intervals could improve identification of ILI signatures.Improvements could be made to the modelling such as using side information from the survey data to improve the representational capacity of the wearable-only model, or a one-time demographic survey could be required so that wearable devices can leverage participant attributes to improve performance.A further reduction on the reliance of surveys could be achieved through objectives that learn to defer to surveys when the wearable-model is uncertain [41].Specific modelling techniques could be applied to address dataset shift due to frequently changing lockdown status or changing prevalence [52].The prospective results on this study report performance on a disjoint test set, which is likely to overestimate the performance gap caused by generalisation across time by encompassing the generalisation across participants.Users could potentially benefit from having personalised models to capture outlying personal patterns [64], though global patterns would need to be leveraged for COVID-19 detection since it does not occur more than once per individual in our datasets.Finally, changes in how these predictions are used could impact their performance.Perhaps tests or treatments are scarce and only allocated to the top-k individuals.In another use case, not providing a prediction when there is insufficient data would improve the performance of the model.This could be useful in a context where a negative prediction is required to attend class in person (whereas positive and null predictions would dictate remote attendance).

ONLINE METHODS
We collected Fitbit activity data, demographic baselines, and weekly symptom surveys from 33777 participants from November of 2019 through August, 2020.All users reside in the United States.We refer to this dataset as FLUCOVID.The characteristics of this dataset are described in greater detail in the following sections.For details on the collection of data, we refer the reader to our previous work [53].

Demographic Data
Participants consented data for the study which includes age, education, ethnicity, children and relationship status, height, weight, and location.In addition to these demographic variables, comorbidities are available including asthma and COPD (amongst others).Demographic data and comorbidities are not used in training wearable models, because they are not available in the wearable device data.This allows us to cast the widest net in COVID-19 detection (sensitivity) at the expense of specificity.The characteristics of the data can be found in Table II.

Wearable Device Data
Our wearable features contain 48 covariates derived from heart rate monitoring, step tracking, and sleep tracking features for each day (Table S1).We compare several data preparations, including regularly vs irregularly indexed data [20], population normalisation vs. individual normalisation [51], and forward filling imputation vs. no imputation.A complete description of our data preparation is available in Supplementary A. When the effect is insignificant, or the effect size is small between preparation tasks, we err on the side of including as many patient-days as possible.Unless otherwise specified, we use regular indexing, population-level normalisation, and forward-filling imputation while reporting our results.

FLUCOVID Survey Data
Each participant in the FLUCOVID cohort experienced at least one influenza-like illness (ILI) event between November 1, 2019, and June 1, 2020.Of the 32198 participants with both surveys and wearable data in this cohort, 2554 had medically diagnosed influenza, 204 had COVID-19, and the remainder had unspecified ILI.27% of users reported more than one ILI event in the study.All of our labels in the survey are derived from self-reported symptoms and test results.Each week, participants are asked to recall their symptoms in the past 7 days.They will note which days they experienced flu related symptoms, COVID-19 related symptoms, and complications with chronic diseases.ILI onset and recovery dates are also reported by the participant.In addition to illness, there are questions pertaining to self-isolation, air travel, and household members with ILI or COVID.

Model Training
We opt for machine learning models that can implicitly handle missingness, namely, we use GRU-D [11] or XGBoost [12] models.Models are exclusively trained on the FLUCOVID dataset such that 35% of participants fall into the training set, 7.5% in the validation set, 7.5% in the held out random retrospective test set, and 50% in the held-out prospective test set (to preserve prevalence).Features are collected on a daily basis, and predictions are made at the same frequency.A complete description of the features, and their preprocessing can be found in Supplementary A. For the XGBoost model, we use the past 7 days of data (inclusive of the prediction day), flattened into a matrix x ∈ R Np×tN f , where N p is the number of participants, t is the number of days, and N f is the number of features.This flattening procedure is rolled across every time-point for each participant.Hyper-parameter selection was performed to minimise the validation set loss whenever a new model is trained.The GRU-D model ingests data in the format of x ∈ R Np×t×N f as it is recurrent.The GRU-D model used the validation set for early stopping.Because of this, Table II.FLUCOVID: Demographics of the Large-scale Flu Surveillance study for survey and wearable cohorts different weights were learned for the model which is stopped on the concurrent portion of the validation set for randomised testing as opposed to the prospective portion of the validation set for prospective testing.The comparisons between randomised and prospective evaluations for the XGBoost model use the same weights in our results whereas the same comparisons for GRU-D do not.The participant splits between XGBoost and GRU-D models are shared, so the performance can be directly compared between the two models.
The survey models used a subset of features that are common for COVID-19 screening questionnaires as seen in Table S2.Notably, several of these features are significantly different between those who tested negative for COVID-19, and those who tested positive.Even still the relative abundances of behaviour and travel are similar.Statistical significance does not promise separability during inference.These features, combined with demographic variables are ingested into a GRU model.Due to the nature of the surveys, days where no answer is provided are assumed to be symptom free.Models are trained only where symptoms are reported, to differentiate between COVID-19 and non-COVID-19 ILI using all symptoms and behaviour patterns observed to date.
Choosing thresholds on all historical validation data does not generalise to the risks that are predicted in the following prospective week.Due to the class imbalance, outcome probabilities (logits) are poorly calibrated.A prospective validation set is necessary to select thresholds which will be calibrated in the next week.Through this process we found the choice of threshold for a given sensitivity on that validation set would better estimate the sensitivity we would observe on the subsequent test set (See Figure S8b).In our case this can be attributed to inconsistency of the target distributions (prior probability shift) [40] seen in Figure 1.For randomised test sets, thresholds were drawn on a validation set that was concurrent with the training and test sets.deviation of the individuals data is buffered by a 7 day gap.This means that on the first day of ILI symptoms, the z score is unencumbered by the most recent 7 days of data to avoid the bias from the viral incubation period.however, illnesses lasting over a week will start to normalise the elevated resting heart rate values.A larger gap, or a longer data requirement period could be administered at the expense of requiring a longer participation period before a prediction can be made.This calculation is shown in equation C1.
Where x i,train is a vector of all observations of feature i in the training set.In order for someone to have a feature observed on a certain day for the z-score normalisation, the raw feature must be observed and they must have at least 14 of the last 30 days of data as a source of standard deviation [51].In contrast, when normalising data by the training population means, participants only need the single raw value in order for that day to be observed.We compare these two methods, referred to as z-score and standard data preparation in Figure S5.We do not see a large difference in performance, meaning the extra observations available in the standard data preparation regime offset any perceived gain of individual z-score normalisation.This is in spite of the inter-individual variance exceeding the intra-individual variance (See Figure S6b) as previously observed [35].
Next, data is imputed using forward-filling.At the earliest time-step for each participant, the training set population mean is applied (for z-score this corresponds to filling with 0), then the most recent measurement is carried forward in all subsequent observations.We retain the mask (whether or not the data was observed at that time-step), the time that has elapsed since the feature was last measured, and the forward-filled data [11].The examples of regularly and irregularly sampled data are shown in Table S5 and Table S6, respectively.All pre-processing conditions (regularly vs. irregularly sampled data, imputed vs. non imputed, z-score vs raw data) are supported by the provided code.

Event reporting delay does not explain poor performance
The FLUCOVID dataset requires exclusively on self-reported events, including onset and recovery dates in order to form labels.Some efforts are taken to smooth events, like merging two ILI events for a participant that bookcase a healthy period of less than one week.We find that despite long delays in reporting illness, the sensitivity and specificity of the models on these labels are not substantially harmed (See Figure S7).Fortunately the majority of labels are reported within a 1-week period, and reporting lapses exponentially decay.However, all of our test sets include these participants in performance reporting nonetheless.Using outcomes attained directly from health systems may eliminate this perceived lapse in test performance and it could strengthen signals during training.We show performances of models as found in their respective citations in Table S7.Wherever possible we match their evaluation scheme for retrospective test sets, prospective test sets, and XGBoost and GRU-D models.Note, that we do not specifically optimise for these objectives, but instead simply select the appropriate days and calculate the metrics.

Figure 1 .
Figure 1.Influenza and COVID-19 incidence: Publicly available data for outpatient influenza from the CDC ILINet and COVIDcast case counts are compared to our incidence of survey ILI and COVID-19 cases.The COVID-19 data is scaled as denoted in the legend.After COVID-19 becomes more prevalent than ILI, as denoted by the vertical line, the ILI survey cases show an unusually high proportion of unspecified ILI, compared to the CDC outpatient counts.Some unspecified ILI after March 16 may be attributed to undiagnosed COVID-19.

Figure 2 .
Figure 2. XGBoost trained on randomly split ILI data: Conditioning the performance by time (solid lines) reduces performance compared to when performance is assessed irrespective of time (dashed lines) for randomly drawn training, validation, and testing sets.All machine learning models outperform the non-machine learning elevated RHR baseline.

Figure 3 .
Figure 3. Random vs. prospective performance : Random retrospective test set AUROC compared to first week prospective test set AUROC for detection of ILI symptom days.Each week uses a newly initialised model trained on a random split of participants (50% prospective test set, 35% retrospective training set, 7.5% retrospective, last-week validation set, 7.5% retrospective test set split by participant).

Figure 4 .
Figure 4. Wearable vs. survey model performance:Performance of two data modalities, wearable data or survey data, for directly predicting COVID-19 each week.Results are shown as mean ± standard deviation (a) XGBoost model tested on 50% of the FLUCOVID dataset.(b) GRU-D model tested on 50% of the FLUCOVID dataset.

Figure 5 .
Figure 5. Combined wearable and survey model performance:Performance of the Covid-19 Wearable+Survey detection models in the FLUCOVID dataset (top) and COVID2020 dataset (bottom) for both XGBoost (left) and GRU-D (right) wearable models.
(a) Models evaluated on a retrospective test set.(b)Models evaluated on a prospective test set.

Figure 6 .
Figure 6.Positive calls around symptom onset:XGBoost model performance for detecting any ILI or COVID-19 both retrospectively and prospectively and grouped relative to the symptom start date (day 0).Note that elevated percent predicted after day 0 indicates better performance under our labelling scheme.

Figure S1 .
Figure S1.Retrospective vs. Prospective testing setup Figure S3.Heart rate feature throughout the study (mean ± standard deviation).
Figure S4.Regularly and irregularly sampled data trained on the same participant splits did not substantially affect the performance.

Figure S5 .
Figure S5.Comparison of z-score individually normalised features to population normalised features for XGBoost.
Figure S7.Performance of the ILI detection Wearable model, stratified by how far in the past an individual was remembering the event when responding to the survey

Figure S9 .
Figure S9.The wearable model outcome logits around symptom onset date for COVID-19, confirmed influenza, and unspecified ILI.The outcomes are gathered from the XGBoost model on retrospective, randomly selected test set.

Figure S10 .
Figure S10.The gain of the model.Green represents the best-case scenario, blue is the model agnostic performance, and red is the when the data are normalised by the model.

( a )
XGBoost model tested on 50% of the FLUCOVID dataset.(b) GRU-D model tested on 50% of the FLUCOVID dataset.

Figure S11 .
Figure S11.Performance of the Covid-19 Wearable+Survey detection models in the FLUCOVID dataset for both XGBoost (left) and GRU-D (right) wearable models aggregated by their time since onset.Cumulative predictions is for a subset of participants because of random sampling of participants each week.

( a )
XGBoost model AUROC for a minimal amount of unobserved measurements.(b) XGBoost model AUPR for a minimal amount of unobserved measurements.

Figure S12 .
Figure S12.Varying the maximum amount of missing data allowed during testing affects the performance.

Table I .
Combined wearable and survey model performance for COVID-19 detection by race or by gender for cohorts with more than 1000 participants.