Leveraging genomic sequencing data to evaluate disease surveillance strategies

Summary In the face of scarce public health resources, it is critical to understand which disease surveillance strategies are effective, yet such validation has historically been difficult. From May 1 to December 31, 2021, a cohort study was carried out in Santa Clara County, California, in which 10,131 high-quality genomic sequences from COVID-19 polymerase chain reaction tests were merged with disease surveillance data. We measured the informational value, the fraction of sequenced links surfaced that are biologically plausible according to genomic sequence data, of different disease surveillance strategies. Contact tracing appeared more effective than spatiotemporal methods at uncovering nonresidential spread settings, school reporting appeared more fruitful than workplace reporting, and passively retrieved links through survey information presented some promise. Given the rapidly dwindling cost of sequencing, the informational value metric may enable near real-time, readily available evaluation of strategies by public health authorities to fight viral diseases beyond COVID-19.


INTRODUCTION
One of the central questions in public health is the effectiveness of distinct disease surveillance strategies. 1 In the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, officials have proposed, deployed, and terminated a wide range of techniques.State and local public health authorities, for instance, have prioritized testing in congregate settings (e.g., long-term care facilities [LTCFs], shelters, and jails), employed contact tracing in conjunction with positive test notification, and directed mandated reporting and outbreak investigation in schools, workplaces, and congregate settings.Emerging tools, such as spatiotemporal scan statistics for identifying clusters of interest, have also attracted interest. 2These approaches have been used for many communicable diseases, from foodborne illness to sexually transmitted infections, 3,4 but there is limited evidence as to the relative effectiveness of distinct disease surveillance strategies, particularly as a disease spreads.
In Santa Clara County (SCC), California, home to over 1.8 million residents, the Public Health Department developed one of the largest COVID-19 case investigation and contact tracing initiatives in the country, along with myriad approaches to monitor and address potential disease spread.We briefly review principal strategies as deployed or considered in SCC during the study period from May 1 to December 31, 2021.First, individualized contact tracing has been widely adopted to map and monitor community transmission of SARS-CoV-2. 5 Contact tracing has been used in the past for surveillance of a wide variety of diseases, including smallpox and SARS, and aims to stem COVID-19 outbreaks by identifying the close contacts of diagnosed cases and informing them of their potential exposure to the virus. 6][9][10] A systematic review of the empirical evidence for contact tracing for several diseases found that it can be an effective public health tool but noted a ''scarcity of quantitative evidence on its effectiveness'' with wide-ranging practices. 11econd, mandated reporting has been employed by public health authorities where extensive identification, testing, querying, and monitoring of individuals, who were exposed to a first positive case and might be part of a larger outbreak, can be targeted at the facility level with the cooperation of designated third parties.3][14] Standardized and comprehensive reporting by employers may expose workplace-specific risks that individualized contact tracing efforts miss and may inform workplace-or ll OPEN ACCESS industry-specific nonpharmaceutical interventions. 15SCC delegated one distinct team to perform contact tracing through interaction with a case or their family, and another team to perform in-depth outbreak investigations, working with COVID designees from an involved institution (e.g., school, employer, and congregate facility) to generate a curated cohort of potentially exposed individuals and to actively increase their testing rate.While various forms of mandated reporting and associated outbreak investigation have been widely deployed for COVID-19 control, they are time-consuming and pose a variety of challenges related to responsiveness, confidentiality, and liability. 16,17Much of the academic literature consists of case studies of individual investigations rather than evidence for the utility of reporting or investigation as a public health tool. 18,19hird, spatiotemporal clustering, which identifies case clusters based on proximity in space and time, has also been proposed as a COVID-19 surveillance strategy.Spatiotemporal clustering has long been a component of syndromic surveillance tools and platforms, such as the Centers for Disease Control and Prevention (CDC) Electronic Surveillance System for the Early Notification of Community-Based Epidemics (ESSENCE) and Early Aberration Reporting System (EARS), which ingest syndromic surveillance data and provide alerts and visualization for spatiotemporal clusters of interest. 20In the context of COVID-19 surveillance, spatiotemporal methods may facilitate the detection of clusters beyond a single household, and can further refine cluster detection by measuring the statistical significance of a cluster relative to the background case rate.][23] One common challenge around assessing these disease surveillance strategies is validation.Confirming a transmission link is resourceintensive, and, even with such information, it is hard to assess how counterfactual strategies might have fared.In this paper, we show that sequenced viral RNA, collected at an unprecedented scale during the COVID-19 pandemic, provides opportunities to assess the relative ability of disease surveillance strategies to recover related cases in near real time as the disease spreads.7][28] In addition, such data may help assess strategies that have not yet been widely implemented, such as spatiotemporal clustering or automatic case linkage using information like workplace location.
In the face of scarce public health resources, it is critical to understand which disease surveillance strategies provide reliable transmission information, what underutilized strategies may exist, and how strategies may serve as complements or substitutes.We hence leverage rich linked datasets from SCC and illustrate one approach for comparing the relative ability of disease surveillance strategies to identify related cases by utilizing increasingly available data from whole-genome sequences.While our illustration is with SARS-CoV-2 data, the approach can be used with any communicable disease that yields genomic data.

RESULTS
We examine the effectiveness of existing and potential strategies for identifying linked cases, summarized in Figure 1.Overlap counts and p values between all pairs of strategies are shown in Figures S1 and S2 in the supplemental information (SI), respectively.

Contact tracing and spatiotemporal clustering
We find that a high proportion of close contacts (83% [95% confidence interval, 67-94%]) identified by contact tracing are plausible transmission links according to whole-genome sequence data.However, a key question is whether contact tracing provides any marginal informational value compared to same address links (83% [78-88%]) which can be passively determined from the original laboratory results, thereby not requiring contact tracing to uncover.Notably, the 17% of close contacts links that do not share an address (i.e., the non-obvious relationships we might depend on contact tracers to find) are also highly plausible (74% [27-100%]), but infrequent.We also find that spatiotemporal links via SaTScan yield high informational value (75% [66-82%]), but when same address links are excluded, leaving only 45% remaining proximate connections, informational value drops (29% [13-47%]), suggesting a lower likelihood of direct transmission.

DISCUSSION
Our initial assessment has yielded valuable insights about SCC's range of disease surveillance strategies.School reporting appears more fruitful than workplace reporting, perhaps in part due to the greater and more relevant training of school COVID designees, often professional school nurses, compared to that of workplace designees, often offsite human resources staff.Our results may point to important differences between not only disease surveillance strategies, but also the underlying spread settings in which surveillance is conducted (i.e., classroom and office layouts).Contact tracing yields many potential links through close contacts, but 83% of the recovered links were from the same home address.The relatively high informational value of close contacts outside the home, compared to more data-driven spatiotemporal attempts at the same, suggests the promise of contact tracing protocols that focus more on uncovering nonresidential spread settings, where circumstantial evidence is otherwise low.Moreover, our analysis suggests there are lighter-weight passive alternatives, such as automated linkage of shared location histories, schools, and workplaces via surveys, which may potentially serve as time-saving substitutes or precursors to active outbreak investigation.We note that all of these examined fields were designed as open text fields, and that the true potential of passive surveillance may hinge on improved user interface design (e.g., autocomplete, drop-down lists, and entity resolution).

Limitations of the study
There are several limitations to our approach.First, while sequencing is broadly representative of the county population, it was not designed as a random sample.The sequencing rate varies across disease surveillance strategies, in part because availability of specimens for genomic sequencing was often tied to the County hospital system or testing sites, which overrepresent portions of the county with higher social vulnerability, and in part because the County explicitly targeted testing in high-risk congregate settings like jails to varying degrees throughout the pandemic.Our comparisons of informational value could be confounded if sequencing practices bias toward transmissibility within specific spread settings.However, in practice, SCC could not operationally bias sequencing within disease surveillance strategies; rather, the County could only vary the overall availability of sequencing across strategies, which we expect to only affect the precision of our results.
Another potential source of selection bias is differences between the two sources of genomic data we rely on in our study, which flow through very different chains of custody.COVID-19 tests that are analyzed by the SCC Public Health Laboratory are sequenced in-house, and then submitted to the County's PCR test-based repository of SARS-CoV-2 genomes, Terra, 29 for analysis.Tests that are performed by Fulgent are sequenced at their own laboratories, submitted to the CDC, and published in the Global Initiative on Sharing Avian Influenza Data (GISAID), a public database of SARS-CoV-2 and influenza sequences, [30][31][32] with separate metadata including personal identifiers shared privately with the County.Both sequencing pipelines were crucial to the County's COVID-19 response, but since they are independent, unobserved practices of the third-party laboratory could introduce bias.As a robustness check, in Figures S3 and S4 in the SI, we present informational value broken out by the source of genomic data.For instance, for Terra, we subset to the links for which both cases are sequenced by Terra.We exclude congregate setting investigations (including jails and LTCFs) because tests from these disease surveillance strategies were nearly entirely sequenced by the County.While some variation exists at the strategy level between the two sources of sequencing, on the whole they present a similar rank order of informational value.The Pearson correlation of informational value between Terra and GISAID is 0.84, and the Spearman correlation is 0.83.To the extent that our findings could still be driven in other ways by selection bias, public health authorities considering the use of similar sequence-based assessment methods should assess the representativeness of their sequencing and explore techniques like uncertainty sampling to improve sequencing balance across strategies. 33,34econd, disease surveillance strategies are not monolithic.SCC designed its contact tracing to prioritize ZIP codes with a higher social vulnerability index and disproportionately invested resources toward school reporting and investigations to help ensure schools stayed open, so SCC's results for these strategies may not generalize to contact tracing and mandated reporting in other jurisdictions.Disease surveillance strategies were not even consistent within SCC; the definition of which students should be included in school outbreak reporting varied by school and over time, from ''only those not wearing masks in the classroom'' to ''the whole classroom'' to ''all classmates, even those who were not at school.''While we are able to generally compare informational yield across disease surveillance strategies, our sample sizes limit us from understanding the specific, potentially manageable conditions (e.g., reporting protocol, room size, speed of testing, and improved form collection) under which low-yield strategies may turn out to excel.Future work can readily assess the informational value of disease surveillance strategy variations beyond SCC's, and with greater granularity, using increasingly available viral sequencing data.
Third, our results inform, but should not be interpreted as conclusive of, the optimal strategy.The resources required to implement different surveillance strategies-ranging from monetary to technological to political-may vary more widely than their ability to identify plausible transmissions.The same is true of the varied cost of not implementing different strategies.Informational value may in fact directly tradeoff with resource investment; for instance, congregate setting investigations may have been more extensive in testing of full populations, hence yielding lower genomic similarity, because of the higher risks of, and political impetus to avoid, transmission in those vulnerable populations.Any policy choice should incorporate the relative benefits and costs.At its core, higher informational value means a lower likelihood of ''wrong links,'' which are undesirable because they incur significant costs in both public health resources and impacts on livelihoods, i.e., someone's ability to earn income, provide childcare, etc.
Last, the efficacy of disease surveillance strategies would be expected to be disease-specific (i.e., contact tracing may be unrealistic for diseases with significantly higher airborne transmissibility).In the case of SARS-CoV-2, genomic similarity is itself limited due to the relatively slow mutation rate and the potential for parallel evolution events 24,35 ; nonetheless, our approach yielded substantive insights.

Conclusion
COVID-19 led to a flurry of distinct strategies to investigate and identify cases, far outpacing the bandwidth to assess their relative usefulness.While pandemic response to COVID-19 has receded, our results illustrate an important lesson that carries relevance for public health prospectively.Our approach illustrates how public health authorities might begin to validate distinct disease surveillance strategies using genomic sequences.Given the rapidly dwindling cost of genomic sequencing, the informational value metric may open the door to not just variant tracking, but also near real-time, readily available evaluation of strategies to fight many viral diseases beyond COVID-19 (e.g., Mpox).Our investigation adds important evidence on which disease surveillance strategies appear to be effective in the face of significant public health investments.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

Lead contact
Requests for further information should be directed to and will be fulfilled by the Lead Contact, Daniel E. Ho (deho@stanford.edu).

Materials availability
This study did not generate new materials.

Data and code availability
This paper analyzes existing, publicly available sequences, but due to privacy concerns, linkages to patient-level data (obtained through SCC's contact tracing database, CalCONNECT) cannot be shared.A DOI containing accession numbers from an initial superset of sequences is available in the key resources table.
Other aggregate data, the study protocol, and code have been deposited at GitHub and are publicly available as of the date of publication.The repository link is available in the key resources table.
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Disease surveillance strategies
CalCONNECT provides information about two primary existing disease surveillance strategies of interest.First, in contact tracing, close contacts reported by a positive case may be subsequently contacted by the contact tracer and encouraged to quarantine and test; if they end up testing positive within 14 days of the original case, we can link the two positive individuals, and consider this link a unique disease surveillance artifact of contact tracing.Multiple contacts can be linked to one case, and multiple cases can be linked to one contact.Second, we observe the artifacts of mandated reporting and associated outbreak investigation as exposure events categorized by location type and link all pairs within an event that are also within 14 days of each other, regardless of whether the outbreak investigation team ultimately determined that they met the definition of a confirmed outbreak.We have enough events from schools, workplaces, jails, and LTCFs to evaluate them as distinct location types of reporting, while events from other settings like shelters are insufficiently numerous to analyze on their own.We consider these disease surveillance strategies to active given their reliance on trained personnel to consider cases one by one.
Disease surveillance can also include passive retrieval of links from other information gathered, such as a shared home address (collected upstream via testing) or shared employer, school, or nonresidential location history (collected via survey or interview).These are passive in the sense that they are derived not from direct linkage of individuals via active staff effort, but rather from post hoc matching of patient-provided information, and are for the most part not currently used to identify potential outbreaks (though they may fill gaps in ongoing investigations).
Patient address is present in data from the lab that performs the COVID-19 test, but it must be geocoded to produce a location usable for spatiotemporal scans and to prevent matching failure due to different ways of writing the same address.In order to protect patient privacy and prevent leakage of information via a third-party API, we perform geocoding offline using fuzzy record linkage to a comprehensive table of addresses in SCC, provided by private data vendor Melissa, which includes latitude and longitude.The query address is parsed to extract house number, street name, unit, and postal code.We first search for exact matches, and then for addresses that do not have an exact match in Melissa, we use the fuzzyjoin R package with a custom matching function.Our matching algorithm finds an exact, unit-level match for 82.1% of cases in the study period and finds at least a match with the same house/building number for 85.0%.The resulting matched address is used to identify cases at the same address, and the latitude and longitude are used as input to SaTScan's spatiotemporal scan algorithm.
Same school links require an exact match on three fields filled in as part of the informational interview or survey: city location of the school, type of school (elementary, middle, or high school), and the school name.Fields are lightly cleaned to remove punctuation and remote locations.Same supervisor links require at least one match out of three fields: supervisor name, supervisor phone number, or supervisor e-mail address.Location histories are unique object types in CalCONNECT which specify the start and end date range for a patient visiting a specific location, uniquely identified by location name and address.We impose an additional time-based constraint that two cases must have start and end date ranges within seven days of overlapping one another (i.e., the start of the second case's location history must be within seven days of the end of the first case's location history), in addition to the 14-day window for episode date.Location histories were used both passively by contact tracers during interviews and actively by investigators to record cases who were present at an exposure event but believed to have been exposed to SARS-CoV-2 off-site; in other words, investigated cases linked through location histories were expected to be less-plausible transmission links.Available location history information was ultimately sparse because of a variety of factors, but has the potential to be more substantively collected with future technology and protocol improvements.
Altogether, these data allow us to identify proposed links between cases for an array of different strategies.We use these proposed links to measure the effectiveness of each disease surveillance strategy at identifying plausible SARS-CoV-2 transmission links.While we assess the informational value of each strategy independently, we also assess set differences of two strategies, such as removing same address links from the set of contact tracing links to assess inter-household spread.

Spatiotemporal clustering
We make use of SaTScan, a software package which implements spatial, temporal, and space-time scan statistics for disease surveillance and analysis. 37We use the retrospective space-time scan statistic, developed by Kulldorff, which tests whether observations of COVID-19 are uniformly distributed among units over space and time, adjusting for underlying spatial and temporal inhomogeneity. 40The scan uses a ''moving cylinder'', with the base and height corresponding to the spatial and temporal dimensions respectively.As the cylinder ''slides'' across space and time, SaTScan uses statistical tests to measure whether the case prevalence in the area of interest is significantly higher than the baseline process would suggest.SaTScan identifies spatiotemporal ''hotspots'', so mere proximity (e.g., three cases at the same address) is not sufficient to be flagged; cases must exhibit statistically significant concentrations.SaTScan offers several baseline models, and the choice of model largely corresponds to the type of data available: Bernoulli for 0-1 data corresponding to infected and uninfected people, Poisson for case data combined with knowledge of the underlying population at risk, and a space-time permutation model for case data only.Since CalCONNECT only gives us information about COVID-19 infections, we use the space-time permutation model. 41aTScan is highly configurable.Some parameters, such as the maximum radius of the scan cylinder in space and its maximum duration in time, do not have obvious defaults, and their values may substantially affect the sort of clusters that SaTScan produces.We want to test SaTScan with a ''good'' set of parameters, but also avoid overfitting to the study data and biasing the result.In order to pick reasonable parameters, we conduct a randomized grid search over possible values of the radius (between 0.05 and 5 km) and the time-duration (7, 14, or value based on the sample of sequenced cases.We restrict the analysis to proposed pairs of cases whose episode dates are within 14 days of one another. To determine whether the genomic data support a proposed transmission link, we measure the genomic distance between sequenced pairs of cases.A straightforward way to measure the similarity of two sequences is the number (or proportion) of observed substitutions (i.e., the Hamming distance).However, this does not account for the uncertainty introduced by missing or ambiguous sites.Bootstrapping is a statistical method used to estimate the sampling distribution of a random variable, based only on the observed data, by resampling from the observed data with replacement. 45 Felsenstein (1985) applies this idea to phylogenetic trees, using it to estimate branch support values (i.e., confidence values on clades) by resampling genome sites with replacement.These bootstrap samples are used to construct a new genome, and then a phylogenetic tree is constructed from the bootstrapped genome. 39By doing this many times, and observing how often a clade appears, one can calculate a support value for the clade.In our setting, we are interested in evolutionary distances rather than clade support values, but these distances can also change across bootstrap iterations, as missing and mismatched sites can be resampled or omitted.Therefore, we apply phylogenetic bootstrapping, in combination with ordinary bootstrap resampling over cases, to calculate non-parametric confidence intervals on our measure of informational value.
For each bootstrap iteration, we first perform a resampling (with replacement) from the 29,903 sites of the SARS-CoV-2 genome, which may repeat or omit sites from the original sequence, and construct a phylogenetic tree with evolutionary distances based on maximum likelihood.We consider two sequences to constitute a plausible transmission link if their distance (i.e., expected number of substitutions) is fewer than two base pairs.The expected number of substitutions is the estimated substitution rate (observed mismatches divided by the number of nonmissing sites) multiplied by the total number of sites.This quantity differs across bootstrap iterations because: (a) mismatched sites may be repeated or omitted, which changes the numerator of the substitution rate; and (b) missing sites may be repeated or omitted, which changes the denominator of the substitution rate.We conduct our analysis in Terra, which uses IQ-TREE to build phylogenetic trees. 38By supplying a command-line argument, we can specify that the tree be constructed using the phylogenetic bootstrap, in which case all n bootstrapped trees are outputted along with the consensus tree.Each bootstrapped tree results in slightly different evolutionary distances, which in turn affects which pairs are considered plausible transmission links, which incorporates uncertainty about similarity between strains into our downstream estimates of informational value.
Only a minority of cases in CalCONNECT are sequenced, and of those that are sequenced, not all are of high quality.Informational value is thus estimated from the sample of the population of all COVID-19 cases that have high quality sequences.To quantify the sampling uncertainty of this method, on top of the uncertainty from missing sites in genomes, we also apply bootstrapping over cases.To do this, for each bootstrap iteration, we resample all 67,374 CalCONNECT cases with replacement.When a case is selected multiple times, its corresponding proposed links count multiple times.For example, if case A is sampled three times, and case B is sampled one time, then link A-B is counted three times.If case B was instead sampled twice, then link A-B is counted six times.When a case is not sampled, then its edges do not count at all.Resampling cases rather than links is faithful to the data generating process, in which cases are the unit selected to be sequenced in a laboratory or not, and links can exist between any pair of cases that was chosen to be sequenced.
Finally, for each iteration, we group all sequenced pairs of cases that are identified by a disease surveillance strategy and calculate that strategy's informational value, equal to the fraction of proposed sequenced case pairs, counted using the weights from case resampling, that are also plausible transmission links, according to the bootstrapped phylogenetic tree.We repeat this process for 100 trials using parallelization, given one bootstrap alone requires nearly 24 h of computation.From the bootstrap trials, we obtain a distribution of estimates for the informational value of each disease surveillance strategy, allowing us to calculate empirical confidence bounds and conduct statistical tests to determine whether the yield of a given strategy is significantly higher than the yield of another.We report p values when comparing two disease surveillance strategies, defined as the proportion of the 100 trials in which their ranking reverses, compared to their ranking based on average informational value.

Robustness to different thresholds for plausible transmission
In the main analysis, we consider two sequences to constitute a plausible transmission link if their distance (i.e., expected number of substitutions) is fewer than two base pairs.We selected this threshold in close consultation with SCC epidemiological staff.Given that SARS-CoV-2 is a relatively small genome with a relatively low mutation rate, and that the higher our threshold, the more pairs of sequences would be considered to be ''transmission links,'' we concluded that our choice of possible thresholds was practically limited to small integer numbers.Empirically, we note that over half of all pairs of sequences, across 100 phylogenetic bootstrap iterations, differ by fewer than 20 base pairs, while more than a quarter of all pairs differ by fewer than three base pairs.Nonetheless, given the possibility that our results are sensitive to our selected threshold, we also recompute our main results from Figure 1 using different base pair thresholds, from one to five.The robustness check can be seen in Figure S8 in the SI.We see that, as expected, higher base pair thresholds are associated with higher informational values.We also see that, in general, the rank order of disease surveillance strategies does not significantly change, which suggests that our approach is relatively robust to the exact method of defining transmission and related uncertainty, and is surfacing robust insights about the relative differences in yield between distinct disease surveillance strategies.As notable exceptions, Jail Reporting (and, as a result, All Mandated Reporting) and Same Occupation Location are the disease surveillance strategies most sensitive to different threshold definitions, which suggests that our main results may underestimate their informational value, but not in ways that fundamentally change our key takeaways.

Robustness to alternative uncertainty measures
The main results reported in Figure 1 use a methodology that accounts for both uncertainty from missing sites (via phylogenetic bootstrapping), and uncertainty from sampling (via case bootstrapping).We compare this ''double-bootstrap'' approach to some simpler approaches, as illustrated in Figure S9 in the SI, to understand the relative contributions of two bootstrapping steps to the overall uncertainty and to see if the added complexity is worthwhile.

Phylogenetic bootstrapping only
In this approach, we ignore sampling uncertainty, and only consider uncertainty from missing sites in sequence data.This is not a recommended approach since sampling uncertainty should be accounted for, but it helps to understand the relative contribution of phylogenetic uncertainty.These intervals are often (but not always) the narrowest intervals, implying that sampling uncertainty contributes more heavily.

Pairwise Distances, binomial confidence interval
Pairwise distances are calculated directly by simply counting the number of mismatched sites (also called single-nucleotide polymorphisms, or SNPs) between pairs of strains.This ignores any uncertainty from missing sites.In this approach, we also forgo bootstrapping cases, and instead treat the fraction of proposed edges that are plausible transmission links as a binomial proportion.We assume each link proposed by a disease surveillance strategy is plausible with probability p, estimate p as equal to the number of plausible links divided by the number of proposed edges, and estimate the 95% confidence interval with the Clopper-Pearson exact method.This is a simpler way to consider sampling uncertainty but a less accurate representation of the data-generating process.Given that cases are the actual unit sampled, links are not independent of one another: knowing that link A-B and link B-C are sequenced implies that link A-C is also sequenced.Thus, we think it makes more sense to treat cases as the unit of consideration, as we do in the next two approaches.

Pairwise distances, case-bootstrap confidence interval
As in the previous approach, we ignore sequence missingness, and only consider sampling uncertainty.Instead of treating links as randomly sampled for sequencing from the population, we treat cases as the unit of interest, and quantify uncertainty by resampling cases from the population.Then, the sampled cases are used downstream to reweight the links.When a case is absent because it isn't sampled, its links are not counted.When a case is sampled multiple times, all its links are reweighted so that they count proportionally to how many times the case was sampled.

Phylogenetic bootstrapping, case-bootstrap confidence interval
This is the methodology used for the main analysis.While all four uncertainty methodologies yield a similar rank ordering of informational value of disease surveillance strategies, we ultimately select this methodology to incorporate the most conservative confidence intervals into our comparisons across disease surveillance strategies.

Figure 1 .
Figure 1.Informational value of disease surveillance strategies Author's analysis of Santa Clara County CalCONNECT and genomic sequencing data from May 1 to December 31, 2021.For each disease surveillance strategy, we show: the informational value (proportion of sequenced case pairs which are plausible transmission links according to whole-genome sequence data); the 95% confidence interval on informational value; and the number of proposed links which are sequenced.Contact Tracing: identified as close contacts by contact tracing.Same Address: shared home address.SaTScan: assigned to the same cluster by spatiotemporal analysis.All Mandated Reporting: linked through mandated reporting in a high-risk setting; School, Long-Term Care Facility, Jail, and Workplace Reporting are sufficiently numerous and sequenced to also be presented on their own.The following five strategies are based on matching results from the case interview or survey.Location History: matching location visited within the same time period.School: same school name, school type, and school location.Occupation Location: same workplace location, usually an address.Supervisor: same supervisor name, phone number, or e-mail.Employer Name: same employer name.CI, confidence interval; CT, contact tracing; LTCF, long-term care facility.See also Figures S1-S4.

TABLE
d RESOURCE AVAILABILITY B Lead contact B Materials availability B Data and code availability d EXPERIMENTAL MODEL AND STUDY PARTICIPANT DETAILS d METHOD DETAILS B Disease surveillance strategies d QUANTIFICATION AND STATISTICAL ANALYSIS B Spatiotemporal clustering B Genomic data processing B Sequencing balance B Informational value B Robustness to different thresholds for plausible transmission B Robustness to alternative uncertainty measures STAR+METHODS KEY RESOURCES TABLE RESOURCE AVAILABILITY