Cryptic transmission of SARS-CoV-2 in Washington State

Following its emergence in Wuhan, China, in late November or early December 2019, the SARS-CoV-2 virus has rapidly spread throughout the world. Genome sequencing of SARS-CoV-2 strains allows for the reconstruction of transmission history connecting these infections. Here, we analyze 346 SARS-CoV-2 genomes from samples collected between 20 February and 15 March 2020 from infected patients in Washington State, USA. We found that the large majority of SARS-CoV-2 infections sampled during this time frame appeared to have derived from a single introduction event into the state in late January or early February 2020 and subsequent local spread, indicating cryptic spread of COVID-19 before active community surveillance was implemented. We estimate a common ancestor of this outbreak clade as occurring between 18 January and 9 February 2020. From genomic data, we estimate an exponential doubling between 2.4 and 5.1 days. These results highlight the need for large-scale community surveillance for SARS-CoV-2 and the power of pathogen genomics to inform epidemiological understanding.

and subsequent local spread, indicating cryptic spread of COVID-19 before active community surveillance was implemented. We estimate a common ancestor of this outbreak clade as occurring between 18 January and 9 February 2020. From genomic data, we estimate an exponential doubling between 2.4 and 5.1 days. These results highlight the need for large-scale community surveillance for SARS-CoV-2 and the power of pathogen genomics to inform epidemiological understanding.

Main text
The novel coronavirus, referred to alternately as SARS-CoV-2 ( 2 ) or hCoV-19 ( 3 ) , emerged in Wuhan, Hubei, China, in late November or early December 2019 ( 4 ) . As of 25 March 2020, COVID-19, the disease caused by SARS-CoV-2, COVID-19, has globally caused 413,467 confirmed cases and 18,433 deaths ( 5 ) . After its initial emergence in China, travel-associated cases with travel histories related to Wuhan appeared in other parts of the world ( 6 ) . The first confirmed case in the United States was travel-associated and was detected in Snohomish County, Washington State, on 19 January 2020. Until 27 February 2020, the US Centers for Disease Control and Prevention (CDC) guidance recommended focusing testing on persons with direct travel history or exposure to a known case-cases of respiratory disease with no known risk factors were not routinely tested. In the 6 weeks between 19 January and 27 February, 59 confirmed cases were reported in the United States ( 7 ) , all with either direct travel history or exposure to a known confirmed case. On 28 February 2020, a community case was identified close to the location of the original Snohomish County case ( 8 ) . By 25 March 2020, in the setting of continuing transmission and increased testing, Washington State had reported 2580 confirmed cases and 132 deaths ( 9 ) .
Here we report on the putative history of community transmission in Washington State as revealed by genomic epidemiology. We conclude that SARS-CoV-2 was circulating cryptically, ie undetected by the surveillance apparatus, in Washington State since January 2020, and we underscore the following recommendations in settings where large-scale community transmission is not yet recognized: the importance of early identification of the virus, extensive testing of potential cases, and immediate self-isolation of infected persons.
All SARS-CoV-2 genomes represented in convenience samples from the COVID-19 pandemic appear closely genetically related with the large majority possessing between 0 and 12 mutations relative to a common ancestor estimated to exist in Wuhan between late Nov and early December 2019 ( Supplementary Fig. 1 ). This pattern is consistent with a reported rate of molecular evolution of~0.8 × 10 -3 substitutions per site per year or~2 substitutions per genome per month ( 4 ) . After its initial emergence by the zoonotic route in Wuhan ( 10 ) , SARS-CoV-2 viral genomes began to accumulate substitutions and spread from Wuhan to other regions in the world ( 4 ) . During December 2019, the Wuhan outbreak was too small to seed many introductions outside of China, but by January 2020, it had grown large enough to begin seeding cases elsewhere in the world ( 11 ) . At this point, travel cases with origins in Hubei began to appear in the US. The first confirmed case recorded in the United States was a travel-associated case from an individual returning from Wuhan on 15 January 15 2020 who presented for care at an outpatient clinic in Snohomish County on 19 January 2020 and tested positive ( 12 ) . This infection is recorded as strain USA/WA1/2020 (referred to here as WA1) and appears closely related to viruses from infections in China (Fujian, Hangzhou and Guangdong provinces). Relative to the basal virus at root of the phylogeny, WA1 possesses mutations C8782T and T28144C (found in 74/224 sampled viruses from China) alongside C18060T (found in 6/224 sampled viruses from China). This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 16, 2020. . California in green, viruses from the Grand Princess cruise ship in yellow and viruses from Washington State in red. This comb-like phylogenetic structure is consistent with rapid exponential growth of the virus population. (B) Highest posterior density estimates for the date of the common ancestor of viruses from the Washington outbreak clade as well as the doubling time in days of the growth of this clade.
Sequencing of viruses from the Washington State outbreak began on 28 February 2020 and has continued since then. We analyzed the sequences of 346 SARS-CoV-2 viruses from the Washington State outbreak collected between 20 February and 15 March 2020. The large majority (293, 85%) of these viruses fall into a closely related clade, and all share the mutations possessed by WA1 and the additional mutations C17747T and A17858G. This results in the nested tree structure shown in Figure 1A in which the Washington State outbreak viruses group together and appear as "direct descendants" of WA1 in a maximum likelihood tree with branches measuring substitutions. This tree structure is consistent with the WA1 strain transmitting locally after arrival into the United States. However, because the rate of evolution of 1 mutation per~15 days is slower than the transmission rate calculated for SARS-CoV-2 of 1 transmission event every 4-8 days ( 13 , 14 ) , it is also possible that WA1 sits on a side branch of the underlying transmission tree even if it appears as a direct ancestor in the maximum likelihood tree.
We sought to test these two hypotheses: (a) SARS-CoV-2 was introduced into Washington State on 15 January 2020 with the arrival of WA1; subsequent cryptic transmission led to a community outbreak first detected on 28 February 2020 and (b) SARS-CoV-2 was imported on 15 January 2020 but this infection did not transmit onwards; a second, initially undetected importation event of a genetically identical or highly similar virus occurred, followed by cryptic transmission that led to a community outbreak. The fact that only a small proportion of infections in China have been sequenced makes it difficult to test these two hypotheses. We attempted a simple probability calculation by noting that the C8782T, T28144C, C18060T variant appears in only six (Fujian/8/2020, Chongqing/YC01/2020, Hangzhou/ZJU-08/2020, Guangdong/GD2020086-P0021/2020, Guangdong/GD2020234-P0023/2020, Guangdong/FS-S30-P0052/2020) out of a total of 224 viral genomes from mainland China. If another variant had been introduced to Washington State, it would not result in the observed nesting pattern. Thus, an extremely rough probability calculation is that there is a 6/224 or 3% chance of observing this pattern under hypothesis (b). However, this does not fully account for the probability of stochastic genetic collision, as sampling in China was non-random and the genetic variant in question may be more frequent and introductions from China into Washington State may be more likely to occur from a subset of individuals within China.
We additionally analyzed 293 Washington State viruses from the outbreak clade (excluding WA1) in a coalescent analysis to estimate temporal patterns. Here, we assume an evolutionary rate of~0.8 × 10 -3 substitutions per site per year, consistent with rates for SARS-CoV-2 across the entire pandemic phylogeny. This analysis uses the degree and pattern of genetic diversity of sampled genomes to estimate the date of a common ancestor and exponential growth rate of the virus population. Applying it to these data gives a median estimate for the date of the clade's for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 16, 2020. . common ancestor at 1 February 2020 with a 95% Bayesian credible interval of between 18 January and 9 February 2020 ( Fig. 1B ). This is consistent with either hypothesis (a) or (b) above. We additionally calculate a rate of exponential growth from the coalescent analysis for this clade, finding a median doubling time of 3.4 days with a 95% Bayesian credible interval of between 2.4 and 5.1 days ( Fig. 1B ). In addition to the 293 viruses sampled from Washington State falling into the WA1 outbreak clade, we observe that seven viruses sampled from the Grand Princess cruise ship in late February and early Mar 2020 all group into the same outbreak clade ( Fig. 1A ). The genetic relationship among these viruses is consistent with a single introduction onto the Grand Princess cruise ship of the basal outbreak variant, possessing C17747T and A17858G, and subsequent transmission and evolution on the ship. With this phylogenetic structure, it is impossible to confidently determine whether the common ancestor of the outbreak clade is in for use under a CC0 license.
This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 16, 2020. . Washington State, is on the Grand Princess, or was transmitted to both from an unsampled source. However, given the relative sizes of the Washington State outbreak and the Grand Princess outbreak, we believe a Washington State to Grand Princess transmission to be more likely, occurring subsequent to the introduction of the WA1 outbreak clade into Washington State.
In addition, we observed 53 SARS-CoV-2 genomes from Washington State that fall outside this primary outbreak clade and into multiple separate clusters ( Fig. 2 ). Many of these separate clusters consist of single viruses and appear to be recent introductions not yet related to large clusters of local transmission. There is a second clade with 38 viruses that represents 11% of Washington State viruses. This clade is closely related to viruses from the European outbreak and likely represents a second introduction occurring at sometime in February 2020.
Assuming a putative 15 January 2020 introduction based on the phylogenetic evidence, we simulated forward in a stochastic epidemiological transmission model to investigate the expected size of outbreak given this date of introduction. Based on epidemiological literature ( 15 -17 ) , we chose a model with basic reproduction number ( R 0 ) of 3.2 corresponding to a mean doubling time of 6.1 days with 90% uncertainty interval of 5.1 to 8.2 days. Under this model, the median prevalence of active infections descended from the presumptive index case as of 1 March 2020 was 310 (90% uncertainty interval 50, 960), and with a total incidence to that date of 400 (90% uncertainty interval 80, 1300) infections ( Fig. 3 ). As we have not yet been able to quantify the impact of social distancing policies since 5 March 2020, we estimate an upper bound of 1600 (90% uncertainty interval 250, 5100) active infections in this transmission chain by 15 March 2020. Although this simulation lacks explicit mobility, we expect the majority of downstream infections to be geographically localized. This model assumes a constant transmission rate; mitigation measures enacted during early March may have decreased transmission rate.
In January and February, 2020, screening for SARS-CoV-2 in the United States was directed at travelers with fever, cough and shortness of breath, with the geographic areas increasing as new outbreaks were identified, but also specifying travel to China up until 24 February 2020 ( 18 , 19 ) . Our analysis suggests that a single clade of SARS-CoV-2 had likely been circulating in the Seattle area for 4-6 weeks by the time the virus was first detected in a non-traveler on 28 Feb 2020. By then, variants within this clade constituted the majority of confirmed infections in the region (293 of 346;85%). Several factors could have contributed to the delayed detection of presumptive community spread, including limited testing among non-travelers or the presence of asymptomatic or mild illnesses. Genetic evidence suggests that this cluster may descend from an initial introduction in mid-January with the WA1 travel case, but other origin scenarios are also possible.
for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 16, 2020. . We demonstrate that SARS-CoV-2 was circulating in Washington State for 4-6 weeks before the first community-acquired case was detected on 28 February, 2020. Refining the time and geographic origin of the introduction into Washington State will require a combination of earlier samples and samples from other geographic locations, including from elsewhere in the United States and from China. It is possible that the Washington State outbreak originated from the WA1 introduction or from a separate introduction directly from China into Washington State, or from an introduction into Washington State from elsewhere in the United States. Given its size, the ongoing outbreak in New York City could also have resulted in early introduction(s) and cryptic community transmission. As of this date, the relative lack of genomic data from New York City limits what can be inferred about transmission there and how that relates to Washington State.
Our results highlight the critical need for widespread surveillance for community transmission of SARS-CoV-2 throughout the United States and the rest of the world even after the current pandemic is brought under control. The broad spectrum of disease severity ( 20 ) makes surveillance challenging ( 21 ) . The combination of traditional public health surveillance and genomic epidemiology can provide actionable insights, as happened in this instance: upon sequencing the initial community case on 29 February 2020, results were immediately shared via Twitter ( 22 ) , resulting in rapid rollout of social distancing policies as Seattle and Washington State came to grips with the extent of existing COVID-19 spread. From 29 February onwards, new genomic data was immediately posted to the GISAID EpiCoV sequence database ( 23 , 24 ) and analyzed alongside other public SARS-CoV-2 genomes via the Nextstrain online platform ( 25 ) to provide immediate and public situational awareness. We see the combination of for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Ethics Approval
Sequencing and analysis of samples from the Seattle Flu Study was approved by the institutional review board at the University of Washington (protocol STUDY00006181). Informed consent was obtained for all community participant samples and survey data. Informed consent for residual sample and clinical data collection was waived. For UW Virology Lab, use of residual clinical specimens was approved by the institutional review board at the University of Washington (protocol STUDY00000408) with a waiver of informed consent.

Disclaimer
This manuscript represents the opinions of the authors and does not necessarily reflect the position of the U.S. Centers for Disease Control and Prevention.

Acknowledgments
We gratefully acknowledge the authors, originating and submitting laboratories of the sequences from GISAID's EpiFlu™ Database on which this research is based. A full acknowledgements table is available as supplemental information. We have tried our best to avoid any direct analysis of genomic data not submitted as part of this paper and use this genomic data as background. We thank Joe Felsenstein and Chris Spitters for helpful discussion.
The Seattle Flu Study is run through the Brotman Baty Institute for Precision Medicine and funded by Gates Ventures, the private office of Bill Gates. The funder was not involved in the design of the study and does not have any ownership over the management and conduct of the study, the data, or the rights to publish. JS is an Investigator of the Howard Hughes Medical Institute. TB is a Pew Biomedical Scholar and is supported by NIH R35 GM119774-01. EBH and RAN are supported by University of Basel core funding. Sequencing analyses of SARS-CoV-2 genomes from California was supported by an NIH grant R33-AI129455 and the Charles and Helen Schwab Foundation to CYC, and an NIH grant K08-CA230156 and the Burroughs-Wellcome CAMS Award to WG. the 2018-2019 flu season to combine clinical and innovative community sampling methods to measure how influenza, RSV, and other respiratory pathogens enter and circulate with the Seattle metropolitan region ( 26 ) . Samples screened for COVID-19 were collected as part of routine clinical testing, and residual samples were utilized for this study. Samples were additionally collected as part of prospective community enrollment of individuals with acute respiratory illness.

Diagnostics and sequencing
Extracted nucleic acids (Magna Pure, 96 Roche) were screened for SARS-CoV-2 in a multiplexed Taqman assay with primer/probe sets targeting SARS-CoV-2 Orf1B (FAM) and human RNaseP (VIC) in duplicate (Life Technologies assay ID APGZJKF and A30064) in 384 well plates. Samples that are positive for SARS-CoV-2 are retested using the CDC-designed rtRT-PCR assays acquired directly from IDT (2019 nCoV Kit lot# 0000500389, 2019-nCoV_N positive control lot# 0000500326) according to CDC instructions with the exception that a 384-well plate and ViiA7 thermocycler was used.
SARS-CoV-2 genome sequencing was conducted using a metagenomic approach. RNA from positive specimens was converted to cDNA using random hexamers and reverse transcriptase (Superscript IV, Thermo) and a sequencing library was constructed using the Illumina TruSeq RNA Library Prep for Enrichment kit (Illumina). The library was sequenced on a MiSeq instrument using a V2 300 kit (Illumina). The resulting reads were assembled against the SARS-CoV-2 reference genome Wuhan-Hu-1/2019 (Genbank accession MN908947) using the bioinformatics pipeline https://github.com/seattleflu/assembly . Consensus sequences were deposited to Genbank (accessions pending) and GISAID.
For UW Virology samples, sequencing was performed as described previously ( 27 ) . Libraries were sequenced on Illumina MiSeq or NextSeq instruments using 1x185 or 1x75 runs respectively. Consensus sequences were assembled using a custom bioinformatics pipeline ( https://github.com/proychou/hCoV19 ) adapted for SARS-CoV-2 from previous work ( 28 , 29 ) . Briefly raw reads were trimmed to remove adapters and low quality regions using BBDuk and a k-mer based filter was used to pull out reads matching the reference sequence NC_045512. Filtered reads were de novo assembled using SPAdes ( 30 ) and contigs were ordered against the reference using BWA-MEM ( 31 ) . Gaps were filled by remapping reads against the assembled scaffold and a consensus sequence was called from this alignment using a custom script in R/Bioconductor. Consensus sequences were annotated using Prokka ( 32 ) and deposited to Genbank (accessions pending), GISAID, and NCBI SRA (Bioproject PRJNA610428).
We examined date of sample collection and other available metadata to remove possible duplicates from sequencing. The final dataset of 346 SARS-CoV-2 genomes from Washington State represent a convenience sample from the underlying outbreak. All Washington State sequences used in the paper are available here https://github.com/blab/ncov-cryptic-transmission .
for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.
The copyright holder for this preprint this version posted April 16, 2020. .

Phylogenetics
SARS-CoV-2 genomes from the global COVID-19 pandemic were downloaded from GISAID ( 23 , 24 ) and processed using the Nextstrain ( 25 ) bioinformatics pipeline Augur to align genomes via MAFFT v7.4 ( 33 ) , build maximum likelihood phylogeny via IQTREE v1.6 ( 34 ) and reconstruct nucleotide and amino acid changes on the ML tree. This bioinformatic processing pipeline is fully documented and reproducible at https://github.com/nextstrain/ncov . The resulting tree was visualized in the Nextstrain web application Auspice to view resulting inferences.
Additionally, 293 SARS-CoV-2 aligned genomes from the WA1 outbreak clade were analyzed in BEAST ( 35 ) to estimate time of common ancestor and rate of epidemic growth. This analysis used an exponential growth coalescent model in which effective population size and rate of exponential growth are estimated. We assumed a HKY85 nucleotide substitution model ( 36 ) with gamma distributed rate variation and a strict molecular clock with a mean of 0.8 × 10 -3 substitutions per site per year. Full analysis details, including BEAST XML, are available at https://github.com/blab/ncov-cryptic-transmission .

Dynamical modeling
To model plausible ranges for the cumulative incidence and current prevalence following the introduction of SARS-CoV-2 into Snohomish County, WA, from the presumptive index case described in ref. ( 12 ) , we used a stochastic susceptible-exposed-infectious-recovered (SEIR) model with the following assumptions. The exposed (latent) period prior to the onset of viral shedding is normally-distributed with a mean of 4 days and standard deviation of 1 day; this is one day shorter than the 5 day consensus estimate of the incubation period prior to symptom onset ( MIDAS-network ) to acknowledge reports of pre-symptomatic shedding. The infectious period is normally distributed with mean 8 days and standard deviation 2 days, based on measured upper-respiratory viral shedding after symptom onset ( 37 ) . We drew transmission events from a truncated normal distribution to approximately reproduce the negative binomial transmission dynamics with transmission heterogeneity parameter k =0.54 from ref. ( 38 ) . We chose a truncated normal to reproduce the general overdispersion associated with SARS-CoV-2 transmission but with reduced probability of very large events involving more than twenty transmissions from a single person. For initial conditions, we assumed that the introduction into Snohomish County began on 15 Jan, 2020, coincident with the return of the presumptive index case to western Washington from Wuhan, China and with their symptom onset date ( 12 ) . The model code and scripts to run it are available at https://github.com/blab/ncov-cryptic-transmission .
for use under a CC0 license. This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available (which was not certified by peer review) is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.