Low Viral Diversity Limits the Effectiveness of Sequence-Based Transmission Inference for SARS-CoV-2

ABSTRACT Tracking the spread of infection amongst individuals within and between communities has been a major challenge during viral outbreaks. With the unprecedented scale of viral sequence data collection during the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, the possibility of using phylogenetics to reconstruct past transmission events has been explored as a more rigorous alternative to traditional contact tracing; however, the reliability of sequence-based inference of transmission networks has yet to be directly evaluated. E. E. Bendall, G. Paz-Bailey, G. A. Santiago, C. A. Porucznik, et al. (mSphere 7:e00400-22, 2022, https://doi.org/10.1128/mSphere.00400-22) evaluate the potential of this technique by applying best practices sequence comparison methods to three geographically distinct cohorts that include known transmission pairs and demonstrate that linked pairs are often indistinguishable from unrelated samples. This study clearly establishes how low viral diversity limits the utility of genomic methods of epidemiological inference for SARS-CoV-2.

T he mapping of transmission networks is a powerful tool for understanding pathogen dynamics during an outbreak. While these networks have often been constructed using traditional contact tracing methods (1,2), phylogenetic techniques can also be used to infer transmission linkages between individuals by identifying samples that map closely together on a phylogeny constructed from community sequences (3,4). Accounting for the spread of subconsensus variants between infected individuals can also enhance sequence-based transmission analyses (5,6).
These sequence-based methods have been used as an alternative or a supplement to traditional epidemiological tactics, especially in settings where contact tracing is rendered less effective due to widespread host interactions within large interpersonal networks (7)(8)(9). The complementation of traditional methods with genomic epidemiology can therefore yield a more robust approach toward mapping transmission networks. For example, genomic methods have been employed to determine epidemiological factors associated with the sustained circulation of antibiotic-resistant Staphylococcus aureus in regions across the globe (10). During the 2016 Ebola outbreak, phylogenies of sample sequences were constructed to track viral spread between countries (11), and sequencing of dengue virus samples has been used to understand transmission dynamics and identify factors that contribute to increased risk of outbreak (12).
The usefulness of these phylodynamic methods depends in part on the amount of genetic variation present within the local pathogen population, however. If the pathogen in question readily generates and preserves mutations, samples from epidemiologically linked individuals may have differing sequences. If little population diversity exists, sequence homology between samples may not necessarily indicate a transmission linkage. Therefore, not all infectious agents are equally appropriate candidates for sequence-based transmission inference. During the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) pandemic, sequence-based inference has been used to identify superspreading events (13), track global viral transmission (14), and map community infection networks (15). However, due to the relatively low mutation rates of coronaviruses (16) and the low levels of within-host genetic diversity observed during acute infections (17)(18)(19), it is possible that the viral diversity generated during intracommunity circulation is inadequate to distinguish true transmission linkages from unlinked samples. Insufficient community sampling may also hinder the reliable inference of transmission networks and decrease the accuracy of these methods, as low sequence availability will decrease the quality of any phylogenetic inference. The dependability of sequence-based inference methods for SARS-CoV-2 must therefore be validated before the technique can be confidently used.
Bendall et al. used SARS-CoV-2 sequence data from households where transmission events between close contacts could be determined with high confidence to determine whether sequences from within a known transmission cluster are more similar to each other than to sequences from the broader community (20). This approach allowed for the evaluation of the accuracy of phylodynamic inference in a scenario in which known transmission linkages were already defined.
Drawing on samples collected from three distinct household transmission studies, the authors constructed phylogenetic trees comprised of SARS-CoV-2 sequences from study participants alongside sequences from the surrounding communities. The amount of community sequence data included on each tree was determined by estimating the overall sampling densities in each study region (New York City, Utah, or Puerto Rico). Though sequences sampled from participants within a household generally grouped together on a phylogenetic tree, these clusters were often interspersed with other (sometimes identical) sequences from the surrounding community that were unlikely to be directly linked by transmission. In a situation where the probable transmission linkages were not already known, this lack of differential clustering would confound efforts to accurately resolve transmission chains. The low levels of SARS-CoV-2 genetic diversity within communities thus hinders the detection of transmission chains from sequence data.
The authors also asked whether including subconsensus genetic variants in sequence comparisons could improve efforts to match linked samples by providing an additional level of genetic diversity. They found that the inclusion of subconsensus variants was not always sufficient to resolve known transmission linkages from a larger pool of community sequences. Therefore, while sequence comparisons could confirm transmission between individuals who were already known to be epidemiologically linked (i.e., household pairs), Bendall et al. show that phylogenetic clustering is not sufficient to confidently determine SARS-CoV-2 transmission linkage in the absence of supplemental contact tracing information.
This study highlights important limitations that should be taken into consideration when reconstructing SARS-CoV-2 transmission networks based on sequence data alone. While the authors note that their results may not translate to congregate settings with higher infection densities, the lack of viral diversity observed in more dispersed communities poses a challenge to epidemiological inference. These findings suggest that sequence data should be used to help confirm likely SARS-CoV-2 transmission events, rather than to identify new ones. Furthermore, the authors demonstrate that phylodynamic methods cannot be applied to all infectious agents with equal effectiveness. The background population diversity and underlying biology of the pathogen of interest must be considered before attempting to draw epidemiological conclusions from sequence data.