The need for linked genomic surveillance of SARS-CoV-2

Genomic surveillance during the coronavirus disease 2019 (COVID-19) pandemic has been key to the timely identification of virus variants with important public health consequences, such as variants that can transmit among and cause severe disease in both vaccinated or recovered individuals. The rapid emergence of the Omicron variant highlighted the speed with which the extent of a threat must be assessed. Rapid sequencing and public health institutions’ openness to sharing sequence data internationally give an unprecedented opportunity to do this; however, assessing the epidemiological and clinical properties of any new variant remains challenging. Here we highlight a “band of four” key data sources that can help to detect viral variants that threaten COVID-19 management: 1) genetic (virus sequence) data; 2) epidemiological and geographic data; 3) clinical and demographic data; and 4) immunization data. We emphasize the benefits that can be achieved by linking data from these sources and by combining data from these sources with virus sequence data. The considerable challenges of making genomic data available and linked with virus and patient attributes must be balanced against major consequences of not doing so, especially if new variants of concern emerge and spread without timely detection and action.


Introduction
Since the start of the pandemic, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has evolved in multiple ways that increase its public health threat, with higher transmissibility (Alpha, Delta, Omicron variants) (1)(2)(3)(4), partial immune escape (Beta, Omicron variants) (5,6) and greater severity (Alpha, Delta variants) (7)(8)(9). The continued emergence COMMENTARY It is to be hoped that SARS-CoV-2 will not evolve higher transmissibility simultaneously with higher severity among vaccinated or recovered individuals. The cellular immune response is strong and complex (12)(13)(14), and breakthrough infections have had reduced severity compared to infections in unvaccinated individuals (15). Before Omicron emerged, vaccine-induced antibody responses remained strong across a variety of VOCs (16,17), but Omicron is a stark reminder that variants can emerge that substantially evade our immune responses (1)(2)(3)18), at least in terms of neutralizing antibodies (14,(18)(19)(20), dramatically reducing vaccine-induced protection against infection (21). There is no guarantee that future variants will follow Omicron's path in terms of severity.
Virus sequencing initiatives and related genomic surveillance systems give a high-resolution and near-real-time view of how SARS-CoV-2 is evolving and spreading and of the mutations that are rising in frequency (22). Establishing surveillance systems that can detect evolving viral characteristics that impact clinical outcomes and effectiveness of control measures is a key aim of viral sequencing efforts (23). For a newly emerging variant with uncertain impact, rapidly assessing the degree of risk to control efforts is paramount and requires multiple sources of data.

Data and linkages that are required
While genomic data alone allow certain inferences (e.g. identifying which cases are related, and identifying which mutations occur in a new variant), substantially greater value can be obtained by combining a "band of four" key data sources: genetic data; epidemiological and geographic data; clinical and demographic data; and immunization (or recovery) data.
Genetic data refer to attributes of the virus. Here we focused on SARS-CoV-2 whole genome sequence data, but note that polymerase chain reaction testing can identify specific mutations or deletions without fully sequencing the virus genome and so can provide rapid VOC detection.
Epidemiological and geographic data refer to information about the transmission context, including the geographic location and the reason for testing or sequencing (e.g. whether the individual was part of a known outbreak, was a traveller, was randomly sampled, was a vaccine breakthrough infection, was someone previously infected or was tested for other reasons). Epidemiological data also include information about the source and location of exposure: workplace outbreak; household; travel; community exposure; animal exposure; and health care worker, as well as any other contact investigation information (e.g. indoors vs outdoors, ventilation, community setting).
Clinical and demographic data refer to attributes of individuals infected with SARS-CoV-2, including treatments provided, outcomes (e.g. symptoms, severity) and demographic aspects (e.g. age, comorbidities, exposure risks). Immunization (or recovery) data refer to attributes of past COVID-19 infection or vaccination, including vaccine type(s), number of doses and dates of doses.
These data are typically gathered by different parts of a health system at different times and are used for a variety of purposes, creating challenges for data linkage. Medical facilities manage the clinical course of disease, contact tracing and other case data are gathered by epidemiological teams in public health, vaccination status may be in medical records or known only to the individual, while sequence information is often collected at specialized sequencing centres. Along the way, information may be lost or remain disconnected. Jurisdictions differ in the extent to which linkages among these data can be made; however, linking these four data sources is the most promising way to rapidly detect variants that have the potential to break through pandemic containment measures.

Opportunities with partial data
It is essential to understand vaccine effectiveness against a variety of outcomes (infection, symptoms, hospitalization, death), as well as intrinsic transmissibility and severity in vaccinated and unvaccinated individuals. These can change rapidly as new variants arise and spread. Links to genetic data can attribute transmissibility, severity and vaccine effectiveness to viral types, and thereby provide a better basis for projecting infections and healthcare burden in the context of vaccination. Viral evolution also causes a continual turn-over in how we classify a virus, as names are given only when a variant has spread and become sufficiently distinct (e.g. by Phylogenetic Assignment of Named Global Outbreak Lineages) (24). Consequently, case data with linked lineage information need to be updated as our classification system changes, and this is only possible if links to sequence data, as opposed to lineage names, are maintained.
With only viral sequences and sample dates, it is possible to identify unusual new variants, bursts of mutations, "mutator" lineages that evolve faster than predicted (25,26) or genetic changes that spread more rapidly than expected; however, rapid growth is difficult to interpret. Rapid growth could be due to viral characteristics, epidemiological fluctuations, travel-associated introductions or sampling artifacts (26). For example, the mutational profile of the Omicron variant was a cause for concern as it includes both new mutations and a number of mutations already seen in other VOC-including mutations known to enable the virus to evade neutralizing antibodies (27). Because of their genetic surveillance system, the Department of Health in South Africa sounded the alarm about Omicron (B.1.1.529; November 25, 2021) after detecting the new subvariant and witnessing its rapid spread in a matter of weeks (first collected on November 11, 2021). The researchers noted key outstanding questions about the effect of Omicron on transmissibility, effectiveness of vaccines and disease severity, which cannot be determined from data on the number of detected Omicron sequences alone (28).
The fields of phylogeography and phylodynamics have enabled the use of virus sequence data to infer the geographic movements of viruses (24,25), identify factors driving transmission across geographic regions (29), estimate the effective reproduction number over time (30,31) and link virus sequences to epidemiological models for a range of applications (32,33); however, there are limitations. Phylogeographic analyses are affected by geographic differences in both sampling rates and strategies. Phylodynamic estimates of reproduction numbers over time tend to be retrospective, apply to large virus populations at the national or international scale, have high degrees of uncertainty and are often not immediately actionable at smaller locations-where public health units need to act. Combining sequence data with the other three bands of data offers more opportunities to use virus sequences to understand transmission, severity and immunity. This combination does not necessarily require individual-level linked data; much could be done with data that are de-identified and even data reported for small groups rather than individuals. Even disaggregating outcomes by VOC status would have very high value, as noted recently for Omicron (34).
If the epidemiological context is known, it is possible to distinguish the emergence of a variant with a high growth rate from growth driven by chance "founder effects" (e.g. superspreader events, social gatherings among unvaccinated individuals, introductions vs transmission in care settings or increased sampling due to a particular outbreak) (35,36). Making this distinction increases the reliability of the inference and the value for both research and public health (36,37). For example, Volz et al. combined sequencing and polymerase chain reaction testing data with reason for sequencing (community samples) and geography in estimating transmissibility of the Alpha variant B.1.1.7 (1). Virus sequences can also be linked to travel history to monitor the spread of emerging variants and to inform public health measures aiming to limit importation (24,38,39).
In densely sampled outbreaks, linking virus sequences to epidemiology can offer information of immediate relevance to infection prevention, especially when analysis can be done in real time. Lucey et al. used whole genome sequence data to identify previously undetected transmission events in hospitalacquired infections, finding evidence that transmission occurred from both symptomatic and asymptomatic healthcare workers, and occurred disproportionately in patients who required high levels of nursing care, informing better prevention tools (40). In a real-time genomic epidemiology study in Australia, sequencing linked to epidemiological data indicated the probable source of infection and identified previously unknown connections between institutions (37,41). Linking virus sequences to additional host and epidemiological data, such as the location of exposure, would also make it possible to detect mutations that give the virus a context-specific advantage, such as transmitting more efficiently outdoors or among specific age groups.
Linking viral sequence data with host data on age, sex, race, occupation, dwelling type, comorbidities and other clinical/demographic data permits virus and host factors contributing to severe disease to be identified. For example, Bager et al. used linked data for virus sequences, hospitalization outcome and a large number of host covariates to demonstrate a higher adjusted risk ratio of hospitalization for the Alpha variant (42). Similarly, Fisman and Tuite estimated the increase in risk of hospitalization, intensive care unit admission and death from N501Y-containing variants and the Delta variant (43). Further resolution could be achieved with whole genome sequence in place of VOC screening data.
Linked immunization and sequence data are essential to determine whether newly emerging types and/or variants reduce vaccine effectiveness and to what extent. For example, Skowronski et al. linked VOC typing with vaccine status and testing information to show that a single dose of messenger ribonucleic acid (mRNA) vaccines was similarly effective against the Alpha and Gamma variants and non-VOC SARS-CoV-2 (44). Examining clusters or sets of closely related virus sequences together with immunization status informs us about potential transmission. If a cluster consists mainly of vaccinated individuals, this suggests considerable transmission among these individuals; however, if breakthrough infections are preferentially sequenced, an apparent cluster of breakthrough cases could be missing many unvaccinated individuals who comprised most of the transmission. Distinguishing between these requires linking sequences, vaccination status and reason for sequencing, which may include contact tracing or household information.
The entire band of four is needed to determine whether a virus variant can be transmitted by vaccinated individuals and cause severe disease among them: sequence data can tell us whether this is a new variant; epidemiological data and vaccination data can tell us whether it is being transmitted among vaccinated individuals and clinical data will indicate whether the variant is causing severe disease. Without these four linked pieces-shared sufficiently rapidly and over a large enough area to have strong statistical power-there will be gaps that substantially weaken our ability to monitor the virus' changing phenotype. Small-scale but aggregated and de-identified data may be sufficient for early warnings and help to avert concerns over privacy.

Data sharing and statistical power
Many jurisdictions may gather virus sequences and clinical, epidemiological and immunization data, but may not permit linkage among them due to structural or other barriers. Even where timely joint analysis of these data is possible, however, there is an additional challenge that an emerging variant or type is necessarily rare when it is first emerging. Sharing data across jurisdictions results in greatly improved statistical power by increasing the total amount of data available. Data delays are an additional problem. Even for countries sharing virus genomic data through the Global Initiative on Sharing All Influenza Data database, lags can span months (45). These extensive time lags hamper international efforts to track variants and their mutations, determine which are rising in frequency and where, track variants' epidemiological and biological consequences and develop effective public health policy (45). Furthermore, even where sequences are shared in a timely manner to the Global Initiative on Sharing All Influenza Data database, they are typically not shared alongside epidemiological, clinical/demographic and immunization data. Indeed, the barriers to public health data sharing are extensive: van Panhuis et al. described technical, motivational, economic, political, legal and ethical barriers (46). Many of these are of daily relevance in the COVID-19 pandemic.

Timeliness matters
To make an immediate practical difference, these data linkages and analyses need to be conducted with as little delay as possible. The sooner a new VOC can be characterized, the more warning decision-makers have about the risk. Identifying the spread of a VOC requires strong real-time genomic surveillance with sampling that reflects community transmission, and it requires regular reporting on the makeup of the virus population.
There are significant challenges to developing timely surveillance for emerging VOC, and these challenges differ according to whether the concern is an increase in severity, immune escape, transmissibility or a combination. It takes many infections before we can estimate a difference in severity, yet changes in severity will shape the impact on the healthcare burden. But only a minority of individuals experience severe disease, and there are inherent delays between infection and eventual outcomes. By the time the risks of hospital and acute care needs can be estimated, many hundreds or thousands of infections will have occurred. To stratify severity estimates by viral factors requires even more hospital records and therefore more infections (potentially thousands). This can be ameliorated slightly by focusing on measures with minimal time lags (for example hospital admissions rather than occupancy) and with timely reporting.
Differences in transmissibility are likely to be apparent earlier than differences in severity, because transmission occurs for all infections (whereas severe outcomes occur for a small minority). Indeed, with both the Alpha and Delta variants, increases in transmissibility were detected well ahead of increases in severity (1,7). Differences in immune evasion may or may not be apparent soon after the relevant variants arise, depending on the genomic surveillance system (e.g. prioritization of breakthrough infections, extent of surveillance) and whether the new type causes severe disease among vaccinated individuals. An effective surveillance system also requires linking timely detection with timely action. Public health and policy makers need to assess when to take action in the face of the uncertainty that is inherent in early assessments of variants that might increase transmission, severity or immune escape. Early localized actions that prevent a VOC from spreading widely, while costly in the short-term, reduce the risk of prolonged and global challenges to effective COVID-19 control.

Discussion
Timely and accurate surveillance requires a range of expertise spanning infectious disease epidemiology, statistics, virus evolution, genomics and public health. Benefits are gained not just from combining data but from conducting joint analyses, bringing together a sufficient range of expertise to increase the chance of early detection of an emerging threat. Many standard approaches used to estimate transmissibility, vaccine effectiveness and severity (e.g. attack rates, test negative study designs) are only possible after community transmission is well established. Designing systems to warn of possible elevated transmission, immune evasion and severity when there are still few cases requires integrating many sources of information and expertise and developing and using analytical methods designed to combine these data streams. Furthermore, progress in establishing linked surveillance for SARS-CoV-2 is likely to benefit surveillance for other respiratory pathogens, including newly emerging zoonotic viruses and high-burden pathogens such as influenza and respiratory syncytial virus. Improvements in sequencing technology also allow sequencing multiple viral pathogens sampled from patients or the environment, improving the ability to respond rapidly to any newly emerging virus (47).
There are precedents for strong genomic-based surveillance systems with linkage to clinical and epidemiological data. PulseNet Canada (48) is a virtual electronic network that delivers systemic surveillance for enteric disease and ensures that genomes of causal bacteria are rapidly sequenced. The presence of clusters of cases triggers coordinated outbreak investigations in which data are collected and linked to sequences to assess the full extent of the outbreak and identify the source. For SARS-CoV-2 surveillance, the Canadian COVID-19 Genomics Network (16) aims to establish large-scale virus and host sequencing at a national scale to inform decision-making and track the evolution and spread of the virus. Such national platforms can enable data linkage, either with public access or with privileged access given to approved researchers. Although to date such goals have been hampered in Canada, in part by limited or delayed access to virus sequences and limited linkage.
Throughout the SARS-CoV-2 pandemic, the United Kingdom has led the world in data linking, analyses and public communication in its efforts to understand SARS-CoV-2 evolution and impact on public health. The COVID-19 UK Genomics Consortium (49) performs and coordinates sequencing, with over 1.5 M publicly available viral genomes as of February 17, 2022 (50). Sequences are linked with clinical and epidemiological information and are stored securely. Public health agencies use genomic data linked to clinical, demographic and epidemiological data in the public health response and can provide de-identified COVID-19 patient information into the Cloud Infrastructure for Microbial Bioinformatics (CLIMB-COVID-19) (51) database. There are systems in place for researchers to access the data.
A recent briefing (SARS-CoV-2 VOC and variants under investigation in England: technical briefing 36) from the UK Health Security Agency (21) provides an excellent example of the impact of research enabled by data linkage in the United Kingdom. This report summarizes research linking Phylogenetic Assignment of Named Global Outbreak lineage information to contact tracing data, permitting the discovery that the BA.2 sublineage of Omicron has shorter serial intervals than the BA.1 sublineage, which in turn impacts the interpretation of selection (higher rate of spread is in part due to faster transmission rather than more overall transmission). Linking to vaccination data, age profiles and severity permitted estimates of protection against severe disease and the likely health care burden of BA.2. Sequence and screen-based characterization of the rise of BA.2 allowed estimates of its rate of spread, which is needed to project the future burden of infection and disease. The report is a collaboration of teams that combine expertise in genomics, outbreak surveillance, contact tracing, epidemiology and data analytics, linking and analyzing emerging data with very rapid turn-around and thereby benefitting the global community.
Beyond national-level analyses, linking data at a local level can provide important insight into transmission routes and outbreak risks; for example, genomic epidemiology tools have been used to examine transmission at the scale of outbreaks (52)(53)(54)(55)(56). By linking sequences, clinical outcome, epidemiological data and vaccination status, such local analyses can alert public health to the emergence of a concerning cluster. If there was a growing cluster with transmission among vaccinated individuals and high severity, this could be detected early. Both national and local-scale analyses require linkage among disparate data systems through unique identifiers, collaboration across multiple disciplines, and a process by which researchers can access linked data to develop and validate methods.

Conclusion
The SARS-CoV-2 virus will continue to evolve. We cannot predict where new variants of concern will arise, nor rely on them being detected early in locations that have strong genomic surveillance. The more we build strong surveillance systems worldwide, with high-quality data and linkages, the earlier we will be able to detect new variants and act accordingly. Many wealthy countries have high rates of vaccination, which leads to selection of variants with the ability to transmit among vaccinated individuals. With extensive international travel, emerging variants will be able to rapidly migrate around the world, and any that evade immunity will not be as impacted by vaccination requirements. In the worst case, viral evolution could undermine the potential for vaccination to mitigate the pandemic, even in countries that have not yet reached high vaccination rates. Countries with the resources to conduct high volumes of sequencing and to develop strongly linked surveillance programs are also the ones that have most benefited from early and extensive vaccination programs. Developing and supporting strong genomic surveillance that enables monitoring the virus' phenotypes is important to help ensure that the vaccines remain effective for the rest of the world.