One year into the pandemic: Short-term evolution of SARS-CoV-2 and emergence of new lineages

The COVID-19 pandemic was officially declared on March 11th, 2020. Since the very beginning, the spread of the virus has been tracked nearly in real-time by worldwide genome sequencing efforts. As of March 2021, more than 830,000 SARS-CoV-2 genomes have been uploaded in GISAID and this wealth of data allowed researchers to study the evolution of SARS-CoV-2 during this first pandemic year. In parallel, nomenclatures systems, often with poor consistency among each other, have been developed to designate emerging viral lineages. Despite general fears that the virus might mutate to become more virulent or transmissible, SARS-CoV-2 genetic diversity has remained relatively low during the first ~ 8 months of sustained human-to-human transmission. At the end of 2020/beginning of 2021, though, some alarming events started to raise concerns of possible changes in the evolutionary trajectory of the virus. Specifically, three new viral variants associated with extensive transmission have been described as variants of concern (VOC). These variants were first reported in the UK (B.1.1.7), South Africa (B.1.351) and Brazil (P.1). Their designation as VOCs was determined by an increase of local cases and by the high number of amino acid substitutions harboured by these lineages. This latter feature is reminiscent of viral sequences isolated from immunocompromised patients with long-term infection, suggesting a possible causal link. Here we review the events that led to the identification of these lineages, as well as emerging data concerning their possible implications for viral phenotypes, reinfection risk, vaccine efficiency and epidemic potential. Most of the available evidence is, to date, provisional, but still represents a starting point to uncover the potential threat posed by the VOCs. We also stress that genomic surveillance must be strengthened, especially in the wake of the vaccination campaigns.

The COVID-19 pandemic was officially declared on March 11 th , 2020. Since the very beginning, the spread of the virus has been tracked nearly in real-time by worldwide genome sequencing efforts. As of March 2021, more than 830,000 SARS-CoV-2 genomes have been uploaded in GISAID and this wealth of data allowed researchers to study the evolution of SARS-CoV-2 during this first pandemic year. In parallel, nomenclatures systems, often with poor consistency among each other, have been developed to designate emerging viral lineages. Despite general fears that the virus might mutate to become more virulent or transmissible, SARS-CoV-2 genetic diversity has remained relatively low during the first ~ 8 months of sustained human-to-human transmission. At the end of 2020/beginning of 2021, though, some alarming events started to raise concerns of possible changes in the evolutionary trajectory of the virus. Specifically, three new viral variants associated with extensive transmission have been described as variants of concern (VOC). These variants were first reported in the UK (B.1.1.7), South Africa (B.1.351) and Brazil (P.1). Their designation as VOCs was determined by an increase of local cases and by the high number of amino acid substitutions harboured by these lineages. This latter feature is reminiscent of viral sequences isolated from immunocompromised patients with long-term infection, suggesting a possible causal link. Here we review the events that led to the identification of these lineages, as well as emerging data concerning their possible implications for viral phenotypes, reinfection risk, vaccine efficiency and epidemic potential. Most of the available evidence is, to date, provisional, but still represents a starting point to uncover the potential threat posed by the VOCs. We also stress that genomic surveillance must be strengthened, especially in the wake of the vaccination campaigns.

Introduction
The small outbreak of "pneumonia-like illness" reported in Wuhan, China, during December 2019 has outgrown (Zhou et al., 2020b;Li et al., 2020a) to be the most devastating pandemic of the 21 st century. The thirteen months of SARS-CoV-2 pandemic have also witnessed synergistic efforts of the global scientific community, not only to demystify the emergence of the virus, but also to develop diagnostics and vaccines. Like never before, the spread of the virus has been tracked nearly in real-time by genome sequencing efforts throughout the world. This wealth of data has allowed researchers to study the evolution of SARS-CoV-2, map the emergence of new viral variants and, in the past few months, track the emergence of novel lineages, which are generating concerns at multiple levels. Focusing on the recent events associated with the detection of novel viral lineages in the UK, South Africa, and Brazil, we provide a summary of the lessons learned during the first year of COVID-19 pandemic.

Origin of SARS-CoV-2
The most fundamental questions that needed to be addressed at the beginning of the pandemic were focused on the understanding of the disease and the origin of the virus. The disease of unknown aetiology was termed COVID-19 (Coronavirus disease 2019) by the World Health Organisation (WHO) (Novel Coronavirus (2019-nCoV) Situation Report 1, 21 January 2020, World Health Organization, 2020) and it became the first example of "Disease X", caused by an unknown infectious agent. The virus was referred to as a novel coronavirus (2019-nCoV) by the WHO and subsequently named SARS-CoV-2 by the International Committee on Taxonomy of Viruses (ICTV) (Coronaviridae Study Group of the International Committee on Taxonomy of Viruses, 2020).
The outbreak of the SARS-CoV-2 at the animal market in Wuhan was investigated from two perspectives with respect to transmission of the virus. The virus could have been either transmitted to the human population from an animal source or it could have been introduced in the market by individual(s) already infected with the virus. Investigation of initial cases revealed that not all the patients had a direct link to the animal market. Further, information on the appearance of their symptoms around December 1, 2019 indicated that they might have acquired the infection from potentially undetected cases, prior to December 2019.
The initial investigation for the characterisation and identification of the viral pathogen was carried out using whole genome sequencing and the first SARS-CoV-2 genome sequence was deposited in the public domain (Virological.org) on January 10th, 2020 . This facilitated the comparison of the genomic sequence of isolate "Wuhan-Hu-1" (GenBank accession number NC_045512) with the sequences of known viruses, which provided insight into the evolution of SARS-CoV-2. These sequence-based searches revealed that the Wuhan-Hu-1 isolate of SARS-CoV-2 shared >96% identity with RaTG13, a bat coronavirus belonging to subgenus Sarbecovirus (genus Betacoronavirus) that was isolated in 2013 from Rhinolophus affinis, a horseshoe bat in Yunnan, China. Based on the genetic proximity between RaTG13 and SARS-CoV-2, a bat origin for the latter was proposed . However, viruses similar to SARS-CoV-2 were also found in pangolins Xiao et al., 2020). An independent evolutionary analysis using the RNA directed RNA polymerase gene (RdRp), a stable genetic marker present in all RNA viruses, revealed that SARS-CoV-2 isolates are a homogeneous population devoid of recombination and cluster independently of SARS-CoV. This study further revealed that SARS-CoV-2 shares a common ancestor with Bat-CoV (RaTG13) and Pangolin-CoVs, hinting at the presence of additional host(s) that might have also contributed to the evolution of SARS-CoV-2 to infect humans (Kasibhatla et al., 2020).
Another novel bat coronavirus, named RmYN02, subsequently discovered using metagenomic analysis of bat viruses from the Yunnan province of China showed, in some parts of the genome, higher sequence similarity to SARS-CoV-2 as compared to its closest relative RaTG13 (Zhou et al., 2020b). Furthermore, both SARS-CoV-2 and RmYN02 were observed to harbour multiple amino acid insertions at the junction of the S1 and S2 subunits of the spike (S) protein. However, relatively lower sequence identity between the spike proteins of RmYN02 and SARS-CoV-2, coupled with the key differences in their RBDs (receptor binding domains), revealed that RmYN02 might not bind the ACE2 (angiotensin-converting enzyme 2) receptor. These studies also indicated that insertion events in spike proteins could naturally and independently occur amongst betacoronaviruses infecting animals and humans (Zhou et al., 2020b). An independent study conducted by Boni et al. to decipher the origin of the pandemic suggested that, though frequent recombination events might have been responsible for shaping genetic diversity and evolution of sarbecoviruses in general, SARS-CoV-2, in particular, is not a recombinant of any sarbecovirus reported so far (Boni et al., 2020). Members of the family Coronaviridae have large genome sizes (~30,000 nt), with positive sense single-stranded RNA, and have a wide host-range that includes mammals, birds, and fishes (Cui et al., 2019;Forni et al., 2017). Members of the genus Betacoronavirus have been responsible for three major outbreaks in nearly two decades of the 21 st century: the SARS, MERS and COVID-19 epidemics in 2002-2003, 2012, respectively (Chan et al., 2020Drosten et al., 2003;Zaki et al., 2012). The other two members of this genus that infect humans, namely HCoV-HKU1 and HCoV-OC43, have been associated with mild diseases (Cui et al., 2019;Forni et al., 2017). However, based on the estimate that HCoV-OC43 emerged around 1890s from a related bovine coronavirus, HCoV-OC43 was proposed as the potential causative agent of the "Russian flu" pandemic (that occurred during 1889-1890), which was speculated to be caused by an influenza A virus (Vijgen et al., 2005). For all betacoronaviruses, bats are reported as reservoirs; however, these viruses also infect different intermediate hosts such as palm-civets (SARS-CoV) and dromedary camels (MERS-CoV) (Forni et al., 2017;Cui et al., 2019).
Since the genomic sequence of SARS-CoV-2 was obtained, the spike gene remained the focus of much attention, as it encodes the surface protein that harbours the RBD. Sequence comparisons revealed that it is one of the hyper-variable regions in coronavirus genomes and 5 out of 6 residues involved in receptor binding differ between SARS-CoV-2 and SARS-CoV. Structural and biochemical studies further explained the role of these mutations in binding to the ACE2 receptor, as well as in host switching Letko et al., 2020;Wrapp et al., 2020). Owing to the high sequence similarity between the ACE2 receptorbinding sites of SARS-CoV-2 and the pangolin-CoVs, a potential role of pangolins as intermediate hosts was also deliberated extensively. A more recent analysis positions pangolins as species that might have had a role in the transmission of SARS-CoV-2 to humans, but dismissed the possibility of pangolins contributing to the process of adaptation of SARS-CoV-2 to humans (Boni et al., 2020). Similarly, the presence of a polybasic furin cleavage site in the spike protein of SARS-CoV-2 and its absence in other betacoronaviruses triggered discussion regarding the emergence of furin cleave site due to recombination, which was ruled out. It was suggested that recombination events might have played a role in the evolution of sarbecoviruses prior to the diversification of SARS-CoV-2 (Boni et al., 2020). These analyses have also helped to dismiss the possibility of SARS-CoV-2 being a synthetic construct  and underlined a need for systematic surveillance of potential reservoir species, not only to connect the missing dots and decipher the zoonotic spill-over of SARS-CoV-2, but also as a systematic plan for pandemic preparedness. More recent genomic analyses, however, indicate the possibility of recombination in SARS-CoV-2 due to coinfections, which is deliberated in the succeeding sections.

COVID-19: from an outbreak to the pandemic
Deciphering the emergence of disease, tracing its source, mapping associated events, identifying the potential mechanisms and/or routes of spread and releasing related information for consumption to governments, health officials and people at large are critical for management of infectious diseases in general and to arrest the spread of the infectious agent in particular. The WHO has played a significant role in monitoring the spread of SARS-CoV-2 through various stages. The WHO, in collaboration with other international agencies, has published several situation reports, guidelines and advisories to monitor the spread of the virus.
The first public release regarding the onset of "pneumonia-like" illness was made based on the detection of a "cluster of cases" by the Wuhan Municipal Health Commission on December 31, 2019. The first case of the disease outside China was reported in Thailand on January 13, 2020. On January 30, when the number of confirmed cases reached 98 and no deaths were reported outside of China, the WHO declared the outbreak of coronavirus as a public health emergency of international concern. The advisory urged nations to use the "window of opportunity" to prevent widespread transmission of the virus and to develop preparedness to combat the disease at every possible level. Subsequently, COVID-19 epidemic announcements were made by various nations. The epidemic threat of SARS-CoV-2 and its implications on global health are reviewed by Sheahan and Frieman (2020).
The WHO, in the interest of the health of the global population, declared the SARS-CoV-2 pandemic on March 11, 2020 (Coronavirus Disease 2019  Situation Report -51, World Health Organization, 2020), when the cases crossed over 118,000 and were spread over 110 countries. This action also had the purpose to inform people at large that the virus was circulating in several countries due to people with travel history and it was expected to spread far and beyond.

SARS-CoV-2 genetic diversity
Similarly to other viruses with an RNA genome, coronaviruses are prone to mutate during their replication and they do so at a much higher rate than viruses and cellular organisms with a DNA genome. In the case of viruses, mutation rates indicate the rate at which errors are made during genome replication, whereas substitution rates indicate the rate at which evolution proceeds at the molecular level, replacing preexisting alleles by new mutations. For example, substitution rates for Influenza A viruses (negative sense ssRNA) are estimated to be around 1.8 × 10 -3 substitutions per site per year (s/s/y) and that of human enterovirus 71 (positive sense ssRNA) is estimated to be 3.4 x 10 -3 s/s/y (Jenkins et al., 2002). Substitutions rates can also vary across genomic regions of the same virus.
The time to the most recent common ancestor (tMRCA) has been calculated for SARS-CoV-2 based on complete genome sequences. Most studies reported a temporal signal with a clock-like pattern of molecular evolution. So far, all analyses have indicated late November 2019 (~13 to 19 Nov 2019, 95% CI: August 2019 to December 2019) as the tMRCA using log-normal relaxed molecular clock model(s). It is interesting to note that the tMRCA did not vary much, irrespective of the number of genomes analysed (Gómez-Carballa et al., 2020;Chaw et al., 2020;Ladner et al., 2020). This finding corroborates the fact that the first reported isolate dated to December 1, 2019 from Wuhan, and the incubation time for this virus is about two weeks. Strict clock models show a slightly earlier date for the tMRCA (November 7, 2019). The likely progression of the SARS-CoV-2 lineages (as described by the Pangolin tool) are A, B and B1 (Gómez-Carballa et al., 2020).
Analyses of members of the Sarbecovirus subgenus (including Pangolin-CoV, Bat-CoV, and, SARS-CoV) revealed that the divergence time estimate for SARS-CoV-2 and RaTG13 is year 1969 (Boni et al., 2020). This hints at the fact that viruses closely related to SARS-CoV-2 have been circulating in horseshoe bats for many decades.

SARS-CoV-2 lineage nomenclature
One important problem in Virology is how to establish a nomenclature system below the species level that is useful, informative, and widely accepted by all those who need to name variants (Rambaut et al., 2020a). Categories below the species level in viruses have usually relied on immunogenic features (e.g., serotypes as in Dengue virus) or distinct and stable genetic differences (e.g., genotypes in hepatitis C virus or subtypes in influenza A virus). However, there are no common rules and additional levels, which usually are of interest for epidemiology, are difficult to accommodate consistently. This is especially so in the initial stages of spread of a new virus, when there has not been enough time to observe clear genetic differences that can be used at least for the major divisions within the species. In the case of SARS-CoV-2, there have been up to three different naming systems coexisting during the first year of the pandemic (Alm et al., 2020).
Most complete genome sequences of SARS-CoV-2 are being uploaded to GISAID (https://www.gisaid.org) since the start of the pandemic. GISAID initiated a naming system based on large clades defined by marker variants from the reference genome (WIV04, MN996528) ( Table 1). This scheme is simple but it is not consistent and it fits poorly the evolutionary characteristics of this virus. The high mutation rates of RNA viruses, despite the lower value of Coronaviruses, imply frequent parallel and backward mutations, impacting reconstruction of evolutionary relationships and thereby introducing inconsistencies in assignment of lineages to viral isolates. If a lineage is defined solely by a few mutations, recurrent mutations make it possible that the same combination appears in a different lineage or a backward mutation may make a member of a lineage loose one of the defining mutations.
SARS-CoV-2 sequences deposited in GISAID are readily analysed in different platforms among which NextStrain (www.nextstrain.org) has gained a well-deserved popularity. In this platform, a consistent pipeline of phylogenetic and phylodynamic analysis (Hadfield et al., 2018) is applied to specific subsets of sequences. This is intended to optimize the balance between information and time used to produce it by reducing the total number of sequences used in each "build" (a new global build is constructed on a daily basis from around 4000 sequences). The nomenclature used in NextStrain builds is based on the inferred Other minor clades phylogenetic relationships, ultimately depending on the accumulation of mutations. Evolutionary stable lineages receive a name based on the year of emergence and a successive letter (e.g., 19A corresponds to the first lineage from 2019 and 20B to the second lineage from 2020). Some particular sublineages are identified with additional information (e.g., 20E/EU1, 20I/501Y.V1) but with no systematic rule.
To alleviate these problems, Rambaut et al. (2020a) proposed a new method of nomenclature for within viral species categories, based on evolutionary relationships and epidemiological relevance, denoted PANGO (Phylogenetic Assignment of Named Global Outbreak Lineages). This is a dynamic scheme that accommodates the expanding phylogenetic diversification of SARS-CoV-2 lineages by constraining the number and depth of hierarchical levels. The names start with a letter (A and B, for the earliest lineages) and up to 4 hierarchical levels, each defined as descendant from a preceding level given four conditions (van Dorp et al., 2020c): (a) one or more shared nucleotide differences from the ancestor lineage; (b) comprising at least five genomes with >95% of the genome sequenced; (c) genomes within a lineage exhibit at least one shared nucleotide change among them; and (d) a bootstrap support >70% for the lineage-defining node. The resulting names (e.g., A, B.1, B.2.5, etc.) are informative, consistent, and coherent and facilitate the use of neutral tags, with no negative connotations. Continuous supervision of active lineages is facilitated by easily accessible tools such as Pangolin (https:// github.com/hCoV-2019/pangolin). Viral lineages with no observations are assumed to be inactive and are delabelled. The different names that a particular genome can receive are provided in the Nextstrain builds (https://nextstrain.org/ncov). Ideally, it would be desirable to have a "rule of equivalences" that would "translate" the name of a clade/lineage in a nomenclature system to the corresponding name in any other. However, this is not possible. Table 2 summarizes the distribution of most (96.6%) sequences deposited in GISAID as of February 4, 2021, according to the three naming systems. All clades included in GISAID are represented in the table, but only a few of the lineages included in NextStrain and PANGO are considered. The reduction is especially relevant in the PANGO system, because there are 18 sublineages derived from lineage A, and 815 lineages derived from B. It is evident that, although many combinations never occur, it is not possible to establish an equivalence between the clade/lineage a sequence corresponds to in a system and those in the other systems.
Recently, some recombinant lineages have been found in the UK (see below) and an appropriate naming system has been proposed for them within the PANGO framework (Pybus, 2021). The general rules for designation of PANGO lineages still apply, but new recombinant lineages of the highest level are preceded by an X, so we will have XA.1, XB.1.1. These names do not contain information about the putative parental lineages.

Emergence and spread of novel variants
The epidemic growth of an organism provides ideal conditions for rapid diversification. This is especially so in fast mutating viruses. SARS-CoV-2, as other coronaviruses, has a low mutation rate compared to other RNA viruses as described above (Sironi et al., 2020). The fate of these mutations is governed by the usual processes operating in evolving populations: genetic drift, natural selection, founder effects, and, in some cases, recombination. Most mutations appearing within an infected patient are deleterious, and hence usually eliminated by selection, or neutral, and their fate is dictated by stochastic events (i.e., genetic drift). Although some mutations might have a selective advantage at the within-individual level, their destiny at the population level will depend on factors such as their effects on transmissibility, which might increase or decrease their chances of survival during an epidemic. Fear of mutations was common in the early stages of the COVID-19 pandemic, although they did not represent any significant change in the virulence, transmissibility, lethality, or other relevant features of the infectious viruses (Grubaugh et al., 2020b). Furthermore, some of the mutations defining variants of concern also appeared in earlier lineages without provoking noticeable effects. Apart from the inherent difficulties in demonstrating a clear effect by a single mutation (MacLean et al., 2020;Tang et al., 2020), epidemiological processes may confound the genetic effects on the phenotype when evaluated in real populations.
Some of the early mutations were already present in individuals who originated large clusters of transmission. As a consequence, many of the viruses sampled for genomic surveillance will present shared mutations originating in their common ancestors, eventually giving rise to new variants, i.e., organisms that share a common set of mutations usually associated with successful transmission or particular features. The descendants of these variants gave rise to a new lineage or sublineage, many of which might be present at a given time during the pandemics while others emerged and disappeared some time after. Hence, it is important to differentiate between mutations (particular genetic changes), variants (sets of organisms that share some mutations), and lineages or clades (the set of individuals descending from a given variant). In Virology, a variant becomes a strain when there is a significant change in its transmissibility, pathogenicity, immunogenicity or lethality characteristics compared to other such variants (Kuhn et al., 2013;Van Regenmortel, 2007).
Before the current variants of concern and interest (analysed in detail below), there have been a few notable variants in the COVID-19 pandemic. The first such variant is known as D614G. This name corresponds to an A to G transition at position 23,403 of the SARS-CoV-2 genome (relative to the Wuhan reference sequence), that results in the replacement of an aspartic by a glycine at position 614 of the S or spike gene. The D614G mutation is usually accompanied by three additional mutations: a C-to-T mutation in the 5' UTR position 241, a silent C-to-T mutation at position 3,037, and a C-to-T mutation at position 14,408 that results in an amino acid change in the RNA-dependent RNA polymerase (RdRp:P323L). These mutations defined a clade (G clade under the GISAID nomenclature system) that became dominant worldwide by the end of March 2020. Although several cases of G614 were detected in Table 2 Distribution of human SARS-CoV-2 sequences of human origin deposited at GISAID (on February 4, 2021) in the major clades/lineages according to the three nomenclature systems: GISAID (clades V, S, O, L, G, GH, GR, GV), NextStrain (only lineages 19A, 19B, 20A, 20A.EU2, 20B, 20C, 20D, 20E(EU1), 20G, 20H/501Y.V2, and 20I/501Y.V1), and PANGO (lineages A*, including 18 sublineages, and B*, including 815 sublineages). In total, 96.6% of the total number of sequences at GISAID are included in this analysis. China and Germany by the end of January 2020, they likely represented independent mutations, and the first representative of the D614G variant was sampled on 20 February 2020 in Italy from where it rapidly spread to Europe and the rest of the world (Korber et al., 2020). The reasons for the rapid spread of this variant have been controversial, because its emergence was coincident with the start of the rapid diffusion of the virus outside of Asia, first in Europe and later in the Americas and rest of the world. The main point of discussion was whether the rapid spread was due to an intrinsic advantage of this variant compared to others or the main causes were related to epidemiological reasons and not to the intrinsic features of the variant (Grubaugh et al., 2020a). Structural and in vitro analyses, along with experiments in animal models, of the G614 form of the spike protein point at an increased infectivity and higher viral load than that of the D614 alternative, thus providing support for intrinsic viral properties as responsible for the observed replacement (Korber et al., 2020;Plante et al., 2020). A large study of more than 25,000 genomes from the UK tested the hypothesis of positive selection of G614 in this population . The results were not conclusive concerning the action of natural selection on this variant but they found a significant association with viral load and younger age of patients. Using an even larger sample and a new method that takes into account the recurrence of mutations, van Dorp et al. (2020b) found no evidence that D614G increases viral transmissibility.
Is selective advantage a necessary condition for a variant to increase its frequency? Population genetics theory shows that a mutant allele can increase its frequency, even reach fixation, without the action of selective processes. Genetic drift operates with intensity inversely proportional to population size, which seems to exclude it from acting in viral populations, although with increased overdispersion of transmission the more genetic drift would be expected. Additionally, the epidemic spread of SARS-CoV-2 has been shown to depend strongly on "super-spreading" events (Adam et al., 2020;Liu et al., 2020;Majra et al., 2021), the epidemiological equivalent of population bottlenecks followed by exponential growth, which can increase rapidly the frequency of a neutral, even slightly deleterious, mutation. Such events have been observed during the COVID-19 pandemic. For instance, the 20E.EU1 (or lineage B.1.177) variant was detected at the beginning of the 2020 summer in the North-East of Spain, linked to agricultural workers . In a few weeks, it became the dominant variant in Spain, which opened its borders to international travel. At the beginning of fall, this variant had become the most prevalent in several European countries, including Switzerland, the UK, and Denmark. The most noticeable variant-defining mutation, A222V again on the spike protein, does not show significant effects on the ability of the protein to mediate viral entry. Hence, its spread seems to be better explained by chance and opportunity than by any selective advantage.
New, adaptive variants may arise at any time during a pandemic. However, benefitting from being advantageous depends on the environment, including external factors and genomic context, and on chance, because population size (i.e., genetic drift) is still important in medium sized populations and in those undergoing frequent bottlenecks. The same D614G mutation that became dominant in the COVID-19 pandemic was detected in other lineages before it originated the highly successful clade 20A. So far, there has been very little immune selection on SARS-CoV-2 from the human population, but this is likely to change as more people develop immune defences as a result of natural infection and vaccination. An excellent opportunity to learn how SARS-CoV-2 adapts to changing environmental conditions and the basis of how it might have jumped to infect humans is given by the changes observed upon jumping to another host species.
SARS-CoV-2 has been found in many domestic and captive animals because of infections from humans or experimental infections (Abdel-Moneim and Abdelwhab, 2020). One particular case has received quite attention, the jump to minks which have resulted in serious outbreaks in mink farms and even transmissions from minks to humans (Oude Munnink et al., 2021). Outbreaks in mink farms have been detected in several countries, including Denmark, the Netherlands, Sweden, USA, and Spain (van Dorp et al., 2020c). Apart from the economic losses for the fur industry, the possibility of adaptation of SARS-CoV-2 to a new host with a widespread distribution is cause of much concern for the additional difficulties in controlling a pandemic with new natural reservoirs.
A detailed genomic analysis of outbreaks in several Dutch mink farms revealed that they were caused by different lineages (Oude Munnink et al., 2021) and showed a clear relationship between the sequences obtained from minks and those from the corresponding farm workers. In one case, transmission from minks to workers of the farm was observed. However, although some mutations were repeatedly observed among the sequences obtained in the different farms, none was consistently found in all of them. Van Dorp et al. (2020c) identified up to 23 recurrent mutations including three non-synonymous mutations in the RBD that appeared independently in at least four occasions (Fig. 1A). These observations might indicate that the virus has explored several ways to adapt to a new host . This might have also been the case during the phase of adaptation to humans.

Recently emerged SARS-CoV-2 lineages
As mentioned above, since the initial phases of the COVID-19 epidemic, a major source of concern was the possibility that SARS-CoV-2 might mutate to acquire novel phenotypes and, possibly, increased virulence or transmissibility. Worries about the rapid spread of the 20E.EU1 lineage were soon tempered, but emphasized the relevance of genomic surveillance during the pandemic. This became even clearer at the beginning of December 2020, when routine epidemiological investigation for increasing incidence of COVID-19 in Kent (England), together with analysis of sequences obtained by the COVID-19 Genomics UK (COG-UK) consortium, revealed the presence of a large monophyletic cluster highly divergent from genomes sampled in the UK and worldwide. Almost half of the genomes sampled in Kent belonged to this new cluster (https://assets.publishing.service.gov.uk/g overnment/uploads/system/uploads/attachment_data/file/947048/ Technical_Briefing_VOC_SH_NJL2_SH2.pdf). Inspection of sampling dates indicated that the earlier genomes in the cluster were collected in September. The long branch (Fig. 2) separating these sequences from the others facilitated the identification of the new lineage, which was initially designated VUI-202012/01 (where VUI stands for variant under investigation) by Public Health England and then renamed VOC-202012/01 (here VOC is variant of concern) on December 18th (htt ps://assets.publishing.service.gov.uk/government/uploads/system/upl oads/attachment_data/file/947048/Technical_Briefing_VOC_SH_NJL2_ SH2.pdf). VOC-202012/01 is also referred to as B.1.1.7 (PANGO nomenclature) and 20B/501Y.V1 (Nextstrain nomenclature). As the long branch in the phylogenetic tree indicates, B.1.1.7 is characterized by an unusually large number of nucleotide substitutions, many of which are nonsynonymous changes or deletions in the spike protein (Rambaut et al., 2020b) (see below and Table 3). The surge in cases associated with B.1.1.7, as well as its biological features, prompted the UK Government to enforce strict control measures. Epidemiological investigations during this time of high social distancing suggested that B.1.1.7 is more transmissible than pre-existing lineages with estimated ratios of reproduction numbers varying between 1.4 and 1.8 (Leung et al., 2021;Volz et al., 2020;Vöhringer et al., 2020;Davies et al., 2021a). Alarmingly, B.1.1.7 also seems to be associated with higher viral loads and increased disease severity (Borges et al., 2021a;Davies et al., 2021b;Kidd et al., 2021). By the end March 2021, the B.1.1.7 lineage had spread in 94 countries, although the coverage of genomic surveillance varies greatly across the globe (https://www.GISAID.org) (Fig. 3). Importantly, the proportion of B.1.1.7 increased significantly from 1 to 70% over the course of the recent SARS-CoV-2 outbreak in Portugal (Borges et al., 2021b). Data from the USA, where B.1.1.7 was introduced in October-November, however indicated that this lineage has spread at an unremarkable pace in California, whereas its diffusion in Florida was definitely faster . The reasons why B.1.1.7 displays different epidemiological characteristics depending on the geographic area are presently unknown, but may relate to the specific control measures that are in place in distinct regions.
B.1.1.7 carries an in-frame deletion in the N-terminal domain (NTD) of the spike protein (HV69-70Del). An apparently unrelated lineage (B.1.375) with the same deletion was detected in multiple locations in the United States Worobey, 2020, 2021). B.1.357 most likely originated in mid-September 2020 and, in addition to the deletion, carries fewer mutations compared to lineage B.1.1.7. (Fig. 1A, Table 3). At present, there is no indication that B.1.357 has peculiar characteristics in terms of disease severity or transmissibility Moreno et al., 2021. Notably, the 69-70 deletion is found in yet another lineage, B.1.258, which has now been reported at considerable frequencies in several locations, especially in Europe (https://cov-lineages.org/lineages/lineage_B.1 Fig. 1. Mutations observed in emerging SARS-CoV-2 lineages, in mink clusters, and in immunocompromised patients with long-term infection. (A) Schematic representation of the coding regions of the SARS-CoV-2 genome. The furin cleavage site in the S protein is represented with an elongated red triangle. Mutations are represented with colored triangles, as per legend. Variants possibly associated with increased mortality were derived from a previous work (Hahn et al., 2020). Subjects with long-term infection were described in the following works: immunosuppressed individual treated with CP (Kemp et al., 2020), Immunocompromised individual with cancer and treated with CP , Immunocompromised individual treated with Regeneron , lymphoma patient (Bazykin et al., 2021). (B) Mutations found in lineages B.1.1.7 (orange), B.1.351 (blue), and P.1 (green) are mapped onto the three dimensional structure of the spike protein. Mutations shared by two lineages are in chocolate, those shared by three lineages in dark red. The 3D structure corresponds to Swissmodel P0DTC2, which includes amino acids that are disordered in the spike crystallographic structures. For clarity, mutations are mapped on one monomer only.  Table). In particular, 20 sequences/month were included. Sequence alignments were generated using MAFFT (v7.427) (Katoh et al., 2019;Polack et al., 2020;Walsh et al., 2020) The phylogenetic tree was constructed using RAxMLversion 8.2.12 (Stamatakis, 2014) and visualized with FigTree (http://tree.bio.ed.ac.uk/).
Unfortunately, these are not the only lineages to generate concern in this phase of the pandemic, as the Network for Genomic Surveillance in South Africa reported the emergence of a distinctive SARS-CoV-2 lineage (lineage 501Y.V2 or B.1.351) in October 2020. Just like B.1.1.7, this lineage is defined by a substantial number of mutations, only a minority of which are shared with the B.1.1.7 lineage (Fig. 1, Table 3). Epidemiological evidence suggests that B.1.351, which emerged in early August in Nelson Mandela Bay, has been displacing other lineages in several provinces in South Africa . B.1.351 has been isolated from people infected with SARS-CoV-2 from more than 50 different countries across the globe (https://www.GISAID.org). Very preliminary data from the Centre of Mathematical Modelling of Infectious Diseases (CMMID COVID-19 working group, London School of Hygiene and Tropical Medicine) indicated that B.1.351 is possibly more transmissible or less susceptible to cross-protection from previous exposure (or both) (https://cmmid.github.io/topics/covid19/sa -novel-variant.html).
Even more recently -i.e., in December 2020-a resurgence of COVID-19 cases in Manaus, Brazil, prompted a genome sequencing program that led to the identification of another novel lineage (P.1) (Sabino et al., 2021;Faria et al., 2021a). Specifically, Manaus experienced a high attack rate, which was estimated to be approximately 75% by October 2020 (Faria et al., 2021a). The P.1 lineage was absent in samples collected until November 2020, but its prevalence raised to 41% during December 2020. P.1, a descendant of lineage B.1.1.28, is phylogenetically distinct from pre-existing strains circulating in Brazil and elsewhere (Fig. 2). In analogy to the two novel lineages described above, it carries a number of mutations, particularly in the spike protein. Some of these are shared with B.1.1.7 and B.1.351 (Fig. 1, Table 3) (Faria et al., 2021a). The progenitor of P.1, lineage B.1.1.28, has been circulating in Brazil since the early pandemic phase (February-March 2020) (Resende et al., 2020). In addition to the P.1 lineage, B.1.1.28 also originated another independent sub-lineage (P.2) (Fig. 2), which shares one mutation (E484K in the RBD of the S protein) with P.1 . Both P.1 and P.2 have been associated with cases of re-infections in Brazil Vasques Nonaka et al., 2021;Resende et al., 2021). Notably, a recent update from Public Health England reported the detection of the E484K mutation in a small subset of sequences belonging to the B.1.1.7 lineage, suggesting multiple independent acquisitions of this change (https://assets.publishing.ser vice.gov.uk/government/uploads/system/uploads/attachment_data /file/957504/Variant_of_Concern_VOC_202012_01_Technical_Briefin g_5_England.pdf).
Finally, on January 19th, 2021, the California Department of Public Health dispatched a note about a SARS-CoV-2 variant carrying the L452R mutation. This lineage, now referred to as B.1.429, caused multiple large COVID-19 outbreaks in Santa Clara County and other regions (https://www.cdph.ca.gov/Programs/OPA/Pages/NR21-020.aspx) and is now detected as majority lineage in California and Nevada (htt ps://outbreak.info). B.1.429 is defined by four non-synonymous substitutions, three of which in the spike protein (S13I, W152C, L452R). Among these, the L452R mutation is located within the RBD. Epidemiological and in vitro analyses have suggested that B.1.429 has increased infectivity and transmissibility, as well as the ability to escape neutralization by CP and vaccine-induced antibodies (Deng et al., 2021;Li et al., 2020b). At the end of January 2021, a descendant lineage of B.1.429 carrying additional variants including Q677H (S protein) was detected in Colorado . Substitutions at position 677 are notable because they seem to have arisen independently in multiple lineages, either as Q677H or as Q677P. The proximity of this position to the polybasic cleavage site might be consistent with a functional relevance for the proteolytic processing of the spike protein (Hodcroft et al., 2021).

Genomic features and possible origin of emerging SARS-CoV-2 lineages
Clearly, one of the most notable features of the emerging lineages, especially B.1.1.7, B.1.351, and P.1, is the large number of nucleotide substitutions they carry (Table 3, Fig. 1). This is readily visualized by the long phylogenetic branches separating these clusters from the other SARS-CoV-2 variants (Fig. 2). As mentioned above, the estimated substitution rate for SARS-CoV-2 is around 10 − 3 substitutions per site per year van Dorp et al., 2020a;Ghafari et al., 2020). Thus, circulating viruses accumulate on average ~2 substitutions per month. The large number of changes on the emerging lineages is thus highly unexpected and represents a new twist in the evolutionary trajectory of SARS-CoV-2. Moreover, the nature of these changes and their location are strongly suggestive that some form of selective pressure underlies the origin of the novel lineages. In fact, evolutionary analyses indicated that the emergence of the three lineages carrying the N501Y substitution (B.1.1.7, B.1.351 and P.1) was accompanied by a shift in the strength of natural selection and all of them carry a number of sites that show evidence of ongoing adaptation . Most substitutions in these lineages are either missense, nonsense or indels (insertions/deletions) that alter protein sequences (Table 3), and the majority of them are located in the spike protein, which accounts for only ~13% of the coding capacity of SARS-CoV-2 (Fig. 1). Importantly, some of the spike protein changes have been associated with increased infectivity (e.g., N501Y, HV69-70Del), escape from immune responses (e.g., E484K, K417N/T, HV69-70Del), or spillover to mink farms (N501T) ( Table 3). Several mutations in the S protein are also predicted to affect conformational epitopes (Fig. 4). It is also clear that substitutions at positions N501, K417, and E484 arose independently on multiple lineages, suggesting convergent evolution or recurrent mutation (Faria et al., 2021a). Indeed, recent observations indicate that a substantial fraction of mutations that define the emerging lineages occur in protein regions of remarkable evolutionary plasticity in sarbecovirus genomes (Garry et al., 2021). In fact, a spike protein alignment of representative sarbecoviruses detected regions that, along the evolutionary history of this virus subgenus, have accumulated micro-deletions and insertions. Such regions include the polybasic furin cleavage site, where P681H is located, but also exposed loop regions, which are most likely tolerant to change and subject to immune selection (e.g., the one where K484 lies). Overall, these observations were taken to imply that the selective pressures that have shaped the evolution of spike proteins in sarbecoviruses are driving the ongoing evolution of SARS-CoV-2 in humans, or at least contributed to the emergence of the highly divergent lineages (Garry et al., 2021). This hypothesis is also in line with preliminary analyses of B.1.351 sequences indicating that both the entire S gene and several mutated sites show evidence of positive selection Tegally et al., 2020).
Whereas all these lines of evidence suggest an underlying selective pressure, the question remains as to which factors or circumstances prompted SARS-CoV-2 to accrue mutations and originated the emerging lineages. At present, the most widely accepted hypothesis involves patients with chronic or long-standing SARS-CoV-2 infection. This is because highly divergent SARS-CoV-2 genomes that carry multiple mutations have been sequenced from immunocompromised subjects with long-term COVID-19, either treated or not with CP or therapeutic antibodies (Bazykin et al., 2021;Avanzato et al., 2020;Choi et al., 2020;Fig. 4. Relationships between amino acid substitutions and conformational epitopes in the SARS-CoV-2 spike proteins. Conformational epitope may induce the neutralizing antibody against various viruses (Aso et al., 2019). Therefore, to estimate the vaccine's efficacy, the relationships between conformational epitopes and amino acid substitutions of the S protein in the B.1.1.7 and B.1.351 lineages were examined. Detailed procedures of these examinations were made as previously described (Aso et al., 2019). The changes detected in the two lineages (red) affected partially overlapping epitopes (blue). Particularly, amino acid substitutions (E484K and N501Y) in conformational epitopes were found in the RBD of the S protein (yellow). Reports suggested that amino acid substitutions of the conformational epitopes lead to viral reinfection and changes of vaccine efficacy (Russi et al., 2018). Thus, our in silico study suggests that the mRNA vaccine efficacy partially changes against the variants (Russi et al., 2018;Sahin et al., 2020), although this should also be examined in vitro and in vivo. Kemp et al., 2020). Moreover, some of the mutations observed in these samples correspond to those identified in the B.1.1.7, B.1.351, and P.1 lineages (Table 3, Fig. 1). As of March 2021, four cases of long-term infections (up to four months) have been described in immunocompromised COVID-19 cases, two with haematological malignancies and two receiving immunosuppressant therapies (Bazykin et al., 2021;Avanzato et al., 2020;Choi et al., 2020;Kemp et al., 2020). In all patients, highly divergent viral lineages emerged, irrespective of symptom severity and disease outcome (two deceased and two recovered) (Fig. 1). Three of these patients were treated at least once with either CP or antibody cocktails (Regeneron). Longitudinal sequencing revealed a progressive accumulation of mutations, as well as the appearance of dynamic viral populations, indicative of intra-host evolution of SARS-CoV-2 Choi et al., 2020;Kemp et al., 2020). In particular, Kemp and coworkers noted that a viral population carrying the VH69-70Del and D769H mutations emerged after the first CP administration, fell to low frequency in a few days, and then increased again after a second CP treatment. This observation suggests competition among intra-host viral populations and supports the idea that CP treatment exerts a selective pressure that favours specific mutants. Indeed, using in vitro assays they found that the VH69-70Del+D769H variant was less sensitive to neutralization by CP (Kemp et al., 2020).
Overall, these data raise the possibility that, in immunocompromised hosts, long-term infection and reduced immune control allow intra-host virus evolution, and that treatment with CP or therapeutic antibodies selects for specific mutations. It is however worth mentioning that a recent report on a patient with lymphoma and long-term COVID-19 also described the emergence of a highly divergent lineage carrying, among others, the VH69-70Del mutation. The patient did not receive CP or antibodies and did not develop neutralizing responses against SARS-CoV-2 (Bazykin et al., 2021). It is thus possible that long-term viral replication in the context of an immunocompromised host, irrespective of the treatment regime, creates the conditions to generate high viral diversity and novel lineages. This has previously been suggested for other viruses, including norovirus and influenza A virus (McMinn et al., 1999;Karst and Baric, 2015;Memoli et al., 2010;Rogers et al., 2015), although in the latter case the interpretation is complicated by the administration of antivirals to chronically infected patients. In this respect, it is also interesting to notice that long-term infections with SARS-CoV-2 (more than 2 months) were previously shown to occur in immunocompetent hosts and, when genome sequencing was performed, limited viral genetic diversity was observed (Li et al., 2020c;Abu-Raddad et al., 2020).
More abundant data on short-term infections indicate that, within the host, SARS-CoV-2 accumulates mutations at a pace consistent with its estimated substitution rate (i.e., ~1 mutation per genome every two weeks) (Tonkin-Hill et al., 2020). Thus, diversity remains low and viral evolution is mainly shaped by purifying selection (Abu-Raddad et al., 2020;Valesano et al., 2021;Popa et al., 2020;Tonkin-Hill et al., 2020;Lythgoe et al., 2021). Nonetheless, variability in the number of detectable viral variants was observed among patients with short-term infections (Popa et al., 2020), and still limited data indicate that SARS-CoV-2 intra-host diversity might increase with age (Al Khatib et al., 2020) and in cancer patients (Al Khatib et al., 2020;Siqueira et al., 2020). Intense monitoring of viruses transmitted by these (and other) patient categories will thus be required to assess whether they can transmit genetically diverse viral genomes.
In summary, given the limited evidence available to date, the most likely explanation for the emergence of the new, highly divergent lineages is that their evolution was accelerated by some specific circumstances (possibly infection of an immunocompromised host) and onward transmission introduced them back in the human population. This is also in line with an analysis of B.1.1.7 genomes (collected up to November 30 th , 2020), which indicated that, since its detection in the UK, this lineage has been evolving with a substitution rate similar to that of other SARS-CoV-2 lineages (Rambaut et al., 2020b).

Recombination in SARS-CoV-2
New evidence points at the occurrence of recombination in SARS-CoV-2 Latinne et al., 2020;VanInsberghe et al., 2021). This is not a novelty in coronaviruses (Gribble et al., 2021) and previous analyses have shown that recombination has played an important role in the evolution of SARS-CoV-2 from its ancestors Boni et al., 2020;Kirtipal et al., 2020;MacLean et al., 2021;Wells et al., 2021). Some recombinant sequences detected in the UK involve a breakpoint near the 5' end of the spike gene from B.1.1.7 variants  but there has been no indication of changes in their phenotypic properties nor an increase in their frequency above the threshold for gaining the assignment of a new PANGO lineage. Nevertheless, in light of the possible epistatic effects of mutations in different portions of the SARS-CoV-2 genome (McCallum et al., 2021a), it will certainly be necessary to closely watch the emergence and possible spread of recombinant genomes of this virus.

Possible impact of SARS-CoV-2 variants on the performance of molecular testing methods
Since COVID-19 has reached pandemic status causing a serious global health threat, a widespread availability of diagnostic testing is crucial to detect SARS-CoV-2 in a variety of specimen types collected from both symptomatic and asymptomatic patients. Many academic and commercial clinical microbiology laboratories and companies have worked around the clock to develop molecular tests that are fast, highly accurate, and inexpensive to meet testing demands. In addition, more accessible and scalable testing is a critical component in managing the COVID-19 pandemic. Molecular tests detect genetic material and they are sensitive enough to be able to pick up very small amounts of viral RNA very early in an infection. Molecular tests are considered the gold standard diagnostic test for SARS-CoV-2 detection. Molecular tests use two major techniques known as reverse transcription polymerase chain reaction (RT-PCR) and isothermal amplification. Another emerging molecular technology is the clustered regularly interspaced short palindromic repeats (CRISPR). The Sherlock™ CRISPR SARS-CoV-2 kit is the first CRISPR-based diagnostic test receiving an EUA by the FDA. This assay is intended for the qualitative detection of the SARS-CoV-2 virus in upper respiratory tract specimens from patients suspected of having COVID-19.
The COVID-19 molecular tests have been designed and developed based on genomic information of SARS-CoV-2. Although molecular tests can provide rapid and accurate diagnosis on the infection, there is a considerable risk of misdiagnosis due to genomic variations, which may have a critical impact on the test performance. Molecular tests often use different primer/probe sets targeting different regions of the SARS-CoV-2 genome. To safeguard against potential mutational drift in the SARS-CoV-2 genome, many molecular tests were developed to amplify and detect at least two conserved regions. Molecular tests designed to detect multiple SARS-CoV-2 genetic targets are less susceptible to the effects of genetic variation than tests designed to detect a single genetic target. The Xpert Xpress SARS-CoV-2/Flu/RSV test targets the N2 and E genes of SARS-CoV-2. The BioFire COVID-19 test consists of three independent and non-overlapping assays targeting the ORF1ab and ORF8 sequences. In the Simplexa COVID-19 Direct assay, two different regions of the SARS-CoV-2 genome, ORF1ab and S gene, are amplified. The Panther Fusion SARS-CoV-2 test detects two conserved regions of the ORF1ab gene, the two regions are not differentiated and amplification of either or both regions leads to a fluorescence signal. A recent study performed the in-silico reassessment of the previously published primers and probes for COVID-19 diagnosis using a total of 17,026 SARS-CoV-2 sequences. Mutations or mismatches on primer/probe binding regions were discovered in seven out of 27 molecular assays. While the US-CDC-N-1 probe in the US CDC 2019-nCoV Real-Time RT-PCR Diagnostic Panel showed only one mismatch with 1.6% viral sequences, the CN-CDC-N forward primer (developed by China CDC, China) had three mismatches with 18.8% of viral sequences. The reverse primer of NIID-JP-N (developed by National Institute of Infectious Diseases, Japan) also showed one mismatch with all the sequences (Khan and Cheung, 2020).
Since single nucleotide polymorphisms (SNPs) of the SARS-CoV-2 genome are now a regular occurrence, with more discovered every day, it is unrealistic to avoid all SNPs on different primer/probe binding sites. However, many molecular tests can tolerate a few single nucleotide mismatches, which can have little to no impact at all on their performance. Lefever at el. discovered that single mismatches located >5bp from the 3' end have a moderate effect on the target amplification and can be tolerated. In addition, four mismatches in a single primer block amplification almost completely, whereas three mismatches in one of the primers must be combined with at least two mismatches in the other primer to achieve the loss of target hybridization (Lefever et al., 2013). Recently, Ziegler and co-authors reported a case tested with the Xpert Xpress SARS-CoV-2 assay with a cycle threshold (CT) value of 22.7 for the E gene, but a negative result for the N2 gene. However, other platforms including the Allplex SARS-CoV-2 assay, the Charité protocol (Charité -Universitätsmedizin Berlin Institute of Virology, Germany) and the US CDC 2019-nCoV Real-Time RT-PCR Diagnostic Panel revealed positive results for all designed targets, including the N gene with CT values of 26.2-27. Sanger sequencing of the two independent PCR amplicons revealed three SNPs compared with the SARS-CoV-2 strain Wuhan-Hu-1 reference genome (Ziegler et al., 2020).
Given the high frequency of SNPs occurrence, the FDA alerts clinical laboratories that false negative results may occur with any molecular test for the detection of SARS-CoV-2 if a mutation occurs in the part of the virus' genome assessed by that test. Since primer/probe sequences of most commercial assays are not revealed, the FDA monitors closely the potential negative impact of genetic variation in molecular tests that have received Emergency Use Authorization (EUA). Based on the FDA's analysis to date, the Accula SARS-CoV-2 test performance may be impacted when a SARS-CoV-2 strain having a genetic variant at position 28,881 (GGG to AAC) is tested. Other two molecular tests, the TaqPath COVID-19 Combo Kit (which may also be labelled as the TaqPath COVID-19 Combo Kit Advanced) and the Linea COVID-19 Assay Kit, have significantly reduced sensitivity due to certain mutations, including one of the mutations in the recently identified VOC 202112/ 01 variant (lineage B.1.1.7 or 20I/501Y.V1). Since this test is designed to detect multiple genetic targets, the overall test sensitivity should not be impacted. However, the pattern of detection when certain mutations are present may help with early identification of new variants in patients to reduce further spread of infection. Be aware of the pattern of detection associated with certain mutations, including the B.1.1.7 variant, specifically a pattern of 2/3 positive targets showing the S-gene drop out (reduced sensitivity with the S-gene target, also denoted as SGTF, S gene target failure), when using the TaqPath COVID-19 Combo Kit, and a pattern of 1/2 positive targets showing the S-gene drop out when using the Linea COVID-19 Assay Kit. Recently, the FDA reports that the Cepheid tests are impacted by a single point mutation in the target area of the test. Two independent single point mutations reduce the test's sensitivity for detecting the N2 target. This observation is unexpected, and the FDA's analysis suggests that the impact of a single point mutation on the test performance is associated with the unique chemistry of the Cepheid tests. The E target is still detected when enough virus is present, leading to a "presumptive positive" result in the Xpert Xpress SARS-CoV-2 and Xpert Xpress SARS-CoV-2 DoD tests. Detection of the E target without detecting the N2 target will be reported as "positive" in the Xpert Omni SARS-CoV-2. The FDA also recommends considering repeat testing with a different test (with different genetic targets) if COVID-19 is still suspected after receiving a negative test result. (https ://www.fda.gov/medical-devices/letters-health-care-providers/geneti c-variants-sars-cov-2-may-lead-false-negative-results-mole cular-tests-detection-sars-cov-2; https://www.fda.gov/medical-devic es/coronavirus-covid-19-and-medical-devices/sars-cov-2-viral-muta tions-impact-covid-19-tests).
Viruses often mutate, and SARS-CoV-2 is no exception. As a consequence, there is the urgent need for continued surveillance of viral evolution and for fast, accurate and sensitive detection methods. Viral metagenomics has emerged to be a powerful method to detect SARS-CoV-2 mutants. The Illumina COVIDSeq test (Illumina, Inc.) is the first next-generation sequencing (NGS) test approved for use under the EUA. This amplicon-based NGS test can amplify up to 98 targets on SARS-CoV-2 genome for highly accurate detection. Other NGS in vitro diagnostic tests under EUA include Clear Dx SARS-CoV-2 test on the Oxford Nanopore GridION Sequencer (Clear Labs, Inc.), Guardant-19 on the Illumina NextSeq 500 & NextSeq 550 Sequencing Systems (Guardant Health, Inc.), and Helix COVID-19 NGS test on the Illumina NovaSeq 6000 Sequencing System (Helix OpCo LLC) https://www.fda.gov/medi cal-devices/coronavirus-disease-2019-covid-19-emergency-use-authori zations-medical-devices/vitro-diagnostics-euas#individual-molecular). These high-throughput assays should be considered when clinical laboratories want to further characterize the clinical specimen with genetic sequencing when the above-mentioned pattern of detection associated with certain mutations is identified.

Characterising the immune response to SARS-CoV-2 and the determinants of immunopathology
To understand the likely impact of new variants and facilitate vaccine design, we first need to understand the natural immune response and the way the immune responses differ between asymptomatic, mildly and severely infected individuals. Typically for viral infections, IgG and IgM antibodies are seen after symptom onset, and neutralising antibodies are correlated with resolution of disease . Asymptomatic individuals show long periods of viral shedding and have low levels of antibody-mediated immunity . The magnitude of the neutralising antibody response is positively correlated with disease severity, as is the slow decline of antibody levels (Seow et al., 2020). The major target of these neutralising antibodies is the S protein, both the S1 (containing the RBD) and S2 domains (Jeyanathan et al., 2020). The roles of T cells are key, including for memory responses, with CD4+ T helper cells stimulating the production of antibodies and CD8+ cytotoxic T cells Sharma et al., 2020). Cytotoxic CD4+ and CD8+ T cells are also known to be associated with disease outcome . CD4+ T cells are generated in high amounts in COVID-19: mild disease and the acute phase of infection have been associated with high SARS-CoV-2 specific CD4+ T cells, while these have been absent where the outcome has been fatal (Rydyznski . Early expression of Type 1 interferons will induce an anti-viral state and promote a pro-inflammatory response through inflammatory cytokines and chemokines. The SARS-CoV-2 virus is associated with delayed Type 1 IFN production, the innate immune response is suppressed, and this delayed response causes an inability to control viral replication, leading to immunopathology, cellular damage of airway epithelia and the lung parenchyma, sometimes resulting in Acute Respiratory Distress Syndrome (ARDS) and an eventual lethal inflammatory cytokine storm (Jeyanathan et al., 2020;Varghese et al., 2020).
COVID-19 tends to be most severe in elderly patients, those with underlying health complications and South Asian and Black people have a higher chance of COVID-19 related death than white people, only partly accounted for by comorbidities . With age, both innate and adaptive immunity undergo cellular and functional changes, known as 'immunosenescence', characterised by higher baseline inflammatory responses, so that even healthy elderly have a continual low-grade inflammation known as 'inflammaging'. They have low numbers of naïve T cells, and although they have good memory T cell populations, the diversity of the T cell repertoire is reduced. The efficient adaptive immune response seen in the young, associated with a good prognosis to COVID-19, is not seen in the aging population, where the poor adaptive response fails to control viral replication (Rydyznski . Heightened innate immunity results in release of cytokines and an inflammatory storm, which may cause tissue damage, ARDS, and ultimately prove lethal (Cunha et al., 2020).

Approaches to vaccination: efficacy, protection and severity
Hundreds of vaccines are under development, and a thorough review of these is beyond the scope of this article. For most all vaccines under development, the major protein target is the S-protein with its direct involvement in infection, and as a major target of the immune response. The development of vaccines against COVID-19 has seen unprecedented use of a large number of platforms for design and development, which are already well reviewed, together with antigen selection, route of delivery and regimens (Jeyanathan et al., 2020;Grifoni et al., 2020;Sharma et al., 2020;Flanagan et al., 2020).
Examples of well progressed vaccines, that have been through clinical trials and are now being delivered to the general population, include the Pfizer/BioNTech, Moderna, and Oxford/AstraZeneca vaccines (Jeyanathan et al., 2020). It is from these first vaccinations that we are beginning to collect empirical data to understand immunity and vaccine efficacy, including with respect to new variants. The Pfizer/BioNTech and Moderna vaccines are both based on mRNA synthesis. These are lipid nanoparticle mRNA vaccines, coding for the RBD of the S-protein, showing very high efficacy and almost identical results in phase III trials. Reports of trials recorded high titres of neutralising antibodies, CD4+ and CD8+ T cell responses (Laczkó et al., 2020;Mulligan et al., 2020;Sahin et al., 2020.). Notably both these vaccines appeared to confer similar protection in both young and old age groups.
The Oxford/AstraZeneca vaccine is a non-replicating recombinant viral vector (ChAd) expressing the S protein. Phase I -III clinical trials have taken place in the UK, South Africa, the USA and Brazil with an overall efficacy of 71% and induction of neutralising antibodies and T cell responses (Folegatti et al., 2020;van Doremalen et al., 2020;Ziegler et al., 2020). It is notable that whichever vaccine is employed, and the reported vaccine efficacy, no severe cases have been found amongst vaccinated people to date. Most recently, The AstraZeneca US phase III trial (astrazeneca.com, 22nd March 2021) with 32,499 participants and 141 cases of symptomatic COVID-19 reported 79% vaccine efficacy at preventing symptomatic COVID-19, 100% efficacy against severe or critical disease and hospitalisation and comparable efficacy across ethnicity and age.

Predicting the immune response and new SARS-CoV-2 variants
As of March 2021, there is a paucity of empirical evidence in the peer-reviewed literature concerning the relevance of new variants on the ability to mount an effective immune response, to correlate with immunopathology or to reduce the protective immunity conferred by vaccination. However, the lack of information is rapidly changing, with preprints appearing relevant to the variants of concern (lineages B.1.1.7, B.1.351, and P.1) in 2021. It is envisaged that during the next few months, data-driven analysis will be performed to review current vaccines and potential modifications to account for loss of efficacy due to new escape variants.
The vast literature that exists on the analysis of viral variants, arising during the pandemic, is covered elsewhere in this article. Based on the very large databases of viral sequences, along with time of origin and details of sampling, predictive tools have been proposed to assess the likely effects of new variants: not only which variants will be selected for and spread through their increased transmissibility and those which result in increased virulence, but also for understanding effects relevant to immunity and vaccination (Tomaszewski et al., 2020). Interest in predicting the likely development and impact of new variants and combinations of variants has often, but not exclusively, focussed on the S protein and the variable RBD. For instance, bioinformatics and machine learning methods have been applied to predict epitope targets for CD4+ and CD8+ T cell response  and also to select for T and B cell epitopes in vaccine design (Kalita et al., 2020;Kiyotani et al., 2020). These predictions have been helpful to identify experimentally the T cell targets and hence to investigate the effects of virus mutations on the immune response. Such predictions, on both T-cell and B-cell epitopes, have been used together with the information pertaining to the likelihood of viral variants undergoing antigenic drift (Koyama et al., 2020).
A major hurdle to develop vaccines that specifically induce T cell responses is the design of a heterogeneous set of epitopes that can be used to cover the vast heterogeneity of MHC genotypes in humans. In the context of other viral infections, such as DENV, Influenza and HIV, several strategies have been developed to achieve an optimal set of epitopes, and these approaches are likely to be underway in COVID-19 vaccine development as well. The use of animal models in our understanding of new SARS-CoV-2 variants and their relevance to vaccine development can be difficult and they are certainly not essential (Flanagan et al., 2020). Interestingly, and potentially benefiting protection through cross reaction, some epitopes from S and N proteins map identically in SARS-CoV-1 and SARS-CoV-2 . Identification of SARS-CoV-2 specific T cell epitopes, including those encompassing the new variants, as well as epitopes recognised by T cells potentially cross reactive with related viruses will be key for understanding immunity and immunopathological sequelae, which will impact on vaccine design (Flanagan et al., 2020;Ahmed et al., 2020;Olvera et al., 2020).

Empirical data: immunity, immunopathology, vaccine efficacy and new variants?
Considering the widespread variants, SARS-CoV-2 D614G became dominant globally in early 2020, but despite greater infectivity and transmission, it is not thought to result in greater severity of disease (WHO.int, 2020). This suggests no marked change in overall immunity to this new variant.
Early information on SARS-CoV-2 lineage B.1.1.7, originated by VOC 202012/01, suggested that this variant shows increased transmissibility but again no greater severity of disease, as measured by hospitalisation and 28-day fatality, in a study matching 1769 variant cases with 1769 wild-type control cases. Similarly, reinfection rates did not appear to differ between the 2 groups with 2 reinfections in the variant case group compared to 3 in the wild-type case group (Public Health England, Investigation of novel SARS-CoV-2 variant, Variant of Concern 202012/ 01 Technical briefing 2-28 December 2020. PHE: London; 2020). Conversely, the most recent data for the B.1.1.7 lineage suggests that it is associated with increased disease severity. In a large study of COVID-19 cases in England, Davies et al. (2021b) considering 4,945 deaths, used S gene target failure as a proxy for infection with B.1.1.7. After controlling for confounding variables, they found an increased mortality rate associated with B.1.1.7: for men in the age range of 55 to 69, the risk of death from COVID-19 increases from 0.6% to 0.9%, and overall there is a 61% higher hazard of death (Davies et al., 2021b). This VOC has also been associated with increased viral loads in respiratory samples, as well as with longer duration of infection and of viral shedding (Calistri et al., 2021;Kidd et al., 2021;Kissler et al., 2021;Borges et al., 2021a), these latter findings helping to explain its increased transmissibility.
Since the substitution N501Y in the RBD of the S-protein is common to the three rapidly spreading B.1.1.7, B.1.351 and P.1 lineages, and is a substitution shown to infect mice more efficiently (Gu et al., 2020) (Table 3), efforts are underway to test the substitution specific neutralisation activity of sera from recovered patients and vaccinated people to assess the likely impact on vaccine performance (who.int, 2020). BNT162b2 (Pfizer Inc. and BioNTech SE) is an mRNA-based vaccine encoding the full length prefusion S protein (Polack et al., 2020;Walsh et al., 2020). The vaccine has been shown to elicit virus neutralising titres similar to those from recovered patients . Isogenic N501 and Y501 SARS-CoV-2 strains were generated and tested against sera from 20 participants of the vaccine trial. The sera had equivalent titres of neutralising antibodies to the 2 viruses (Gu et al., 2020;Xie et al., 2021). No reduction in neutralising activity against the virus carrying the Y501 variant was noted to be consistent with preserved neutralisation of 15 pseudoviruses with other mutations found to be circulating (Sahin et al., 2020.). However, another study (Wang et al., 2021), which used a similar in vitro approach, found that the activity against the N501Y variant of plasma from vaccinees with either BNT162b2 or mRNA-1273 (Moderna) vaccines was slightly but significantly reduced; and two further studies on sera from BNT162b2 vaccinated individuals detected only a minor reduction in efficacy against the B.1.1.7 variant compared to the prototypic strain (Muik et al., 2021;Hoffmann et al., 2021;Muik et al., 2021). Finally, an in vivo study using hamsters as model organisms indicated that the ChAdOx1 vaccine (AZD1222, Oxford/AstraZeneca) is protective against the B.1.1.7 lineage (Fischer et al., 2021). Overall, these results suggest that most available vaccines should retain an acceptable efficacy against the B.1.1.7 variant.
Patients with lineage B.1.351, variant 20H/501Y.V2 originally detected in South Africa, which is now spreading rapidly albeit not yet globally, show a greater viral load which is in line with its increased transmissibility and greater disease burden. However, there is no clear evidence of greater disease severity in patients infected with this variant (who.int, SARS-CoV-2 Variants-Disease Outbreak News, 31st December 2020). Information is needed on cross neutralisation, where antibodies to an infecting strain might provide protection to a different strain. Faulkner et al. (2021) reported that antibodies elicited during B.1.1.7 infection had reduced recognition and neutralising ability of parental strains or B.1.351, and that this drop in cross reactivity was greater following infection with B.1.1.7 than parental strains (Faulkner et al., 2021). The rapid spread of B.1.351, containing multiple S protein mutations, is a cause for concern, not only because of its increased infectiousness but also because there is a suggestion that it could compromise vaccination (Sahin et al., 2020;Tegally et al., 2020). In particular, the E484K substitution is found in B.1.351 and also in P.1 lineages   (Table 3, Fig. 1). This mutation is in the RBD recognised by neutralising antibodies. Greaney et al. (2021) longitudinally sampled polyclonal convalescent plasmas, and looking at the 3 main epitopes of the RBD (443-450, RB Motif, and 494-501) found binding of 11/11 samples was reduced at F456 and 9/11 reduced at E484. The most important site was E484 where neutralisation by some plasma was reduced >10 fold . However, the impact of epitope variation on responsiveness varied amongst individuals and temporally in the same individual. E484K was predicted to be an immune escape variant by Andreano et al. (2020) and has been described to be present after reinfection Vasques Nonaka et al., 2021). Experimentally SARS-CoV-2 was grown in the serum from a convalescing patient, selecting for mutations that were avoiding the antibody repertoire. Three mutations were picked up, one of which was E484. This mutation was missed by the patient's neutralising antibodies. We might expect the E484K mutation to have a greatly reduced susceptibility to neutralisation in some individuals .
Additional studies further emphasise the need for vigilance with implications for immunity and vaccination against strains carrying combinations of new variants, particularly since there is conflicting data appearing in the literature. Wibmer et al. considering the mutations in SARS-CoV-2 B.1.351 demonstrated the complete escape of this lineage from therapeutically relevant monoclonal antibodies (Wibmer et al., 2021). Each mutation K417N and E484K, both in the ACE2 RBD, failed to bind 3 such antibodies, and the same antibodies were unable to neutralise respective pseudovirus. This lineage also showed substantial or complete escape from neutralising antibodies in COVID-19 plasma. In their study of antibody and memory B cell responses of 20 volunteers who had received either of the mRNA vaccines, described above, Wang et al. (2021) showed high IgM and IgG anti-S, RBD titres after a second vaccination and RBD neutralising antibodies and memory B cell responses similar to those seen after natural infection, but RBD neutralising activity against E484K-and N501Y-carrying variants, or the SARS-CoV-2 B.1.351 combination K417N:E484K:N501Y, showed a small but significant decrease (Wang et al., 2021;Wibmer et al., 2021). In further monoclonal antibody studies, examining B cell memory, 26% of the antibodies showed at least a 5-fold decrease in binding to at least one of the RBD mutants tested, which included E484K. The K417N-E484K-N501Y combination is also found in the Brazilian P.1 lineages, reported by Faria et al. (2021b) as ~1.4-2.2 fold more transmissible and 25-61% more likely to evade protective immunity elicited by previous infection with non-P.1 lineages (Faria et al., 2021b). Recently, Wu et al. reported the neutralising capacity of sera of humans and non-human primates after vaccination with the Moderna BNT162b2 vaccine . When testing against the lineage B.1.1.7 S protein, including VH69-70del and N501Y, in a pseudovirus system, there was no significant impact on neutralising activity. However, there was a significant reduction in neutralising activity when testing against the B.1.351 lineage in pseudoviruses containing K417N-E484K-N501Y. Similar results were reached by Hoffmann and co-workers, who showed that the B.1.351 and P.1 variants are resistant to therapeutic monoclonal antibodies and to the neutralization activity of plasma from BNT162b2 vaccinated individuals (Hoffmann et al., 2021;Muik et al., 2021). Finally, a double-blind, randomized, controlled trial with ChA-dOx1 in South Africa showed that the vaccine offered limited protection against mild and moderate COVID-19 caused by B.1.351 (Madhi et al., 2021).
Research on neutralising antibodies has focussed initially on the RBD. However, antibodies binding outside this region will also play a part in control of infection. McCallum et al. (2021b) generated monoclonal antibodies from memory B cells recognising the N-terminal domain. A subset of these antibodies had potent neutralising effects. Using a Syrian hamster model, they mapped neutralisation escape mutants and produced an NTD antigenic map. Similarly, although neutralising antibodies, including those recognising the RBD and NTD, will control infection per se, T cell responses will determine later stages of infection, determining disease severity, and are particularly relevant to observations made on cross strain protection and vaccine efficacy. Reports on T cell responses appearing in the literature include studies on model organisms and using cells from convalescent subjects. The Oxford/AstraZeneca vaccine efficacy against B.1.1.7 and B.1.351 was tested in a Syrian hamster model (Fischer et al., 2021). Whilst neutralising antibodies were much reduced in vaccinated hamsters against B.1.351 compared to B.1.1.1.7, nevertheless the vaccine was effective against clinical disease caused by either variant. This is likely due to protection against severe COVID-19 being mediated by T cells, since T cell epitopes are not affected by the substitutions in these lineages, though mild infection, prior to T cell activity, still occurs. Tarke et al. (2021) reported on CD4+ and CD8+ T cell responses in convalescent subjects, and Moderna and Pfizer/BionTech vaccinees, recognising the ancestral strain compared to variant lineages including B.1.1.7, B.1.351 and P.1; and concluding that T cell responses in convalescent subjects are not substantially affected by mutations in SARS-CoV-2 variants (Tarke et al., 2021).

Ongoing investigations concerning host immunity
Since variant strains are rapidly occurring and their novel genetic composition includes new substitutions as well as indels, emphasis in the coming months will need to be placed on identifying specific genomic alterations that significantly impact viral evolution, transmission, and, notably, immune responses, which are critical for successful vaccines. Effort is needed to associate particular mutations, and also their combinations, with prevalence and transmissibility, for instance by identifying within novel variants changes in antigenic epitopes. While the majority of antigen epitopes lie within the S protein, a broad set of epitopes for T cells have also been discovered along the entire length of the genome . It seems likely that mutations outside the S region may impact the immune response. Population data so far are encouraging as it seems that mutations are not notably impacting the immune response, and disease severity has not been markedly different in people infected with different strains to date. Consortium groups will be required to study the threats from new SARS-CoV-2 variants, such as the one at Imperial College in London, announced in January 2021.
However, recent laboratory findings suggest caution is needed and the complex problems associated with 'immune escape', particularly for vaccination and therapeutics are real (Kupferschmidt, 2021). More investigations are required to test the effect of novel variants on immune recognition and response, for instance identification of viral epitopes recognised in both recovered patients and vaccinees, which may shed some light on the extent of immune escape. Monitoring will also be required to understand whether novel variants may affect reinfection or vaccine failure. It is possible that vaccines will need to be adjusted annually, as is the case for vaccinating against influenza. Preparations for providing boosters adjusted for new variants are already underway (Kupferschmidt, 2021). Developers of new vaccines are reporting on efficacy against new variants within the approval phase; e.g., data is just appearing for the Novavax S-protein vaccine. Although it is currently thought that immunity to SARS-CoV-2 will last for months after the initial infection at a minimum, insufficient individuals have been studied and time elapsed to firmly establish the duration and nature of immunity to infection or current vaccines.

Importance of large-scale sequencing and surveillance
Since the early identification of SARS-CoV-2 a large effort has been made to characterize the global evolving genetic diversity of the virus. As of March 2021, more than 830 thousand full-genomic sequences of SARS-CoV-2 have been made available on public databases. Genomic characterization and analysis provide valuable tools for the global effort to control the pandemic. Specifically, sequence analysis has been used for phylogenetic classification into clades or lineages using slightly different approaches, such as the GISAID clades (https://www.gisaid. org/) and the PANGO lineages (https://cov-lineages.org/index.html; https://pangolin.cog-uk.io/) (Rambaut et al., 2020a), as detailed above. PANGO classification of newly sequenced viruses can be performed online through the pangolin website (https://pangolin.cog-uk. io/).
Genomic sequencing and analysis have been used for many purposes as for example to identify the geographic origin of SARS-CoV-2 transmission, to uncover the patterns of virus dispersal, to quantify the levels of virus importation, to infer transmission dynamics, and to estimate whether contacts identified by phylogenetic analysis were in accordance with those identified by contact tracing. Specifically, Worobey et al. investigated the temporal and geographic origin of SARS-CoV-2 infections at the early stage of the pandemic in Europe and North America suggesting that the origin of the virus in Washington state and Italy was from two independent sources from China . In a study in Brazil, Candido et al. suggested that the majority of viruses were introduced from Europe during late Februarybeginning of March and that during the early phase of the pandemic virus spread within the country (Candido et al., 2020). Thereafter, several transmissions from large urban centres were detected across the country, which coincided with the increase in national air travel (Candido et al., 2020). In one of the earliest studies in Iceland, Gudbjartsson et al found that the founder strains in Iceland were separated from the original haplotype in Wuhan by 5 mutations (Gudbjartsson et al., 2020). They identified the number of distinct viral clades among those who had travelled abroad, and provide a lower bound of virus importation events, and the putative geographic origin of the infections. Importantly, they also found that the contacts identified through molecular analyses and contact tracing were highly concordant, thus suggesting that the latter approach can be used for the accurate identification of contacts at highest risk and prevention of SARS-CoV-2 transmission (Gudbjartsson et al., 2020). Multiple molecular epidemiology studies from Europe (i.e. Austria, Scotland, Italy, Romania, Austria, the UK, the Netherlands, Greece, France) Di Giallonardo et al., 2020;du Plessis et al., 2021;Oude Munnink et al., 2020;Popa et al., 2020;Spanakis et al., 2021;Surleac et al., 2020), the Americas Lemieux et al., 2021), Asia (Ko et al., 2021), or globally (Mastriani et al., 2021) investigated the origin of SARS-CoV-2 transmission during the first pandemic wave, the patterns of viral dispersal, and the dynamics of SARS-CoV-2 lineages (Dellicour et al., 2020).
Genomic surveillance is of particular importance for SARS-CoV-2 for many reasons, including the detection of new variants with biological importance, as is the case of the novel lineages, to monitor their local or global spread; this issue becomes of higher importance during vaccination and afterwards, when the emergence of immune escape variants is highly likely. Moreover, as shown by previous studies, genomic analysis is of relevance to better understand the spatiotemporal characteristics of SARS-CoV-2 transmission, thus providing an added value to traditional epidemiology for the control of the pandemic. In this respect, genomic epidemiology can be regarded as a tool to inform public-health decisions and containment strategies.

Conclusions and perspectives
The COVID-19 pandemic has triggered an unprecedented international effort at multiple levels. From the perspective of viral genomics, since its initial phases, the spread of SARS-CoV-2 has been described in real-time, through the phylodynamic analysis of viral genomes. The speed and scale of sequencing programs has been extraordinary: as of March 23rd 2021, more than 836,000 high-quality, complete genomes have been sequenced and deposited in public databases (Fig. 3). These efforts have proved to be pivotal for the early identification of novel variants and emerging lineages. However, genomic surveillance strategies and resources are highly heterogeneous across geographic areas, with the overwhelming majority of sequences coming from a few countries (Fig. 3) (The Lancet, 2021). Moreover, the top 10 countries (UK, USA, Denmark, Germany, Canada, Japan, Switzerland, Australia, Netherlands, Italy) produced 83.5% of sequences despite only having 35% of worldwide cases (data from the WHO, https://covid19.who. int/table). This implies that a large portion of the genetic diversity of SARS-CoV-2 remains unsampled and that, should new lineages emerge in regions where surveillance is leaky, they may remain undetected for a long time, with clear consequences on their spread and possibility of control (or lack thereof). Thus, it is essential to raise the bar on genomic surveillance, both in already active areas and in regions where sequencing efforts have lagged behind. Overall, there is a need to improve genomic sequencing capabilities to characterize the genetic diversity of SARS-CoV-2 isolates that are circulating in populations worldwide, in line with WHO guidelines (https://www.who.int/publi cations/i/item/WHO-2019-nCoV-genomic_sequencing-2021.1). This capability shall enhance diagnostic, therapeutic and preventive strategies. Active surveillance will be even more important in the coming months, when the deployment of large-scale vaccination campaigns and the use of monoclonal antibodies may subject the virus to novel selective pressures and possibly favour antigenic drift.
Finally, as noted elsewhere (Cyranoski, 2021), genomic epidemiology has transformed into an essential tool for adopting informed public health measures quickly, but its full potential will only be realized when embedded in surveillance programs that are widespread, standardized and incorporated into national programmes for pandemicprevention.

Glossary
Antigenic drift: accumulation of mutations in viral protein regions that are recognized by the immune system (antigens).
Backward mutation: a mutation that acts on a previously mutated site restoring the original nucleotide (C → A → C).
Convergent evolution: in general, the evolution of similar traits in two or more distantly related organisms. In this context, the independent acquisition of the same mutation by different viral lineages.
Monophyletic cluster: in this context, a group of sequences that descend from a common ancestor.
Parallel mutations: Mutations that appear repeatedly and independently in different lineages.
Positive selection: the increase in frequency of a beneficial mutation in a population. In coding sequences, positive selection is often detected by searching for genes or sites that show an excess of nonsynonymous changes compared to synonyms substitutions.
Reproduction number: the expected number of secondary cases produced by a single infected individual.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.