Introduction

One year after SARS-CoV-2 was declared a pandemic by the WHO, some of its variants with various combinations of amino acid substitutions or deletions in the spike protein have taken centre stage [1]. This protein leads to virus entry into human respiratory cells [2, 3]. It is also the major target of neutralising antibodies which are elicited post-infection or vaccine immunisation, and is the target of most vaccine strategies implemented to date [4]. As a result of our surveillance of genomic epidemiology, our institute has observed and described the emergence of a dozen SARS-CoV-2 variants since summer 2020, after the rate of diagnoses fell to almost zero for several weeks between May and June 2020. These included the Marseille-1 and Marseille-4 variants later assigned to Pangolin lineages B.1.416 and B.1.160, respectively, and the Marseille-501 variant corresponding to lineage A.27 [5,6,7,8]. The variants recently considered to be of greatest concern are those carrying amino acid substitutions N501Y and/or E484K within the spike protein [9, 10], as they have increased affinity for the ACE2 cellular receptor, decreased sensitivity to neutralising antibodies, and may escape the immune responses elicited by the vaccines currently used in Western countries [2, 3, 11]. Nevertheless, various other SARS-CoV-2 variants have been reported as emerging worldwide.

At the end of 2020, the incidence of variants carrying the Q677H, or Q677P substitutions in the spike increased, mainly in the USA, where the first sequences originated from Louisiana [12]. Interestingly, these strains were reported as belonging to six independent sublineages, including one, two, and three in Nextstrain clades 20A, 20B, and 20G, respectively. Amino acid 677 of the SARS-CoV-2 spike protein is only three codons upstream of the polybasic/furin cleavage site of S1/S2 spike domains, which is critical for SARS-CoV-2 pathogenesis, as this cleavage induces a spike conformational change that favours the binding to the Angiotensin-Converting Enzyme 2 (ACE2) cellular SARS-CoV-2 receptor and may enhance viral infection [12,13,14,15,16]. We looked in our SARS-CoV-2 genome sequence database which we have been contributing to since February 2020, in order to analyse the prevalence and genotypic patterns of viruses mutated at codon 677 of the spike.

Results and discussion

SARS-CoV-2 genome sequences analysed here were obtained by next-generation sequencing from viral RNA extracted from 200 µL of nasopharyngeal swab fluid collected from patients for the diagnosis of SARS-CoV-2 infection, using the EZ1 Virus Mini Kit v2.0 on an EZ1 Advanced XL instrument (Qiagen, Courtaboeuf, France) or the KingFisher Flex system (Thermo Fisher Scientific, Waltham, MA, USA) following the manufacturer's recommendations [17]. Extracted viral RNA was reverse-transcribed using SuperScript IV (ThermoFisher Scientific) prior to cDNA second strand synthesis with Klenow Fragment DNA polymerase (New England Biolabs, Beverly, MA, USA), with the LunaScript RT SuperMix kit (New England Biolabs), or according to the COVIDSeq protocol (Illumina Inc.). Next-generation sequencing was performed as previously described [17] using either the Illumina Nextera XT paired-end procedure on a MiSeq instrument (Illumina Inc., San Diego, CA, USA), until mid-March 2021; the Oxford Nanopore technology (ONT) on a GridION instrument (Oxford Nanopore Technologies Ltd., Oxford, UK) with cDNA amplification using a multiplex PCR protocol with the ARTIC nCoV-2019 V3 Panel primers (Integrated DNA technologies, Coralville, IA, USA) according to the ARTIC procedure (https://artic.network/), between mid-March and mid-April 2021; or the Illumina COVIDSeq protocol on a NovaSeq 6000 instrument (Illumina Inc), since mid-April. Genome consensus sequences were obtained as previously described [17] by mapping on the SARS-CoV-2 genome GenBank accession no. NC_045512.2 (Wuhan-Hu-1 isolate) with the CLC Genomics workbench v.7 (https://digitalinsights.qiagen.com/) or the Minimap2 software [18]. Sequences described in the present study have been deposited on the GISAID sequence database (https://www.gisaid.org/) [19] (Supplementary Table S1). The phylogenetic tree was based on the SARS-CoV-2 genomes obtained in our laboratory and on the 67 genomes the most similar to them retrieved using the GISAID BLAST tool (https://www.epicov.org/epi3/frontend#4ee9c) from the GISAID database [18], and on additional reference genomes corresponding to major SARS-CoV-2 variants or to the Wuhan-Hu-1 isolate. The tree was built using the Nextstrain tool (https://docs.nextstrain.org/projects/ncov/en/latest/index.html) [20] that performs maximum-likelihood phylogeny using IQ-TREE [21] then visualised with Auspice (https://docs.nextstrain.org/projects/auspice/en/stable/). The nature and number of nucleotide changes in the SARS-CoV-2 genomes and of amino acid changes in the SARS-CoV-2 proteins were obtained using the Nextclade tool (https://clades.nextstrain.org/results). Classification into Marseille variants was performed using an in house Python script based on sets of mutations. Classification into Nextstrain clades was performed with the Nextclade online tool at URL: https://clades.nextstrain.org/) and classification into Pangolin lineages was performed using the Pangolin online tool at URL: https://cov-lineages.org/pangolin.html [22].

A codon change or deletion was detected in 83 (2.5%) of 3364 SARS-CoV-2 genomes obtained from respiratory samples collected from different patients between February 2020 and April 2021. These 83 genomes were classified in ten different Pangolin lineages, namely A, A.27 (Marseille-501), B.1, B.1.1, B.1.1.10, B.1.1.241 (Marseille-9), B.1.1.7, B.1.160 (Marseille-4), B.1.416 (Marseille-1), B.1.525 (Marseille-484 K.v3) in three, six, four, three, three, one, 38, three, one, and 21 cases, respectively (Fig. 1; Supplementary Table S1).

Fig. 1
figure 1

adapted from screenshots of an output of the Nextstrain tool (https://docs.nextstrain.org/projects/ncov/en/latest/index.html) [20]. Sequences described in the present study are labelled with the light grey colour

Phylogenetic tree of SARS-CoV-2 genomes harbouring a deletion or substitution from Q to H of spike amino acid 677 and obtained from patients diagnosed with SARS-CoV-2 in our institute. The phylogenetic tree was built using the Nextstrain tool (https://docs.nextstrain.org/projects/ncov/en/latest/index.html) [20] that performs maximum-likelihood phylogeny using IQ-TREE [21], and was visualised with Auspice (https://docs.nextstrain.org/projects/auspice/en/stable/). The tree incorporated 69 of the SARS-CoV-2 genomes described here and obtained in our laboratory; 67 genomes corresponding to the best hits of those obtained in the present study, retrieved using the GISAID BLAST tool (https://www.epicov.org/epi3/frontend#4ee9c) from the GISAID database [18]; and additional reference genomes corresponding to major SARS-CoV-2 variants or to the Wuhan-Hu-1 isolate. SARS-CoV-2 genomes harbouring a Q677 deletion are indicated by a black asterisk. This figure is

In seven of the 3634 patients (0.2%), a deletion of five amino acids was observed at spike protein positions 675–679 (QTQTN) which include amino acid 677 (Supplementary Table S1). These sequences were obtained from respiratory samples collected in six cases between 28 March 2020 and 12 October 2020 and in one case on 1 February 2021. They involved three, one, and three viral strains classified in Nextstrain clades 20A, 20B, and 20G, respectively. Pangolin lineages were B.1 (n = 4 cases), B.1.416 (1), B.1.1.241 (1), and B.1.160 (1). The genome obtained in 2021 was a Marseille-4 variant/B.1.160 [7]. This deletion of spike amino acids 675–679 (QTQTN) was reported to abrogate the enzymatic cleavage of the S protein and the efficiency of both TMPRSS2 and TMPRSS13 for facilitating this cleavage, and to possibly restrict late-phase viral replication in Vero E6 cells [23,24,25]. In addition, it was reported to be present at a greater frequency from culture supernatant, in 12 (50%) out of 24 isolated viruses, than from respiratory samples, in three (4%) of 68 cases [24]. This suggests that selection pressures on SARS-CoV-2 regarding infection and/or replication differ in vitro and in vivo. In addition, the rare identification of this deletion in the clinical samples in this latter study and in the present work further suggests that this QTQTN sequence is under strong purifying selection in vivo. It is also worthy to note that the SARS-CoV-2-like bat strain RmYN02 was reported to exhibit a similar QTQT deletion in the spike protein [26].

In 76 patients (2.3%), a spike Q677H substitution was observed in the 3634 SARS-CoV-2 genomes obtained in our institute. These changes involved viral genomes that were all obtained from respiratory samples collected from 19 January 2021, and were classified as belonging to seven different Pangolin lineages (Supplementary Table S1). A majority of these genomes were of Pangolin lineage B.1.1.7, in 41 cases (54% of the 76 genomes and 7.2% of all B.1.1.7 genomes in our database of 3364 SARS-CoV-2 genomes). In addition, two were classified in the Marseille-4/B.1.160 lineage that was first detected in our institute in July 2020 [7] and accounted for 573 (17%) genomes in our database, although its members did not harbour this spike Q677H substitution before 19 January 2021. Also, eight genomes were classified in the Marseille-501/A.27 lineage that was first detected in our institute in January 2021, accounted for 18 genomes in our database and which may or may not harbour Q677H substitution [8]. Codon changes from CAG to CAT or CAC leading to this Q677H substitution deoptimise the viral codon usage relative to that of the human genome. Indeed, in the human genome, CAG, CAT, and CAC usage frequencies are 34.2, 10.9, and 15.1, respectively (https://www.kazusa.or.jp/codon/cgi-bin/showcodon.cgi?species=9606), which represent 3.1-fold and 2.3-fold decreases. Worldwide, 13659 SARS-CoV-2 genomes were found to encode this amino acid substitution Q677H according to the CoV-GLUE online tool (http://cov-glue.cvr.gla.ac.uk/) [27] (Fig. 2). They were classified in 229 different Pangolin lineages, the majority (77%) being classified in lineages B.1.2 (n = 5537), B.1.1.7 (1434), B.1.525 (952), B.1.170 (818), B.1.234 (714), B.1.1.316 (560), and B.1.1.284 (512). These genomes were obtained from clinical samples collected in 84 countries, mostly in the USA (89%; n = 7801), England (1866), Japan (610), Denmark (497), Canada (379), Switzerland (314), Germany (309), India (194), and Egypt (122).

Fig. 2
figure 2

Numbers of SARS-CoV-2 genomes harbouring the spike Q677H mutation worldwide according to timeline. Data were collected from the Cov-Glue online tool (http://cov-glue.cvr.gla.ac.uk/) [27]

Therefore, the spike Q677H substitution should be considered as another example of convergent evolution, in addition to spike amino acid substitutions N501Y [28], L452R [29], and L18F [30] which also independently appeared in various lineages. Q677H is notably part of the substitutions that emerged in strains of the B.1.1.7 [31] and B.1.351 [32] variants of concern. Here, we report the presence of this substitution in SARS-CoV-2 of B.1.146/Marseille-1, B.1.160/Marseille-4, B.1.1.241/Marseille-9, and B.1.525/Marseille-484 K.v3 lineages, which are different bona fide variants [17] although all B.1-derived lineages. In addition, we report the presence of the spike Q677H substitution in the A.27/Marseille-501 variant, which is a A-like lineage [8]. Therefore, we found this substitution in lineages that separated at different stages, including early, in the pandemic. Congruently, Hodcroft et al. reported the emergence late 2020 of variants harbouring substitutions Q677H/P in the spike of viruses that were classified into sublineages of Nextstrain clades 20B and 20G as well as into Nextstrain clade 20A [12], and the Q677H substitution has been identified in the B.1.525 lineage first described in Nigeria [33] and as a fast growing mutation in variants of concern [31].

Such convergent evolution is deemed to be the result of positive selection, and suggests that these amino acid substitutions at spike position 677 confer an evolutionary advantage to the virus [28]. The Q677H substitution in the SARS-CoV-2 spike protein is in the close vicinity of the polybasic RRAR furin-cleavage site of the spike S1/S2 boundary, possibly impacting binding between the spike receptor binding domain and ACE2 [12, 15, 16]. It has been hypothesised that histidine protonation in Q677H could induce a conformational switch that may affect the accessibility to protease of this site, which may enhance the cleavage at the S1/S2 junction and viral entry efficiency [1215, 16]. Structural analyses demonstrated that the spike amino acid 677 was located in subdomain SD2 of each of the protomers forming the homotrimeric spike protein, at the beginning of a very flexible loop (residues 675–690) and in the very vicinity of two experimentally observed O-glycosylation sites at T676 and T678 [34,35,36]. Zeng et al. used a luciferase-bearing lentiviral pseudotype-based neutralisation assay to assess the effect of this spike Q677H substitution [37]. They did not evidence a significant effect on the cleavage of the spike protein. In contrast, they reported that this amino acid substitution increased viral infectivity and syncytium formation. In addition, they reported that when present in variants of concern of the B.1.1.7 and P1 lineages, it increased viral infectivity by 150 and 26%, respectively, and it decreased susceptibility to neutralisation by serum samples from recipients of the Moderna mRNA-1273 and Pfizer BNT162b2 vaccines of 22 and 29%, respectively. Finally, viral neutralisation in the presence of a monoclonal conformation-dependent antibody targeting the spike receptor binding domain led to a 50% reduction in neutralisation of 677H-harbouring spike relative to 677Q-harbouring spike (wild type), and of Q677H/D614G-harbouring spike relative to D614G only-harbouring spike. This suggested an alteration of the conformation of the spike receptor binding domain. Thus, overall, the functional consequences of the spike Q677H substitution and its epistatic interactions with other amino acid substitutions located inside or outside the receptor binding domain of the spike protein are currently not precisely deciphered. Taking into account the growing prevalence of this substitution in distinct SARS-CoV-2 variants of interest or of concern worldwide and its possible impact on viral infectivity and immune escape, previous data justify carrying out more investigations and monitoring its evolution through genomic surveillance.

Combined with previous findings, present data highlight the great genetic variability of SARS-CoV-2 and warrant broader genomic surveillance of SARS-CoV-2 in order to gain a better insight of the epidemiology and evolution of this virus at the global, national, and local scales.