Introduction and effect of natural selection analysis at common mutations of SARS-CoV-2 spike gene in Iran

Highlights • The sequences were classified into 35 haplotypes, which 11haplotypes were new (H1, H2, H3, H4, H6, H7, H11, H13 H15, H16, H25) and have not been reported so far.• Amino acid substitutions were found at 40 positions that 23 were located at S1 subunit and 16 were at S2 subunit and one was at cleavage loop.• The neutrality index (NI) analyses showed a negative departure from the neutral substitution patterns (NI>1) for S1 and S2 subunit in the studied sequences.


Introduction
The pandemic of coronavirus disease 2019 (COVID-19) was caused by SARS-CoV-2 (Zhu et al., 2020).Globally, as of 3April 2023 over 765 million confirmed cases and over 6.9 million deaths have been reported (World Health Organization, 2023).SARS-CoV-2 is a single-strand RNA-enveloped virus.The genome size is 29,881 bp in length encoding structural proteins including the S, E, M, N genes and nonstructural proteins such as 3-chymotrypsin-like protease, papain-like protease and RNA-dependent RNA polymerase (Chan et al., 2020;Goldsmith et al., 2020;Lu et al., 2020).
The surface of the virus is covered by a large number of glycosylated spike proteins which bind to the host cell receptor angiotensin-covering enzyme 2 (ACE2) mediating viral cell entry (Letko et al., 2020).Among all human coronaviruses, the S protein is highly conserved and critical to the virus life cycle (Huang et al., 2020).
The SARS-CoV-2 S protein is made of 1273 amino acid, which is composed of two subunits (S1 and S2) (Huang et al., 2020).The S1 subunit includes an N-terminal domain and the receptor binding domain (RBD) and the S2 subunit contains an internal membrane fusion peptide, two heptapeptide repeat sequences (HR1 and HR2), a membrane-proximal external region and a trans membrane domain (TM) (Robson, 2020).
Evidences show that the S protein is the main surface glycoprotein and the most relevant source of antigens for vaccine development (Martinez-Flores et al., 2021;Sharma et al., 2020).On the other hand, the emergence of new variants may be occurred by random mutations, co-infection with different strains and recombination between their genomes that may increase their transmissibility and/or virulence with a possible reduction of vaccine effectiveness (Gui et al., 2017;Yu et al., 2017).Although antigenic variation is one of the major problems to develop an effective anti-COVID-19 vaccine, Knowledge of the virus genome polymorphisms, help us to design an effective vaccine.
To date, five SARS-CoV-2 variants including Alpha, Beta, Gamma, Delta, and Omicron variants have been declared variants of concern (VOCs) by the World Health Organization on the basis that they exhibit substantially altered transmissibility or immune escape, warranting close monitoring.Compared to the original virus, VOCs have increased transmissibility and have the potential for increasing disease severity (Carabelli et al., 2023;Choi and Smith, 2021).
Early investigations classified available SARS-CoV-2 genome sequences from all over the world into haplotype groups (Forster et al., 2020;Pachetti et al., 2020;Tang et al., 2020).A haplotype includes a set of variants along a chromosome that can be limited to a single gene or include multiple genes (Leitwein et al., 2020).An investigation has grouped the SARS-CoV-2 genome sequences, pertaining to viruses isolated from patients prior to the end of March 2020 into 66 haplotypes (Safari et al., 2021).Other studies have labeled haplotypes as clades, cluster, or lineages (Liu et al., 2020;Rambaut et al., 2020).Here, the SARS-CoV-2 S gene sequences obtained from samples, collecting from different regions of Iran, were classified into haplotypes based on the S gene by using Dna SP Software.

Sample collection
A total of 95 positive saliva samples were collected from 14 different cities including Tehran, Karaj, Sari, Tabriz, Mashhad, Hamedan, Boroujerd, Esfahan, Zabol, Ahvaz, Khorramshahr, Abadan, Abadeh, Jahrom in Iran during 2020− 2023 (Fig. 1).The samples were obtained from patients with symptomatic COVID-19 infections who had referred to medical centers.All studied individuals gave informed consent according to Universidad ethical guidelines.This study has the ethical approval of the Islamic Azad University Sari Branch (IR.IAU.SARI.REC.1402.111).
The characteristics of the studied individuals and their travel history were recorded before sampling.The majority of the patients (~63%) were between 40 and 60 years old.Saliva samples were self-collected under the observation of a healthcare worker.Viral agents in specimen aliquots were inactivated in guanidinium-thiocyanate-based lysis buffer at room temperature for 10 min before extraction of viral nucleic acid.

RNA extraction, cDNA synthesis, and PCR
RNA was extracted from clinical samples with the High pure viral RNA kit (Roche) according to the manufacturer's instructions.After RNA extraction, cDNA was synthesized by cDNA synthesis Kit (Yektatajhiz, Iran).Overlapping PCR was performed to amplify the S gene and primers were designed as described in Salles et al. (2022).PCR amplifications were performed in a 25 μl PCR reaction volume containing 0.2 μM of each primer pair, 1.5 mM MgCl2, 1 × PCR buffer (50 mM KCl, 10 mM Tris-HCl, pH 8.3), 0.2 mM dNTP and 0.5 u of Taq DNA polymerase (Bioron, Ludwigshafen, Germany).The cycling parameters to amplify the fragments were as follows: initial denaturation at 94 • C for 10 min, 35 cycle of denaturation at 94 • C for 1 min, annealing at 57 • C for 1 min, extension at 72 • C for 1 min and a final extension at 72 • C for 10 min (Applied bio system, Hong Kong, China).

Nucleotide sequencing and data analysis
PCR products were sequenced using ABI Big Dye Terminator Reaction ready kit (PE Applied Bio system, Foster City, USA) and an automated DNA sequencer (ABI 310 Genetic Analyzer; PE Applied Bio system).Nucleotide sequences were aligned using MEGA7 (ver.7.0.26).The S1 sequence of the wuhan-Hu-1 (accession No. NC045512) isolate was used as the reference sequence.The haplotypes were classified  based on the S nucleotide sequence using Dna SP (ver.5.10.01).The haplotype diversity (Hd), polymorphic or segregating sits (S), nucleotide diversity (π) and average number of pairwise nucleotide differences within the population (K) were estimated using Dna SP program.Evidence of recombination including Recombination between adjacent polymorphic site (Rb), Recombination per gene (Ra) and minimum number of recombination events (Rm) was also measured using Dna SP program.
To test the signature of neural selection under the null hypothesis based on a comparison of substitutions within and between sequences, the MC Donald and Kreitman test was performed with a single Middle East respiratory syndrome (MERS)-related coronavirus isolate HCoV-EMC (accession No. NC019843) as an out group, under neutrality the ratio of non-synonymous (Pn) to synonymous (Ps) mutations within species is expected to be equal to that of non-synonymous (Dn) to synonymous (Ds) mutations among species.However, under positive or negative selection these ratio will not be equal (Neutrality Index; NI = 1).The higher value of fixed differences among species to that with in species for non-synonymous mutations [Dn/Pn>Ds/Ps] indicates that genetic polymorphisms have been subjected to positive selection.To determine the significant deviations from neutrality, the two tailed Fisher's exact test was computed (P > 0.05).Also, Tajima's test and other analysis were performed using Dna SP software to study the effects of natural selection in the population including Fu and Li's D* and F* tests (Fu and Li, 1993;Tajima, 1989).
Phylogenetic tree based on the S gene sequences was performed by the neighbor joining method in MEGA software to analyses the genetic relationships among the haplotypes from this study and previously reported SARS-CoV-2 sequences from different geographic regions.
Fixation-statistics (Fst) was performed to measure the extent of allele diversity among populations from Germany, Switzerland, Colombia, and USA using Dna SP program.The Fst interpretation at each locus is based on four categories as follows: no genetic differentiation (0), low genetic differentiation (0-0.05),moderate differentiation (0.05-0.15), a value between 0.15-0.25 indicates high differentiation and above 0.25 shows very high genetic differentiation (Balloux and Lugon-Moulin, 2002).

Genetic variation and natural selection
The fragments of the S gene were successfully amplified and sequenced from 95 SARS-CoV-2 infected Saliva samples.The nucleotide sequence data in this study are available in the Gen Bank database under accession numbers OR121421-OR121455.The 95 sequences were classified into 35 haplotypes where haplotype H5 was predominant (17.9%, n = 17).The H9 and H17 were observed in 15 isolates each (15.8%, n = 15) and H27 in 10 isolates (Table 1).Amino acid sequence polymorphisms of the 35 haplotypes, reported in this study, are available in Table 2. Also, the Gen Bank BLAST searches for each haplotype showed that 11 haplotypes were novel (H1, H2, H3, H4, H6, H7, H11, H13 H15, H16, and H25) and have not been reported so far.
Analysis and comparison of the sequences with the isolate Wuhan-Hu-1 at the nucleotide level showed that point mutations occur at 50 sites (19 singleton variables and 31 parsimony informative sites).
Computing the McDonald and Kreitman test to determine whether natural selection contributes to the diversity within and between species showed an excess of synonymous substitutions between species and nonsynonymous substitutions within species compared to the neutral substitutions (PnDs/PsDn > 1) (Table 3).Neutrality index was calculated to be 2.110, 1.712 and 1.509 for the S1, S2+cleavage site and the entire S gene, respectively indicating an excess of polymorphic variation compared to the neutral substitutions due to negative selection and statistically nonsignificant (Fisher' s exact text P < 0.05) (Table 3).Also, the values of D* (F&L) and F* (F&L) tests (Table 5) showed a negative departure from neutral mutations at S1, S2+cleavage site and full length of S gene sequences in the studied population.
Amino acid changes were found at 40 sites where 23 polymorphic sites located at S1 subunit and 16 sites were at S2 and one was at cleavage loop.Out of these polymorphic sites, 38 sites were dimorphic and two were trimorphic (K417T/N, P681H/R) (Table 4) Nucleotide sequence analysis was performed for S1, S2+cleavage site and overall S gene within 95 sequences.Total number of mutation sites for these regions were 27, 26 and 53, respectively (Table 5).The overall nucleotide diversity (π) and Hd for complete S gene were 0.0025 and 0.91, respectively.Also, the average number of pairwise nucleotide differences (K) was 9.822 (Table 5).Results indicated that the extent of genetic polymorphisms (Hd and π) was higher at S1 subunit than S2 (Table 5).
The Fst estimation at the S gene among sequences from Iran, Germany, Switzerland, Colombia and USA showed no genetic differentiation between populations from USA and Germany (Fst = 0.00) whereas the fixation index among isolates from other countries indicated very high genetic differentiation (Fst > 0.25) (Table 6).
Analysis with sliding window plot (window length of 100-bp, step size of 25-bp) of the Dna SP package showed π diversity ranging from 0 to 0.1658.Maximum diversity was found between nucleotide positions 1-150 and then 1426-1525 at S1 subunit.The nucleotide positions 2751-2850 have also a high level of polymorphism at S2 subunit (Fig. 2).Minimum number of recombination events between adjacent polymorphic sites (Rm) for the entire S gene was 6, whereas recombination between adjacent site and recombination per gene were 0.003 and 11.4,respectively.

Phylogenetic analysis
Phylogenetic analysis based on the S gene of SARS-CoV-2 put the isolates from different geographic regions of the world in different subdivision of phylogenetic tree (Fig. 3).

Analysis of the B-cell epitopes and molecule structure of SARS-CoV-2 S antigen
The B cell epitopes predicted using ABCpred server.172 residues were predicted by the threshold of 0.9 to show a given region as a reliable linear B-cell epitope (Fig. 4).The molecule structure of SARS-CoV-2 S antigen was constructed to determine the distribution of mutation sites.The results showed that almost all of these mutations were located in one side of the molecule (Fig. 5).

Discussion
This study was aimed to investigate the extent of genetic variations and effect of natural selection at SARS-CoV-2 spike protein among Iranian isolates.Although, antigenic polymorphism is a major problem to design an effective vaccine, the detailed knowledge of the virus antigenic variations is a prerequisite to design an effective vaccine in various regions of the world.
K417N/T, L452R, E484K, N501Y, P681H, P681R mutations reported in this study was classified in Alpha-Delta variants.K417N appears in the Beta and Omicron variants, while K417T appears in Gamma (Hadj Hassine, 2022).Both of them are resistant to neutralizing antibodies (Khan et al., 2021).K417N mutation in Omicron variant reduces protein stability and increases disease susceptibility (Kumar et al., 2022).L452R is found in the Delta variants and appears to enhance ACE2 receptor binding affinity and can diminish the interaction with vaccine-elicited antibodies (Hadj Hassine, 2022;Tchesnokova et al., 2021).E484K is shared by Beta and Gamma variants (Hadj Hassine, 2022).This mutation improves the capability to escape the immune system by affecting antibody recognition (Greaney et al., 2021).The substitution N501Y is shared by Alpha, Beta, Gamma, and Omicron variants (Hadj Hassine, 2022;Kumar et al., 2022).The N501Y mutation is one of 15 Omicron variant mutations found in RBD.Although it improves protein stability and is well tolerated; however, it is disease-prone (Kumar et al., 2022).This change induces higher concentrations in the pharynx and the nasal cavities and therefor, increases its transmission rate (Conti et al., 2021).The P681H and P681R substitutions are located at the cleavage site and have been detected in Alpha and Delta, respectively (Hadj Hassine, Psthe number of synonymous substitutions within the spike gene; Pnthe number of non-synonymous substitutions within the spike gene; Dsthe number of synonymous changes between species; Dnthe number of non-synonymous changes between species; NIneutrality index; The values were not significant with P < 0.05.2022).The P681R mutation increases viral fusogenicity (Liu et al., 2022).Also, A701V mutation reported in this investigation is significantly resistant to neutralizing antibodies elicited by immunization (Wang et al., 2023), indicating reduced protective efficacy of the vaccines against these SARS-CoV-2 variants and enhance the infectivity of the virus, the evidence shows the necessity of vaccine renewal and providing direction for the development of new vaccines.
Analysis of nucleotide diversity showed more variation at S1 subunit than S2, in other words S2 subunit is more conserved across coronaviruses.This result is similar to previous studies (Ng et al., 2021;Premkumar et al., 2020;Shah et al., 2021).Furthermore a high level of   polymorphism were found among Iranian isolates (H = 35) that may be occurred due to recombination between genomes of SARS CoV-2 strains with passage time.In other words, in addition to random mutations, Co-exiting viruses and recombination between their genomes may significantly lead to the emergence of new SARS-CoV-2 haplotypes (Taghizadeh et al., 2021).
The McDonald Kreitman test was performed as a test of neutrality and findings showed a negative departure from neutral substitution patterns (NI > 1) for S1, S2+cleavage site and the entire length of S gene in the studied sequences supported by the results of the D* (F&L) and F* (F&L) test indicating host immune selection pressure due to antibodies elicited by infection, vaccination or both.The result is in agreement with previous report by (Lu et al., 2023) whereas is in contrast to some other reports (Hou et al., 2022;Morales et al., 2021).
To design an effective vaccine against SARS-CoV-2, selection of antigenic epitopes that induce protective antibodies would be a conductive strategy.Indeed, the co-occurrence of B-cell epitopes and variable sites represents the regions which are more likely to be exposed to the immune system and such residues have been found in 7 positions in this study.Therefore, most of the mutation sites have located outside B-cell epitopes.
In conclusion, present study extends our knowledge of polymorphisms and the prevalence of spike protein alleles in natural populations of SARS-CoV-2 from Iran.Moreover, combination of this information with similar reports from other geographical regions can be helpful in developing vaccine constructs that will elicit immune response against all known variants of the virus.

Funding
None.

Fig. 1 .
Fig. 1.Locations of sample collection and haplotype distribution of 95 SARS-CoV-2 samples in Iran.

F
.Nedaei et al.

Fig. 2 .
Fig. 2. Sliding window plot of nucleotide diversity per site (π) at the spike gene in 95 Iranian samples with the window length 100 bp and step size of 25 bp.The analysis was performed using DnaSP ver. 5. 10. 01.The level of nucleotide diversity was high between the nucleotide positions 1 -150 and then 1426-1525 at S1 subunit also nucleotide positions 2751-2850 at S2 subunit.

Fig. 3 .
Fig. 3. Phylogenetic tree based on the complete S gene sequences.The evolutionary history was reconstructed using the Neighbor-Joining method in MEGA7.The bootstrap consensus tree was inferred from 1000 replicates and values for the nodes are shown in the tree.

F
.Nedaei et al.

Fig. 4 .
Fig. 4. Location of detected SNPs in the present study in the predicted B-cell epitopes using ABCpred server.Seven mutation sites were found in the B-cell epitopes.Gray shadow: SNPs in the current study; underlined shows B-cell epitopes.

Fig. 5 .
Fig. 5. Crystal structure of Spike protein was generated with Web Lab Viewer Lite 4.2.The polymorphic sites are indicated in Black color.All other positions show invariant amino acids.Panels A and B are two opposite faces of the same structure.

F
.Nedaei et al.

Table 1
Distribution of COVID-19 cases and haplotype frequency by different age categories.

Table 2
Amino acid polymorphisms in the SARS-CoV-2 spike protein among Iranian (H = 35) and their comparison with Wuhan-Hu-1 sequence.

Table 3
Null hypothesis results with the McDonald and Kreitman test.

Table 4
Amino acid substitutions at spike protein of SARS-CoV-2 among 95 Iranian samples.

Table 5
DNA sequence polymorphisms estimation and neutrality tests at spike protein of SARS-CoV-2 among 95 Iranian samples.

Table 6
Genetic differentiation estimation at SARS-CoV-2 spike gene among populations from various geographic areas.Fst values are showed in the lower left and P-values are showed in the upper right of the table; nnumber of samples analyzed;* the Fst values were found as significant difference (P < 0.05).