Molecular evolutionary model based on phylogenetic and mutation analysis of SARS-CoV-2 spike protein sequences from Asian countries: A phylogenomic approach

The lethal pathogenic severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) infection has caused the COVID-19 pandemic, posing serious risks to people. The clove-like spike (S) protein that distinguishes coronaviruses from other viruses is important for viral pathogenicity, evolution, and transmission. The investigation of the unique structural mutations of the SARS-CoV-2 spike protein among 34 Asian countries, as well as the resulting phylogenetic relationship, provided critical information in understanding the pathogenesis. This can be utilized for the discovery of possible treatments and vaccine development. The current study analyzed and depicted phylogenetic and evolutionary models useful for understanding SARS-CoV-2 human-human transmission dynamics in Asian regions with shared land borders. Further, integrated bioinformatics analysis was performed to predict the pathogenic potential and stability of 53 mutational positions among 34 coronavirus strains. Mutations at positions N969K, D614G and S884F have deleterious effects on protein function. These findings are crucial because the Asian mutations could potentially provide a vaccine candidate with co-protection against all SARS-CoV-2 strains. This region is vulnerable because of the high population density and the volume of domestic and international travel for business and tourism. These discoveries would also aid in the development of plans for governments and the general populace to implement all required biocontainment protocols common to all countries.


Introduction
Alphacoronavirus, Betacoronavirus, Gammacoronavirus, and Deltacoronavirus are the four genera that make up the Coronavirinae subfamily of the Coronaviridae family, which includes coronaviruses. It has long been recognised that humans and several animal species can contract coronaviruses (CoVs) which can lead to systemic infections of the liver, gut, brain system, and respiratory airways [1][2][3]. Severe acute respiratory syndrome coronavirus (SARS-CoV-2) [4,5], is an enclosed virus with a positive-sense single-stranded RNA that is roughly 30 Kb long [6][7][8]. The first emergence of this virus was in Wuhan, China on December 2019 with numerous cases of severe pneumonia were reported [9]. The structure of this virus, though similar to other lineages, is distinct in organization and function. The capsid is created by the nucleocapsid protein (N), and the envelope that surrounds the genome is connected to three structural proteins called membrane protein (M), spike protein (S), and envelope protein (E) [10]. SARS-CoV-2 has sixteen non-structural proteins (nsp1- 16) in addition to the aforementioned four structural proteins (S, E, M, and N). Total amino acid length of SARS-CoV-2 S is 1273 aa and consists of a signal peptide (amino acids 1-13) located at the N-terminus, the S1 subunit (14-685 residues), and the S2 subunit (686-1273 residues). The S1 and S2 subunits make up the S protein of the SARS-CoV-2, a class I fusion transmembrane structural glycoprotein homotrimer that weighs 180-200 kDa [11]. The clove-shaped structure, which protrudes from the virus's surface, carries the viral binding property to the host cell and facilitates fusion. The S protein is a primary target for antibodies that work to neutralise viruses since it is the virus's fundamental unit that identifies and attaches to the host cell receptor ACE-2 [12,13].
To effectively control SARS-CoV-2 and further slow the present pandemic, it is imperative that the major issues related to its cross-host transmission dynamics, evolutionary mechanisms, variants divergence from the original animal reservoir be addressed. The SARS-CoV-2 genome has a high rate of recombination and adaptive expression strategies which allows them to pass into and infect several hosts through variety of environmental niches. Several studies have shown viral variations at different geographical locations like Southeast Asian [14], European [15] and North American nations [16], and globally [17].
The rapidly changing epidemiology of SARS-CoV-2 makes it imperative for continue updating. It was first suggested that 2019-nCoV and SARS like bat coronaviruses and MERS are related [18]. Comparison of the three strains Wuhan/IVDC-HB-01/2019 (GISAID accession ID: EPI_ISL_402119) (HB01), Wuhan/IVDCHB-04/2019 (EPI_ISL_402120) (HB04), and Wu-han/IVDC-HB-05/2019 (EPI_ISL_402121) (HB05) revealed high identity. However, their differences including the 8a protein is present in SARS-CoV and absent in 2019-nCoV; the 8b protein is 84 amino acids in SARS-CoV, but longer in 2019-nCoV, with 121 amino acids; the 3b protein is 154 amino acids in SARS-CoV, but shorter in 2019-nCoV, with only 22 amino acids [19]. This raises questions on how these differences affected the pathogenicity of 2019-nCoV. Similarly, recent SARS-CoV-2 viruses isolated a from number of patients indicated sequence identity of about 99.9%, potentially implying a very recent host jump into humans [20,21]. SARS-CoV-2 showed 86% similarity to SARSr-CoV (species of virus consisting of strains that are phylogenetically related to severe acute respiratory syndrome coronavirus 1 (SARS-CoV-1) possessing the capability to infect human, bats and other mammals) [22]. Thus, since high identity of former lineage, and that RNA coronaviruses recombination are frequent, SARS-CoV-2 have newly emerged from bat SARSr-CoVs by recombination. Another evidence includes recombination of two existing bat strains, WIV16 and Rf4092 giving rise to the civet SARS-CoV strain SZ3 have been reported [23]. More important is that because αand β-CoVs infect livestock animals they pose significant risk to human. Analysis of 103 SARS-CoV-2 genomes nucleotides have shown evolution into two major types (L and S type); a more prevalent and aggressive L type that evolved during the outbreak from the ancestral less aggressive clone S type [24].
In this study, we have studied thirty-four of SARS-CoV-2 to construct the molecular evolutionary model based on their phylogenetic analysis and to explore the possible mutations while the spread of the infections in the countries of Asia sub-continent. The study mainly deals with identification of phylogenetic relationship shared among spike protein sequences from Asian countries. Mutation is a major evolutionary force that must be studied and understood to understand evolution. Therefore, to understand the evolutionary process characterization of the rates and patterns of mutation is performed in present study. This study focuses on spike protein sequences from 34 Asian countries only because of high proportion of frequent cross-border travels and high percentage of genetic similarity shared among population setting of Asian countries. The mutational rates of RNA virus alter the function of protein. These can interfere with the functions of the transcribed proteins by affecting their stability and modulating interactions with other biological molecules [25]. By altering their stability and regulating interactions with other biological molecules, these can prevent the transcribed proteins from carrying out their intended functions. Understanding various biological processes, such as resistance to disease and medication, is crucial for understanding the effects of mutations on protein stability and interactions, which may also have an impact on the severity of a disease [26]. Therefore, computational techniques are utilized to predict the effects of mutations on protein stability and pathogenicity in order to support the quick and routine analysis of sequencing data required for personalized medicine. This results in the identification of mutation sites in spike gene region (substitutions and deletion), which can be utilized for target-based drug and vaccine discovery. Such phylogenomic analysis results in better understanding of shared evolutionary pattern among genomic strains of SARS-CoV-2.

Dataset collection
The National Center for Biotechnology Information (NCBI) and representative sequences from the SARS-CoV-2 database GSAID was mined for the retrieval of complete nucleotide sequences of reference genomes of SARS-CoV-2 from Asian countries by December 23, 2022 (https://www.ncbi.nlm.nih.gov). Reference nucleotide sequences underpinned the possibility of performing numerous studies related to development, variation and diseases. They serve as foundation for medical, functional and diversity studies [26]. The protein sequences of spike protein sequences from each strain were obtained in FASTA format which enables the phylogenetic analysis and identification of plausible substitutions of nucleotide bases at various genomic positions.

Alignment, nucleotide and amino acid variation analysis
The MUSCLE programme version 3.8.31 was used to perform multiple sequence alignment (MSA), and Clustal v2.1 was used to generate percent identity matrices [27]. This enables the identification of genome variation from single-nucleotide polymorphism sites. Further, Nextstrain enables the visualization of the alignment with respective mutation at multiple sites aiming to understand epidemiological understanding and improve outbreak response [28]. The relationship of each variation to spike protein sequences of SARS-CoV-2 genome ORFs was then investigated. MEGA 11 enables the computation of evolutionary model and evolutionary rates [29].

Phylogenomic analysis
Phylogenomic analysis enabled the evolutionary patterns shared among selected viral strains in a defined geographical location. The nucleotide alignment obtained was subjected to NextClade v2.1.0 tool [30]. Further, the output obtained from NextClade was used to generate a maximum likelihood (ML) in IQ-TREE multicore v1.6.12 [31] for phylogenomic analysis, using the best substitution model and other default parameters. Further, phylogenetic analysis was generated based on spike protein genomic regions. The obtained tree was graphically represented in Interactive Tree of Life (iTOL) v5 [32] and newick format [33] where the clades were assigned using the intrataxa classification system that most accurately describes the viral diversity.

Predicting functional impact of mutations in spike protein sequences
Plausible amino acid substitutions affect the functions of protein sequences. Therefore, combination of tools was applied to evaluate the biological relevance of mutated sequences from 34 nations in terms of being considered either as neutral or deleterious. The computationally derived information from PredictSNP [34], MutPred2 [35] renders the consensus results for spike protein sequences. Based on sequence homology and the physical characteristics of the amino acids, PredictSNP is a consensus classifier which represents the combined results from MAPP, nsSNPAnalyzer, PANTHER, PolyPhen-1, PolyPhen-2, SIFT and SNAP. A computer-based system and software called MutPred2 (http://mutpred.mutdb.org/) combines genetic and molecular data to offer probabilistic reasoning on amino acid substitution pathogenicity.

Stability analysis of mutational sites
Structural conformation caused by amino acid variations alters the stability of protein structure. Such structural changes directly affect the pathogenicity and functionality of proteins. Therefore, the stability analysis was performed using I-Mutant 3.0 [36], MUpro [37], to analyze the impact of variants on the Spike glycoprotein. I-Mutant 3.0 (htt p://gpcr2.biocomp.unibo.it/cgi/predictors/I-Mutant3.0/I-Mutant3.0. cgi) utilizes single-point mutations to predict protein stability changes. It is based on support vector machine classifier algorithm to estimate change in protein stability by performing linear regression. In addition, combination of neural networks and SVM was utilized in MUpro (http:// mupro.proteomics.ics.uci.edu/) to analyze protein stability for single point amino acid mutations.

Evolutionary conservation analysis
To measure conservation degree at each aligned position, ConSurf server (http://consurf.tau.ac.il) [38] enables the analysis and computation of conservation degree at each aligned position of spike glycoprotein. It involves the clustering of amino acid residues at a fixed window length for detection and monitoring of substitutions [39], which computes evolutionary survival rate utilizing Bayesian inference. This platform addresses the conservation patterns by means of score (1quickly developing sites, 5 -average rate, 9-gradually evolving sites). This enables the assessment of structural and functional effect of amino acids within the protein.

Dataset retrieval
The nucleotide sequences were obtained from NCBI virus database and their respective NCBI accession numbers are listed in Table 1. Complete nucleotide sequences from 34 Asian countries were selected and subjected to multiple sequence alignment. This study includes reference genome sequences of all selected countries, which results in the end-to end comparative analysis of multiple sequences.

Alignments, nucleotide and amino acid variation analysis
Multiple sequence alignment identified common evolutionary descents or common structural and function features shared among aligned sequences. The MUSCLE programme was used to perform multiple sequence alignment at default parameters (Gap penalties: gap open = − 2.90, gap extension = 0, hydrophobicity multiplier = 1.20; Maximum Iterations = 16; Method = Unweighted Pair Group Method with Arithmetic Mean). Nextstrain visualized and tracked the mutational sites along multiple sequencing genome sequences which assisted in the generation of phylogenomic relationships among strains. Table 2 enlists the total number of substitutions, deletions, amino acid substitutions, amino acid deletions at whole genome level. In this study, spike protein amino acid substitutions and deletions were taken into consideration for phylogenomic analysis (Table 3).

Phylogenomic analysis
The alignment output file was subjected to NextClade for the construction of phylogenetic tree. Further, iTOL assisted in the visualization of the tree and supports the generation of tree files in newick format ( Figs. 1 and 2). It is a method of representing graph-theoretical trees with edge lengths using parentheses and commas [40]. Heuristic search was used to infer the evolutionary history, which was performed by using the Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances calculated using the Tamura-Nei model, and then choosing the topology with the highest log likelihood value. The result presented that genomes were highly conserved, and phylogenetic reconstruction suggests the effects of major substitutions in evolution and surveillance.

Pathogenicity and stability predictions
To predict pathogenicity and stability, several tools for predicting the impact of mutations on the structure and function of the protein were utilized. Based on the underlying algorithm, mutational sites were categorised as either neutral or deleterious. PredictSNP renders the consensus results (red colour represents deleterious mutations and green colour represents neutral mutations) of all tools (Supplementary File: Sheet 1). A total of eleven mutational sites L452Q, V367F, D614G, N440K, N501Y, Y505H, N764K, N969K, S884F, T95I, G232C were identified as deleterious with predicted conservation scores. Further, consensus stability prediction from I-Mutant 3.0 and MuPro identified three deleterious mutational sites (N969K, S884F and D614G) resulting in decreased stability of spike protein (Supplementary File: Sheet 2).

Conservation analysis
A residue's importance in maintaining the structure and   functionality of proteins is demonstrated by the conservation pattern. ConSurf examines the degree of conservation at each aligned position, highlighting the localized development. Multiple sequence alignment obtained from MUSCLE program was subjected to ConSurf server followed by Bayesian empirical interface computation for the analysis of rate of developmental preservation. Protein function is affected by conserved position of mutation, therefore N969K and S884F positions may have destructive effects by decreasing the stability of the protein (Fig. 4).

Discussion
The SARS-CoV-2 that is causing the current pandemic is spreading through human-to-human transmission dynamics [41]. Comorbidities such as diabetes, hypertension, lung problems, obesity, and malignancies aggravate the disease [42,43]. Although most infected persons only have mild to moderate symptoms, acute respiratory distress syndrome can still happen to certain people [44]. It is not clear how the virus spreads, evolves and adapts to selectively infect different populations leading to evolution of variants at different geographic location. Information about the phylogenetic origins, evolutionary line, and adaptive mutation at SNP, is limited. This study deals with the understanding the evolutionary pattern of SARS-CoV-2 reference strains among 34 Asian countries ( Table 1). The major focus of this study the phylogenetic relationship shared among strains based on the genome sequences their respective spike protein sequences.
The situation in which we exercise public health has been shaped by evolution and its components of natural selection, population migration, genetic drift, and founder effects. And phylogenomic analysis promise for resolving incongruences highlighted in phylogenetic studies [45]. Therefore, reference whole genome sequences were taken into consideration for this study. Reference genome has inherent ability to characterize particular gene, or gene families that are related to species-specific conservation. These are high quality genomes offering completeness, low number of errors and high percentage of sequence assembled into chromosomes [46]. The complete nucleotide sequence of SARS-CoV-2 genomes were subjected to multiple sequence alignment to identify the incurring mutation or variation sites at whole genome level. All plausible variation sites are indicated in Table 2. Maximum mutational variations/substitutions were found among isolates from Japan, Mongolia, Singapore, Laos, Vietnam, Malaysia and Myanmar. The evolutionary history indicates that there were mutations among the genomes, but they were highly conserved. The molecular evolutionary model analysis was performed for 34 genomic strains utilizing the Tamura-Nei model with the discrete Gamma distribution. This model demonstrated that nucleotides formed at various rates and underwent various degrees of transitions and transversions throughout transmission. Further, the nucleotide heterogeneity rate, which was represented in Fig. 3. For ease of calculation, the sum of r values is set to 100 (default). With an estimated value of 0.0500 for the shape parameter for the discrete Gamma Distribution, ML has been estimated for site rates  [46]. Differences in evolutionary rates among sites were modelled using a discrete Gamma distribution (5 categories, [+ G]). In these categories, the average evolutionary rates were 0, 0, 0, 0, 0, and 4.97 substitutions per site. The total nucleotide frequencies were A = 29.88%, T/U = 32.15%, C = 18.36%, and G = 19.62%. Computed tree topology of the maximum Log-likelihood was -43780.941. The results suggested that matrix found significant as the estimated Transition/Transversion bias (R) was 2.90 and confirming that there was heterogeneity among the genomes, which is also supported by MSA using CLUSTAL O v1.2.4. The equality of evolutionary rate between sequences A (BS006524.1SARS-CoV-2/human/Japan) and B (OM881786.1|SARS-CoV-2/human/Singapore), with sequence C (ON527300.1|SARS-CoV-2/human/Malaysia) with maximum substitution was seen using Tajima's relative rate test through the molecular clock. The χ 2 test statistic was 61.79(P = 0.001 with 1 • of freedom). Since the P value here is less than 0.05, the null hypothesis was rejected for equal rates between lineages [47]. There was a total of 29,586 positions in the final dataset (Table 4). Results indicated that there were 90 mutations in BS006524.1, 11 in OM881786.1 and 11 mutations in ON527300.1. The evolutionary rates within the phylogenetic tree of the genomes were determined using CorrTest analysis of the correlation rate among 34 taxa. The ML method was used to examine the tree topology, and the CorrTest analysis score was 1.0000 [48]. As a result, the model is significant, and the null hypothesis was rejected (p value was 0.001 demonstrating evolutionary rates). This demonstrates a strong correlation between the genetic distances in all taxa's genomes and the ML tree (Fig. 1).
Further, spike protein sequences of 34 complete genome sequences were taken into consideration for phylogenetic analysis. Multiple sequence alignment shows maximum substitutions in Japan, Mongolia, Singapore, Laos, Vietnam, Malaysia and Myanmar. Phylogenetic relationship among spike protein sequences results in the similar phylogenetic relationship with several amino acid deletions and substitutions. As per Table 3, deletion of E156, F157 is common among spike protein regions of Laos, Malaysia, Mongolia, Singapore, Uzbekistan and Vietnam. The geographical locations of these connected countries may result in the maximum substitutions. In addition, amino acid substitutions at similar places (T19, G142, R158, L452, D614, D950, P681) results in identification of closely related phylogenetic relationship among these above stated strains ( Fig. 1: Blue colour). Maximum amino substitutions at spike gene sequence make genome sequence of Japan far variable than other sequences from Asia sub-continent ( Fig. 1: Red colour). Spike protein regions are evolutionary related for Lebanon, West Bank, Iraq and Philippines (Fig. 1: Green colour). D614G is common amino acid substitution change among Bangladesh and Iran spike protein sequences of SARS-CoV-2 genomes (Fig. 1: Magenta colour). Further, India, Nepal, Qatar, Pakistan, Saudi Arabia, Turkey, Bahrain, Bhutan, Jordan, South Korea, China, Georgia, Hong Kong etc. are shared closely related   In addition, bioinformatics tools have been used to examine and clarify the function of the SARS-CoV-2 spike protein and natural mutations. According to the results, the most reliable tools for predicting the pathogenicity of the SARS-CoV-2 mutational sites was PredictSNP. This leads to a great understanding of genetic differences in disease susceptibility and drugs responses. For example, the mostly occurred mutation D614G shown deleterious affect with decrease in stability. An interesting issue is the recurrence of this mutation in many phylogenetic SARS-CoV-2 tree lineages and among all people around the world. This study provides the platform to future researchers to unravel the structural conformational differences caused by mutational sites. These sites will be further analyzed for drug and vaccine discovery. Given the epidemiological impact and genomic features of this new outbreak, it is necessary to perform genomic surveillance on this virus in order to design disease control and prevention mechanisms in non-endemic regions.

Conclusion
In conclusion, when compared to genomes from earlier viral outbreaks, all genomes from the present human SARS-CoV-2 epidemic of 2020 show a distinct monophyletic lineage. This increased mutational signature indicates a more rapid evolutionary path than the rate typically seen in coronaviruses. Identification of mutational signatures among SARS-CoV-2 strains from countries on the Asian subcontinent renders the possibility of developing a vaccine that can be beneficial for all nations. To reduce the morbidity and mortality of the continuing pandemic as a worldwide priority, rational medication and vaccine designs that prove successful will be guided by the findings discussed in this review. Further research is necessary to provide the necessary epidemiological insights and tools to better forecast disease dynamics and stop further spread because there is a lack of information regarding the cause and modes of transmission of the present outbreak. It is also necessary to prepare for several possible eventualities, including spillover to developing nations in these new niches for expansion and the emergence of developing pandemic events that may help the illness endure in non-endemic areas.

Funding
This research has been funded by Scientific Research Deanship at University of Ha'il-Saudi Arabia, through project number RG-21 047.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Configuration Count
Identical sites in all three sequences 29,474 Divergent sites in all three sequences 0 Unique differences in Sequence A 90 Unique differences in Sequence A 11 Unique differences in Sequence A 11