Understanding the mutational frequency in SARS-CoV-2 proteome using structural features

The prolonged transmission of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus in the human population has led to demographic divergence and the emergence of several location-specific clusters of viral strains. Although the effect of mutation(s) on severity and survival of the virus is still unclear, it is evident that certain sites in the viral proteome are more/less prone to mutations. In fact, millions of SARS-CoV-2 sequences collected all over the world have provided us a unique opportunity to understand viral protein mutations and develop novel computational approaches to predict mutational patterns. In this study, we have classified the mutation sites into low and high mutability classes based on viral isolates count containing mutations. The physicochemical features and structural analysis of the SARS-CoV-2 proteins showed that features including residue type, surface accessibility, residue bulkiness, stability and sequence conservation at the mutation site were able to classify the low and high mutability sites. We further developed machine learning models using above-mentioned features, to predict low and high mutability sites at different selection thresholds (ranging 5–30% of topmost and bottommost mutated sites) and observed the improvement in performance as the selection threshold is reduced (prediction accuracy ranging from 65 to 77%). The analysis will be useful for early detection of variants of concern for the SARS-CoV-2, which can also be applied to other existing and emerging viruses for another pandemic prevention.

Almost two years into the pandemic, several variants of the SARS-CoV-2 virus have emerged all around the world. Viruses naturally have high mutation rates, which provide critical diversity for natural selection to screen variants with better transmission and survival according to the environment [12,13]. One such example, "mutation D614G in spike protein," is studied extensively. It enhances viral replication in human airway passage tissues and lung epithelial cells by increasing the infectivity and stability of virions [14]. Some of the new strains of SARS-CoV-2, for example, Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), Delta (B.1.617.2) or Omicron (B.1.1.529) are more transmissible than the original one [15,16]. Currently, Delta and Omicron variants are dominant strains in circulation. The Delta variants had relatively fewer mutation sites in the Spike protein (T19, E156, L452, T478, D614, P681 and D950) compared to the Omicron variants, which had more than 30 mutation sites in the spike protein. Previously reported Alpha, Beta and Gamma variants had few overlapping mutation sites (N501, K417 and D614) in the spike protein and these variants are no longer in circulation.
As of April 2022, a significant part of the population is vaccinated in most of the countries. Although, there is a possibility of existing or emerging strains to be vaccine-resistant. It is an utmost priority for the scientific community to observe and understand these new strains to terminate the pandemic as early as possible. A deeper understanding of the mutation can lead to early prediction of the viral variants and subsequently in silico protocols can be utilized to identify/design drugs and vaccines [17][18][19].
In the recent advancement in the mutational study of the SARS-CoV-2, Garvin et al. [20] used an artificial intelligence approach to find the mutational hotspots in SARS-CoV-2 genome to provide insights into drug development and surveillance strategies to combat the current and future pandemics. Other studies have compared the binding interface [21,22] and mutational pattern in the SARS-CoV-2 with other similar coronaviruses [23]. Several studies have looked into the evolving geographic diversity of the SARS-CoV-2 [24][25][26][27][28]. Sen et al. [29] have analyzed the structural malleability of viral proteins that may lead to comorbidities. Researchers have also studied the effect of mutation on binding affinity of the known SARS-CoV-2 specific antibodies [30]. The analysis of protein site conservation/mutability was mainly limited to HIV viruses before pandemic to identify immunogens for vaccine [31,32] due to the unavailability of relatively large-scale datasets. Similarly, there have been few studies exploring the conservation of SARS-CoV-2 proteome using multiple sequence alignment to identify the vaccine targets [33,34]. However, these studies depend on prior knowledge of viral variants and may not be effective for the prediction in new viruses. A recent study used unsupervised probabilistic models using direct coupling analysis (DCA) to predict SARS-CoV-2 mutable and constrained positions, which incorporate pairwise epistatic terms and all known coronavirus genomes to allow selective pressure for coronaviruses [35].
The ongoing pandemic is changing the mutation dynamics of the SARS-CoV-2 proteome on a daily basis. However, not all protein sites observe an equal mutation rate in the population. The observed mutability of protein sites can be potentially attributed to the combined effect of intrinsic physicochemical parameters [36] and effect on the transmission/survival of the virus [13]. The intrinsic physicochemical parameters (such as residue composition, surface accessibility, local stability, residue contacts, hydrophobicity, etc.) predicted from the viral sequence-structure information, are well characterized in several studies such as aggregation [37,38], stability [39,40], function [41][42][43], binding [44], etc. These are the inherent properties of the sequence and therefore expected to be applicable to protein site mutability of all multicellular organisms including virus species, over a large time frame [45]. On the other hand, protein-protein interaction and potential residue modifications affecting biological processes are important for preservation of mutation. However, these phenomena cannot be generalized and are specific to organisms [46,47].
In this work, we have analyzed the mutation information from "2019 Novel Coronavirus Resource (2019nCoVR, https://bigd.big.ac.cn/ ncov)" to understand the intrinsic sequence-structure factors that affect the mutability of proteins with respect to reference SARS-CoV-2 proteome from Wuhan. The initial analysis to understand the physicochemical parameters affecting mutability of protein sites in the whole proteome and each protein is done at 30% selection threshold (top 30% high and low mutation sites based on mutant isolate count). We observed that physicochemical features such as residue type, surface accessibility, residue bulkiness, stability and conservation can distinguish the sites with high and low mutation frequency. Further, we developed machine learning (ML) models using these features to classify the high and low mutation sites at different selection thresholds ranging from 5 to 30% and obtained model accuracy in the range of 65-76.7%. It was observed that increasing the confidence level of low and high mutability sites (i.e. lowering the selection threshold) improves the prediction performance of the ML models. We further observed that the physicochemical features of the mutation sites in variant of concern (VOC) and interest (VOI) are potential causes of their higher mutation rate (VOC and VOI information collected in June 2021). The study provides significant insights into the viral mutability, which can be helpful in early detection of potentially harmful viral variants for SARS-CoV-2 and other infectious viruses.

Dataset preparation
We have downloaded the variance annotation dataset from the 2019 Novel Coronavirus Resource (https://bigd.big.ac.cn/ncov/variation/a nnotation) [48] in June 2021. The non-synonymous single nucleotide polymorphism (SNP) entries were considered in the analysis. The structures of the SARS-CoV-2 proteins were obtained from (https://zh anglab.ccmb.med.umich.edu/COVID-19/) [49]. The physicochemical features of the mutation sites were analyzed by classifying the mutation sites at 30% selection threshold, where top 30% mutation sites with high isolate count were considered "high mutability sites" and bottom 30% of the mutation sites with low isolate count were considered "low mutability sites". The remaining 40% in the middle were considered ambiguous to be classified in any category. The number of mutation sites and cutoff of isolate count for 30% selection threshold are given in Table S1. A separate dataset at 10% selection threshold was also prepared to verify the observations of 30% selection threshold.

Collection of sequence and structural-based features
Collection of biologically-relevant features is an important step in machine learning model development and statistical analysis of complex biological problems [50][51][52][53]. We collected several sequence and structure features for the viral proteins from various sources and custom scripts. Briefly, features include relative accessible surface area (r ASA ) [54], all atom residue depth [55], surrounding hydrophobicity (within heavy atoms contact distance of 5 Å) (https://www.iitm.ac.in/bioinfo/ pdbparam/index.html), sequence based physicochemical and energetic features [56], residue type (polar, non-polar and charged) and contacting residues information (within the heavy atom contact distance of 5 Å). The sequence-based features were average values of tripeptides occurring at the mutation site along with one residue on each side. The position specific scoring matrix (PSSM) profiles were generated for each viral protein position using "blastpgp" on the "UniRef90" database [57]. Further, above-mentioned features were filtered based on inter-property correlation (r ≤ 0.8) and statistical difference in mean value of low and high mutability sites (p-value≤ 10-11, at 30% selection threshold). The final dataset contained 21 sequence and structure-based features (Table S2).

Feature selection and development of machine learning (ML) models
We used a forward feature selection approach to select the optimal number of features in the baseline model. Firstly, we selected the best performing feature in the ML model based on Area under the ROC curve (AUC). Further, features were added one by one until the best performance (AUC) of the model was reached (Fig. 1). We restricted to a maximum of six features to avoid overfitting and the final model contains four features with a balance between the number of features and performance. The baseline ML model was developed at 30% selection threshold using "support vector machine (SVM)" and linear kernel in Weka 3.8.6 [58]. The SVM based parameter "BuildLogisticModels" was kept "True" and "optimal complexity parameter (c)" was optimized to 2.0 to obtain the best performance. The rest of the parameters were kept default. The final selected model was also trained on selection thresholds ranging from 5% to 25% to observe change in performance upon undersampling.

Performance evaluation
The performance of the model at 30% selection threshold was evaluated primarily using area under the ROC (receiver operating characteristic) curve. We have also included following performance measures for the final ML model: where TP, TN, FP and FN are the number of true positives, true negatives, false positives and false negatives, respectively. In our study, low mutability sites are considered positive and high mutability sites are considered negative class. The robustness of the model was evaluated using 10-fold and leave-one-out cross-validation (LOOCV). The 10-fold cross-validation was performed 100 times while randomizing the dataset each time. In leave-one-out cross-validation, the regression model was trained on n-1 data points and tested on the remaining one data point, recursively.

Analysis of variants of concern (VOCs) and variant of interest (VOIs)
The list containing mutations in VOCs and VOIs of SARS-CoV-2 virus (designated by WHO as of June 2021) were obtained from https://outbr eak.info/. In addition, we have also collected the list of mutations of interest and mutation of concern. The radar plot for these mutation(s) was plotted using Matplotlib library [59] in python.

Analysis of the dataset
We analyzed 8673 protein sites in SARS-CoV-2 proteome containing at least one mutation among 1079273 isolates (Fig. 1). Firstly, we plotted a histogram for the number of mutation sites in the whole SARS-CoV-2 proteome with respect to their isolate counts (Fig. 2). The histogram represented approximately 90% of the protein sites with less than 1000 isolate count and showed an exponential decay curve, where more than 950 protein sites had less than 10 mutant isolates and only 14 protein sites had mutant isolate count of more than 100,000. The higher isolate count generally denotes mutation in the protein site at an early

Role of sequence and structure-based features on site mutability
We have analyzed several sequence and structure-based features to classify the low and high mutability of the protein sites. We first filtered the features based on the "statistical significance" and "low interproperty correlation" selection criteria (as described in Collection of sequence and structural-based features in Methods section). The features, capable of distinguishing the low and high mutability of protein sites, are further classified into five general categories and discussed in detail. It is also important to note that size of viral proteins (corresponding to number of mutation sites) vary greatly (ranging from 30 to 1757 residues), which can lead to less to no statistical significance in some protein-wise results (Table S1).

Residue type
We observed that residue-type is the most capable feature to classify the high and low mutability of protein sites in the SARS-CoV-2 proteome (Fig. 3). The proportion of bulky aromatic residues (F, Y, W) is significantly higher in the low mutability sites. The positively charged residues (R, K) occur frequently at low mutability sites whereas negatively charged residues (D, E) are present more in high mutability sites. Gly (G), a smaller amino acid with similar physicochemical features as Ala (A) has surprisingly higher frequency in the low mutability. Overall, Ala (A), Cys (C), Asp (D), Phe (F), Trp(W) amino acids have more than 2-fold difference in the frequency in the low and high mutability sites. The probable reason for low mutation frequency observed for bulky and small amino acids is likely to be due to the loss of interactions and steric hindrance, respectively. The mutability of the charge residues is mainly dependent on the environment. However, intravirion environment (RNA) and overall viral surface is negatively charged leading to higher mutation rate in negatively charged residues for better stability [60]. Mutations in R, G, C and W residues have been linked to higher probability of disease-causing mutations in humans [61]. A similar observation in SARS-CoV-2 virus indicates that mutations in these residues may also lead to decrease in fitness of the virus.

Surface accessibility
The Relative accessible surface area (r ASA ) feature calculated from Dictionary of Secondary Structure of Proteins (DSSP) [54] is an important feature to identify mutation sites with high and low mutation rates (Figs. 4a and S1a). In the SARS-CoV-2 proteins, it was observed that high mutability sites also have higher relative accessible surface area and vice versa, for most of the large proteins. The observation is reasonable as most residues in small proteins are surface accessible due to small size. The proteins including E, M, nsp4, nsp6, nsp7, nsp8 and ORF6 showed an opposite or no trend for surface accessibility. The features related to buriedness (such as number of contacts at 5 Å distance and residue depth) also supports the observation of relative accessible surface area (data not shown). Surface accessibility is also important for the interaction with the environment including self/host proteins [62]. Therefore, mutability of these sites can significantly affect the survival or transmission of the virus. It is also important to note that surface accessibility alone is not sufficient to predict the mutability of protein sites as only a small percentage of the surface accessible sites interact with other molecules.

Residue bulkiness
The residue type analysis showed that bulky residues such as aromatic residues are highly preferred in the low mutability sites in the SARS-CoV-2 proteome. The extended analysis using residue volume feature (AAindex id: BIGC670101) supported the observation for all amino acids (Fig. 4b). Similar trend was also observed in each protein (Fig. S1b). However, the p-values from the t-test showed relatively less statistically significant outcomes among other major features discussed. Other related features such as molecular weight also showed that low mutability sites have higher molecular weight and vice versa (data not shown). The bulky amino acids most likely show low mutational frequency due to higher contact order and biosynthesis cost [63][64][65]. This also explains the observation that bulky aromatic groups such as Tyr are preferred at protein interaction sites and are less likely to mutate [21,66,67].

Stability of the mutation site
Understandably, we observed that locally stable protein sites are less likely to mutate to avoid destabilization of the protein structure [68]. We calculated the local average stability of the mutation site and one flanking residue on each side using unfolding enthalpy of the chain (ΔH c ). The feature showed that low mutability sites are more stable compared to high mutability sites (Fig. 4c). The protein-wise analysis also showed similar results except for E, nsp10 and ORF6 proteins (Fig. S1c).

Conservation of the mutation site
Residue conservation is directly linked to the mutability of the amino acids. A residue is likely to be conserved in observed protein if it is conserved in the closest homologous proteins. Therefore, we used several sequence conservation related features derived from the position-specific scoring matrix (PSSM). The average value of the 20 amino acids in the position weight matrix (PWM) was able to classify the high and low mutation sites (Fig. 4d) and it is more negative for specific dominant mutations. The higher chances of random mutations shift the average value of the PWM matrix towards the positive scale. We observed that high mutability sites also have higher values of average PWM, which is consistent in all SARS-CoV-2 proteins except N, nsp1 and nsp7 (Fig. S1d). Therefore, these high mutability sites are more prone to be replaced by any other amino acids. On the other hand, low mutability sites prefer only self (or specific) mutations and are considered relatively more conserved. We have also analyzed the information content (IC) parameter in the PSSM file, which measures the probability of a given PWM to be different from the uniform distribution. Expectedly, we observed a weak negative correlation (− 0.14) between isolate count and information content (Fig. S2a). The high mutability sites are expected to have more uniform distribution of possible mutations leading to decrease in information content [69]. However, it is not sufficient to differentiate high and low mutability sites alone (Fig. S2b).   Fig. 3. Amino acid frequency in low and high mutation sites class.

Analysis for high and low mutability sites using 10% selection threshold
The above-discussed major features are also calculated for an undersampled dataset to observe consistency of the results. The isolate count cutoffs and number of mutation data are reevaluated at 10% selection threshold (881 protein sites in low mutability class and 867 protein sites in high mutability class), which in turn also reduced the dataset size for each protein and restricted statistically significant observations. However, the observations with the undersampled dataset at 10% selection threshold were the same as the observation at 30% selection threshold. In summary, the residue type at the mutation site showed higher presence of Gly, positively charged and aromatic residues in the low mutability sites (Fig. S3). High mutability sites also observed lower values for residue bulkiness and stability features and higher values for surface accessibility and conservation features (Fig. S4).

Machine-learning model development
We further developed a machine-learning model to assess the ability of the intrinsic physicochemical features to predict low and high mutability sites in SARS-CoV-2 proteome. The baseline model is developed at 30% selection threshold (top and bottom 30% of the mutation sites selected based on mutant isolate count) as discussed below:

Development of baseline model
We used a forward feature selection approach to select the optimal number of features in the baseline model (Fig. S5). We observed the best model performance (area under the ROC curve: 0.71) with four features and SMO (Sequential Minimal Optimization) algorithm, a SVM (Support Vector Machine) based method for the classification (see Methods section for more detail). SVM based models have been extensively used in the biological problem due to better interpretability, learnability and generalization [70][71][72]. The selected feature in the SVM model includes residue at the mutation site (residue type; Res), residues flanking the mutation site (Res flank ), relative accessible surface area (r ASA ) and average value of position weight matrix (PWM avg ). Further optimization of the model parameters revealed the accuracy of 65% with sensitivity of 62.4% and specificity of 67.7% with ROC value of 0.711 for the training dataset ( Table 1). The performance of the model was further rigorously tested using different performance measures including 10-fold cross-validation with randomization (average ROC of 0.646 ± 0.002 after 100 iteration) and n-fold cross-validation (ROC = 0.648). The analysis showed that the developed model is robust (Table 1).
We further analyzed the importance of each feature in the ML model. The features "Res" and "Res flank " significantly reduce the performance of the model upon elimination (ROC 0.652 and 0.649, respectively). On the other hand, "Res" feature showed the best performance (ROC = 0.621) when only one feature was used in the model. Therefore, we concluded that "Residue type (Res)" feature is the most important feature for the classification of the low and high mutability of protein sites (Table S3).

Performance of the machine learning model at different selection threshold
The baseline model developed at 30% selection threshold was further tested on other thresholds ranging from 5 to 25%. We observed that the performance of the model increases as the selection threshold decreases ( Table 2). The correlation between area under the ROC curve and selection threshold was also high (r 2 = 0.95; Fig. S6). This is mainly due to the fact that decreasing the selection threshold proportionally setup more stringent conditions for mutations to be assigned to either low or high mutation sites, thus improving the confidence level.

Case study: analysis of variants/mutation of concern/interest
The physicochemical features including surface accessibility, residue bulkiness, stability and conservation were analyzed for the mutation/ variants of concern and interest with respect to average value of all protein sites (Table 3). There was a total of nine protein sites in spike protein containing at least one mutation of interest (L18, K417, N439, L452, S477, S494, N501, P681) or concern (E484). These mutation sites were considered as high mutability sites, where they are expected to have higher than average values for surface accessibility and conservation features and lower than the average values for residue bulkiness and stability features. Among the nine mutations of interest and concern in the spike protein, six mutations (E484, K417, N439, S477, N501, P681) satisfied the criteria for all four features. The remaining three mutation sites L18, S494 and L452 satisfy the criteria for 3, 2 and 1 features, respectively.
A further extended analysis was carried out for all mutation sites in the proteome of SARS-CoV-2 variants of concern (VOC) and interest   Table 3 The features related to mutation probability analyzed for the mutation of concern and mutation of interest.

Note:
The average values are calculated from the mutation sites considered in the current study of SARS-CoV-2 proteome. These mutations of concern/interest are expected to be present at the high mutability sites. The features that do not follow the observed trend in the study are highlighted.  (Table 4). Therefore, higher chances of mutation in these protein sites lead to emergence of new variants that improved the fitness of the virus in terms of better survivability and more transmissibility.

Potential applications
The study will improve our understanding of intrinsic physicochemical parameters affecting mutability of the viral proteome, which in combination with the virus-specific biological features (such as important binding/cleavage sites) can be used to predict the potential future mutations leading to improvement in survivability, infectivity or lethality of the virus. The intrinsic parameters discussed here can be used as a starting point for in silico prediction of future variants of any pathogen. Moreover, exposed protein sites with less probability of mutation can be used as immunogens for vaccine development or potential epitopes for antibody-based therapeutics.

Conclusion
In this study, we have provided insights into the mutability of SARS-CoV-2 proteome from the perspective of intrinsic sequence-structurebased features. The study highlights the role of surface accessibility, residue bulkiness, stability and evolutionary conservation in determining the mutational probability of a protein site. The major advantage of the study is that it does not require any priori information other than the sequence and structure information of the virus of concern. The study leverages the large-scale mutational data (1079273 viral isolates of SARS-CoV-2) to predict the protein sites that are less or more prone to mutations. The study also focuses on the robustness of the inference by utilizing different selection thresholds, as the reference dataset is changing daily. Although, it is also important to note that the study has some limitations such as mutations are considered mutually independent, deletions/insertions are excluded and biological/functional aspects are not considered. Moreover, mutations are considered only with respect to reference Wuhan strain due to lack of real-time mutation data, which may lead to biases towards the early mutations in the SARS-CoV-2 genome. A more sophisticated time series analysis based on real-time viral mutation, effect of concurrent mutations and role of the biologically relevant protein sites can be explored further for greater understanding of viral protein mutability. The dataset/features used in the study can be obtained from the GitHub repository (https://github.com/ puneetrawat/COVID_Mutation_Site).

Declaration of competing interest
The authors declare no competing interests.

Acknowledgement
We thank the Department of Biotechnology and Indian Institute of Technology Madras for computational facilities and the Ministry of human resource and development (MHRD) for HTRA scholarship to DS and MP. This work is partially supported by the Robert Bosch Center for Data Science and Artificial Intelligence (RBCDSAI), Indian Institute of Technology Madras, India to MMG (Project no: CR1718CSE001RBEIBRAV).

Table 4
The mutation sites of variants of concern and interest analyzed for four physicochemical features.

Variant of concern
Mutation sites Number of features satisfying criteria As per the study, the satisfactory criteria for the feature is: 1. High mutability sites are likely to have higher than average value for surface accessibility and conservation, and vice versa. 2. High mutability sites are likely to have lower than average value for residue bulkiness, and stability, and vice versa.