A study on non-synonymous mutational patterns in structural proteins of SARS-CoV-2.

SARS-CoV-2 is mutating and creating divergent variants across the world. An in-depth investigation of the amino acid substitutions in the genomic signature of SARS-CoV-2 proteins is highly essential for understanding its host adaptation and infection biology. A total of 9587 SARS-CoV-2 structural protein sequences collected from 49 different countries are used to characterize protein-wise variants, substitution patterns (type and location), and major substitution changes. The majority of the substitutions are distinct, mostly in a particular location, and leads to a change in amino acid's biochemical properties. In terms of mutational changes, Envelope (E) and Membrane (M) proteins are relatively stable than Nucleocapsid (N) and Spike (S) proteins. Several co-occurrence substitutions are observed, particularly in S and N proteins. Substitution specific to active sub-domains reveals that Heptapeptide Repeat, Fusion peptides, Transmembrane in S protein, and N-terminal and C-terminal domains in N protein are remarkably mutated. We also observe a few deleterious mutations in the above domains. The overall study on non-synonymous mutation in structural proteins of SARS-CoV-2 at early in the pandemic indicates a diversity amongst virus sequences.


148
The distinct variants are represented as v1, v2, · · · , vk (k is the number of distinct variants observed in each protein category). If we consider the group size for different proteins, we observe that more than 95% of the samples from E and M proteins sharing two groups and 151 85% samples from N and S proteins distributed in two groups (Figure 1(c)). We observe a 152 maximum variation in S protein followed by N, M and E proteins. It further reveals interesting 153 fact that in terms of mutational changes E and M proteins are relatively stable than N and S 154 proteins. A country-wise variants and its count (sample frequency) for each structural protein 155 is listed in Supplementary-A (Table S1).

165
In this section, we first try to investigate the substitution patterns in comparison to our    occur for S to other (≈ 13%, 10 distinct types). For spike (S) protein, we observe A to other 213 change is the maximum (≈ 14%, 5 distinct types). However, maximum distinct substitution 214 are observed in S to others (10 distinct types). 215 We also observe several co-occurrence substitutions in Spike (S) and Nucleocapsid (N) 216 proteins (Table 2)  Aliphatic, Aliphatic to Hydroxyl-containing for M protein, Hydroxyl-containing to Aliphatic 231 for N protein, Aliphatic to Aromatic and Hydroxyl-containing for S protein (Figure 6(a)).

232
In case of, more than 50% substitutions (except E protein), we observe a change of one  The blue node represents country (labeled with three letters country code), whereas the light red node represents substitutions with position indicated within parenthesis. The edge between two kinds of nodes depicts existence of a particular substitution in the sequence from a country. 239 We try to highlight and quantify the substitutions in various functional sub-domains of two 240 structural proteins, S and N. We report domain specific substitutions in S and N proteins in 241   Table 3.

242
Observed substitutions are mostly unique. In a particular location not more than two 243 substations are observed. In case of S protein, we observe maximum substitutions in HR2 and 244 FP domains (≈ 22% locations) followed by Transmembrane (TM) domain (≈ 17% locations).

245
Table 1: The table shows top 5 substitution type and associative quantitative information for two categories. For a substitution X > Y , in the the first category, X is fixed and Y can be any amino acid, and in the second category, X is any amino acid and Y is a fixed.     (Table S3)).     (Table 4).

274
The value of J(R m , S m ) lies between 0 and 1, i.e. 0 ≤ J(R m , S m ) ≤ 1. The higher value of 282 J(R m , S m ) indicates more similarities between the sample sets. 283 We report in Figure

289
In this work, we performed a sequence analysis on the structural proteins of SARS-CoV-290 2 , reported from different parts of the world. All the unique sequences are grouped, and 291 the variations are analyzed based on Wuhan-Hu-1 SARS-CoV-2 sequence (as reference). We 292 highlighted various commonly and rarely occurring amino acids substitution in four structural 293 proteins. We reported location-wise amino acid substitution patterns in the structural proteins.

294
The change in biochemical groupings as a result of substitutions are also reported for the 295 candidate variants. We even highlighted the above variations specific to any particular country.

296
Although we observed much more unique variants in S protein than N protein, in terms Hydroxyl-containing, and from Hydrophilic to Hydrophobic. A few substitutions in functional 304 domains of S and N proteins are also highlighted found to be deleterious.

305
The overall study summarizes diversity amongst viruses sequenced early in the pandemic. 306 We believe that the current findings will help better understand the disease pathogenesis of 307 COVID-19 followed by suitable and relatively stable small molecule identification that may 308 bind susceptible structural proteins like Spike (S) protein that is frequently changing.