Unlocking the puzzle: non-defining mutations in SARS-CoV-2 proteome may affect vaccine effectiveness

Introduction SARS-CoV-2 variants are defined by specific genome-wide mutations compared to the Wuhan genome. However, non-clade-defining mutations may also impact protein structure and function, potentially leading to reduced vaccine effectiveness. Our objective is to identify mutations across the entire viral genome rather than focus on individual mutations that may be associated with vaccine failure and to examine the physicochemical properties of the resulting amino acid changes. Materials and methods Whole-genome consensus sequences of SARS-CoV-2 from COVID-19 patients were retrieved from the GISAID database. Analysis focused on Dataset_1 (7,154 genomes from Italy) and Dataset_2 (8,819 sequences from Spain). Bioinformatic tools identified amino acid changes resulting from codon mutations with frequencies of 10% or higher, and sequences were organized into sets based on identical amino acid combinations. Results Non-defining mutations in SARS-CoV-2 genomes belonging to clades 21 L (Omicron), 22B/22E (Omicron), 22F/23A (Omicron) and 21J (Delta) were associated with vaccine failure. Four sets of sequences from Dataset_1 were significantly linked to low vaccine coverage: one from clade 21L with mutations L3201F (ORF1a), A27- (S) and G30- (N); two sets shared by clades 22B and 22E with changes A27- (S), I68- (S), R346T (S) and G30- (N); and one set shared by clades 22F and 23A containing changes A27- (S), F486P (S) and G30- (N). Booster doses showed a slight improvement in protection against Omicron clades. Regarding 21J (Delta) two sets of sequences from Dataset_2 exhibited the combination of non-clade mutations P2046L (ORF1a), P2287S (ORF1a), L829I (ORF1b), T95I (S), Y145H (S), R158- (S) and Q9L (N), that was associated with vaccine failure. Discussion Vaccine coverage associations appear to be influenced by the mutations harbored by marketed vaccines. An analysis of the physicochemical properties of amino acid revealed that primarily hydrophobic and polar amino acid substitutions occurred. Our results suggest that non-defining mutations across the proteome of SARS-CoV-2 variants could affect the extent of protection of the COVID-19 vaccine. In addition, alteration of the physicochemical characteristics of viral amino acids could potentially disrupt protein structure or function or both.


Introduction
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), first reported in Wuhan (China) in late 2019, soon spread around the globe.To date, more than 700 million people have been infected and 6.97 million have died worldwide.Despite the availability of vaccines, SARS-CoV-2 remains a cause for concern.Like other RNA viruses, SARS-CoV-2 is endowed with a high mutation rate and high viral load and accumulates mutations with each replication cycle.As a result, viral variants that differ in one or more nucleotides are continuously being generated in infected hosts (1)(2)(3).The nomenclature of SARS-CoV-2 variants plays a critical role in facilitating their clear identification, tracking, and global collaboration in understanding the evolutionary dynamics of the virus.For Nextstrain clade naming, a new major clade earns its designation once it attains a 20% frequency on a global scale, regardless of the time frame.When computing these frequencies, it is crucial to ensure a relatively uniform sampling of sequences across different times and geographical locations due to the considerable disparity in sequencing efforts among countries.Clade names are formulated using a standardized protocol, typically derived from the year of emergence and the subsequent available letter in the alphabet.Additionally, a newly identified clade must exhibit a minimum of two mutations differentiating it from its parent major clade.This systematic approach ensures consistency and accuracy in clade designation, facilitating the clear identification and tracking of viral lineages in genomic surveillance studies.SARS-CoV-2 has undergone several genetic mutations leading to the emergence of various clades since its first identification in 2019.A recent study using Belgian data revealed an enhanced immune escape capability exhibited by the Omicron variant compared to the Alpha and Delta variants, resulting in a substantial reduction in the protective efficacy conferred by both acquired immunity and vaccination.Furthermore, a decline in vaccine effectiveness over time was observed, underscoring the significance of booster doses to sustain long-term immunity (4).
In the Netherlands population, the effectiveness of primary and booster vaccination against SARS-CoV-2 infection was estimated overall and in four risk groups defined by age and medical conditions during the Delta and Omicron BA.1/BA.2periods.The findings underscored the advantages of booster vaccinations in reducing infection rates, particularly within at-risk groups (5).
Investigations of the complete of SARS-CoV-2 proteome have often been limited to analyzing only a few sequences or individual sequences, rather than taking advantage of a large sequencing dataset.However, a direct RNA sequencing approach was employed to assess the SARS-CoV-2 transcriptome in Vero E6 cells, and mass spectrometry was used to explore the proteome and phosphoproteome of these virus-infected cells (6).Furthermore, a proteomic analysis of proteins extracted from nasopharyngeal swabs of 12 patients diagnosed with COVID-19 identified 13 different SARS-CoV-2 proteins.Additionally, host proteome analysis revealed that several key host proteins were uniquely expressed in patients with COVID-19 (7).In a separate study, distinct epitopes of seven different proteins were identified using the complete SARS-CoV-2 virus genome obtained from the NCBI database.However, the 12 protein sequences of the genome were formatted as FASTA files using RefSeq accessions (8).Furthermore, a proteome-wide study of SARS-CoV-2 assessed its potential to induce autoimmune diseases by segmenting the proteome into peptides and identifying shared peptides with experimentally confirmed human T-cell and B-cell epitopes (9).
The SARS-CoV-2 genome consists of a single-stranded, unsegmented, positive-polarity RNA molecule [(+) ssRNA] 29,903 nucleotides in length encoding 13 ORFs (10).Two-thirds of the viral genome corresponds to ORF1a and ORF1b, which express the two polyproteins pp1a and pp1ab, the latter though a − 1 ribosomal frameshift, and that are processed by two viral proteases into 16 non-structural proteins (nsp).ORF1a encodes nsp1 to nsp11 and ORF1b comprises nsp12 to nsp16.Non-structural proteins make up the replication and transcription machinery and are responsible for the maintenance of the viral genome (11).Some nsp proteins are targets for antiviral drugs such as nsp12, the RNA-dependent RNA polymerase (RdRp), and nsp5, the 3C-like protease (Mpro, 3CLpro) (12).In addition, nsp3, the papain-like protease (PLpro) is also a therapeutical target for antivirals (13).
The structural proteins are, namely, the surface glycoprotein or spike (S), the envelope protein (E), the membrane glycoprotein (M), and the nucleocapsid phosphoprotein (N).The S protein has been shown to play a major role in virus attachment and entry into cells, being a key antigen for development of vaccines and neutralizing antibodies, and as a pharmacological target (14)(15)(16).The E protein plays a key role in the pathogenesis of the virus affecting the binding of SARS-CoV-2 to the tight junction proteins (17).The M protein is responsible for maintaining the shape of the virion by spanning the membrane bilayer and facilitates budding of the viral particles from the host cells (18).Interestingly, the M protein was found to elicit IgM response during the acute phase of SARS-CoV-2 infection (19).The nucleocapsid function is to maintain the genome structure inside the envelope (20).The N protein has been identified as an important target for T-cell response, making it a suitable candidate for nextgeneration COVID-19 vaccines against emerging variants (21)(22)(23).
The existing literature predominantly concentrates on isolated mutations within the S protein rather than examining mutations as a network involving multiple proteins.Individually, mutations in SARS-CoV-2 might not pose significant risks, however their collective effect in tandem with other mutations could amplify the virus's transmissibility and virulence.Consequently, relying solely on information about distinct segments of the virus might provide an incomplete understanding.Therefore, a comprehensive study of mutations across the entire SARS-CoV-2 genome and proteome becomes essential.Such a holistic approach is critical in understanding the virus's mechanisms for evading vaccines and is instrumental in the development of effective vaccines and therapies.Mutations (including deletions) that alter protein sequence, may affect physicochemical properties and folding conformation of proteins resulting in changes in biological functions.Amino acid positions in proteins could be assigned a single conservation number based on the alignment of homologous proteins for quantification of amino acid substitutions (28,29).
Given that structural variations due to mutations could affect vaccine effectiveness and drug function, and thus the severity of COVID-19, this work involved an extensive genome-wide mutation combination analysis of SARS-CoV-2 in vaccinated patients from Italian and Spanish populations as two independent datasets.Analyzing more than 7,500 proteomes from each population increases the statistical power and reliability of the results.Our objective was to identify mutations across the entire viral genome, rather than focusing on individual mutations, that may be associated with vaccine failure, in addition to analyze the physicochemical properties of the resulting amino acid changes.This study involved aligning SARS-CoV-2 genomic sequences from both vaccinated and unvaccinated COVID-19 patients using the Wuhan genome as reference.We examined the frequency of specific mutation sets and analyzed their physicochemical properties to understand how these mutations may affect the structure and function of viral proteins.This approach provides a comprehensive view of the genetic diversity of SARS-CoV-2 variants circulating in different geographic regions and contributes to a deeper understanding of the underlying mechanisms of the vaccine effectiveness, which is crucial for informing public health strategies and vaccine development efforts.

SARS-CoV-2 genome sequences
Sequences of COVID-19 patients were retrieved from the Global Initiative on Sharing All Influenza Data (GISAID) (30): (a) Dataset_1 contained 7,154 aligned consensus sequences of SARS-CoV-2 genomes isolated from patients of Friuli-Venezia Giulia (Italy) from January 01, 2021 to June 24, 2023.Of these, 2,419 were fully vaccinated and 1,667 received a booster dose vaccination against the Omicron variant; (b) Dataset_2 contained 8,819 aligned genomes mainly from Catalonia (Spain) since January 01, 2021 to July 25, 2022.Of those, 2,969 were completely vaccinated and 699 received the third COVID-19 vaccine dose against Omicron.Table 1 shows a description of the datasets.Datasets are available in the Supplementary material.
The NetAlign CLI software was used for sequence alignment (version 2.4.0)(31).Genome and coding sequences (CDS) of SARS-CoV-2 Wuhan-Hu-1 reference sequence NC_045512.2were retrieved from GenBank.SARS-CoV-2 variants and mutations were sourced from the CoVariant website, 1 which employs the Nextstrain nomenclature for identification of variants.

Mutations of interest
Changes in genome, including indeterminations during sequencing base calling, were referred to as non-synonymous or missense mutations when the codons contained mutations (substitutions or deletions) with a frequency of 10% or higher in the genomic sequence alignments.These were referred as mutation of interest (MOI).Sequences sharing the same combination of MOIs (haplotypes) were grouped together into set of sequences.Non-defining mutations were also referred to as additional MOIs.

Bioinformatic tools and statistical analysis
To identify amino acid changes within SARS-CoV-2 genomes, codon translation in coding sequences (CDS) was accomplished using scripts written in the R programming language (version 4.1.0)(32) and Python (version 3.8.11)(33).The Biopython library (version 1.76) (34) was employed for managing the amino acid alphabet.Qualitative variable analysis was performed using the Chi-square test from the scipy.statsPython module (scipy version 1.9.3) (35).We employed the trackViewer package (version 1.34.0)(36) for creating visual representations of mutations in SARS-CoV-2 proteins.Furthermore, EMBOSS Seqret (version 6.6.0.0) (37) was employed to record significant mutation combinations in mega non-interleaved output format.Data manipulation and analysis were carried out using libraries like Matplotlib (version 3.5.1)(38) and Pandas (version 1.2.3) (39).In-house scripts used in this study were developed by the authors.Scripts used in our analysis are openly available at: https:// github.com/papersarscov2proteome/.
Undetermined amino acids and special characters from the Biopython dictionary (B, Z, J, U, and O) were represented as 'X.' Sets containing 'X' were excluded from further analysis, while those with a frequency of 1% or greater were retained for statistical examination.Statistical significance was determined with p values <0.05 (following False Discovery Rate (FDR) correction) (40).Vaccine coverage was identified when expected frequencies exceeded observed frequencies.
Genetic distances estimation within each sequence set was carried out using MEGA (41).The mean distance was calculated using the Bootstrap method for variance estimation, with 1,000 bootstrap replications, employing the p-distance model for amino acid substitution type.Ambiguous positions were eliminated for each sequence pair using the pairwise deletion option.

Patients
We categorized individuals as having complete active immunity or fully vaccinated (FV) as those who had received a minimum of 2 doses (or 1 dose in the case of the Janssen vaccine) at least 14 days prior to infection, regardless of the specific infectious variant.For FV patients who were infected by the Omicron variant, those who had received a third COVID-19 vaccine dose (or a second dose of Janssen vaccine) were designated as booster patients.Vaccines referred to as others include Sinovac and Sinopharm.Individuals who did not meet any of the above criteria were classified as not fully vaccinated.

Residue conservation
Conservation analysis stands as one of the most widely used methods for predicting functionally significant residues in protein sequences.Residue conservation, as defined by Livingstone et al. (29), employs two distinct methods to quantify a singular conservation score for each position.For both, the physicochemical properties assessed for the 20 amino acids take into account whether the molecules are hydrophobic, polar, small, proline, tiny, aliphatic, aromatic, positive, negative and charged.In this study, we used the method 1 which considers any property exhibiting positive or negative conservation.A deletion is considered to possess all of these properties for the conservation index calculation.In this work high conservation index refers around a range of 10-8, intermediate to 7 and 6, and values equal to or less than 5 with low conservation.(ORF3a)), and two were associated with the 21J (Delta) clade (T3255I (ORF1a) and T478K (S)).On the other hand, Dataset_2 exclusively contained seven MOIs, with none of them overlapping with Dataset_1.Among these, three were defining mutations of the 21K (Omicron) clade (T95I (S), V143-(S), and G496S (S)), while four lacked Nextstrain Clade assignments (L829I (ORF1b), Y1759Y (ORF1b), Y145H (S), and Q9L (N)).

Associations with vaccine failure 3.2.1 Dataset_1
When applying a Chi-square test, individuals who were fully vaccinated and infected with non-Omicron variants showed a statistically significant association with vaccine coverage.This trend was observed from Set1_ds1 to Set4_ds1.Thus, variants that peaked in December, January, March, and April and 2021 were strongly linked to vaccine protection (Table 2).However, Omicron cases in sets of sequences, Set5_ds1 (21L, FDR = 1.58 × 10 −41 ), Set6_ds1 and Set7_ds1 (22B and 22E, FDR = 1.32 × 10 −122 and 1.12 × 10 −50 , respectively), as well as Set8_ds1 (22F and 23A, FDR = 1.24 × 10 −88 ), displayed a strong association with COVID-19 infection (Table 3; Supplementary Table S1).The booster dose demonstrated a slight improvement in protection against viral infection, with the exception of Set7_ds1.When comparing Set7_ds1 to Set6_ds2 from the same clade, we observed that both shared the same non-defining MOIs A27-(S), I68-(S), R346T (S) and G30-(N), with the only difference being the presence of a 'T' at locus 346 in the spike protein (Table 6).
Additionally, it was observed that the booster dose was found to provide increased protection against the Omicron 21K and 22B clades (Table 5; Supplementary Table S1).

Dataset_2
The conservation index, assessing properties that are positively or negatively conserved, displayed high values in 39.5% of the MOIs, while 37.0% showed low conservation values (Table 8; Supplementary Figure S2B; Supplementary Table S3).The predominant physicochemical properties affected by these changes were hydrophobic (69.1%) and polar (53.1%) (Table 8).Only the 21J variant was found to be associated with vaccine escape so it was analyzed in detail.This analysis identified 26 loci that differed between set of sequences ranging from Set2_ds2 to Set6_ds2 resulting in 26 amino acid changes characterized predominantly by their hydrophobic (69.2%) property (Table 8).Substitutions with a conservation index corresponding to high and intermediate conservation were dominant (34.6 and 38.5%, respectively).Non-defining mutations of 21J (Delta) included P2046L, P2287S, L829I, T95I, Y145H, R158-, and Q9L which were characterized by hydrophobic (83.3%) and aliphatic (66.7%) amino acid substitutions.These changes were mainly led to an intermediate conservation of physicochemical properties.

Vaccine effectiveness
Given the significant role of non-clade-defining mutations in the risk of vaccine failure, the next step was to conduct a joint analysis of both datasets.This analysis aimed to identify non-clade defining MOIs associated with vaccine failure caused by the Omicron (21L, 22B and 22E) and Delta (21J) clades, which are common variants present in both datasets but were not initially linked to vaccine failure in the same manner.Both set5_ds1 and set8_ds2 belonged to the 21L clade and shared three additional mutations: L3201F (ORF1a), A27-(S), and G30-(S) (Table 9).However, while set5 was associated with vaccine failure even after receiving a booster dose (FDR = 1.58 × 10 −41 and 3.20 × 10 −31 , respectively), set8_ds2 did not exhibit the same level of association.The Chi-Square test revealed statistically significant differences in vaccine distribution between fully vaccinated individuals (FDR = 3.58 × 10 −17 ) and those who received a booster dose (FDR = 1.71 × 10 −20 ) when comparing set5_ds1 and set8_ds2, respectively.Set5_ds1 was primarily associated with the Pfizer and Moderna vaccines (61.1 and 35.4% for fully vaccinated, and 59.3 and 39.0% for booster doses, respectively).In contrast, set8_ds2 was predominantly associated with Pfizer and AstraZeneca vaccines (77.9 and 13.6% for fully vaccinated, and 78.9 and 14.5% for booster doses, respectively).Furthermore, the analysis of the genetic distance in the sets of sequences showed remarkable differences between sets, with the genetic distance of set8_ds2 being 3.8 times higher compared to set5_ds (Figure 3).
Only Set2_ds2 and Set5_ds2 from Dataset_2 were associated with vaccine failure.The vaccine distribution between these risk sets exhibited marginal significance (FDR = 0.049), with the major vaccine brands being both Pfizer and AstraZeneca.The genetic distances for set2_ds2 were twice as high as those observed in set5_ds2 (Figure 4).
There was no correlation between the variability of amino acids in the sets of sequences, measured by the genetic distance, and the loss of vaccine effectiveness for the compared clades.

Discussion
Mutations and deletions in SARS-CoV-2 proteins can significantly alter their structure and function.This study investigated the impact of mutations of interest (MOI) across the entire SARS-CoV-2 proteome on vaccine escape using two datasets.The analysis revealed several mutations and combinations of residues that may influence vaccine coverage, particularly concerning the Delta (clade 21J) and Omicron BA.2 (clade 21L) variants.
Observations of mutations in the spike protein, particularly A27-, I68-, and R158-located within the N-terminal domain (NTD), and the R346T mutation within the Receptor Binding Domain (RBD), align with previous studies.Molecular dynamics simulations have shown the critical involvement of NTD residues in interactions with monoclonal antibodies (42), suggesting potential immune evasion risks for viruses carrying mutations in these regions.Furthermore, mutations within the RBD have been shown to affect the binding affinity to the ACE2 receptor, indicating potential shifts in the binding free energy of the RBD-ACE2 complex and modified chemical interactions, leading to increased stability (43).Limited published data on vaccine efficacy or immunity related to the A27-(S) deletion were found.However, the spike 68-76 deletion within the NTD was identified in a human hepatoma cell clone termed Huh7.5-adapted-SARS2, indicating genetic adaptations.This modified version of SARS-CoV-2 effectively infiltrated A549 lung cancer cells, inducing cellular damage, a capability absent in the original strain, which exhibited no infectivity toward A549 cells.Additionally, the Spike 68-76 deletion variant displayed increased susceptibility to IFN-α2b treatment in comparison to the wild-type SARS-CoV-2 strain.However, the Spike 68-76 deletion was not found in SARS-CoV-2 isolates obtained from VERO E6 cells (44).In the context of vaccine stability and effectiveness, it suggests that despite the presence of the 68-76 deletion in vaccine batches (CoronaVac), it might not drastically alter the vaccine's effectiveness.The R158-(S) mutation in combination with E156G/157 deletion and L452R mutation has been suggested to exhibit higher infectivity in spike-pseudotyped viruses (45).However, experimental evidence points to the 156-158 deletion notably diminishing the neutralization capacity against antibodies present in the sera of convalescent COVID-19 patients and vaccinated individuals (42,46,47).
The R346T change in the RBD is a key mutation for neutralization escape, enhanced fusogenicity, and enhanced S protein processing.Structural modeling suggests that R346T appears to disrupt salt bridge formation between the S protein and class III monoclonal antibodies (e.g., Cilgavimab), lowering effectiveness (48,49).However, in our study the set of sequences containing R indicates poorer vaccine coverage than T for the Omicron clades 22B/22E.
The outcomes related to spike mutations impacting the infectivity of SARS-CoV-2 appear to exhibit a wide-ranging scope.Conversely, the available data regarding the influence of mutations occurring in other viral proteins seems comparatively constrained.
An analysis conducted on 244 SARS-CoV-2 positive samples, gathered during the second wave of the pandemic, indicated that mutations P2046L and P2287S in the nsp3 (ORF1a) gene might contribute to persistent symptomatic COVID-19 infections postvaccination (50).Additionally, an investigation involving severe, moderate, and mild COVID-19 cases, encompassing individuals who were either partially or fully vaccinated (with Covishield/Covaxin) or unvaccinated, revealed a marginal association of the P2287S mutation with disease severity (51).The nsp3 protein in the SARS-CoV-2 virus constitutes a crucial component of the viral replicase complex, contributing significantly to multiple functions associated with viral replication (52), transcription (53), and modulation of the host immune response (54), but no specific data on the impact of the L3201F mutation on these functions have been found in the literature.
A study aimed at modeling the fitness of several SARS-CoV-2 lineages by combining the effect of individual mutations introduced a scalable hierarchical Bayesian regression model to analyze all available SARS-CoV-2 genomes.The study identified the L829I (ORF1b) mutation in nsp12, which promotes an amino acid change in the RdRp (RNA-dependent RNA polymerase) thumb subdomain that could affect the function of the enzyme.RdRp plays a critical role in replicating and transcribing the viral genome in RNA viruses like SARS-CoV-2 (55).
The N protein consists of different structural components, namely, an N-arm, an N-terminal RNA-binding domain, a linker region containing serine/arginine-rich loops (SR-rich region), a C-terminal RNA-binding domain, and a C-tail (56-61).Some regions of the N-arm (amino acids 1-46) have been identified as immunodominant Genetic distances between amino acids per site within each sequence set of Dataset 1 and 2 obtained by averaging all sequence pairs, along with the standard error estimates The average distance was calculated using the Bootstrap method for variance estimation, with 1,000 bootstrap replicates.The p-distance model was used for the amino acid substitution type.Ambiguous positions were removed for each pair of sequences using the pairwise deletion option.The bars are organized based on the datasets and are color-coded according to sets of sequences that share the same MOIs.The research revealed that one particular antibody had a specific affinity for the N-arm region of the N protein (62), suggesting the possible involvement of mutations identified in this study such as Q9L and G30-in immune evasion.
As mentioned, some of the substitutions and deletions associated with vaccine failure in our study seem to be in line with previous studies.However, others have not been previously linked to vaccine coverage or immunity.Therefore, a genome-wide analysis of SARS-CoV-2 mutations, and their effects on the proteome, could help to understand the molecular basis of viral vaccine escape, in connection with vaccine and therapeutic drug development.
The effect of COVID-19 booster-dose vaccination against the Omicron variant has been reviewed by a study that identified a total of 27 published studies supporting the effectiveness of booster dose vaccine (63).Our results are consistent with the improved effectiveness of the booster dose against the Omicron variant in Dataset_2, where the vaccines show high coverage further improved by the administration of the third dose for clades 21L and 22B.However, in Dataset_1 where the vaccine is not effective for Omicron clades, administration of the third dose showed a slight protection for clades 21L, 22B/22E, and 22F/23A.We hypothesize that the discrepancies observed between these datasets might be attributed to the molecular composition of the administered vaccines.
Full implementation of SARS-CoV-2 vaccines is a major goal facing the COVID-19 pandemic.A comparative analysis of COVID-19 vaccine characteristics, adverse events, efficacy and effectiveness reported that all vaccines up to 22nd September 2021 appeared to be safe and effective tools against all variants of concern to prevent severe COVID-19, hospitalization, and death.However, the evidence varies greatly depending on the vaccines considered (64).In addition, the effectiveness of BNT162b2/Comirnaty vaccine (Pfizer) against the Omicron variant has been reported as 60% (65).Conversely, the effectiveness of the Spikevax/mRNA-1273 vaccine (Moderna) was published for symptomatic and asymptomatic cases, without specifying effectiveness for different clades of SARS-CoV-2 (66)(67)(68).In this line, in our study, the set of sequences associated with vaccine failure were mainly related to nucleic acid-based vaccines developed by BioNTech-Pfizer and Moderna-Lonza.However, Omicron cases where the vaccine exhibited a protective effect showed a high percentage of the Pfizer brand.
Investigation of the physicochemical attributes of amino acids enables an understanding of the intricate dynamics between viral proteins and the host immune system.This exploration provides valuable insights into viral pathogenicity, contributes to vaccine design and shapes strategies for drug development.In our study a significant number of substitutions, evaluated by conservation scores, showed a robust conservation index, indicating a strong correlation with amino acid physicochemical properties in approximately one-third of the changes.Furthermore, a comparable proportion of these substitutions exhibited lower conserved similarities in both datasets, implying an equal prevalence of such disparities among substitutions analyzed using conservation scores.Analysis of the physicochemical properties of amino acid changes revealed a predominant occurrence of hydrophobic and polar amino acid substitutions in both datasets.Substitutions in the Omicron variant (clades 21L, 22B/22E, and 22F/23A) were predominantly characterized by hydrophobic and polar properties.However, the non-defining mutations of each Omicron clade were mainly polar and small.The 21J (Delta) clade sequence set featured mainly amino acid substitutions with hydrophobic properties.The non-defining mutations of 21J (Delta) were distinguished by prevalent hydrophobic and aliphatic amino acid substitutions.This is agreed with the research that showed the significant role of hydrophobic residues in the spike protein, enhancing interactions in the Delta variant (69).Recently, it was highlighted that the defining mutations in the Delta and Omicron variants markedly impact hydrophobicity, polarity, and charge distribution in all regions of the N-protein (70).
The role of missense mutations and deletions in SARS-CoV-2 has been recognized as pivotal for vaccine effectiveness and residue interactions, highlighting the need to elucidate the molecular basis of these substitutions and deletions for advancing vaccine and drug development.Our investigation identified six mutations significantly associated with reduced vaccine coverage, such as P2046L (ORF1a), P2287S (ORF1a), L3201F (ORF1a), L829I (ORF1b), R346T (S), and Q9L (N), along with the four deletions A27-(S), I68-(S), R158-(S), and G30-(N).Analysis of whole proteome sequences of SARS-CoV-2 derived from COVID-19 patients revealed a correlation between non-clade-defining mutations and vaccine effectiveness.Currently approved vaccines primarily target the spike protein.Thus, changes in this protein could challenge vaccine effectiveness.Our findings support a proteome perspective in SARS-CoV-2 vaccine design, which could improve vaccine effectiveness.In addition, we found that amino acid substitutions exhibited predominantly hydrophobic and polar properties.Understanding the physicochemical properties of amino acid substitutions is crucial, as it reveals how these modifications affect protein structure, function, and interactions.This understanding provides valuable insights into disease mechanisms and the identification of potential therapeutic targets.

FIGURE 1 Flowchart
FIGURE 1Flowchart of the methodology used in this study.1.Sequences and metadata of vaccinated and unvaccinated COVID-19 patients from Italy and Spain were downloaded from GISAID. 2. The SARS-CoV-2 genome sequences of the COVID-19 patients were aligned to the Wuhan reference genome.3. The mutations of interest (MOIs) are then defined.4. Next, the proteomes are obtained and those sequences that have the same combination of MOIs are grouped together.5 and 6.Statistical analysis is now performed using a Chi-square test to identify sets of sequences between fully vaccinated and unvaccinated patients that are associated with vaccine failure.7. The role of non-clade-defining mutations in the risk of vaccine effectiveness is then investigated by comparing populations.8. Finally, the physicochemical properties of the MOIs in the sequence sets associated with vaccine failure are analyzed.

TABLE 1
Brief description of Dataset_1 and Dataset_2.

TABLE 3
Descriptive statistics of Omicron set of sequences of Dataset_1, comparing fully vaccinated and booster doses with non-fully vaccinated COVID-19 patients.
a FDR, False discovery rate.b Referred to the highest number of cases.

TABLE 5
Descriptive statistics comparing fully vaccinated and booster-dosed COVID-19 patients with those who are not fully vaccinated for Omicron sets of sequences.
a FDR, False discovery rate.b Expected frequency exceeded the observed frequency for fully vaccinated patients infected by SARS-CoV-2 sequences sharing the same combination of MOIs, in clinical terms this translates to vaccine coverage or efficacy.c Expected frequency exceeded the observed frequency for booster patients infected by SARS-CoV-2 sequences sharing the same combination of MOIs, in clinical terms this translates to booster dose efficacy.d Referred to the highest number of cases.

TABLE 6
Set of sequences sharing the same combination of MOIs in Dataset_1 and Dataset_2 for 22B (Omicron) and 22E (Omicron) variants and their distribution based on vaccine brands.
a False discovery rate value for the comparison of fully vaccinated vs not fully vaccinated COVID-19 patients.b False discovery rate value for the comparison of the patients who received booster dose versus not fully vaccinated COVID-19 patients.c Expected frequency exceeded the observed frequency for fully vaccinated patients infected by SARS-CoV-2 sequences sharing the same combination of MOIs, in clinical terms this translates to vaccine coverage or efficacy.d Expected frequency exceeded the observed frequency for booster patients infected by SARS-CoV-2 sequences sharing the same combination of MOIs, in clinical terms this translates to booster dose efficacy.

TABLE 4
Descriptive statistics for non-Omicron sets in Dataset_2 among fully vaccinated and non-fully vaccinated COVID-19 patients.Expected frequency exceeded the observed frequency for fully vaccinated patients infected by SARS-CoV-2 sequences sharing the same combination of MOIs, in clinical terms this translates to vaccine coverage or efficacy.
a FDR, False discovery rate.b

TABLE 7
Set of sequences sharing the same combination of MOIs in Dataset_1 and Dataset_2 for 21J (Delta) variant and their distribution based on vaccine brands.
a False discovery rate value for the comparison of fully vaccinated vs not fully vaccinated COVID-19 patients.

TABLE 8
Physicochemical properties of the amino acids substituted in Dataset_1 and Dataset_2.

TABLE 9
Set of sequences sharing the same combination of MOIs in Dataset_1 and Dataset_2 for 21L (Omicron) variant and their distribution based on vaccine brands.
a False discovery rate value for the comparison of fully vaccinated vs not fully vaccinated COVID-19 patients.b False discovery rate value for the comparison of the patients who received booster dose versus not fully vaccinated COVID-19 patients.10.3389/fpubh.2024.1386596Frontiers in Public Health 11 frontiersin.org 10.3389/fpubh.2024.1386596Frontiers in Public Health 13 frontiersin.orgwork was supported by the European Commission-NextGenerationEU (Regulation EU 2020/2094), through CSIC's Global Health Platform (PTI Salud Global).Project: PID2021-124662OB-I00 financed by MCIN/AEI/10.13039/501100011033/andby FEDER "A way to make Europe." With support from the AXA-ICMAT Chair in Adversarial Risk Analysis.The research was also made possible through the Proyectos Estratégicos Orientados a la Transición Ecológica y a la Transición Digital under the code "TED2021-129970B-C21. " AG-P acknowledges funding by the Programa Operativo FEDER 2014-2020 and Consejería de Economía y Conocimiento, Junta de Andalucía (grant number CV20-10932).