Prediction of the effects of the top 10 synonymous mutations from 26645 SARS-CoV-2 genomes of early pandemic phase

Background The emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) had led to a global pandemic since December 2019. SARS-CoV-2 is a single-stranded RNA virus, which mutates at a higher rate. Multiple works had been done to study nonsynonymous mutations, which change protein sequences. However, there is little study on the effects of SARS-CoV-2 synonymous mutations, which may affect viral fitness. This study aims to predict the effect of synonymous mutations on the SARS-CoV-2 genome. Methods A total of 26645 SARS-CoV-2 genomic sequences retrieved from Global Initiative on Sharing all Influenza Data (GISAID) database were aligned using MAFFT. Then, the mutations and their respective frequency were identified. Multiple RNA secondary structures prediction tools, namely RNAfold, IPknot++ and MXfold2 were applied to predict the effect of the mutations on RNA secondary structure and their base pair probabilities was estimated using MutaRNA. Relative synonymous codon usage (RSCU) analysis was also performed to measure the codon usage bias (CUB) of SARS-CoV-2. Results A total of 150 synonymous mutations were identified. The synonymous mutation identified with the highest frequency is C3037U mutation in the nsp3 of ORF1a. Of these top 10 highest frequency synonymous mutations, C913U, C3037U, U16176C and C18877U mutants show pronounced changes between wild type and mutant in all 3 RNA secondary structure prediction tools, suggesting these mutations may have some biological impact on viral fitness. These four mutations show changes in base pair probabilities. All mutations except U16176C change the codon to a more preferred codon, which may result in higher translation efficiency. Conclusion Synonymous mutations in SARS-CoV-2 genome may affect RNA secondary structure, changing base pair probabilities and possibly resulting in a higher translation rate. However, lab experiments are required to validate the results obtained from prediction analysis.


Methods
A total of 26645 SARS-CoV-2 genomic sequences retrieved from Global Initiative on Sharing all Influenza Data (GISAID) database were aligned using MAFFT.Then, the mutations and their respective frequency were identified.Multiple RNA secondary structures prediction tools, namely RNAfold, IPknot++ and MXfold2 were applied to predict the effect of the mutations on RNA secondary structure and their base pair probabilities was estimated using MutaRNA.Relative synonymous codon usage (RSCU) analysis was also performed to measure the codon usage bias (CUB) of SARS-CoV-2.

Results
A total of 150 synonymous mutations were identified.The synonymous mutation identified with the highest frequency is C3037U mutation in the nsp3 of ORF1a.Of these top 10 highest frequency synonymous mutations, C913U, C3037U, U16176C and C18877U mutants show pronounced changes between wild type and mutant in all 3 RNA secondary structure prediction tools, suggesting these mutations may have some biological impact on viral fitness.These four mutations show changes in base pair probabilities.All mutations except U16176C change the codon to a more preferred codon, which may result in higher translation efficiency.

Conclusion
Synonymous mutations in SARS-CoV-2 genome may affect RNA secondary structure, changing base pair probabilities and possibly resulting in a higher translation rate.However, lab experiments are required to validate the results obtained from prediction analysis.

Introduction
In December 2019, coronavirus disease 2019 (COVID- 19)  cases first emerged from Wuhan, China 1 .Soon after, rapid spread of COVID-19 has resulted in a serious global outbreak.COVID-19 is an infectious and potentially lethal disease caused by a newly found coronavirus strain, known as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2).The virus causes clinical manifestation ranging from asymptomatic to severe pneumonia and in the worst scenario, death 2 .SARS-CoV-2 seems to have a higher transmission rate 3 but lower mortality rate 2 in comparison to Middle East respiratory coronavirus (MERS-CoV) and severe acute respiratory syndrome coronavirus (SARS-CoV).
SARS-CoV-2 is a single-stranded RNA virus with a genome size of 29,903 bases.In general, RNA viruses have a higher mutation rate than DNA viruses and this allows them to evolve rapidly, escaping the host immune defence response 4 .Different SARS-CoV-2 variants with multiple synonymous and nonsynonymous mutations have been reported since the beginning of the outbreak 5 .Some variants are classified as variants of concern (VOCs) since they are associated with the change in viral pathogenicity such as, higher disease severity, higher transmission rate, lower immunity response in the host as a consequence of the mutations 6 .However, it is expected that most of these mutations in SARS-CoV-2 genome are either neutral or mildly deleterious 7 .Numerous studies have been carried out to understand the molecular mechanisms of these nonsynonymous mutations on the functions of different SARS-CoV-2 proteins 6 .For example, the Alpha variant (B.1.1.7) of SARS-CoV-2, first identified in the UK in late 2020, is characterized by several mutations, including the D614G mutation in the spike protein, which enhances its binding affinity to the ACE2 receptor 8 .This variant exhibits increased transmissibility compared to the original virus, which led to its rapidspread globally 9 .However, there are only a few studieson the synonymous mutations of SARS-CoV-2 genome 10,11 .
Synonymous mutations are also known as silent mutations because the nucleotide mutations result in a change in the RNA sequence without altering the amino acid sequence 12 .Synonymous mutations have been suggested to have no functional consequence on the fitness of organisms and their evolution in long term 13 .However, numerous recent studies had showed that synonymous mutations may affect the folding and stability of RNA structures 14 .Interestingly a large scale study of synonymous mutations in multiple yeast genes has shown that most of synonymous mutations are not neutral, affecting the fitness of the cell 15 .For RNA viruses, even though synonymous mutations generally do not change their pathogenicity directly, some studies reveal that synonymous mutations may affect the RNA secondary structure of the virus 16 and also change the codon usage bias of the genes in the virus 17,18 .The use of mRNA-based COVID-19 vaccines reduce the severity of the disease.However, mRNA molecule is susceptible to the degradation due to the presence of 2' OH group in the ribose.To improve the stability of mRNA vaccine, Zhang et al. (2023) designed a novel algorithm, which optimizes the codon usage and RNA secondary structure by using synonymous codon 19 .
The synonymous mutations play some important biological roles, which may affect viral fitness and pathogenicity.However, the study of biological consequences of synonymous mutations have been largely overlooked.In this study, we identified synonymous mutations of SARS-CoV-2 genome from early pandemic phase.We predicted the effects of these synonymous mutations of the top 10 highest frequency on RNA secondary structures and codon usage bias of SARS-CoV-2 genome.These findings allow the researchers to prioritize these mutations for function analysis in the future.

Methods
Sequence retrieval 30,229 SARS-CoV-2 genomic sequences were downloaded from GISAID database (Global Initiative on Sharing All Influenza Data, RRID:SCR_018251) 20 ranging from 31 December 2019 to 22 March 2021.SARS-CoV-2 genomic sequences were filtered by setting parameters to keep only sequences with complete genome and high coverage.The sequences were further filtered to remove those sequences with higher than 0.1% "N" unresolved nucleotides and ambiguous letters.A total of 3,584 sequences were removed by applying this filter.The reference sequence of SARS-CoV-2 genome (NC_045512.2) 21 was retrieved in fasta format from NCBI database (NCBI, RRID:SCR_006472).It is a Wuhan isolate with a complete genome which comprises of 29,903 bases.

Multiple sequence alignment
The rapid calculation available in MAFFT online server (MAFFT, version 7.467, RRID:SCR_011811) 22 was used to perform multiple sequence alignment (MSA) for 26,645 SARS-CoV-2 genomes.This option supports the alignment of more than 20,000 sequences with approximately 30,000 sites.The alignment length was kept, which means the insertions at the mutated sequences were removed, to keep the alignment length the same as the reference sequence.While other parameters were left as default.
Identification of mutations and their frequency in SARS-CoV-2 genomes A simple Python script was written to identify the mutations in 26,645 SARS-CoV-2 genomes.To determine whether the identified mutations are synonymous or nonsynonymous, MEGA

Amendments from Version 3
The title of the paper has been revised to, "Prediction of the effects of the top 10 synonymous mutations from 26645 SARS-CoV-2 genomes of early pandemic phase".The introduction section with information on Alpha variant has been added and some introduction section has been revised.Additional paragraphs in the discussion section on identification of synonymous mutation, RNA secondary structure prediction and codon bias usage have been included to improve the clarity of the manuscript.Some citations have been removed, added or updated accordingly.
Any further responses from the reviewers can be found at the end of the article X software, version 10.2.5 build 10210330 (MEGA Software, RRID:SCR_000667) 23 was utilized to perform the translation for inspection purposes.The presence of amino acid changes was identified by referring to the genomic position of the nucleotide mutations.Synonymous mutations with the top 10 highest frequencies were generated.

SARS-CoV-2 RNA secondary structure prediction
The RNA secondary structure of wild type and mutant sequences were predicted using RNAfold program, version 2.4.18 (Vienna RNA, RRID:SCR_008550) 24 with the incorporation of SHAPE reactivity data obtained from the study done by Manfredonia et al. (2020) 25 .The RNA secondary structure prediction was performed using a sequence length of 250 nucleotides upstream and downstream of the mutation site.
Other than RNAfold, another two programs which are IPknot++ version 2.2.1 (SCR_022557) 26 , and MXFold2 (SCR_022558) 27 were also used to perform the RNA secondary structure prediction of SARS-CoV-2 wild type and mutants.

Base pair probability estimation
To predict how the mutations affect RNA local folding, base pair probability was estimated by utilizing MutaRNA, version 1.3.0(MutaRNA, RRID:SCR_021723) 28 .MutaRNA is a web-based tool that allows prediction and visualization of the structure changes induced by a single nucleotide polymorphism (SNP) in an RNA sequence.It includes the base pair probabilities within RNA molecule of both wild type and mutant.The parameters used in MutaRNA were set as default except the window size was changed to 501nt.

Relative Synonymous Codon Usage (RSCU)
Relative synonymous codon usage (RSCU) represents the ratio of the observed frequency of codons appearing in a gene to the expected frequency under equal codon usage.RSCU is calculated using the formula: where X i implies the number of occurrences of codon i and n stands for the number of synonymous codons encoded for that particular amino acid.

Results and discussion
A synonymous mutation is a change in the nucleotide that does not cause any changes in the encoded amino acid.Synonymous mutations were previously considered to be less important, but they are now proven to have some effects on RNA folding, RNA stability, miRNA binding and translational efficiency 29 .Synonymous mutations may have significant effects on the adaptation, virulence, and evolution of RNA viruses 30 .Another study done also indicated that synonymous mutations have association with more than 50 human diseases such as hemophilia B, tuberculosis (TB), cystic fibrosis (CF), Alzheimer, schizophrenia, chronic hepatitis C and so on 31 .All these studies show that increasing importance has been associated with synonymous mutations over these years.Hence, it is necessary for us to study the effects of synonymous mutations of SARS-CoV-2 genome.

Identification of SARS-CoV-2 synonymous mutations
A total of 381 mutations were found in SARS-CoV-2 genomes by using python script, in which 150 of them are synonymous mutations.The distribution of these 150 synonymous mutations in 11 coding regions is shown in Figure 1.Among these mutations, ORF1a and ORF1b have a higher number of synonymous mutations at 76 and 33, respectively, which might be due to their longer sequence length.Besides that, our findings also show high C to U mutation rate in SARS-CoV-2 genome and this mutational skews are in line with multiple studies [32][33][34][35] .The high C to U mutation rate may be driven by host APOBECmediated RNA editing system and overexpression of APOBEC3 protein promotes viral replication and propagation in the human colon epithelial cell line 36 .These mutational skews are necessary to be considered when deducing the selection acting on synonymous variants in SARS-CoV-2 evolution 11 .Synonymous mutations are assumed subject to a lower selective pressure than nonsynonymous mutations, presumably the purifying selection force has stronger negative impact on the frequencies of nonsynonymous mutations.Interestingly there may be some selection force on synonymous mutations shown by a few studies, suggesting that these synonymous mutations are not random and neutral, may have some biological impact on viral fitness 11,32,37 .
The synonymous mutations in SARS-CoV-2 genomes with the top 10 highest frequency obtained from the analysis of 150 synonymous mutations were listed in Table 1.Our sequence samples are obtained from December 2019 to March 2021 and this period overlapped with the peak of Alpha variant (B.1.1.7)outbreak 5 .The defining synonymous SNPs of Alpha variant include C241T, C913T, C3037T, C5986T, C14676T, C15279T and T16176C 5 , and all except C241T are reported in our study as well.As shown in Table 1, synonymous mutations with the highest frequency identified from SARS-CoV-2 genomes is C3037U mutation located in nsp3 of ORF1a, followed by C313U mutation in nsp1 of ORF1a and C9286U mutation in nsp4 of ORF1a.Mutations with higher frequency are mostly found in ORF1a and ORF1b.Although there are some overlapping ORFs in the SARS-CoV-2 genome, such as ORF1a and ORF1b, ORF3a and ORF3c 38 , the top 10 highest frequency synonymous mutations are not located in these overlapping sites.
It is of great interest to find out the effect of these top 10 synonymous mutations on SARS-CoV-2 genome.However, it is important to take note that the high frequency of some mutations is not necessarily due to their positive effects.They may emerge during early stage of pandemic and are transmitted to all of their descendants, even though they have no or little effect on viral fitness 39 .Similar to another companion paper, which focuses on the prediction analysis of nonsynonymous mutations of SARS-CoV-2 proteins 40 , the same SARS-CoV-2 virus genome data from GISAID database ranging from 1st January 20 to 22 March 21 were used in this study.The data collection time was overlapping with the period when the frequency of alpha variant reached the highest numbers around March-May 21 34 .There are seven synonymous mutations identified as the defining mutations in the alpha variant, of which all except C241T are also reported in our study.Due to the rapid evolution of SARS-CoV-2 genome, it is beyond the scope of our study to keep track SARS-CoV-2 mutational profile and to predict the consequences of these mutations.Two independent studies reported that alpha or alpha-like SARS-CoV-2 variants are circulating among wild deer population in North America in late 2021 41,42 .Although there is no reported case of viral spillback from deer to human transmission, we can't simply rule out this possibility yet.Hence, our findings remain relevant despite of not using the latest genome dataset.
RNA secondary structure prediction and base pair probability estimation analysis SARS-CoV-2 virus can form highly structured RNA elements, which may affect viral replication, discontinuous transcription and translation 43,44 .For example, SARS-CoV-2 forms a three-stemmed pseudoknot structure to promote programmed -1 ribosomal frameshifting to increase the synthesis of the proteins required for viral replication 43,44 .There are numerous high throughput studies on the characterization of RNA secondary structure of SARS-CoV-2 genome 25,[45][46][47][48] .In these recent high throughput studies, the RNA secondary structures of SARS-CoV-2 genome were determined experimentally using chemical probing methods, such as SHAPE-MaP 25,45 or proximity ligation methods, such as RIC-seq 47 , COMRADES 48 .Although these data are very useful to determine the RNA secondary structures of SARS-CoV-2 virus, there is very little study on the effect of the synonymous mutations on RNA secondary structure, which may be beneficial or deleterious to the viral fitness.Therefore, we performed RNA secondary structure prediction and base pair probability estimation analysis of these top 10 highest frequency of synonymous mutations.
To improve the outcome of the study, multiple RNA secondary structure prediction tools, namely RNAfold with SHAPE reactivity data 24 , IPknot++ 26 and MXfold2 27 were applied in our study.In addition, MutaRNA analysis tool was used to estimate the base pair potential of the wild type and mutant sequences.RNAfold with SHAPE reactivity data uses thermodynamic approach to calculate the minimum free energy for the most probable RNA secondary structure by  In addition, it has been shown that SARS-CoV-2 may adopt different RNA secondary structure conformations 7,19,36,37,39,41 .
Our study is aimed to predict if the sSNP may affect RNA secondary structure and the outcomes allow us to prioritize variants for the experiment functional studies in the future.Using multiple prediction tools may help to increase the accuracy and reliability of the prediction result.The prediction results for all 10 synonymous mutations using these 3 tools and the base pair probability estimation results are summarized in the Table 2 (✓ -changes, × -no change).The results for all 10 synonymous mutations predicted with RNAfold, IPknot++ and MXfold2 are available in Extended data 2, 3 and 4, respectively 51 .The base pair probabilities for all 10 synonymous mutations are shown as circular plots in Extended data 5 51 .The darker the edge is, the more likely the two connected bases to form base pair.Of these 10 synonymous mutations, four mutants which are all located in ORF1ab, namely C913U, C3037U, U16176C and C18877U mutants show pronounced changes between wild types and mutants in all 3 prediction tools, suggesting these synonymous mutations may have some biological impact on viral fitness.Having say that, it is also possible that other mutants with only one or two changes predicted by these analyses, may also affect RNA secondary structures, having some impact on viral fitness.It has been shown that SARS-CoV-2 virus can form elaborated RNA secondary structures at 5' and 3'UTRs, and frameshifting element (FSE), located between the boundary of ORF1a and ORF1ab 7,19,36,37,39,41 .The 5' UTR of SARS-CoV-2 is important for viral mRNA stability 52 and protein translation 53 while the 3' UTR may be involved in viral proliferation in the host cell 54 .Interestingly it has been observed that base substitution type, transitions from C to U base occurred at higher frequency in the stem region of RNA secondary structure of 5' and 3' UTR of SARS-CoV-2 genome, possibly due to the less detrimental effect on the structure 34 .The FSE can form pseudoknot structures, which regulate the relative protein expression of ORF1a and ORF1ab during viral infection 43,44 .
Other than 5' and 3' UTRs, Huston et al. (2021) found that ORF1ab region forms extensive RNA secondary structure network 45 .Coincidentally all four mutations, C913U, C3037U, U16176C and C18877U reported in our study are located within ORF1ab.C913U mutation is found in the Nsp2, near the start codon (position 806) in ORF1a in SARS-CoV-2 genome.As shown in Figure 2, the wild type structures predicted by RNAfold and MXfold2 shares some degree of similarity around position 95-330 of 501 base long structure.C913U mutation has a pronounced effect on RNA secondary structure predicted by RNAfold.C913U mutation results in the appearance or disappearance of multiple loops, not only at the nearby mutated residue, but also at the sites further apart, suggesting this mutation may affect its long-range RNA interaction.While MXfold2 predicts that U913 mutant forms a shorter stem and a larger hairpin loop compared to C913 wild type.However, the structure predicted by IPknot++ is quite different from others, in which, C913U results in change of pseudoknot structure.Figure 2D shows that the base pair interactions of wild type RNA are changed substantially by  C3037U mutation is found in the Nsp3 in ORF1a.As shown in Figure 3, both IPknot++ and MXfold2 predict that U3037 mutant forms longer stem and smaller internal loop compared to wild type.On the contrary, RNAfold predicts that a small internal loop fuses into a bigger internal loop in U3037 mutant.MutaRNA circular plot shows that there is some minor difference in base pair probabilities between C3037 wild type and U3037 mutant.Nsp3 is a papain-like protease, which hydrolyzes several Nsp proteins, involved in viral replication 58 .Hence, we should investigate the effect of this mutation on its cleavage activity, probably through the change in transcription or translation level of Nsp3.
U16176C mutation is located in the Nsp12, close to the boundary of Nsp12 and Nsp13 genes in ORF1b.As shown in Figure 4, U16176C mutation results in a drastic change in RNA secondary structure predicted using RNAfold.IPknot++ predicts C16176 mutant forms new pseudoknot structures, which are absent in wild type U16176.On the other hand, MXfold2 predicts C16176 mutant forms a larger multi-branched loop and a shorter stem compared to wild type.Similarly, MutaRNA result shows C16176 mutant affects base pair potential at multiple sites.Nsp12 is one of the subunits of RNA-dependent RNA polymerase (RdRp), which is required for RNA synthesis 59 .A study showed that a 1.4-kb-long SARS-CoV-2 RNA sequence (residues 15071-16451) located in the Nsp12 and Nsp13 regions is required to facilitate viral RNA packaging 60 .Since U16176C mutation may affect RNA secondary structure, it will be interesting to see if it affects viral RNA packaging.U16176C together with C14676U and C15279U have very similar number of frequencies as shown in Table 1.Interestingly IPknot++ predicted all of them result in changes in pseudoknot structure as shown in Extended data 3.We speculated that these three sSNPs may be functionally related.These mutations are located downstream of the frameshifting element (residues 13405-13488) and this element forms a pseudoknot to promote ribosomal frameshifting during viral replication 61 .It has been demonstrated that synonymous mutations affect both RNA secondary structure of the ribosomal frameshift signal and frameshifting efficiency in SARS-CoV virus 62 .Another study had shown that this ribosomal frameshifting structure in SARS-CoV-2 virus involves long-range sequence interaction of 1.5 kb 48 .It remains to be seen whether the long-range sequence interaction for ribosomal frameshifting can go beyond 1.5kb long.
C18877U mutation is located in Nsp14 in ORF1b.As shown in Figure 5, an additional internal loop is formed in U18877 mutant predicted by RNAfold.IPknot++ predicts U18877 mutant forms extra internal loops and longer hairpin near the mutated residue and it also affects the pseudoknot structure at 2 different sites further from the mutated residue.While MXfold2 predicts U18877 mutant forms one hairpin with multiple loops instead of one hairpin as seen in wild type.The changes at multiple base pairing sites due to the U18877 mutation is also observed in MutaRNA circular plot.Nsp14 is important to maintain high fidelity during viral RNA synthesis 63 .

RSCU analysis of SARS-CoV-2
Other than affecting RNA secondary structure, it has been shown that synonymous mutations may affect protein translation efficiency and accuracy through the formation of codon usage bias (CUB), which is non-random usage of synonymous codons, common in all species 64 .It is a phenomenon where some codons are preferred over others for a specific amino acid.SARS-CoV-2 replicates using host cell's machinery and synthesizes its protein by utilizing host cellular components.Hence, codon usage bias may affect the replication of viruses 65 .
Relative synonymous codon usage (RSCU) is a widely used statistical approach 66 that can be used to measure codon usage bias in coding sequences.The RSCU values of SARS-CoV-2 are shown in Table 3 and the most preferred codons for each amino acid are marked in bold.Stop codons (UAA, UAG, UGA) and codons which code for an amino acid uniquely (AUG, UGG) are excluded from RSCU analysis.
Table 4 shows the RSCU analysis of the top 10 synonymous mutations.The codons in bold in the 'codon change' column are the codons with higher RSCU value, which means they are more preferred in SARS-CoV-2 genome.Most of the mutations change the codon to a more preferred codon as shown in Table 4. Nine of the ten synonymous mutations involve changes from C to U nucleotides and eight of them are located at the third position of codon, suggesting these changes are not random and possibly subjected to some selection pressure.In agreement with our study, the excessive changes of C to U nucleotides in SARS-CoV-2 genome has been reported in multiple studies [32][33][34][35] .Since the preferred codons may have a better translation efficiency and accuracy compared to the nonpreferred codons 64 , it is possible that most of these mutations may increase the viral fitness.While a study show that RNA secondary structures may be functionally linked to protein translation based on the evidence obtained from experimental work 67 , it is difficult for us to establish the connection solely using in silico studies.

Conclusions
The effects of SARS-CoV-2 synonymous mutations in various aspects such as RNA secondary structure and codon usage bias were studied, even though they do not cause changes in amino acid residue of the protein.C913U, C3037U, U16176C and C18877U mutants show pronounced changes between wild type and mutant predicted in all 3 RNA secondary structure prediction tools, suggesting these mutations may have some biological impact on viral fitness.In addition, these mutations showed changes in base pair potential estimated by MutaRNA.All mutations except U16176C change the codon to a more preferred codon, which may result in higher translation efficiency.Due to the shortcomings of prediction tools, experimental studies, such as protein translation assays, RNA packaging assays,o are needed to give a more comprehensive understanding of the biological consequences of synonymous mutations on SARS-CoV-2 virus.

Data and software availability
Underlying data SARS-CoV-2 virus genome sequence data were obtained from the GISAID Database.The multiple alignment data can be assessed through FigShare.
Extended data 3.The RNA secondary structure of SARS-CoV-2 genome predicted using IPknot++.
Extended data 4.The RNA secondary structure of SARS-CoV-2 genome predicted using MXfold2.
Extended data 5.The base pair probabilities of SARS-CoV-2 genome estimated using MutaRNA Data are available under the terms of the Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Diego Forni
Scientific Institute IRCCS E Medea, Bosisio Parini, Italy I would like to thank the authors for answering my questions.
I still believe that it is not optimal to report counts without taking into account ORD length in figure 1.I still also believe that differences among RNA secondary structures are relevant and I am not really sure that "C913U, C3037U, U16176C and C18877U mutants show pronounced changes between wild type and mutant" as stated in the conclusion.
Finally the two sentences starting with "Synonymous mutations are assumed subject to a lower selective pressure than nonsynonymous mutations, presumably.." in the introduction are not clear and need to be better explained.
Competing Interests: No competing interests were disclosed.

Reviewer Expertise: viral genomics, viral evolution
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

Version 3
Reviewer Report 11 September 2024 https://doi.org/10.5256/f1000research.162881.r304727 © 2024 Schlick T. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Tamar Schlick
Department of Chemistry and Courant Institute of Mathematical Sciences, New York University, New York, New York, USA In this work, synonymous mutations of the SARS-CoV-2 genome are explored, with the rationale that these mutations impact function through altered RNA folding, despite unaltered protein products.Specifically, the researchers find 150 synonymous mutations after performing multiple sequence alignment and examining the protein products, and mapped the distributions in the coding regions.RNA secondary structure predictions and base pair probability calculations are then presented for 4 mutations C913U, C3037U, U16176C and C18877U that show pronounced changes between wildtype and mutant structures.Different prediction tools were used to confirm the structure predictions.The value of the study, while interesting, is limited because the long RNA lengths used make the disparate predicted structures unreliable.If the Information Classification: General authors examine some mutant systems experimentally with chemical reactivity like SHAPE or DMS, the results may be more meaningful.

Is the work clearly and accurately presented and does it cite the current literature? Yes
Is the study design appropriate and is the work technically sound?Partly

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Partly Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Partly
Competing Interests: No competing interests were disclosed.
I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
Reviewer First of all, it is clear that the dataset is not up to date and it will not be changed, still this should be clearly stated, starting from the title, reporting or the time period or something like "at the beginning of the pandemic''.
-Following this, also the introduction section should be focused on what we know about genomic variability and viral evolution in the time period being analyzed herein.If I understood well, most of these sequences belong to the alpha variant.
-Moreover, " variant of concern" (abbreviated VOC not VOI as stated in the text) refers to a specific lineage that carries several and distinct mutations.As it is stated in the text now it seems that a single mutation is a VOC.Please check it.
-It is expected that most mutations are found in ORF1ab, as this covers most of the genome, so counts should be normalized by ORF length (e.g.fig.1).
-Table 1 should report mutation frequency not count.
-Authors should discuss more why different predictors give very different results in terms of rna secondary structure, and how this can affect their analysis.
-Why did the author select 250 bp flanking positions for their prediction?Which is the rationale?How does this influence the results?-Are there any other mutations in the 501 nucleotide regions?Could mutations have an impact on predictions?-How did the authors handle overlapping/internal ORFs?In these situations a syn mutations could also be non-synonymous for the other ORF.I do not think this is the case for the top 10 mutations, but still this should be explained.
-The CUB section seems to me not really linked to the rest of the analyses and could be expanded more.

Is the work clearly and accurately presented and does it cite the current literature?
Partly

Chong Han Ng
In this manuscript, Dr Boon et al. analyzed the most common synonymous mutations of early pandemic SARS-CoV-2 genomes.They identified substitutions that influence the RNA secondary structure surrounding these positions.The authors have already modified and updated their manuscript based on previous comments, still there are few open questions in my opinion.
First of all, it is clear that the dataset is not up to date and it will not be changed, still this should be clearly stated, starting from the title, reporting or the time period or something like "at the beginning of the pandemic''.
Author response: The title has been revised to "Prediction of the effects of the top 10 synonymous mutations from 26645 SARS-CoV-2 genomes of early pandemic phase".
Reviewer comment: Following this, also the introduction section should be focused on what we know about genomic variability and viral evolution in the time period being analyzed herein.If I understood well, most of these sequences belong to the alpha variant.
Author response: We added a few lines about Alpha variant in introduction, and result and discussion section (Identification of SARS-CoV-2 synonymous mutations).
Reviewer comment: Moreover, " variant of concern" (abbreviated VOC not VOI as stated in the text) refers to a specific lineage that carries several and distinct mutations.As it is stated in the text now it seems that a single mutation is a VOC.Please check it.
Author response: The abbreviation has been corrected from VOI to VOC.In the text, we mentioned that different SARS-CoV-2 variants with multiple synonymous and nonsynonymous mutations have been reported since the beginning of the outbreak.Therefore, it is clear that VOCs contain multiple mutations.
Reviewer comment: It is expected that most mutations are found in ORF1ab, as this covers most of the genome, so counts should be normalized by ORF length (e.g.fig.1).Author response: Figure 1 show the distribution of 150 different types of synonymous

Roland Huber
Bioinformatics Institute, A*STAR, Singapore I agree with previous reviewers that the analyzed sequences represent only a limited sample of variation in SARS-CoV-2, specifically from early in the pandemic.This might introduce unexpected biases in the analysis.E.g. it would be more likely to observe host adaption early on which would be consistent with more favourable codon usage.
With regard to the data that was analysed, the structure models obtained show limited consistency.The authors state that they do not expect the used tools to concur on the structures since they use different algorithms.This is concerning, as one would expect unambiguous structures to be consistent, even using different methodologies.Other tools, e.g.RNAstructure, also allow the inclusion of shape data and the prediction of pseudoknots.We are thus left with a series of diverging structure predictions and unsure what, if any, effect these specific mutations have.This is not helped by inconsistent presentation of the results.Figures 2-5 use different visualisations for the results of the 4 tools employed, which makes comparisons of the structures difficult for the reader.
The study unfortunately does nothing to associate the regions or structural elements with any type of functional or biological information.We are thus left with a study that analyses a limited set of synonymous mutations using inconsistent structure predictions and offers no additional biological insight.

Is the work clearly and accurately presented and does it cite the current literature? Partly
Is the study design appropriate and is the work technically sound?Partly

Are sufficient details of methods and analysis provided to allow replication by others? Yes
If applicable, is the statistical analysis and its interpretation appropriate?Partly Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Computational Biology, Structural Genomics, RNA biology, Virology I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.
sequencing with complete genome and high coverage still contain bad sequences, how did the authors exclude those sequences?
In the abstract section, the results part only included those mutations with high frequency but did not contain results of other analyses.Instead, they included the results of RNA secondary structure analysis in the conclusion.Based on the timelines of emergence of different variants, the cut-off date of our data collection overlaps with the period when the frequency of the alpha variant peaked.All the synonymous mutations except C241T reported in the alpha strain have been identified in our study as well.We added new paragraph in the discussion section to discuss the impact of our findings and to explain why the study of alpha variant remains relevant now.In addition, we used 3 RNA secondary prediction analysis tool, instead of one, to improve the outcome of the analysis.

Is
The prediction results for all top 10 synonymous mutations using these 3 tools and the base pair probability estimation results are summarized in the Table 2.
A major concern is that the sequences even filtered by setting parameters to keep only sequencing with complete genome and high coverage still contain bad sequences, how did the authors exclude those sequences?
Response: In our original method, we used the high coverage filter when downloading from GISAID database, in which only entries with less 1% N and 0.05% unique amino acids mutations are included.To further reduce bad sequences, we filtered to remove those sequences with higher than 0.1% N unresolved nucleotides and ambiguous letters.A total of 3584 sequences were removed by applying this filter.The list of the synonymous mutations remains unchanged despite of the revision of the mutation frequency.
In the abstract section, the results part only included those mutations with high frequency but did not contain results of other analyses.Instead, they included the results of RNA secondary structure analysis in the conclusion.

Response:
The abstract has been revised to update the results and conclusion.
Competing Interests: No competing interests were disclosed.
Reviewer Report 02 November 2021 https://doi.org/10.5256/f1000research.76505.r97299 the time of the study.RNAfold computes an optimal structure over the whole sequence length, which means its performance and accuracy can be affected as the sequences used get longer.However, in our case, the accuracy of the prediction is still acceptable since the sequence we used to do the structure prediction is relatively short.Besides, the prediction results we obtained from RNAfold is further supported with the prediction results obtained from MutaRNA, which predicts the structural changes induced by the mutation by estimating the base pairing probabilities.In our results, the circular plot from MutaRNA shows the changes in the base pairing probabilities near the mutation site correlates well with the RNA secondary structure predicted by RNAfold.
of Molecular and Cell Biology in Warsaw, 3.

Figure 2 .
Figure 2. The effect of C913U mutation on RNA secondary structure of nsp2 in ORF1a.(A) RNA secondary structure of C913 wild type and U913 mutant predicted using RNAfold.(B) RNA secondary structure of C913 wild type and U913 mutant predicted using IPknot++.(C) RNA secondary structure of C913 wild type and U913 mutant predicted using MXfold2.(D) MutaRNA circular plots of base pairing probabilities of C913 wild type and U913 mutant.The black arrow indicates the position of WT and mutated nucleotides while the red arrow indicates the starting position of the query sequence.

Figure 3 .
Figure 3.The effect of C3037U mutation on RNA secondary structure of nsp3 in ORF1a.(A) RNA secondary structure of C3037 wild type and U3037 mutant predicted using RNAfold.(B) RNA secondary structure of C3037 wild type and U3037 mutant predicted using IPknot++.(C) RNA secondary structure of C3037 wild type and U3037 mutant predicted using MXfold2.(D) MutaRNA circular plots of the base pairing probabilities of C3037 wild type and U3037 mutant.The black arrow indicates the position of WT and mutated nucleotides while the red arrow indicates the starting position of the query sequence.

Figure 4 .
Figure 4.The effect of U16176C mutation on RNA secondary structure of nsp12 in ORF1b.(A) RNA secondary structure of U16176 wild type and C16176 mutant predicted using RNAfold.(B) RNA secondary structure of U16176 wild type and C16176 mutant predicted using IPknot++.(C) RNA secondary structure of U16176 wild type and C16176 mutant predicted using MXfold2.(D) MutaRNA circular plots of the base pairing probabilities of U16176 wild type and C16176 mutant.The black arrow indicates the position of WT and mutated nucleotides while the red arrow indicates the starting position of the query sequence.

Figure 5 .
Figure 5.The effect of C18877U mutation on RNA secondary structure of nsp14 in ORF1b.(A) RNA secondary structure of C18877 wild type and U18877 mutant predicted using RNAfold.(B) RNA secondary structure of C18877 wild type and U18877 mutant predicted using IPknot++.(C) RNA secondary structure of C18877 wild type and U18877 mutant predicted using MXfold2.(D) MutaRNA circular plots of the base pairing probabilities of C18877 wild type and U18877 mutant.The black arrow indicates the position of WT and mutated nucleotides while the red arrow indicates the starting position of the query sequence.
Report 27 July 2024 https://doi.org/10.5256/f1000research.162881.r304726© 2024 Forni D. This is an open access peer review report distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Diego Forni Scientific Institute IRCCS E Medea, Bosisio Parini, Italy In this manuscript, Dr Boon et al. analyzed the most common synonymous mutations of early pandemic SARS-CoV-2 genomes.They identified substitutions that influence the RNA secondary structure surrounding these positions.The authors have already modified and updated their manuscript based on previous comments, still there are few open questions in my opinion.

Table 1 . SARS-CoV-2 synonymous mutations with the top 10 highest frequency.
25,m-loop structures, and in some studies, 8 SLs, depending on the sequence length25,[45][46][47][48]50  . To emonstrate the usability of prediction tools, we predicted RNA secondary structure of the sequence of 5' UTR (1-480 nt) of SARS-CoV-2 (Extended data 1) 51 .The RNA secondary structures predicted by RNAfold with SHAPE data, IPknot++ and MXfold2 are similar, especially SL1, SL5-8 regions and they are comparable to most of the published experimental data

the study design appropriate and is the work technically sound? Partly Are sufficient details of methods and analysis provided to allow replication by others? Partly If applicable, is the statistical analysis and its interpretation appropriate? Yes Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Partly Competing Interests:
No competing interests were disclosed.

the work clearly and accurately presented and does it cite the current literature? Partly Is the study design appropriate and is the work technically sound? Partly Are sufficient details of methods and analysis provided to allow replication by others? Yes If applicable, is the statistical analysis and its interpretation appropriate? Yes Are all the source data underlying the results available to ensure full reproducibility? Yes Are the conclusions drawn adequately supported by the results? Partly Competing Interests:
No competing interests were disclosed.

confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.
March 2021.As of 17 Dec 2021, there are over 4.4 million complete genomes with high coverage.Will these sequences used in the study include all variants reported or what sequences of variants are included?A table to summarize all variants with these mutations or not is needed.Response: We downloaded the raw data on 23 rd March 2021 and completed the data analysis in late June 2021.We submitted the manuscript in late August 2021.The landscape of mutational profile of SARS-CoV-2 genome is very dynamic, changing rapidly.It is far beyond the scope of this study to include all variants reported.