Characterizing the West Nile Virus's polyprotein from nucleotide sequence to protein structure – Computational tools

Objectives West Nile virus (WNV) belongs to the Flaviviridae family and causes West Nile fever. The mechanism of transmission involves the culex mosquito species. Infected individuals are primarily asymptomatic, and few exhibit common symptoms. Moreover, 10 % of neuronal infection caused by this virus cause death. The proteins encoded by these genes had been uncharacterized, although understanding their function and structure is important for formulating antiviral drugs. Methods Herein, we used in silico approaches, including various bioinformatic tools and databases, to analyse the proteins from the WNV polyprotein individually. The characterization included GC content, physicochemical properties, conserved domains, soluble and transmembrane regions, signal localization, protein disorder, and secondary structure features and their respective 3D protein structures. Results Among 11 proteins, eight had >50 % GC content, eight proteins had basic pI values, three proteins were unstable under in vitro conditions, four were thermostable according to >100 AI values and some had negative GRAVY values in physicochemical analyses. All protein-conserved domains were shared among Flaviviridae family members. Five proteins were soluble and lacked transmembrane regions. Two proteins had signals for localization in the host endoplasmic reticulum. Non-structural (NS) 2A showed low protein disorder. The secondary structural features and tertiary structure models provide a valuable biochemical resource for designing selective substrates and synthetic inhibitors. Conclusions WNV proteins NS2A, NS2B, PM, NS3 and NS5 can be used as drug targets for the pharmacological design of lead antiviral compounds.

Infected individuals are primarily asymptomatic, and few exhibit common symptoms.Moreover, 10 % of neuronal infection caused by this virus cause death.The proteins encoded by these genes had been uncharacterized, although understanding their function and structure is important for formulating antiviral drugs.
Methods: Herein, we used in silico approaches, including various bioinformatic tools and databases, to analyse the proteins from the WNV polyprotein individually.The characterization included GC content, physicochemical properties, conserved domains, soluble and transmembrane regions, signal localization, protein disorder, and secondary structure features and their respective 3D protein structures.
Results: Among 11 proteins, eight had >50 % GC content, eight proteins had basic pI values, three proteins were unstable under in vitro conditions, four were thermostable according to >100 AI values and some had negative GRAVY values in physicochemical analyses.All protein-conserved domains were shared among Flaviviridae family members.Five proteins were soluble and lacked transmembrane regions.Two proteins had signals for localization in the host endoplasmic reticulum.Non-

Introduction
West Nile virus (WNV) is an insect-borne periodic epidemic disease causing seasonal infections in temperate climatic regions.This virus belongs to the genus Flavivirus of the Flaviviridae family. 1 Transmission occurs through culex species mosquitoes, which are exposed to the virus after feeding on dead birds. 2 Transmission through breastfeeding, blood transfusion and organ transplantation have been reported to lead to infections in humans. 3The first infection was reported in 1937, in Uganda.In the latter half of the 20th century, the virus spread worldwide. 3Most infected individuals are asymptomatic, and the few symptoms include general fever, vomiting, headache and rashes.Less than 1% of affected people exhibit neuroinvasive diseases, such as encephalitis and meningitis, 10% impact on the nervous system leading to death.
The genome of WNV is composed of a positive-stranded RNA containing approximately 10,000 nucleotides. 4Coding regions are translated into polyproteins, which are then cleaved into structural and non-structural proteins.The structural proteins are C, PM, M and E, whereas the nonstructural (NS) proteins are NS1, NS2A, NS2B, NS3, NS4A, NS4B and NS5.The C (capsid) protein packs the RNA in an immature state. 5The PM and M proteins are considered a single entity known as M protein, which plays a crucial role in infection by activating the viral entry proteins within the cell. 6The envelope formed by the E protein binds the host cell's surface receptors. 7E protein is typically targeted by the T (CD8þ) cell response. 8eplication complex regulation by NS1 imparts viral viability.Host cell death, viral assembly and replication are based on NS2A protein mechanisms. 9he cofactor NS2B plus NS3 form the NS2B-NS3 protease complex, which plays crucial roles in polyprotein cleavage and virion replication. 5NS3 has a multifunctional serine protease at the NTD and a helicase at the CTD. 10 NS3 helicase activity is regulated by NS4A, and NS3 acts as a cofactor for infectious virion propagation. 1 NS4B blocks interferon signalling of the host cell.NS5, also known as RNA-dependent DNA polymerase, acts as a methyltransferase. 11V protease (NS2B-NS3) activity is inhibited by 8hydroxyquinoline 12 and the trypsin inhibitor aprotinin. 13nly the protease complex and E proteins have been used to design antiviral drugs by peptides and ligands.The remaining proteins have roles in replication and infection.Hence, detailed analysis of genome characteristics is a necessary prerequisite for developing drugs against WNV.
The present work focused on characterizing the 11,022 nucleotide WNV genome sequence retrieved from the NCBI database (https://www.ncbi.nlm.nih.gov/nuccore/KT862844.1) by using various computational tools to assess nucleotide sequence and 3D protein structure (Figure 1).To our knowledge, no prior studies have performed complete genome characterization of proteins from the WNV polyprotein.Hence, this work may help medicinal chemists design new and better antiviral drugs on the basis of the described protein properties.

Physicochemical properties of the proteins
The amino acid sequence of the proteins was input in ProtParam (https://web.expasy.org/protparam/) to analyse the physicochemical properties 15 ; descriptors included the total number of amino acids, molecular weight, isoelectric constant (pI), total number of basic (LysþArg) and acidic (GluþAsp) amino acids, extinction coefficient (EC), 16 instability index (II), 17 aliphatic index (AI) 18 and grand average hydropathy (GRAVY). 19nserved domain detection NCBI's interface was used to search conserved domain detection (CDD) data for the 11 protein sequences (https:// www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi). 20he deposited protein sequences were annotated with RPS-BLAST, a j-BLAST variant, to enhance the set of precalculated PPSSMs with query protein's, setting a threshold limit of 0.01.

Determination of soluble or transmembrane proteins
The SOUSI server (https://harrier.nagahama-i-bio.ac.jp/ sosui/mobile/) was used to determine the soluble or conserved transmembrane residues of the proteins. 21

Prediction of protein signal localization
The web tool Virus-mPLoc (http://www.csbio.sjtu.edu.cn/bioinf/virus-multi/), based on the data for 252 viral protein sequences, was used for predicting the viral protein's subcellular localization within the host and various sites of virus-infected cells. 22

Protein disorder
Under normal conditions, disordered proteins do not form well-defined tertiary structures.IUPred2A (https:// iupred2a.elte.hu/), 23a combined web interface, was used to determine protein disorder and disordered binding regions, by using IUPred2 and ANCHOR2.

Protein secondary structure features
Polyprotein secondary structure was determined with the web server Self Optimized Prediction Method and Alignment (SOPMA) (https://npsa-prabi.ibcp.fr/NPSA/npsa_sopma. html), 24 which provides detailed information on the ahelix, b-sheet, random coil and extended regions of proteins, given in fasta format.

Prediction of protein tertiary structures
Protein three-dimensional structure determination was performed through a template-based search with SWISS homology modelling (https://swissmodel.expasy.org/interactive). 25ProMod3 was used for structural reading of input sequences along with insertions and deletions on an alignment basis; OpenMM and OpenStructure were used for simulations; and comparative modelling followed by parameterization was performed with the CHARMM22/ CMAP force field.The automated models built structures based on the updated sequences from UniProtKB and 190,687 PDB repository structures, for template identification and determination of sequence identity.Generated tertiary structures were further subjected to validation in PROCHECK with the Saves v6.0 server, 26

GC content calculation and translation
The total length of the WNV polyprotein was 11,022 nt; a 10,297 nucleotide coding sequence was found to encode a polyprotein (Table 1), which undergoes protease cleavage, thus forming the 11 proteins.NS5 had a nucleotide sequence length of 2715 nt, NS3 had a nucleotide sequence length of 1857 nt, M had a shorter nucleotide sequence length of 225 nt, and PM had a nucleotide sequence length of 275 nt.Among the 11 proteins, eight had >50%GC content: C, M, E, NS1, NS3, NS4A, NS4B and NS5.PM, NS2A and NS2B had 49.00e49.99% GC content.NS4B had a high GC content of 52.15 %, and M had a GC content of 52.00 %.PM had the lowest GC content, at 49.09 %.

Physicochemical properties of the proteins
Each protein's physicochemical properties were computed with the ExPASY ProtParam tool.Descriptors including the length of the protein, EC, molecular weight, II, pI, AI, total number of positively (ArgþLys) and negatively (AspþGlu) charged residues, and GRAVY values were determined (Table 2).Amino acid composition, including polar, non-polar, aromatic, acidic and basic properties, are represented in Figures 2 and 3.
The pI values for these proteins ranged from 12.31 to 4.

Conserved domain detection
In the CDD analysis of proteins from the WNV polyprotein, 4 of 11 proteins did not have domains identified from the database: PM, NS2A, NS2B and NS4B.Domains and descriptions for each protein are provided in Table 3

Prediction of protein signal localization
Investigation of the localization signals suggested that C, M, NS1, NS2A, NS2B, NS4A and NS4B localized in the viral capsid, whereas NS5 and NS3 localized in the viral capsid and host endoplasmic reticulum, and PM protein localized in the host cytoplasm (Table 5).

Protein secondary structure features
Secondary features of the proteins were predicted with the SOPMA tool (composition descriptors in Figure 6).The analysis revealed that the maximum frequency of   Characterizing the West Nile virus's polyprotein

Prediction of protein tertiary structures
The tertiary structures of the proteins from the WNV polyprotein models were matched to the structures deposited in the PDB (Table 6), including the percentage sequence identity and alignment sequences.The templates used were 5OW2 for C, 2BSJ for PM, 7 LCG for M, 7KV9 for E,  Characterizing the West Nile virus's polyprotein 4TPL for NS1, 6ZLH for NS2A, 2QQV for NS2B, NS2B for NS3, 6HUM for NS4A, 5H3C for NS4B and 4K6M for NS5.A sequence identity >30 % was used to build the 3D protein structure.The built predicted 3D protein structures for C, M, E, NS1, NS2B, NS3 and NS5 are shown in Figure 7. Validation of the generated 3D protein structures with Ramachandran plots was based on >90 % of residues in allowed regions indicating the built model was appropriate.Only two proteins, NSP1 and NS2A, had <90 % of residues in allowed regions (Table 6 and Figs.S1eS11).

Discussion
The guanine (G) and cytosine (C) content in the nucleotide sequence affects DNA stability.A high GC content enhances genome stability at elevated temperatures and promotes strong base stacking. 27It imparts crucial information for determining the temperature during PCR annealing, designing primers, and constitutes a key feature in shaping proteins. 28The proteins NS4B, M, NS5, NS4A, C, E, NS3 and NS1 were found to have high GC content of 52.15 %, 52.00 %, 51.82 %, 51.67 %, 51.49 %, 51.29 %, 51.26 % and 50.75 %, respectively, accounting for more than 50 % of the genome; consequently, the proteins that are stable and rigid, and can withstand relatively higher temperatures than other WNV polyproteins.In PCR, when generating complementary gene sequences for these proteins, the annealing temperature is slightly higher for proteins with lower GC composition.Proteins with less stability and greater flexibility are more prone to mutations under abnormal conditions.Physicochemical analysis was performed through an in silico approach.The pI is the pH at which no electrical charge is present on the molecule, or the total number of negative and positive charges is equal. 15Four proteins (PM, NS1, NS2B and NS4A) had pI values below seven, whereas the remaining seven proteins had pI values above seven.The isoelectric focusing technique is performed on the basis of the pI values to separate molecules from complexes. 29These values aid in the isolation of proteins of interest from the WNV polyprotein in wet laboratory experiments after digestion.
EC is defined as the amount of light absorbed per mole protein at a specific wavelength of light.A protein's EC value is calculated according to composition of tryptophan, tyrosine, and cysteine residues, because these amino acids substantially contribute to measuring the protein's optical density in the 276e282 nm range. 15Proteineprotein and protein-ligand quantitative studies can be understood on the basis of EC values. 30The highest EC value was observed for the NS5 protein.II indicates protein stability under both in vivo and in vitro conditions.Proteins with II above 40 are considered unstable, whereas those with II below 40 are stable. 17The proteins PM, NS1 and NS2A were found to be unstable; thus, efficient procedures are required for laboratory experiments.The remaining eight proteins were found to be stable.AI is another parameter describing protein stability according to temperature.AI is defined as the relative volume occupied by aliphatic side chains, such as alanine (Ala), valine (Val), leucine (Leu) and isoleucine (Ile). 15,18A high AI indicates high thermostability of a protein, which is an additional factor for wet laboratory studies.Among proteins from the WNV polyprotein, NS2A (133.90) and NS4A (122.42) had high AI values indicating high stability under a wide range of temperatures.IN contrast, PM (68.15) and NS5 (72.87) had low AI values and highly flexible structures.Proteins with lower GC and AI values were more flexible, as indicated in Tables 1 and 2. GRAVY values range from e4 to þ4, and indicate the hydrophilic and hydrophobic nature of proteins. 19A low GRAVY range suggests a globular (hydrophobic) rather than a membranous (hydrophilic) protein.In this study, the proteins with negative GRAVY values (PM, E, NS3 and NS5) (Table 2) did not form transmembrane regions, as confirmed by the SOSUI server (Table 3).
A set of proteins from the WNV polyprotein subjected to NCBI/CDD-BLAST is listed in Table 3.The proteins were found to contain conserved domains, such as Flavi_capsid, Flavi_M superfamily, Flavi_E_C, Flavi_E_Stem, Flavi_NS1 superfamily, Flavi_DEAD, SF 2_C_viral, pepridase_S7, Flavi_NS4A superfamily, Flavi_RdRp and capping_2_OMTase-Flaviviridae.Few of these domains have enzymatic activity.Four proteins, PM, NS2A, NS2B and NS4B, were found not to have conserved domains, according to the CDD search.A superfamily is a set of conserved domain models that generate overlapping annotations of protein sequences, are assumed to represent evolutionarily related domains and may be redundant with one another. 31The superfamily is relatively close to being identified as the Flaviviridae superfamily.
Four amino acid characteristics are applied in SOSUI predictions: hydropathy index, amphiphilicity index, amino acid sequence charge and protein length. 21The bilayer transmembrane functions in physiological process such as cell recognition, intracellular joining, attachment, enzymatic function and signal transduction. 32The results in Table 4 indicated that proteins with TM regions contained high non-polar amino acids to maintain the hydrophobicity and relatively less polar residue content.NS2A, NS4A, NS4B, NS2B, E and M contain TM regions that actively perform cellular functions.
Viruses can replicate their genome to increase their progeny only in host cells, and their functions depend on the environmental conditions in the body. 33Therefore, knowledge of the subcellular localization of viral proteins in host cells or virusinfected cells greatly aids in understanding the relevant mechanisms and designing antiviral drugs.On the basis of analysis of the subcellular localization of WNV proteins (Table 5), all proteins except PM were predicted to localize in the viral capsid.NS3 and NS5 were predicted to localize in the host endoplasmic reticulum, and PM protein was predicted to localize in the host cytoplasm.The translation of WNV RNA in the host ER results in polyprotein formation, increases ER stress and leads to protein unfolding. 34isordered proteins have relatively low hydrophobic residue content and high amounts of charged polar residues, the latter of which are responsible for hydrophilicity and consequently interference with water molecules and the formation of weak multivariant interactions. 35,36Analysis of WNV polyprotein disorder (Figure 5) indicated that NS2A was less disordered and more stable than NS3, in terms of both protein regions and binding region disorder.Among all proteins, the b-sheet composition was greatest in NS2B (Figure 6).Graces et al., in X-ray crystallography studies, have observed that NS2B comprises more b-sheets, 37 in agreement with the results of this study.Tertiary structure has broad applications in the pharmacologic design of drugs targeting a given protein.The tertiary structures of proteins from the WNV polyprotein are represented in Figure 7.However, protein structure refinement and analysis through modelling must be performed to use these structures accurately.Nonetheless, our validation results indicated that most of the generated proteins had residues in allowed regions in the Ramachandran plot.Further structural refinements, such as QMEAN scores and RMSD, have been suggested to extend the study of these proteins.The proteins PM, NS2A, NS4A and NS4B showed <30 % identity (Table 6), and ab initio modelling must be performed to obtain their 3D structures.
In current study, we recognize the limitation of potential sampling bias, wherein the chosen proteins might not have fully captured WNV polyprotein diversity.This bias might affect the external validity or generalizability of our findings to the broader WNV population.To mitigate this limitation, future research could explore a more diverse set of proteins or consider comparisons with other WNV strains or Flaviviridae family viruses.Although this study provides valuable understanding within the scope of the selected proteins, we acknowledge the importance of expanding the analysis to gain a more comprehensive understanding.
The methods used in the current study were formulated on the basis of insights from previously published research articles. 38,39Although every bioinformatics tool has its own advantages, being aware of limitations is crucial.For example, working with very large datasets may lead to difficulties with EMBOSS TransSeq, thus delaying processing times.Protparam is a useful program for calculating basic protein parameters but might be unable to capture finer details, particularly in the case of sophisticated post-translational modifications or interactions with other biomolecules.Because NCBI-CDD is dependent on preexisting databases, its accuracy relies on the comprehensiveness of those databases.Whereas SOUSI, Virus-mPLoc and IUPRED2 provide useful information about subcellular localization and protein abnormalities, they might not be able to accurately anticipate fine structural features or proteins with significant divergence.Furthermore, ANCHOR2 is useful in locating disordered binding areas, but care must be taken, because its predictions might not always align perfectly with experimental observations.SOPMA can predict secondary structures, but its accuracy varies, particularly in areas with large structural differences.SWISS-Homology modelling is a powerful tool whose efficacy is contingent on the quality and accessibility of template structures in the Protein Data Bank.Because proteins are dynamic, and computational predictions have inherent limits, results must be evaluated cautiously.These bioinformatics tools are best used in conjunction with experimental validation to support strong conclusions.
The conserved domain characteristics of WNV, in comparison to those of other flaviviruses, are highly conserved across the genus and thus may serve as potential targets for novel therapeutic strategies. 40A large-scale analysis of the WNV proteome has revealed the numerous evolutionarily stable nonameric positions present across the proteome and identified several completely conserved sequences. 41utational studies on the fusion glycoproteins indicated they inhibited the Zika virus and yellow fever virus, but not WNV, in terms of production of infectious virions.These conserved sequences are shared by other flaviviruses and have been associated with the functional and structural properties of viral proteins. 42Additionally, the WNV envelope glycoprotein fusion peptide region has been identified as an immunodominant epitope stimulating antibodies, particularly monoclonal antibodies with diverse patterns of cross-reactivity. 43Understanding these conserved domain characteristics is important for the development of antiviral therapies and the design of peptide-specific vaccines for flavivirus infections. 44ecause in silico methods were used, additional study limitations include the valid lack of wet laboratory (in vivo and in vitro) investigation.The computational tools and servers, after updating or reconstruction with various algorithms, might yield different results even if the same sequence of the WNV is input.Future investigations based on the key findings of this study might consider the non-structural and pre-membrane proteins of the WNV polyprotein.Notably, NS2A, NS2B, PM, NS3 and NS5 proteins may be favourable drug targets for designing small molecules/ligands in drug discovery 45 and epitope design through immunoinformatic approaches. 46

Conclusion
The proteins derived from the WNV polyprotein are associated with viral replication and induce disease in host cells.In this present study, several bioinformatic tools were used to study the characteristics of the WNV genome, including basic nucleotide sequences and complex 3D protein structures for structural and non-structural proteins.NS4B protein, with its high GC content, is considered relatively stable.In physicochemical analysis, the pI of C protein was highly basic, whereas that of NS2B was weakly acidic.The total number of positive and negative amino acid residues was relatively greater in NS5, thus indicating its high reactivity and ability to be isolated easily from the complex, given its high EC value.NS2A was the most thermostable among all proteins, on the basis of its high AI.Four proteins did not have transmembrane domains, on the basis of negative GRAVY values.Conserved domains were identified from the Flaviviridae family.The SOSUI server identified seven proteins with transmembrane domains.The 2D and 3D structures of these proteins provide insights for developing accurate models.Therefore, these proteins might be crucial for host invasion and therefore could potentially be used as drug targets for pharmacological study.
. C, M and NS4A contained the CD of flavivirus family capsid protein, envelope glycoprotein M and the flavi_NS4 superfamily.E protein had two domain clans.Protein NS1 had a domain in the flavi_NS1 superfamily.NS3 contained three domain clans: flavivirus DEAD domain, viral Characterizing the West Nile virus's polyprotein helicase CTD, peptidase S7 and flavivirus serine protease NS3.NS5 contained two domain clans: one relevant to the RNA-dependent DNA polymerase, which acts as the catalytic domain in the flavivirus genus, and one specific to the Flaviviridae methyltransferase.

Figure 1 :
Figure 1: Methods used in the present study.

Figure 2 :
Figure 2: Heat map representation of amino acid composition in proteins from the WNV polyprotein.

Figure 3 :
Figure 3: Amino acid residue composition of proteins from the WNV polyprotein.

Figure 7 :
Figure 7: Predicted 3D structures of proteins from the WNV polyprotein.

Table 1 :
Coding sequence length and GC content for proteins from the WNV polyprotein.

Table 2 :
Physico-chemical properties of proteins from the WNV polyprotein.
AA e Total Number of Amino Acids; M.wt e Molecular Weight; pI e Isoelectric Constant; (L)R e Total Number of Negatively Charged Residues (AspþGlu); (D)R e Total Number of Positively Charged Residues (ArgþLys); EC e Extinction Coefficient ( a units of M À1 cm À1 at 280 nm measured in water); II e Instability Index; AI e Aliphatic Index; GRAVY e Grand Average of Hydropathicity

Table 3 :
Identified conserved domains and descriptions for proteins from the WNV polyprotein.
0 -O-)-methyltransferase of Flaviviridae C e Capsid; PM e Pre-Membrane; M e Membrane; E e Envelope; NS e Non-Structural Protein.

Table 4 :
Functional characterization of proteins from the WNV polyprotein.

Table 4
Characterizing the West Nile virus's polyprotein occurrence of a-helices was in NS2A (67.53), and the minimum was in E (21.56).Relatively more random coils were present in NS1 (46.88), whereas fewer were present in C (5.69).Relatively more extended turns were present in NS3(23.42),whereasfewerwere present in NS2A(7.36).
Figure 4: Representation of transmembrane regions in wheel form.

Table 5 :
Subcellular localization of the proteins from the WNV polyprotein.

Table 6 :
Shared protein templates from the PDB with respect to proteins from the WNV polyprotein and validation in Ramachandran plots.
C e Capsid; PM e Pre-Membrane; MÀ Membrane; E À Envelope; NS e Non-Structural Protein.