A detailed analysis of codon usages bias and in�uencing factors in the nucleocapsid gene of Nipah Virus

Several outbreaks of Nipah Virus (NiV) have recently been reported in various parts of the world including India. The nucleocapsid (N) protein is the major structural and regulatory (for viral replication cycle) protein of NiV. In the current study, we have conducted a codon usage analysis of N protein encoding gene (N gene) of NiV. The relative synonymous codon usage (RSCU) values, in combination with an ENC value of 50.98, represented low codon usage bias in N gene. The effect of mutational pressure on codon usage bias was con�rmed by signi�cant correlations of GC3s, G3s, C3s, A3s, U3s, and ENC values with whole nucleotide contents (GC%, G%, C%, A%, and U%). Correlation study of GC3s, G3s, C3s, A3s, and U3s with axis values of correspondence analysis (CA) also supported the role of mutational pressure. The correlation study of Gravy values with GC3s, G3s, C3s, A3s, and U3s revealed the presence of natural selection in addition to mutational pressure on codon usage bias. Moreover, NiV codon adaptation index (CAI) value higher than their corresponding expected CAI (eCAI) values against human (CAI, 0.726; eCAI, 0.713), pig (CAI, 0.838; eCAI, 0.819), and bat (CAI, 0.763; eCAI, 0.751) also indicated natural selection play role on codon usage bias. Additionally, geographical distribution, and evolutionary processes also in�uenced the codon usage bias to some extent.


Introduction
Nipah virus (NiV) is a highly contagious zoonotic virus that can infect both wild animals and human beings and is listed under the "Terrestrial Animal Health Code" of the World Organization for Animal Health (OIE) (https://www.oie.int/en/disease/nipah-virus/).It is a single-stranded negative-sense RNA virus of the Paramyxoviridae family of genus Henipavirus.Twenty-three years ago, NiV was reported [1] and between September 1998 and May 1999 (the rst outbreak), it was having 40% mortality rate with loss of 105 lives in Malaysia [2].In India and Bangladesh, NiV outbreaks were reported with a high fatality rate of 70%, in 2001 [3].This virus also causes catastrophic infections in animals such as pigs, causing huge nancial losses in the piggery industry.Several NiV strains have been reported with varied clinical and epidemiological characteristics [4].
The NiV has a single-stranded negative-sense RNA (ssRNA) genome of 18.2-kb size.The genome has six genes that encoded nine proteins including phosphoprotein (P), nucleoprotein (N), fusion protein (F), glycoproteins (G), large polymerase (L), matrix protein (M), W, V, and C protein [5].The N protein is the most abundant viral protein [6] which interacts with the P protein of the polymerase complex.Relative availability of N protein is determining factor for the activation of genome encapsidation, replicase activity, and regulating viral RNA synthesis.Overexpression of N protein inhibits in trans viral transcription while promoting viral genome synthesis [7].Therefore, N protein played an important regulatory role in virus replication.
Amino acids are the building blocks of proteins, and 20 amino acids are encoded by a set of 61 different codons.Except for Methionine and Tryptophan, most amino acids are coded by more than one codon due to the degeneracy of genetic codes.The use of more than one codon for each amino acid is referred to as synonymous codon usage.Synonymous codon usages are not random, but some codons are preferred over others.The non-random usage of synonymous codons is termed codon usage bias.Several factors including mutational pressure, natural selection, geographical distribution, evolutionary process, etc are the main driving forces for codon usage bias in many RNA viruses [8,9].It has been reported that virus with several ORFs has varied codon usage patterns for different genes [10,11].
Considering the public health concern of NiV and the importance of N protein in virus replication and assembly, we examined the codon usage bias of N gene and its in uencing factors.

Nucleotide sequence data Collection
The N gene sequences (1599 base length) of thirty-two NiV isolates were downloaded from the nucleotide database maintained by the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov) in FASTA format.These sequences were used for the analysis of codon usage indexes and phylogenetic analysis.According to the Wisconsin system, the nucleic acid sequences are presented conventionally in the 5'-3' direction and the nucleotide T in the NCBI database is replaced by U in the RNA genome [12].Table S1 contains the detailed information (GenBank accession number, geographic location, year of sequence submitted to NCBI, etc) of all sequences used in this study (Supplementary Table 1).

Whole nucleotide and codon third position nucleotide composition analysis
The nucleotide content (G%, C%, A%, and U%) of the N gene coding region was calculated using the CAI Cal (available at http://genomes.urv.es/CAIcal/).Moreover, nucleotide compositions of synonymous codons at the third position (G3s, C3s, A3s, and U3s,) and G + C contents of synonymous codons at the third position (GC3s) were calculated as explained by Peden [13] with the help of CodonW (version 1.4.2) program (http://codonw.sourceforge.net//).

Relative synonymous codon usage (RSCU) analysis
The ratio between the observed frequency of a given codon of a gene to the frequency of all the expected synonymous codons for a particular amino acid is indicated by the relative synonymous codon usage (RSCU) value.The use of RSCU matrix for the assessment of codon usage bias was rst described by Sharp and Li [14].The RSCU value lesser than 1.0 represents negative codon usage bias while higher than 1.0 indicates positive codon usage for a given synonymous codon [15].RSCU values of N gene of NiV isolates and bat (Pteropus alecto) were calculated by using the following formula in codonW program: Here, 'x' represents any codon.

Effective number of codons analyses
In the coding sequence, most of the amino acids (except Methionine and Tryptophan) are encoded by multiple codons which are known as synonymous codons.The effective number of codons (ENC) can be used for estimation of degree of bias in the codon usage [16], which ranges from 20 to 61.If the ENC value is 61, it indicates no codon usage bias.Whereas if the ENC value is 20, it indicates codon usage bias is at extreme and only one codon is being used from each amino acid [9].The ENC value was calculated using the following formula in CodonW software: The 's' represents the GC3s value [16].

Codon adaptation index analysis
Codon adaptation index (CAI) can be used for estimation of synonymous codon usage bias in protein coding nucleic acid sequence of a given gene.It represented the comparison between given gene synonymous codon usage and synonymous codon frequency in a reference set [14].The CAI is a quantitative value that indicates how many times a preferred codon is used among highly expressed genes.
It is an indicator of translational e ciency [17].CAI values range between 0 and 1.A higher value of CAI indicates higher gene expression potential.The CAI value is independent of sequence length and it depends only on the codon frequency [18].The effect of hosts (human, pig, and bat) on codon usage of NiV was estimated as per the method described by Puigbo et al. [19] for CAI calculation.Codon usage tables for human (Homo sapien) and pig (Sus scrofa domesticus) were used from previously published data [9] while for bat (Pteropus alecto) was prepared by using CDS sequences available on NCBI (https://www.ncbi.nlm.nih.gov/nuccore/).Host codon usage tables were used for calculating CAI of N gene of various NiV isolates.Occasionally, extreme nucleotide/amino acid compositions may yield statistically irrelevant CAI values.Therefore, Puigbo et al. [19] recommended the metric of expected CAI (eCAI) for statistical analysis of CAI analysis and developed a perl script (CAIcal_ECAI_v1.4.pl).The eCAI of N gene was calculated with a 95% con dence interval, as described by Puigbo et al. [19].

Aromaticity and hydropathicity analysis
Aromaticity (Aromo) and hydropathicity (Gravy) are the factors that indicate the effect of translation or natural selection of the given gene product.The Gravy value represented the average hydropathy value of amino acids in a protein whereas, the Aromo value represented the frequency of aromatic amino acids [20].
The Aromo and Gravy values were calculated by using the following formula [13]: Here, 'N' is the number of amino acids, k i is the hydrophobic index of i th amino acid and v i is either 1(for an aromatic amino acid) or zero.

Correspondence analysis
The correspondence analysis (CA) is a multivariate analysis method used for the analysis of complex codon usage data.The data of CA is not only represented in form of rows and columns but also helps in identifying major variable trends in the data.To better understand variations, the output of CA can be plotted along various axes [9].In this study, we used the CodonW program to perform CA on RSCU.Further, CA of the codon usage pattern of N gene from NiVs of different geographical distributions was also performed.
With the help of XLSTAT 2015 software, graphs were plotted using the rst two principal axes of CA.

Mutational pressure and natural selection analysis
For the determination of codon usage bias in viruses, mutational pressure and natural selection are two important factors.Correlations of GC3s, G3s, C3s, A3s, U3s, and ENC values with the nucleotide contents and axes value of CA can be used to evaluate the effect of mutational pressure on codon usage.However, the mutational pressure can also be estimated by correlation analysis of %GC and GC3s values [9].Furthermore, correlations of the Gravy and the Aroma values with GC3s, G3s, C3s, A3s, and U3s can be used to evaluate the effect of natural selection [21,22].In this study, XLSTAT 2015 software was used for all correlation analyses.

Phylogenetic Analysis
Codon usage bias is also in uenced by the evolutionary processes of several viruses [23,24].MEGAX software was used to study phylogenetic analysis based on the nucleotide sequence.The nucleotide sequences of the N gene of several NiVs were rst aligned using the MEGAX program.Further, the aligned sequences were used to construct the phylogenetic tree with the maximum likelihood method with complete deletion parameters.The Robustness in the phylogenetic tree was tested using the Bootstrap method [9].
Further, ENC values of the N gene of NiV isolates were calculated (Supplementary Table 1) to estimate the degree of codon usage bias in the NiV.The ENC value was 50.98 ± 0.367 (mean ± SD) indicated low codon usage bias in NiV.

RSCU analysis
To analyze the codon usage bias and effect of the hosts (Human, Pig, and Bat) on the N gene codon usage bias, the RSCU values for each synonymous codon in N gene of NiV isolates and its hosts were calculated and compared (Table 1).Out of 18 amino acids (coded by more than one codon), preferred codons for 2 amino acids [Ile (AUC) and Arg (AGA)] were similar between NiV and humans.Preferred codons for 5 amino acids [Ile (AUC), Pro (CCA), Glu (GAA), Arg (AGA), Gly (GGA)] were similar between NiV and pig, whereas preferred codons for only 1 amino acid [Ile (AUC)] was found to be similar between NiV and bat (Table 1).These fewer common preferred codons between NiV and its hosts indicated codon usage bias in N. The RSCU values in bold letter are the preferentially used codons.
Further, to check the codon usage bias in the N gene of NiV isolates, the correspondence analysis (CA) analysis based on RSCU (CA-RSCU) was performed.The CA-RSCU indicated that the rst, second, and third axis contributed for 86.05%, 4.61%, and 2.48% of total variations, respectively.Therefore, the rst axis mostly explained the presence of codon usage bias.A graph was plotted using rst and second axes values to understand the distribution of synonymous codons.Most preferred and moderately preferred codons in various synonymous groups were found to be closer to the intersection of axis 1 and axis 2 while the least preferred codons in respective synonymous groups were found away from the intersection (Fig. 1).

Effect of mutational pressure on N gene codon usage bias
To analyze the factors affecting codon usage bias in the N gene, a graph using ENC and GC3s values was plotted.In the graph, a single cluster was observed for all NiVs indicating less variation of ENC values among various isolates (Fig. 2).However, all NiV isolates were lying slightly below the expected curve.This suggested that the codon usage bias of the NiV might be due to a combination of mutational pressure and natural selection.
For evaluation of the degree of codon usage bias in uenced by mutational pressure, the correlation was performed among nucleotide composition at the third position of codons, ENC values, and whole nucleotide compositions.Signi cantly high correlations among third position nucleotide composition of codons, ENC values, and whole nucleotide compositions (excluding poor correlations of whole nucleotide compositions with GC3s values and %G with ENC values) were observed (Table 2), while a weak correlation (r = 0.402, p < 0.02) between %GC values GC3s was observed (Fig. 3).These results indicated that in addition to mutational pressure, other factors also in uence NiV codon usage bias.

of natural selection on N gene codon usage
To the natural selection on the N gene of NiV codon usage bias the correlation analysis between nucleotide composition at the third position of all codons, and ENC values with Aroma values and Gravy values was performed.Aroma values do not have any correlation (due to the absence of variation of aroma value in N gene of various NiV isolates) with the GC3s, G3s, C3s, A3s, U3s, and ENC values.But GC3s, G3s, C3s, A3s, U3s, and ENC values have signi cantly correlated with the Gravy values (Table 3).These results indicated that in addition to mutational pressure, natural selection has also in uenced the codon usage bias of the N gene.Subsequently, the relative adaptiveness of NiV codon usage to its hosts was measured by using the CAI metric.The CAI values of NiV were found to be 0.726 ± 0.003 (mean ± SD), 0.838 ± 0.003, and 0.763 ± 0.004 when compared with human (CAI H ), pig (CAI P ), and bat (CAI B ), respectively (Supplementary Table 1).To lessen the effect of extreme G + C and/or amino acid compositions and to overcome the effects of compositional, Puigbo et al. [19] suggested the use of the eCAI algorithm.Codon usage bias in uenced by natural selection was also con rmed by higher CAI values of all NiV isolates than their corresponding eCAI values against human (eCAI H , 0.713), pig (eCAI P , 0.819), and bat (eCAI B , 0.751).

Effect of geographical and evolutionary process on N gene codon usage bias
Based on the geographical distribution and time of N gene sequence of various isolates reported to NCBI, NiVs were grouped into three different bunches (Fig. 4B).In the rst bunch, all Malaysian isolates (sequence reported between 1999 and 2004) were found.In the second bunch, NiV isolates from Cambodia, and Thailand (sequence reported between 2005 and 2013) while in the third bunch, isolates from India and Bangladesh (sequence reported between 2005 and 2011) were reported (Supplementary Table 1).Further, we evaluated the codon usage in NiV isolates from different geographical locations by CA.During CA, the rst and second axis contributed 86.05%, and 4.61% of the total variation, respectively.The graph of the rst and second axis of CA showed that all NiV isolates were organized into two distinct clusters (Fig. 4C).All isolates of the Malaysia, and Cambodia were found in cluster-A, while all isolates from India and Bangladesh were found in cluster-B.NiV isolates from Thailand were distributed in both cluster-A and Cluster-B.Subsequently, phylogenetic analysis using N gene of various NiV isolates was carried out.In the phylogenetic tree, all NiV isolates were organized into two separate clades (Fig. 4A) having similarities to the clustering pattern observed in CA analysis (Fig. 4C).The similarity in geographical distributions, evolutionary tree, and CA results of various isolates indicated the in uence of geographical distribution and evolutionary processes on codon usage bias.

Discussion
The NiV is an emerging bat-borne pathogen that causes severe respiratory and neurological disease with high mortality.It can spread in the population through infected people or infected animals.Based on epidemiological distribution, different strains of the virus with differing clinical features have been reported [1].Information of factors in uencing codon usage bias and its intensity are important to know detail about viral evolution and its transmission.Previously, Khandia et al. [3]and Chakraborty et al. [12] studied the NiV codon usage pattern and its in uencing factors.In both studies, RSCU values (a major indicator of codon usage bias) and codon usage patterns for the complete genome were evaluated.But, it has been reported that virus with different open reading frames (ORFs) has varied codon usage patterns for different genes [10,11].
The NiV genome (single-stranded negative-sense RNA; 18.2-kb size) has six genes/ORFs which encoded nine different proteins (N, P, F, G, W, V, C, M, and L proteins) [5].As, the N gene encodes for viral N protein, which is the most abundant protein among all structural proteins of NiV [6], and relative abundance of N protein is a major controlling factor for genome encapsidation, replicase activity, and regulating viral RNA synthesis [7], the current study was focused to understand the codon usage bias of the N gene of NiV using multiple systemic analytical methodologies.We calculated and compared RSCU values for each synonymous codon of the N gene of various NiV isolates and its hosts (Human, Pig, and Bat).Comparison of the preferred codon of each amino acid of viral N gene and its host indicated 2 preferred codons [Ile (AUC) and Arg (AGA)] were common between virus and humans whereas 5 preferred codons [Ile (AUC), Pro (CCA), Glu (GAA), Arg (AGA), Gly (GGA)] were common between virus and pig, and only 1 preferred codon [Ile (AUC)] was common between virus and bat.RSCU comparison of viral N gene and its host indicates the presence of codon usage bias in NiV.
The ENC is a simple indicator of codon bias.Earlier, ENC values of various RNA viruses have been determined like Japanese encephalitis virus (mean ENC = 55.30)[9] Zika virus (mean ENC = 52.72)[8], and chikungunya viruses (mean ENC = 55.56)[21].The higher (more than 45) ENC value is an indicator of week codon usage bias [8,9].In the current study, the mean ENC value for the N gene of NiV isolates was 50.98 indicating low codon usage bias in NiV.The virus having low codon usage bias can use multiple codons for each amino acid which allows viral replication more e ciently in the host cell [22].
In several RNA viruses, mutational pressure and natural selection are two key forces that determine codon usage bias [21].If mutational pressure is the only factor determining codon usage bias, during the ENC versus GC3 analysis, all data points re ecting ENC values should lie on the expected curve [16].In this study, the data points were found below the predicted curve.This indicated that other factors also in uence codon usage bias of the N gene of NiV in addition to the mutational pressure.The effect of mutational pressure on codon usage bias was supported by substantial correlations between total nucleotide overall nucleotide contents and A3s, U3s, C3s, and G3s.The signi cant correlation between ENC values and whole nucleotide contents (except %G) con rmed the involvement of mutational pressure.The rst and second axes values of CA were also signi cantly correlated with whole nucleotide content.All of the above ndings indicate that mutational pressure is a signi cant factor in uencing the codon usage bias of the N gene of NiV.
Natural selection may also alter codon usage patterns during the virus's adaptation to host cells [8,9].Strong correlations between Gravy values with GC3s, G3s, C3s, A3s, U3s, and ENC values were observed in current studies, indicating that viral protein characteristic has also been responsible for the observed variation in NiV codon usage.High CAI values of N gene of NiV isolates in comparison to its host (human, pig, and bat) indicated the effect of natural selection on codon usage bias.Moreover, CAI values were higher than eCAI values in respective hosts also indicated the signi cant adaptation of the virus to their hosts be due to natural selection.
In many RNA viruses, geographical dispersion and evolutionary processes also contribute to codon usage bias [8,9].In this study, geographical distribution based on CA and phylogenetic analysis were used to investigate the effects of geographical dispersion and evolutionary processes on codon usage, respectively.
During CA, two different clades were formed.All Malaysian and Cambodian isolates fell into Cluster-A and Indian and Bangladeshi isolates fell into Cluster-B, while isolates from Thailand were distributed in both cluster-A and Cluster-B.This distribution of area-speci c NiV isolates in speci c clusters of CA graph indicated the role of geographical distribution on codon usage bias.In the phylogenetic tree, two clades were observed and distributions of NiV isolates were similar to distributions of isolates in CA graph.Similar patterns of clustering during CA and clade formation in phylogenetic tree supported the role of evolutionary processes on codon usage bias in NiV.
The current study indicated low codon usage bias in NiV.Mutational pressure and natural selection were found to be two key factors impelling codon usage bias.In addition to mutational pressure and natural selection, geographical distribution and evolutionary processes were also in uencing codon usage bias, to some extent.

Declarations Funding Information
There is no role of any funding agencies in the current study.The phylogenetic tree based on N gene and geographical distributions of various NiV isolates.
(A) Phylogenetic tree parameters included: pairwise deletion, 1000 replicates for bootstrap analysis, neighorjoining method for tree construction.All NiV isolates were organized into two separate clades (Clade-X, and Clade-Y) (Ban, Cam, Ind, Mal, and Thi, stands for Bangladesh, Cambodia, India, Malaysia, and Thailand, respectively).(B) Based on the geographical distribution and time of N gene sequence of various isolates

Figures
Figures

Figure 1 The
Figure 1

Figure 2 The
Figure 2

Figure 3 The
Figure 3

Table 1 :
The synonymous codon usage pattern of N gene in NiVs and its hosts AA: amino acid