Genetic and codon usage bias analyses of polymerase genes of equine influenza virus and its relation to evolution

Equine influenza is a major health problem of equines worldwide. The polymerase genes of influenza virus have key roles in virus replication, transcription, transmission between hosts and pathogenesis. Hence, the comprehensive genetic and codon usage bias of polymerase genes of equine influenza virus (EIV) were analyzed to elucidate the genetic and evolutionary relationships in a novel perspective. The group - specific consensus amino acid substitutions were identified in all polymerase genes of EIVs that led to divergence of EIVs into various clades. The consistent amino acid changes were also detected in the Florida clade 2 EIVs circulating in Europe and Asia since 2007. To study the codon usage patterns, a total of 281,324 codons of polymerase genes of EIV H3N8 isolates from 1963 to 2015 were systemically analyzed. The polymerase genes of EIVs exhibit a weak codon usage bias. The ENc-GC3s and Neutrality plots indicated that natural selection is the major influencing factor of codon usage bias, and that the impact of mutation pressure is comparatively minor. The methods for estimating host imposed translation pressure suggested that the polymerase acidic (PA) gene seems to be under less translational pressure compared to polymerase basic 1 (PB1) and polymerase basic 2 (PB2) genes. The multivariate statistical analysis of polymerase genes divided EIVs into four evolutionary diverged clusters - Pre-divergent, Eurasian, Florida sub-lineage 1 and 2. Various lineage specific amino acid substitutions observed in all polymerase genes of EIVs and especially, clade 2 EIVs underwent major variations which led to the emergence of a phylogenetically distinct group of EIVs originating from Richmond/1/07. The codon usage bias was low in all the polymerase genes of EIVs that was influenced by the multiple factors such as the nucleotide compositions, mutation pressure, aromaticity and hydropathicity. However, natural selection was the major influencing factor in defining the codon usage patterns and evolution of polymerase genes of EIVs.


Background
Equine influenza outbreaks are of major concern throughout the world. The vaccines often fail to protect horses, especially in the case of newly emerged equine influenza A (H3N8) viruses owing to antigenic variation of haemagglutinin (HA) protein [1,2]. Thus due to its importance for vaccine strain selection [3,4] most evolutionary dynamic studies have been focused on HA gene [5][6][7]. Since their first isolation in 1963 (Miami/ 63), EIV H3N8 have diverged due to variation in the HA gene viz., Pre-divergent, Eurasian and American (evolved into currently circulating Florida sub-lineage clade1 & 2) [5,7,8]. The evolution of internal genes of EIV H3N8 in Greek outbreak [9] and evidence of species-jumping of H3N8 from horse to canine [10][11][12]; and pigs [13] emphasizes the role of internal genes, especially polymerase genes in replication, transcription and host adaptation.
The three largest genome segments of EIVs encode polymerase heterotrimeric proteins namely basic protein 1 (PB1), basic protein 2 (PB2) and acidic protein (PA). These subunits together with the nucleoprotein form the complex which is accountable for viral RNA transcription and replication [14][15][16]. Genome segment 2 encodes the PB1 subunit, the central component of the polymerase complex with RNA polymerase activity; PB1 interacts directly with PA and PB2 subunits [17][18][19][20][21]. Segment 2 also encodes another protein, PB1-F2, from an alternate reading frame of PB1 mRNA [22,23]. PB1-F2 is known to increase virulence through the induction of cell death, promoting inflammation and up-regulating viral polymerase activity [24][25][26]. Genome segment 1 encodes the PB2 subunit, a key player in the initiation of viral transcription through cap-snatching activity in conjunction with endonuclease function of PA subunit [17,[27][28][29]. Finally, segment 3 encodes the PA subunit, which is responsible for generalized proteolysis of viral and host proteins [30,31] and plays a role in virus assembly [32]. Segment 3 also encodes a second protein called PA-X [33], which modulates viral pathogenesis [34,35]. In addition to their basic functions in viral genome expression, the polymerase proteins play a pivotal role in host adaptation and viral pathogenesis [34,36,37]. Several mutations in the polymerase subunits of avian and human influenza viruses have been implicated in enhancing virus replication and adaptation to various hosts [38][39][40][41][42]. However, the factors dictating the genetic changes of polymerase genes of EIV H3N8 especially in terms of codon usage bias patterns have not been explored in detail.
The codon usage pattern has a significant role in the evolution of viruses. Several studies have testified the species-specific synonymous codon usage bias [43][44][45], which shows preference for certain codons encoding for the same amino acids [46]. Such preference for codon usage offers evolutionary force for determining the overall fitness of the virus influencing various cellular processes [47][48][49]. So far, limited studies have highlighted the factors responsible in shaping the codon usage bias in different influenza viruses. For instance, while mutation pressure plays a key role in shaping the codon usage bias in H1N1 (human influenza virus) [50] and H9N2 (avian influenza virus) [51], natural selection primarily dictates the bias of synonymous codon usage for EIV [52]. One study on the PB1-F2 gene of EIV showed that mutational bias along with selection pressure and gene length influenced the codon usage pattern [53]. Furthermore, the codon usage bias analysis provides a different perspective regarding virus evolution studies in comparison to phylogenetic studies. Nevertheless, comprehensive analysis of codon usage pattern of polymerase of EIVs has not been elucidated so far.
Thus the present study focuses on comprehensive analysis of genetic evolution, synonymous codon usage pattern and factors involved in shaping the codon usage pattern of three polymerase genes across the lineages of EIV H3N8 strains circulating worldwide from 1963 to 2015. Combining the codon usage bias and traditional phylogenetic analyses of polymerase genes of EIV H3N8 will help in understanding the novel perspective of molecular evolution dynamics of EIVs.

Results
Amino acid variations in the phylogenetic clusters of polymerase genes The deduced amino acid sequences of polymerase genes of EIV H3N8 were aligned and compared against A/ equine/Richmond/07 isolate -representative strain of Florida clade 2 sub-lineage of EIV H3N8. The detailed amino acid substitutions observed in PA, PA-X, PB1, PB1-F2 and PB2, proteins encoded by the polymerase genes of EIV H3N8 are presented in additional file (see Additional file 1: Table S1a, S1b, S1c, S1d & S1e).
In  Table S1b). However, one specific consensus point mutation (G213S) was observed in Florida clade 1 & 2 isolates from 2002 onwards except A/equine/Cheshire/06 & A/equine/ Linconshire/06 isolates. However, a non-sense point mutation was observed at position 210 in A/equine/ Richmond/07 isolate. These results in a truncated PA-X protein, which was reported previously [54].
In PB1 protein, lineage -specific amino acid changes were found (Table 1 and Additional file 1:   Table S1e).

Phylogenetic analysis
The phylograms of polymerase genes broadly showed that the segment 3 (PA) is more evolutionarily divergent as compared to segment 1 (PB2) & 2 (PB1). The phylogram of PA gene divided EIVs circulating globally into four major groups: Florida clade (1& 2), American, Eurasian and Pre-divergent viruses ( Fig. 1) corresponding to those observed in the HA1gene phylogram [5]. The topogram of PA tree also revealed time scale specific grouping of EIVs, wherein A/equine/Newmarket/1/ 93 was the closest sister lineage of Florida clade-1 & 2 lineages and A/equine/Newmarket/5/03 was the parent isolate of the Florida clade 2 sub-lineage ( Fig. 1). Furthermore, in Florida clade 2 lineage, from A/equine/ Richmond/07 onwards, EIVs formed a separate group along with 2007 isolates from China (A/equine/Xinjiang/ 07) and Kazakhstan (A/equine/Almaty/07). The PA-X phylogenetic tree depicted a similar pattern of EIV H3N8 grouping into four major groups as observed for PA gene (Additional file 2). However, two exceptions were found, where A/equine/Shropshire/10 isolate grouped close to A/equine/Newmarket/5/03 in Florida clade 2 lineage and A/equine/Snailwell/98 isolate clustered between A/equine/Newmarket/1/93 and other Florida clade 1 members.
The phylogram of PB1 gene divided EIV H3N8 into Florida clade 1 & 2, America + Eurasian and Predivergent lineages (Fig. 1). The EIVs from A/equine/ Richmond/07 onwards were divided into two separate groupsthe isolates from Asian countries including China, Kazakhstan & India formed one group and European isolates formed other group. The phylogeny of PB1 also revealed that A/equine/Newmarket/1/93 isolate although clustered separately with A/equine/Cheshire/ 06, but broadly grouped under Eurasian lineage (Fig. 1

Nucleotide composition affects the codon usage bias
The influence of compositional constraints on the codon usage was assessed by computing the nucleotide compositions of the polymerase genes of EIV and correlating these with the principal axes generated on correspondence analysis (COA) which is used to study the trends in Moreover, correlation of different nucleotide compositions with the two principal axes of COA was performed. In the case of PB1, axis 1 has distinct positive correlation with G3 (r = 0.959, p < 0.0001), GC3s (r = 0.954, p < 0.0001) and negative correlation with A3 (r = −0.956, p < 0.0001). Furthermore, the Enc (Effective number of codons) values were computed to assess the degree of codon usage bias in the polymerase coding sequences of EIV. There is significant positive correlation between ENc and GC3s (r = 0.613, p < 0.0001), while ENc has positive (r = 0.668, p < 0.0001) and negative (r = −0.678, p < 0.0001) correlation with axis 1 and 2, respectively. The case of PB2 is quite different, where axis 1 has distinct positive correlation with A3 (r = 0.852, p < 0.0001) and distinct negative correlation with G3 (r = −0.871, p < 0.0001), C3 (r = −0.821, p < 0.0001), GC3s (r = −0.880, p < 0.0001), and ENc (r = −0.840, p < 0.0001). Similarly, axis 2 also showed distinct correlations with these nucleotide compositions in an opposite polarity pattern (Additional file 3). In the case of PA, axis 1 has distinct positive correlation with A3 (r = 0.910, p < 0.0001), and distinct negative correlation with C3 (r = −0.892, p < 0.0001), GC3s (r = −0.840, p < 0.0001), ENc (r = −0.659, p < 0.0001). These results demonstrate that compositional constraints indeed affect codon usage bias in all the polymerase genes of EIV, but to a different extent.

Relative synonymous Codon usage (RSCU) analysis
The RSCU analysis determines the patterns of synonymous codons usage without the confounding influence of amino acid compositions. We analyzed RSCU values of polymerase genes of EIVs and compared them with their clinical host species ( Table 2). The RSCU values of the majority of preferred and non-preferred codons fell between 0.6 and 1.6. Amongst the 18 most frequently used codons in EIVs, fourteen in PA genes (eight Aended; six U-ended), thirteen in PB1 (nine A-ended; four U-ended) and twelve codons in PB2 (eight Aended; four U-ended) were A/U-ended, while remaining were G/C-ended codons. We furthermore analyzed the over-and under-represented codons and found that almost all the over-represented codons (RSCU ≥1.

Codon usage bias among polymerase genes varies and is not clade specific
The ENc values were calculated to estimate the magnitude of codon bias in the polymerase genes of EIV H3N8. The mean ENc values of PB1 (53.04 ± 0.63) was highest, followed by PA (52.12 ± 0.55) and PB2 (49.48 ± 0.88). The ENc values of PB2 were significantly lower compared to PB1 and PA (p < 0.0001). Furthermore, the ENc values were analyzed clade wise in each polymerase i.e. in four well defined clusters of EIV H3N8 (Pre-divergent, Eurasian, Florida clade 1 and Florida clade 2). There was no significant differences Mutation bias plays a minor role in the codon usage bias of polymerase genes of EIV H3N8 An ENc-plot is widely used to find whether mutation bias/mutation pressure influence the codon usage bias or not, whereas the Parity rule 2 (PR2) plot estimates the effects of mutation pressure and natural selection on the codon usage. Hence, to examine whether codon usage patterns in the polymerase genes of EIV have been governed by mutation pressure or not, ENc-GC 3s and PR2 plots were constructed. Firstly, a PR2 plot was constructed to examine whether the biased codon choices are restricted in highly biased protein-coding genes. For this, the relationship between purines (A and G) and pyrimidines (C and T) in the four fold degenerate codon families (Ala, Arg, Gly, Leu, Pro, Ser, Thr and Val) were determined. This plot showed that AU bias predominated in the fourfold degenerate codon families in all the polymerase genes coding sequences of EIV (Fig. 3). Furthermore elucidation was derived from the ENc-GC 3s plot. In this plot, all the polymerase genes of EIV H3N8 strains clustered below the expected ENc curve (Fig. 3). None of the strains of polymerase genes fell on the expected curve, which would have indicated a major role of mutation pressure in the coding sequences of polymerase genes. These under-curve clustering indicated that factors other than mutation pressure played a major role in the codon usage bias of polymerase genes of EIV H3N8.
Natural selection predominates in shaping the codon usage patterns of polymerase genes of EIV H3N8 ENc-GC 3s plot analysis demonstrated the minor contribution of mutational pressures in shaping the codon usage patterns of the polymerase genes. We next sought to determine the magnitude of natural selection or mutation pressure in generating codon usage bias by constructing the neutrality plots, which determine the mutation-selection equilibrium in shaping the codon usage bias [55]. The distribution range of GC3 was narrow in all the polymerase genes, i.e. PA (40.2 to 47.8%), PB1 (42.5 to 50.3%), and PB2 (40 to 46.1%).
There was noticeable correlation between GC1 and GC3 in PA (r = 0.280, p < 0.05) and PB1 (r = 0.610, p < 0.0001), which seemed to indicate the qualitative role of mutation pressure in codon usage bias.
In the neutrality plot analysis, a significant positive correlation was observed between the GC12 and GC3 values of PA (r = 0.61, p < 0.0001) and PB2 (r = 0.29, p = 0.001) genes of the EIV. However, the slopes of the regression line in PA and PB2 were calculated to be 0.0946 and 0.0335, respectively (Fig. 3). This indicates that the influence of direct mutation pressure for codon usage bias in PA and PB2 genes is only 9.46% and 3.35%, respectively. The contribution of natural selection in influencing the codon usage bias was high i.e. 90.54% in PA and 96.65% in PB2 genes. In the PB1 gene of EIV, a negative correlation (−0.332, p < 0.05) was observed, with a slope of −0.0133, and the mutation pressure and natural selection were determined to be 1.33% and 98.67%, respectively, again demonstrating the dominant influence of natural selection. Despite the observed correlation in all the polymerase genes, natural selection emerged as the dominant factor influencing the codon usage bias. Similar results were also obtained when selection pressure was executed on the polymerase genes, with PB1 (dN/dS = 0.082) being under the strongest purifying selection followed by PB2 (dN/dS = 0.129) and PA (dN/dS = 0.147).

Translational selection affects codon usage bias
The influence of translational selection on codon usage bias is usually assessed based on how frequently preferred codons are recognized by the most abundant isoacceptor tRNAs of the host species. The most preferred codons (for each amino acid) were compared with their respective tRNA isotypes in equine cells. This analysis revealed that out of a total of nine, seven two-fold synonymous codon families (Asn, Lys, Asp, Glu, His, Gln, Cys) in PB1 & PB2 and six in PA (Phe, Lys, Asp, Glu, Gln, Tyr) were found to have 'non-optimal codonanticodon usage' ( Table 3). The remaining two (Phe and Tyr) had 'optimal codon-anticodon usage' in PB1 and PB2. But, these two aromatic amino acid (Phe and Tyr) in PA have their corresponding less frequent tRNA isotypes. We noticed optimal codon-anticodon usage, especially in hydrophobic amino acid (Val, Leu, Ile) of PA which is lacking in PB1 and PB2. Overall findings suggest that PA segment appears to have more adaptability to the tRNA pool of equine cells compared to PB1 and PB2 segments.
COA analysis built on RSCU of codons revealed that codons in PB1 and PB2 were frequently distributed along the first (ƒ´1) and second (ƒ´2) principal axes, respectively (Fig. 4), while diffusely distributed in the case of PA. The PB1 and PA gene segments of EIV strains grouped into four defined clusters on both of the principal axes. Moreover, investigation in this type of clustering in PB1 and PB2 revealed that the strains clustered steadily in the format of clusters generated in the phylogenetic analysis, i.e. Pre-divergent, Eurasian, Florida clade 1 and clade 2 clusters (Fig. 4). This was not in the case of PA, where strains did not form these defined clusters and rather frequently distributed along the second (ƒ´2) principal axis.

PB2 gene shows highest codon usage deoptimization for Equus caballus
The relative codon deoptimization index (RCDI) compares the similarity in codon usage of a given coding sequence with that of a reference genome. The RCDI values of each polymerase genes of EIV were computed to compare the similarity of the codon usage of these genes and the codon usage of Equus caballus. Mean RCDI values were highest for PB2 (1.43 ± 0.017) followed by PB1 (1.33 ± 0.013) and PA (1.32 ± 0.010) in relation to their clinical host, Equus caballus. Furthermore, the RCDI values were estimated cluster-wise in these polymerase genes to examine whether these variations are due to inherent properties of genes or arose during the course of evolution. We found that Florida clade 1 (PB1 = 1.37 ± 0.015, PB2 = 1.49 ± 0.020) and clade 2 (PB1 = 1.37 ± 0.003, PB2 = 1.47 ± 0.013) isolates displayed higher RCDI values than pre-divergent (PB1 = 1.28 ± 0.015, PB2 = 1.36 ± 0.014) and Eurasian (PB1 = 1.31 ± 0.019, PB2 = 1.38 ± 0.022) isolates in PB1 and PB2 genes (Fig. 5).

Equus caballus-induced selection pressure is variable among EIV polymerase genes
A similarity (SiD) analysis was performed to examine whether codon usage patterns of Equus caballus influence the evolution of the codon usage patterns of the polymerase genes coding sequences. We found that Equus caballus exerted more selective pressure on the codon usage patterns of PB1 followed by PB2 and PA (Additional file 5).
Amino acid hydropathicity plays a significant role in codon usage bias rather than aromaticity or genome length The relationship among the hydropathicity, aromaticity, genome length, ENc and first two principal axes of COA were assessed through Spearman's rank correlation analysis. Both General average hydropathicity (GRAVY) and aromaticity (AROMO) are indices of amino acid usage, and the variation in amino acid compositions can   Table S6c) suggesting that both hydropathicity and aromaticity significantly influences the codon usage bias in PA gene segment.

Discussion
The polymerase genes of EIVs like other gene segments undergo evolutionary changes. We showed that groupspecific consensus amino acid substitutions occurred in polymerase genes among EIVs (H3N8) circulating since 1963, which led to divergence of EIVs. Furthermore, we analyzed codon usage pattern of these genes to understand the factors involved in evolution of EIV and their fitness towards host. Currently EIVs belonging to clade 1 and 2 of Florida sublineage are circulating globally. Analysis of various isolates representing two lineages showed consistent 14 consensus amino acid changes from 2007 onwards. However, unavailability of sequences of polymerase genes between 2003 and 2007 obscures the exact origin of these changes. Availability of partial sequences of a Greek isolate (A/equine/Athens/03) with some of these mutations indicates their origin some time before [9]. Among polymerase proteins, PB2 is more conserved than the other two subunits (PA & PB1) of the RNP complex and interestingly, few stable point mutations in PB2 were found in all Asian isolates and A/equine/ Richmond/07 isolate which are located in the cellular mRNA cap -binding site (318-483 amino acid residues) of this protein [28]. Similarly, the D377E change observed in PB1 of all Florida clade 2 EIVs may lead to alteration of stability of the protein as previously described [56]. An amino acid substitution G70E in PB1-F2 protein, which lies in the region essential for virus mediated apoptosis pathway [57][58][59], was found to be stable in all clade 2 viruses, but variable in European and Florida clade 1 lineages.
The surveillance of internal genes along with HA1 gene may help in better prediction of evolution pattern [60,61] and severity of infections [62]. The phylogenetic analyses of the polymerase genes of EIVs circulating globally till 2015 have revealed differing evolutionary patterns (Figs. 1, 2, 3, 4 and 5). The isolates such as A/ equine/Wisconsin/03 and A/equine/Newmarket/05/03 which represent clade 1 and clade 2 of Florida sublineage according to HA gene analysis did not group in the respective clades. The PA gene showed more evolutionary divergence and classified EIVs into specific lineages which, barring few isolates as mentioned above, followed the evolutionary pattern of HA1 and corroborate with the earlier analysis of equine influenza [63] and other influenza A viruses in various host animals [64,65]. Our findings suggest that the divergence of Florida clade 2 sublineage of EIVs circulating in Asian region is occurring not only on the basis of stable point mutations in surface encoding HA protein [66], but also in the RNP complex proteins as well as Matrix (M1 and M2) protein [67]. The point mutations in polymerase complex subunits of influenza A viruses may have implication in host adaptation, virus replication and virulence as supported by several previous research findings [38,[68][69][70].
Various factors shaping the synonymous codon usage bias of the EIV polymerase genes were analyzed systematically. The correspondence analysis was carried out to assess the trends in codon usage variations of polymerase genes. This analysis has previously been used for demonstration of the evolutionary trends and classification of envelope glycoprotein genes or mitochondrial genome [71][72][73][74]. This revealed EIV strains to be grouped in well-defined clusters generated in the phylogenetic analysis in respect of PA and PB1 genes, whereas strains did not form defined clusters for PB2 gene.
The overall nucleotide composition affected the codon usage bias especially for codons ending with A/U, which is supported by the fact that over-represented codons (RSCU ≥1.6) were A-ended. Earlier studies on genome based codon usage bias analysis in different influenza viruses specifically H3N8 [52], H5N1 [75], H1N1 and H3N2 [50] also demonstrated a preference towards Aor U-ended codons. Interestingly, none of the overrepresented codons in any of the polymerase genes was common to their host species. The ENc values indicated that overall codon usage bias was significantly lower in PB2 (49.48 ± 0.88) as compared to PB1 (53.04 ± 0.63) and PA (52.12 ± 0.55), while the difference was not significant amongst the four clusters of respective polymerase gene. ENc values dropped steadily during the course of evolution from 1963 to current isolates which may reduce competition between the virus and the host for utilizing cellular machinery and thus favour efficient replication of the virus [74,76]. Similar low codon bias has also been observed in other RNA viruses such as Ebola virus [77], Zika virus [78], H1N1pdm IAV [79]; equine influenza virus H3N8 [52], Foot-and-Mouth Disease Virus [80]; H5N1 influenza virus [75] and equine infectious anaemia virus [81].
However, the low codon bias as revealed by the ENc does not reflect its underlying cause (mutation pressure and/or natural selection). To investigate this ENc-GC3s and PR-2 bias plots were constructed where the undercurve clustering of polymerase genes of EIVs showed that mutation pressure is not the sole factor that influenced the codon usage patterns of the EIV polymerase genes coding sequences. The magnitude of these factors in generating codon usage bias as determined by the   The RCDI analysis of phylogenetic derived clusters of EIVs polymerase genes in relation to their host species, Equus caballus neutrality plots, which indicated the predominant role of natural selection in influencing the codon usage bias in all the three polymerase genes (highest in PB1) of EIVs. However, previous studies reported that mutational pressure dominated in structuring the codon usage patterns of H1N1 pdm IAV [79], various RNA [82] and DNA viruses [83]. Contrastingly, PB1-F2, a protein encoded by an alternative open reading frame (ORF) of segment 2 of EIV having role in enhanced pathogenicity was governed by multiple factors such as mutation pressure, natural selection and gene length [53]. Host specific mutation bias was detected while comparing the human and avian influenza viruses [84].
The host imposed translational selection pressure also influenced the codon usage bias of the EIV polymerase genes which was assessed by comparing how most preferred codons are recognized by their respective tRNA isotypes in equine cells. Our previous studies showed that overall codon usage of EIVs does not seem to be well adapted to tRNA pool of equine cells [52]. However, more stringent analysis of polymerase genes in this study suggests that PA segment in comparison to PB1 and PB2 appears to have better adaptability to its clinical host, horse especially by utilizing the high copy number tRNA-anticodon for valine, serine leucine, aaparagine, histidine, isoleucine, cysteine in equine cells. Overall, the polymerase genes by virtue of this could capture the more copy of respective tRNA-anticodon in the equine cells and allow the virus to quickly replicate. However, overall adaptability of polymerase genes to tRNA pool of equine cells was moderate. Other determining parameters of natural selection driven codon bias, GRAVY and AROMO values (indices of amino acid usage) were estimated and correlated with two main axes of COA [85]. Interestingly, aromatic amino acid (Phe and Tyr) had 'optimal codon-anticodon usage' in PB1 and PB2, but not in PA. Also, both hydropathicity and aromaticity significantly influence the codon usage bias in PA gene segment, but only hydropathicity in both PB1 and PB2 gene segments.
The high RCDI in PB2 as compared to PB1 and PA genes indicates less adaptation to the codon usage towards highly expressed gene in Equus caballus. The increment in RCDI during the course of evolution in PB1 and PB2 furthermore suggests their less adaptability towards the host. This might lead to low replication rate for successful establishment of virus in its host with alternative codon usage patterns [86]. A similarity (SiD) index revealed that Equus caballus exerted more selective pressure on the codon usage patterns of PB1 as compared to PB2 and PA. Earlier studies also estimated low adaptability in HIV-1 [86] and H1N1pdm IAV in human cells [50]. Destite this, HIV-1 modulate the tRNA pool by selectively enhancement of tRNAs translating A-ending codons (86). This suggests that polymerase genes of EIV having moderate adaptability, may modulate tRNA pool and selectively enhance tRNAs translating Aending codons for maintaining the efficient replication.

Conclusions
The genetic analysis of the three polymerase genes of EIVs in the present study and matrix gene in our previous study revealed that clade 2 EIVs underwent major changes which led to emergence of a phylogenetically distinct group of EIVs originating from Richmond/1/07 (other sequences being unavailable). These consensus amino acid changes especially in clade 2 EIVs may have repercussions on pathologenicity and virulence of EIVs and needs further studies. The polymerase genes of EIVs exhibit a weak codon usage bias. Multiple factors such as nucleotide composition, mutation pressure, aromaticity and hydropathicity influenced the synonymous codon usage bias in polymerase genes, with natural selection being the major influencing factor.

Viruses and sequencing
The equine influenza virus (EIV) H3N8 isolates of Florida clade 2 sub-lineage viz. A/equine/Katra-Jammu/06/08, A/ equine/Mysore/01/08 and A/equine/Ahmedabad/1/09 isolated by our laboratory from EI epizootic (2008-09) in India [66], were included in the study. The isolates were propagated in embryonated hen's eggs. Viral RNA was extracted from 200 μl of allantoic fluid using a RNA extraction kit (QIAamp (R) Viral RNA Mini Kit; Qiagen, Valencia, CA), cDNA synthesized using Uni12 primer [87] and polymerase genes were amplified using designed primers (Additional file 7). Amplifications of polymerase genes were carried out using 0.5 μl (3 U) of Hot-Start Taq DNA polymerase (Qiagen, Valencia, CA) in a 50 μl reaction volume with cycling conditions of 95°C for 5 min followed by 34 cycles of 95°C for 1 min, 52°C for 1 min, 72°C for 3 min and final extension at 72°C for 10 min. All amplicons were cloned into pTZ57R/T vector (MBI Fermentas, Burlington, Canada) and three recombinant plasmids of each polymerase genes were sequenced commercially using Dye Deoxy Terminator Cycle Sequencing strategy from EuroFins Genomics India Pvt. Ltd. Bangalore, India.

Sequence data
The full length nucleotide sequences of three polymerase genes of EIVs across different lineages reported worldwide between 1963 to 2015 were retrieved from the Influenza Virus Resource at the National Center for Biotechnological Information (http://www.ncbi.nlm.nih.gov/ genomes/FLU/FLU.html). Few sequences from GISAID (www.gisaid.org) were also included in the present study in consultation with the Animal Health Trust, UK. The data set comprised 387 nucleotide sequences (124 of PA gene, 134 of PB1 gene & 129 of PB2 gene). The details of the EIV strains and accession numbers of sequences used in this study are provided in Additional file 8.

Sequence and phylogenetic analyses
The consensus nucleotide sequences of 3 clones of each polymerase gene of Indian isolates were generated. Deposited in GenBank, NCBI and accession numbers are provided in Table S2. For comparative studies, the nucleotides as well as deduced amino acid sequences of these genes of EIV H3N8 circulating throughout the globe were aligned using ClustalW programme of MEGA5.0 software [88].
Phylogenetic trees of the polymerase genes were inferred using MEGA5.0 software employing the maximum likelihood model using nucleotide sequences including parameters of transitions and transversions of nucleotide substitutions, equal substitution rates among sites but heterogenous rate among lineages and gaps treated by pairwise-deletion [88]. The bootstrap analyses of the phylograms were carried out with 1000 replicates of dataset to determine the robustness of the individual nodes of the tree.

Selection pressure analysis
Site specific selection pressure was estimated on three gene segments (PB1, PB2, and PA) of EIV using the HyPhy software implemented in the Datamonkey webserver [89]. The similar/duplicate sequences were excluded by this server. The best nucleotide substitution models, PA (012212), PB1 (012210), and PB2 (010210) were chosen based on Akaike Information Criterion (AIC) on a defined Neighbor-Joining (NJ) phylogenetic tree. We also performed GARD (Genetic Algorithm for Recombination Detection) to detect any recombination [90]. We compared four models, Single Likelihood Ancestor Counting (SLAC) model, Fixed Effect Likelihood (FEL) model, Mixed Effects Model of Evolution (MEME), and Fast Unbiased Bayesian Approximation (FUBAR) for estimation of sites under selection pressure [91][92][93]. The ratio of nonsynonymous (dN) to synonymous (dS) substitutions per site (x = dN/dS) was estimated to know the strength of selection pressure. The sites with p values <0.1 for SLAC, FEL and MEME models, and a posterior probability >0.90 for FUBAR were accepted as candidates for selection.

Nucleotide composition analysis
The nucleotide compositional parameters were calculated for polymerase genes of each EIV strain. These includes the frequencies of occurrence of each nucleotide (A %, U %, C %, and G %); each nucleotide at the third position of the synonymous codons (A3%, U3%, C3%, and G3%); G + C at the first (GC1), second (GC2), and third codon positions (GC3); G + C at the first and the second positions (GC12). The codons for Met (AUG), Trp (UGG) and termination codons (UAA, UGA, UAG) were removed from the analysis, as they were not expected to contribute in the codon usage bias.

Relative synonymous codon usage (RSCU)
The RSCU values for all the coding sequences of polymerase genes of EIV were calculated to determine the patterns of synonymous codon usage without the confounding influence of amino acid composition. The RSCU value of a codon is measured as the ratio of its observed frequency to its expected frequency given that all codons for a particular amino acid are used equally [89]. RSCU values are not affected by sequence length and amino acid frequency since these factors are eliminated during the computation. The RSCU values were estimated as per equation given below [94]: Where g ij is the observed number of the i th codon for the j th amino acid which has n i kinds of synonymous codons. Synonymous codons showing the RSCU values of <1.0 represent negative codon usage bias, while of 1.0 and >1.0 represent no bias and positive codon usage bias, respectively.

Effective number of codons (ENC)
The ENc values were computed to assess the degree of codon usage bias in the polymerase coding sequences of EIV using the formula given below: Where F (i = 2,3,4,6) is the mean of F i values for the ifold degenerate amino acid. The F i values were calculated using the formula given below: Where n is the total number of occurrences of the codons for that amino acid and n j is the total number of occurrences of the j th codon for that amino acid. The ENc values range from 20 to 61 [95]. The ENc value of 20 states an extreme codon usage bias (only one of the possible synonymous codons is used for the corresponding amino acid), while that of 61 states no bias at all (all possible synonymous codons are used equally for the corresponding amino acid). Consequently, the smaller the ENc value, the greater will be the extent of codon usage bias. Altogether, a gene of ENc ≤ 35 is designated to possess strong codon bias [95,96].

ENc-plot
An ENc-plot is widely employed to see whether mutation bias/mutation pressure influence the codon usage bias or not. In this plot, the ENc values are the ordinate and the GC3s values (frequency of either a guanine or cytosine at the third codon position of the synonymous codons, excluding Met, Trp, and stop codons) are the abscissa [95]. If predicted ENc values lies on or around the standard curve (functional relation between expected ENc and GC3s), then codon usage is constrained only by G + C mutation bias. Else, other factors such as natural selection play significant role in shaping the codon usage bias. Expected ENc values were calculated as below: Where 's' is the frequency of G + C at the third codon position of synonymous codons (i.e. GC3s).

Parity rule 2 analysis
The Parity rule 2 (PR2) plot was constructed to estimate the effects of mutation pressure and natural selection on the codon usage of EIVs polymerase genes. In this plot, the AU-bias [A3/(A3 + U3)] at the third codon position of the four-codon amino acid of entire coding sequences is the ordinate and the GC-bias [G3/(G3 + C3)] is the abscissa. The center of the plot, where both coordinates crosses at 0.5, denotes no bias between the influence of the mutation pressure and natural selection [97,98].

Neutrality plot
A neutrality plot was generated where, GC12 is the ordinate and GC3 is abscissa. A plot regression with a slope of 0 (the points positioned on the parallel lines of the abscissa) indicates no effect of directional mutation pressure, while a slope of 1 (the points positioned on the diagonal line) is indicative of complete neutrality [99].

General average hydropathicity (GRAVY) and aromaticity (AROMO)
The GRAVY values were calculated as a sum of the hydropathy values of all the amino acid in the gene product divided by the number of residues in the polymerase gene sequences [100]. These values range from −2 to 2; where positive and negative values are indicative of hydrophobic and hydrophilic proteins, respectively. AROMO value denotes the frequency of aromatic amino acid (Phe, Tyr, Trp).
Both GRAVY and AROMO values are indices of amino acid usage, and the variation in amino acid compositions can also influence the results of codon usage analysis.

Relative codon deoptimization index
The relative codon deoptimization index (RCDI) compares the similarity in codon usage of a given coding sequence with that of a reference genome. The RCDI values for the polymerase genes of EIV were computed using web-based RCDI/eRCDI server (http://genomes.urv.es/ CAIcal/RCDI/). This server also calculates expected RCDI values for a set of sequences by generating random sequences with similar G + C content and amino acid compositions. The expected RCDI provides a direct threshold value for discerning whether the differences in the RCDI value are statistically significant and arise from the codon preferences or whether they are merely artifacts that arise from internal biases in the G + C composition and/or amino acid composition of the query sequences. The RCDI value of 1 specifies that the virus trails the host codon usage pattern and shows a host-adapted codon usage pattern. Contrariwise, RCDI values higher than 1 indicate the deoptimization of the codon usage patterns of the virus from that of its host(s) [101,102].

Similarity index
The similarity (SiD) index provides an insight into the influence of the overall codon usage pattern of the host on the formation of the overall codon usage of the virus, and was calculated as follows: Where R(A,B) is defined as the cosine value of the angle included between the A and B spatial vectors, and represents the degree of similarity between the virus and host overall codon usage patterns. a i is defined as the RSCU value for a specific codon among the 59 synonymous codons of the virus coding sequence. b i is the RSCU value for the same codon in the host. D(A,B) represents the potential effect of the overall codon usage of the host on that of the virus, and its value range from 0 to 1.0 [103].

Statistical analysis
Correspondence analysis (COA), a multivariate statistical method is widely used to study the trends in codon usage variations. In this analysis, the degrees of freedom were condensed to 40 (from 59 synonymous codons) by