A Bioinformatic Glimpse of Human-Origin Zika Virus Polyprotein

An amino acid position with maximum Shannon information entropy and maximum cumulative mutual information is identified in Zika virus polyprotein. This amino acid position is used to sort the subset of Zika virus polyprotein mutations found exclusively in viruses isolated from human hosts but not from vector Aedes mosquitos. The identified mutational amino acid position is a component of a 20-mer peptide in the NS1 protein that has been reported with putative epitopic activity by Freire et al. It is suggested that the observed dual maxima bioinformatic parameters reported here for an exclusively human mutational site support the proposed function of that site in immunological activity.

It was recently reported that the set of mutations in polyproteins obtained from ZIKV isolated from humans can be partitioned into two subsets [6]: Exclusive (x) subset and Common (c) subset. Mutations in the x subset occur exclusively in the human host. In contrast, mutations in the c subset occur in ZIKV isolated both from human hosts and from the Aedes mosquito vectors. The present report is focused on identification and proposed immunological significance of the bioinformatically predominant member of the exclusively human (x) subset. It is proposed that the mutations that occurred only in the x subset but not in the c subset may reflect biological processes occurring only in humans but not in mosquitos. Some of these processes may be metabolic, some may be conformational, and some may be immunologic. In the work reported here, an amino acid position with prominent bioinformatic properties is identified within the Zika virus polyprotein and a biologic function of that amino acid position is proposed.
Polyprotein domains were assigned by alignment with the default MR 766 reference sequence [7]. Polyprotein sequence management was facilitated with Jalview 2.9.0b2 [8]. Information entropy (H) was computed by the equation of Shannon [9] and is expressed in bits. H was determined for all amino acid positions in the set of polyproteins isolated from humans, and independently for all amino acid positions in the set of polyproteins isolated from mosquitos. Amino acid positions where H>0.0 in the polyprotein were classified and sorted into Exclusive (x) and Common (c) subsets as previously described [6], depending upon whether (1) a positive H value occurred only at amino acid positions in ZIKV polyproteins obtained exclusively from human hosts or whether (2) the positive H value occurred at amino acid positions in ZIKV polyproteins common both to human hosts and to Aedes species of vector mosquitos.
Mutual information (MI), also in bits, was computed according to Cover and Thomas [10]. Cumulative mutual information (cMI) was computed with exclusion of autocorrelation. Z-tests were performed using 1000 pseudo-random trials and are reported with two-tail probabilities. Secondary structure (ss3) of the non-structural NS1 protein was computed online with RaptorX [11].
Sample Python code for computing the major components within H and cMI, and for sorting human and Aedes polyprotein sequences are in the Supplementary Information file. A plot of the information entropy (Hx) as a function of cumultative mutual information (cMIxx) for the Exclusive (x) human set of ZIKV polyprotein amino acid positions at which Hx>0.0 is shown in Figure  1. The xx notation indicates that both members of the pair used to compute mutual information are members of the Exclusive subset. Of the two hundred and ninety-four (294)   The sorting of mutational positions described above is based upon the mutational position of both maximum information entropy and maximum mutual information. This mutational position was amino acid 1118 of the ZIKV polyprotein. Amino acid 1118 is a component of the nonstructural NS1 protein domain of the ZIKV polyprotein. Position 1118 was occupied by an amino acid in 387 of the 389-total number (99.49%) of sequences in the dataset prepared from humans. These 387 amino acids at position 1118 were (counts in parentheses): wild type=ARG (315), mutant1=TRP (61) and mutant2=GLN (11). Each of the two mutant counts differ significantly from zero: mutant1 Z=7.9173, p=2.4278 × 10 -15 and mutant2 Z=3.2354, p=0.0012 with a combined (mutant1, mutant2) probability p=2.9488 × 10 -18 . It should be noted that the Hopp-Woods [12] hydrophilicity coefficients of these amino acids are: HW(ARG)=3.0, HW(TRP)=-3.4 and HW(GLN)=0.2. The values of the HW coefficients for the observed amino acids thus span the entire hydrophilicity spectrum. The expected effects of these amino acids on the secondary structure of the NS1 protein are shown in Figure 2. Each of the three amino acids at position 1118 is a component of an extended strand. However, there is an increased helical tendency in the wild type ARG1118 NS1 protein, both in the neighboring NH2-region and in several, more distal regions between position 1118 and the NH2-terminus of the protein. Amino acid position 1118 is a member of the Exclusive human subset of mutation positions, i.e., no mutations were observed at position 1118 in the dataset of polyproteins of Aedes origin. The mutation rate at position 1118 in the sequences exclusively of human origin was 72/387=18.60%. Applying that mutation rate to the set of sequences of Aedes origin yields a predicted 50 × 18.60%=9.3 mutations. Instead of the predicted 9.3 mutations, zero mutations were observed. The difference between zero observed mutations and 9.3 predicted mutations is statistically significant (Z=3.0569, p=2.2362 × 10 -3 ). It is noted that a larger ZIKV dataset of mosquito origin may be needed to detect a possible shift of position 1118 from the Exclusive to the Common subset. Meanwhile, it may currently be concluded that position 1118 is indeed a member of the Exclusive human subset of mutations and that membership in the Exclusive subset of ZIKV mutations reflects biological processes that occur in human hosts but not in Aedes mosquito vectors. One class of such biological processes is the immunologic response.

Results and Discussion
ZIKV is known to induce antibody-mediated and cell-mediated immunological responses [13,14]. Most significant to the bioinformatic parameters reported here, polyprotein amino acid 1118 is a component of a putative 20-mer conformational epitope (CE6) that has been reported [15] for polyprotein amino acids 1105-1124. These 20 amino acids are within the nonstructural protein NS1 domain of the polyprotein; these 20 amino acids occupy numerical positions 311-330 of the mature NS1. This 20-amino acid sequence was computationally identified [15] by structural, conformational and epitopic mapping of Zika virus polyproteins by means of the combined use of Ellipro [16], Epitopia [17] and Discotope [18]. The protein structure and geometric properties are used by Ellipro to computationally predict immunogenic regions of the protein [16]. Protein structure and amino acid sequence are used by Epitopia [17] to computationally predict B-cell antigenicity of the protein. The occurrence of discontinuous B-cell epitopes is computationally predicted by Discotope [18] on the basis of threedimensional structure and surface accessibility of the protein. Thus, none of these epitope-prediction methods directly depend upon the propensity of a set of amino acids at a given position in a set of sequences to mutate, as does the Shannon information entropy reported here.
Polyprotein position 1118 is position 324 of the mature NS1 protein.
CE6 is depicted below as peptide1, along with the variants described in this report as peptide2 and peptide3: R1118, R324 = WCCREC TMPPLSFRAKDGCW [1] W1118, W324 =WCCREC TMPPLSFWAKDGCW [2] Q1118, Q324 = WCCREC TMPPLSFQAKDGCW [3] The mutation site in polyprotein amino acid 1118 (numerically NS1 amino acid 324) is depicted in red. The data and analysis presented here support and expand the data and analysis presented by Freire et al. [15] and therefore suggest that the observed Shannon entropy may be associated with immunological activity. Unfortunately, not all immunological activity is favorable to the infected host. For example, Zika virus has been shown to inhibit and evade the immune response by interaction with several regulatory physiological processes at the molecular level [19] and to cross-react with antibodies against other flaviviruses, thereby worsening infection through antibody-dependentenhancement [20,21].
Because of its serious and common effects on infants infected in utero and the serious, albeit rare CNS diseases it causes, Zika virus remains a significant public-health problem [7,22]. As of this writing, there is neither a preventive vaccine nor a treatment for infection by Zika virus. The three peptides reported here, with a highly mutating position at polyprotein amino acid 1118 may help provide a basis for the needed anti-Zika vaccine. Initial analysis of the immunological characteristics of these peptides in an experimental system should be relatively simple, rapid and cost-effective.

Conclusion
It is recognized that the sorting assignment of amino acid position 1118 to the Exclusive human subset may change with time, especially because of the relatively small size of the current set of Zika polyproteins of Aedes mosquito origin (n=50). As stated above, it is also recognized that the bioinformatic maxima of position 1118 reported here may be associated with non-immunological biological processes. Those processes may be manifest as networks of interacting genes detectable by bioinformatic techniques similar to those used here for ZIKV and reported previously for influenza A virus [23]. Indeed, the network analysis previously used for influenza A can be applied to the Exclusive and Common subsets of Zika polyprotein mutational amino acid positions. A network analysis can increase insight into the biological organization and the driving forces behind those mutations. However, in the context of infantile microcephaly and the other complications associated with Zika infection, the results reported here, in agreement with the findings of Freire et al. [15], suggest that the immunogenicity, toxicity and protective effectiveness of these three peptides should expeditiously be tested experimentally for potential clinical usefulness.