Systematic analysis of protein identity between Zika virus and other arthropod-borne viruses

Abstract Objective To analyse the proportions of protein identity between Zika virus and dengue, Japanese encephalitis, yellow fever, West Nile and chikungunya viruses as well as polymorphism between different Zika virus strains. Methods We used published protein sequences for the Zika virus and obtained protein sequences for the other viruses from the National Center for Biotechnology Information (NCBI) protein database or the NCBI virus variation resource. We used BLASTP to find regions of identity between viruses. We quantified the identity between the Zika virus and each of the other viruses, as well as within-Zika virus polymorphism for all amino acid k-mers across the proteome, with k ranging from 6 to 100. We assessed accessibility of protein fragments by calculating the solvent accessible surface area for the envelope and nonstructural-1 (NS1) proteins. Findings In total, we identified 294 Zika virus protein fragments with both low proportion of identity with other viruses and low levels of polymorphisms among Zika virus strains. The list includes protein fragments from all Zika virus proteins, except NS3. NS4A has the highest number (190 k-mers) of protein fragments on the list. Conclusion We provide a candidate list of protein fragments that could be used when developing a sensitive and specific serological test to detect previous Zika virus infections.


Introduction
Monitoring the geographic and the demographic distribution of people infected with Zika virus is important for informing decision-makers and researchers during the ongoing epidemic. Health officials also need further knowledge about the associations between Zika virus infection and its sequelae, such as microcephaly and Guillain-Barré syndrome. However, the absence of a sensitive and specific serological test for detecting prior Zika virus infection impedes research. According to the World Health Organization's Target product profiles for better diagnostic tests for Zika virus infection, 1 such a test must be able to differentiate between chikungunya, dengue and Zika viruses, since these mosquito-borne arboviruses can be cocirculating and can cause similar symptoms. 2 Dengue and Zika viruses belong to the virus family Flaviviridae, while chikungunya virus belongs to the Togaviridae family. Although they belong to different virus families, Zika and chikungunya viruses share some similarities in envelope protein folding and membrane fusion mechanisms. 3 Active Zika virus infections can be detected by nucleic acid-based diagnostic tools. 4,5 However, developing serological diagnostic tests to detect previous Zika virus infections has been challenging, because of cross-reactivity between antibodies against different arboviruses. 6-12 Hence, current serological assays, such as enzyme-linked immunosorbent assay (ELISA) and plaque reduction neutralization tests, may not be able to distinguish if a person has been infected with Zika virus or another flavivirus or if a person has received a previous yellow fever or Japanese encephalitis vaccination. 13,14 A study has shown that neutralizing monoclonal antibodies generated against recombinant fragments of the envelope protein of dengue virus serotype 2 tend to be cross-reactive among flaviviruses, while nonneutralizing antibodies seem to be virus specific. 15 We hypothesize that immunogenic protein regions with sequence dissimilarity may exist across arthropod-borne viruses (arboviruses) and that antibodies targeting these regions may be less likely to be cross-reactive. Identifying such regions could aid the development of specific microarray-based serological tests, such as a peptide microarray, to detect Zika virus and/or other related viruses. A peptide microarray is a high-throughput method for detecting interactions between peptides and antibodies and is composed of multiple spots of peptides on a solid surface. 16 We also hypothesize that protein regions that are more conserved among different strains of the Zika virus are more likely to contribute to the sensitivity of the peptide microarray. Thus, to identify Zika virus conserved protein fragments that are variable among other virus species, we analysed proportions of protein sequence identity across virus species and protein polymorphism among different strains of Zika virus. We analysed the flaviviruses Zika, dengue, West Nile, Japanese encephalitis and yellow fever, and the alphavirus chikungunya.

Methods
We used publicly available proteomic sequencing data (Table 1). For the Zika virus, we used data set A from Faria et al. 17 We downloaded the protein sequences of Japanese encephalitis virus, yellow fever virus and chikungunya virus from the National Center for Biotechnology Information (NCBI) protein Objective To analyse the proportions of protein identity between Zika virus and dengue, Japanese encephalitis, yellow fever, West Nile and chikungunya viruses as well as polymorphism between different Zika virus strains. Methods We used published protein sequences for the Zika virus and obtained protein sequences for the other viruses from the National Center for Biotechnology Information (NCBI) protein database or the NCBI virus variation resource. We used BLASTP to find regions of identity between viruses. We quantified the identity between the Zika virus and each of the other viruses, as well as within-Zika virus polymorphism for all amino acid k-mers across the proteome, with k ranging from 6 to 100. We assessed accessibility of protein fragments by calculating the solvent accessible surface area for the envelope and nonstructural-1 (NS1) proteins. Findings In total, we identified 294 Zika virus protein fragments with both low proportion of identity with other viruses and low levels of polymorphisms among Zika virus strains. The list includes protein fragments from all Zika virus proteins, except NS3. NS4A has the highest number (190 k-mers) of protein fragments on the list. Conclusion We provide a candidate list of protein fragments that could be used when developing a sensitive and specific serological test to detect previous Zika virus infections.
database and the sequences for dengue virus serotypes 1-4 and West Nile virus from NCBI virus variation resource. 18 We used BLASTP 19 to find regions of identity between arboviruses, applying a default Expect (E)-value threshold of 10, that is the expected number of hits of the observed similarity, by chance, is fewer than 10. The results are robust and we obtained the same results when E-value thresholds were 5 or 50. When comparing the chikungunya and the Zika viruses, we used an E-value threshold of 1000, because chikungunya does not belong to the Flaviviridae family and we could not identify any regions of similarity when using an E-value threshold of 10. For all protein fragments across the proteome, we calculated the proportion of shared amino acids between virus species and polymorphism among different Zika virus strains. We analysed protein fragments of different lengths, so called k-mers (where k is the amino acid length of the protein fragment), with k equal to 6 or ranging from 10 to 100. We used a sliding window approach, where we moved the window one amino acid at a time along the proteome to include every possible k-mer. To be conservative, we identified protein fragment identity between species by the maximum identity among all the pairs of strains for each window considered. For analysing the identity with dengue virus, we used the highest identity between the Zika virus and all four serotypes of the dengue virus for each window considered. To assess if protein identity between the Zika virus and each of the dengue serotype was significantly associated with polymorphism within each dengue virus serotype, we calculated P-values by using Pearson's correlation test.
To identify polymorphisms within viruses, we used both the average pairwise difference and the proportion of polymorphic sites. Average pairwise difference is calculated by averaging the proportions of differences in peptide sequences from all pairs of the virus strains. We chose to plot the proportion of polymorphic sites in the figures because it is less sensitive to population structure and/or sampling bias.
To identify potential protein fragments that could be used for diagnostic tests, we selected k-mers with low proportion of identity between the Zika virus and other arboviruses as well as low polymorphism between different strains of Zika virus as lead candidate protein fragments. The rationale for this approach was that fragments with low between-species identity and low within-species polymorphism are most likely to have both the required specificity and sensitivity for such tests. We chose k-mers in the bottom quintile of values of identity and polymorphism for each k-mer length.
Insights into protein structures are critical for assessing the possible antigenicity of peptides, because buried peptides are less likely to be antigenic. 20 To determine if any of the fragments are exposed or buried in the two Zika virus proteins with available protein structures, the envelope protein and the non-structural (NS) protein 1, we calculated the solvent accessible surface area for each amino acid. We used the published structures of dimeric NS1 (protein data bank identification, PDB ID: 5GS6) 21 and the envelope protein in the biological assembly of the mature virus (PDB ID: 5IRE). 22 To calculate the solvent accessible surface area, we used the linear combinations of pairwise overlaps method 23 and used 10 Å 2 as the upper limit for buried residues, as this value corresponds to half the surface area of a single water molecule. The regions at the C-terminal end of the dengue virus envelope protein interact with the viral lipid membrane 24 and are unlikely to be exposed. Due to the high structural similarity of the envelope proteins between dengue and Zika viruses, we assume that the region from residue 404 to the C-terminus in Zika virus envelope protein is also buried. For the lead candidate list, we excluded the k-mers without any continuous exposed peptides longer than five amino acids in the two proteins, because exposed peptides are more likely to be antigenic. The threshold of five amino acids was chosen because 99.7% of experimentally determined antigenic B-cell epitopes for flaviviruses found in Virus Pathogen Database and Analysis Resource database are longer than five amino acids. 25 We obtained the list of theses epitopes through the database's web site at http:// www.viprbrc.org/.

Results
On average, Zika virus shares 55.6% amino acid sequence identity with dengue virus, 46.0% with yellow fever virus, 56.1% with Japanese encephalitis virus, 57.0% with West Nile virus and 1.3% with chikungunya virus. The identity between Zika virus and other viruses and Zika virus polymorphism for all kmers are available from the corresponding author. As an example, Fig. 1 and Fig. 2 show the identity between Zika virus and other viruses investigated and polymorphisms within the Zika virus for all 50-mer peptides. Fig. 3 shows protein fragments mapped to the corresponding envelope or NS1 proteins. The exposed areas of the proteins show regions with both low identity with other flaviviruses and low Zika virus polymorphism.
The lead candidate list for developing a specific and sensitive microarraybased serological test contains 294 protein fragments. These fragments have low similarity between viruses, low polymorphism within the Zika virus and continuous exposed peptides longer than five amino acids (Table 2; available at: http://www.who.int/bulletin/volumes/95/7/16-182105). The list excluded 10.9% (36/330) of k-mers containing previously identified B-cell epitopes for other flaviviruses than Zika, because they are likely to be crossreactive. Protein fragments from all Zika virus proteins, except NS3, are present in the list. NS4A has the highest number (190 k-mers) of candidate protein fragments (Table 3).
As Zika virus infection is associated with birth defects that are not seen in other flavivirus infections, we compared identity and polymorphism of proteins between flaviviruses. Overall, the level of identity between Zika virus and other flaviviruses is similar to the level of iden-tity seen when comparing other flaviviruses with each other (available from the corresponding author). In contrast, one region (amino acid positions 430-500 in the proteome) in the envelope protein shows both low identity between Zika virus and other flaviviruses and low polymorphism within Zika virus (Fig. 2) and the relative polymorphism of NS2A and NS2B is on average 53.6% and 69.5% lower in Zika virus than in other flaviviruses, respectively (Fig. 4).
Protein identity between dengue and Zika viruses is negatively associated with polymorphism within the dengue virus proteins (P-values < 0.01 for all dengue serotypes; Fig. 5). This result can be explained by so-called negative selection, i.e. protein regions under stronger selective constraints tend to be more conserved and have higher identity between species and lower polymorphism within species. 26 We did not observe a similar association for within-Zika virus polymorphism, which might be due to fewer strains analysed and/or smaller effective size of the global Zika virus population from which sequences were sampled, resulting in lower selection efficiency.

Discussion
Here we identified regions within the Zika virus proteome that have low identity with other viruses and low within-species polymorphism. These regions may be used to develop new serological diagnostic tests to detect Zika virus infection. However, for some of the identified regions, their antigenic properties are unknown and, therefore, these regions would first need to be evaluated for such properties. The regions identified as antigenic could then be used for developing a peptide microarray, where a collection of identified peptides are displayed on a surface. Antibodies generated during a previous Zika virus infection will then be able to bind to these displayed peptides. The read-out of the microarray is the fluorescent signal generated by fluorescence-coupled secondary antibodies that have bound to the serum antibody-peptide complexes. An advantage of assessing multiple peptides simultaneously in one test is that individual peptides do not need to generate a strong signal, since the intensities of signals of all different antibody-peptide complexes can be incorporated into a        16 Microarrays also have a greater potential to identify prior virus infections than neutralizationbased assays, because microarrays can detect a broader range of antibodies than only antibodies that neutralize the virus and protect against infections. Peptide microarrays have been used to differentiate between serological responses to closely related bacterial pathogens 16 and to detect previous viral infections. 27 The computational selection strategy used here represents a targeted approach, which reduces the number of potential candidate peptides. These peptides could be used for creating a peptide-antibody signature for a given viral infection. Once the signature is identified, a diagnostic test employing only the most important peptides contributing to that signature can be designed and produced. While our computational analysis of k-mers focused on linear epitopes, specific and sensitive linear epitopes together may be sufficient to distinguish different arboviruses. Moreover, depending on how a serological diagnostic test is produced, some of the longer k-mers might fold with sufficient similarity to their native folding to present conformational epitopes.
Our analysis showed that NS1 protein polymorphism is low. Therefore, using peptides from the NS1 protein for diagnostic test might result in a high-sensitivity test for detecting antibodies against Zika virus from different geographical locations. On the contrary, the identity of NS1 protein across flaviviruses is not particularly low compared to other proteins (third highest among 10 proteins), suggesting that NS1 is not the top candidate protein for low cross-reactivity. Recently, Euroimmun AG (Lübeck, Germany) developed a Zika virus ELISA for immunoglobulins (Ig)M and IgG, based on the NS1 pro-     28,29 However, the small sample size, the fact that the samples were not from regions with endemic dengue and the lack of samples from patients with different stages of infection weaken the conclusion. 28,29 Moreover, because each diagnostic test has its advantages and disadvantages, having multiple approaches available is helpful for providing an accurate diagnosis. A sensitive and specific diagnostic test detecting several arbovirus infections simultaneously would be valuable, 1 so that only one assay is required to diagnose active and previous flavivirus infection(s). While we designed the sequence analysis for specificity and sensitivity of detection of Zika virus infection, the same type of analysis could be used for identifying specific and sensitive markers for each arbovirus. By including specific and sensitive markers from all arboviruses in the same peptide microarray, the microarray has the potential to detect several arbovirus infections simultaneously.
To further dissect the molecular mechanism leading to the Zika virus sequelae not seen with other flaviviruses, the protein fragments presented in the candidate list may be useful. The low polymorphisms in NS2A and NS2B proteins might be good candidates to start investigating the possible molecular link between Zika virus and microcephaly and Guillain-Barré syndrome.
Peptide-sequence identity is unlikely to fully predict cross-reactivity due to other factors, such as glycosylation. Nonetheless, this analysis based on publicly available sequences provides a step towards the development of a serological test that can distinguish previous Zika virus and co-circulating arbovirus infections. 1