Shannon Entropy Screening of Influenza Hemagglutinin for Tetrapeptides with Exact Homology to Human Proteins

Based upon a unique pair of non-mutating contiguous amino acids in the HA2 region of influenza H1N1 hemagglutinin, identical tetrapeptides were identified in the influenza hemagglutinin and in proteins of human origin. It is hypothesized that such peptide domains, present in both host and virus, increase the adaptability of the virus to the host. Shannon Entropy Screening of Influenza Hemagglutinin for Tetrapeptides with Exact Homology to Human Proteins


Introduction
Influenza virus remains a significant public health problem [1]. Understanding the biology of influenza virus may facilitate the design of new anti-viral therapeutic and preventive agents and strategies [2]. The present report is based upon an analysis of Shannon entropy and secondary protein structure in the hemagglutinin protein of H1N1 influenza virus. The work focuses on positions of zero Shannon entropy.

Materials and Methods
Sequences of Influenza virus H1N1 HA protein were downloaded from the Influenza Virus Resource (https://www.ncbi.nlm.nih.gov/ genomes/FLU/Database/nph-select.cgi?go=database) on 2 Aug 2018 [3]. Of a total 16678 HA protein sequences, 1915 sequences were of length between 560 and 565 amino acids, 30 sequences were of length between 567 and 575 amino acids and 14733 HA protein sequences were of length 566 amino acids. The largest subset, consisting of the HA sequences of length 566 amino acids, was used for this study.
Computations were performed with Anaconda Python 2.7.14. Information entropy (H) was computed by the method of Shannon and is reported in bits [4]. Protein secondary structure was computed on the RaptorX server [5]. Sequence management and calculation of consensus sequence were performed with the Jalview application [6]. The domains of the HA protein were assigned as signal sequence (Positions 1-17), HA1 (Positions 18-344) and HA2 (positions 345-566) according to reference sequence Influenza A virus (A/Puerto Rico/8/1934(H1N1)) segment 4, complete sequence NP_040980.1. The Mann-Whitney U test and the Z-test with 1000 pseudorandom trials were performed with Scipy [7].
Protein-protein searches for human protein sequences (Homo sapiens, taxid 9606) were performed on the National Library of Medicine-National Center for BioInformatics website (https://blast. ncbi.nlm.nih.gov/Blast.cgi?PAGE=Proteins) using BLASTP, with the Blosum 62 substitution matrix and the NLM-NCBI Reference Proteins (refseq_protein) database; organism was set to Homo sapiens (taxid: 9606). Searches of the refseq_protein database of human-origin used two tetrapeptides as query peptide sequences: GLY TRP TYR GLY and GLY TRP PHE GLY. These two tetrapeptides were detected in H=0.0 distributions in the HA2 domain of H1N1 influenza hemagglutinin.

Results and Discussion
The distribution of H in the H1N1 HA protein with intact signal sequence is shown in the top graph of Figure Table 1. Table 1  The "Position Count" column shows the running total count of amino acid positions at which H=0.0, beginning with the N-terminal MET.  (1) GLY TRP TYR GLY

As shown in
(2) GLY TRP PHE GLY The occurrence of these two influenza tetrapeptides as possible components of human proteins was next addressed. Screening human protein sequences for the presence of the two tetrapeptides detected in H1N1 influenza HA2 proteins yielded a total of 68 hits (Z=8.0183, p=1.0722 × 10 -15 ). This total consisted of 57 occurrences of tetrapeptide_1 in the absence of tetrapeptide_2 (Z=7.6666, p=1.7666 × 10 -14 ) and 11 hits for tetrapeptide 1 in the presence of tetrapeptide 2 (Z=3.3351, p=0.0009). These Z-test results indicate that each of the tetrapeptide counts was statistically greater than zero. Tetrapeptide_2 was not observed in human proteins in the absence of tetrapeptide_1.
An example (Integrin beta-like protein 1 isoform 4) of one of the human protein sequences detected by the presence of a tetrapeptide detected in influenza HA protein (tetrapeptide 2) is shown in Table 2.
A complete list of reference human proteins detected because of the presence of either a tetrapeptide 1 or a tetrapeptide 2 component is given in Supplementary Information.

Conclusion
It is proposed that the tetrapeptides expressed both in human proteins and in influenza H1N1 HA are structural features that are associated with immunological or other disguising features of the virus in the human host, thereby permitting viral replication and function [8]. The effects of such peptides on HA-based vaccines and treatments should be determined.