Information Signatures of Viral Proteins: A Study of Influenza A Hemagglutinin and Neuraminidase

Hemagglutinin (HA) and neuraminidase (NA) are glycoproteins encoded by several types of viral particles. Most notably, they exercise complementary chemical functions during infection and propagation of influenza A: infection of a host is initiated by HA while NA catalyzes the release of newly-made viral particles. The antibodies of the molecules form the means of classifying the influenza A subtypes: H1N1, H2N2, H3N2, etc.. Given the risks of viral exposure to global host populations, intense effort is directed toward understanding the molecular mechanisms. Further, the design and formulation of drugs which subvert the mechanisms are on-going challenges. This research focuses on the primary structure information expressed by the two proteins, applying an information theoretic model from previous research. The amino acid sequences for HA and NA such as MKARLLILLCALSATD..... MNPNQKIITIGSICMAI...... are parsed for their correlated information, both the total accumulation and fluctuations. Data for the HA and NA of multiple influenza A subtypes are illustrated via information signatures and phase plots. This enables sharp contrasts to be drawn between seasonal infectious proteins and ones with high pandemic potential. Overall, the analysis illuminates new ways of evaluating HA and NA molecules for their subtype and virulence based on information properties. Just as important, the results point to mutation strategies for re-directing and attenuating the protein functions.


Introduction
Hemagglutinin (HA) and neuraminidase (NA) are glycoproteins in the surface membrane of influenza particles [1]. Infection of a host is initiated by HA while NA catalyzes the release of newly-made viral particles [2]. The antibodies of the molecules form the means of classifying the influenza A subtypes: H1N1, H2N2, H3N2, etc. [3]. At present, there are at least 16 and 9 known subtypes for HA and NA, respectively.
Given the risks of viral exposure to global populations, intense effort is directed toward understanding the molecular mechanisms. Further, the design and formulation of drugs which subvert the mechanisms are on-going challenges [4]. The sequences offer detailed information. Yet a computer-unassisted reading of them is bewildering. This is apparent because, among other things, one cannot distinguish the extraordinary from ordinary. The above include formulae allied with the "Spanish flu" pandemic of 1918 [5]. But which ones are these? The correct answers are Seqs. (2) and (4). The reader's uncertainty is understandable given the lengths and complexities of the sequences.
Our approach to proteins has looked for guidance from information theory [6 -10]. Here we focus on the HA and NA primary structure information.
The results draw contrasts between seasonal molecules and ones with high virulence potential.
The data further point to mutation strategies for re-directing and attenuating the functions.

Proteins and Sequence Information
The approach builds on research from the mid-2000s. Work in this lab quantified the correlated information CI expressed by the naturally occurring amino acids based on their atom and covalent bond structure [6,8]. An average < CI > and standard deviation σ CI were established and a dimensionless quantity ) (i CI Z was based on each amino acid's CI contribution relative to the average CI, e.g.
There are twenty amino acids and thus sixteen more ) (i CI Z to note as in reference [6]. The superscript symbols refer to the amino acid while the numerical Mol2Net, 2015, 1(Section A, B, C, etc.), 1-x, type of paper, doi: xxx-xxxx 3 value represents the CI distance from the average in standard deviation (σ CI ) units. The sign reflects whether the amino acid contributes information above or below the natural average. The Z-terms largely follow chemical intuition. Tryptophan (W) features a network of aromatic bonds and functional groups; it exerts nearly +3σ CI impact in a protein. Alanine (A) is a simple aliphatic and contributes CI below average at ca. -0.5σ CI . The methodology originated in an information study of ribonuclease A and lysozyme [8,9].
A dimensionless function G(k) is constructed for a sequence by adding order; k is a counting index less than or equal to N number of residues in the protein. G(k) tracks the accumulation and fluctuations of information: ...
(1) Proteins generally host a majority of low information residues. As a consequence, G(k) scales linearly with negative slope and is well accommodating of least squares analysis. The analysis establishes an ensemble of linear regression functions L j (k) with typical correlation coefficient R 2 > 0.95.
Graphs such as in Figure 1 serve as signatures of the primary structure information. They reflect more than a molecule's local composition. If a substitution is made at site j, the collection in Eq. (2) is altered. The amplitude is impacted at all sites k = 1, 2, …, N.
The information signatures can be strikingly different, depending on the subtype.
There are as many signatures as there are HA and NA variants.

____
In thermodynamics, the variance of an extensive property such as enthalpy and entropy scales with a capacity [11]. In the same way, the variance in H j (k) can be viewed in terms of a protein's functional capacity. Molecules composed of only one type of amino acid, e.g. AAAAAAAAA…., offer zero capacity. They are of no biochemical utility because they lack diversity of information. This is borne out in the signatures: their G(k) trace perfect lines (R 2 = 1.000); corresponding H(k) express zero amplitude.
The information signature variance is calculated as follows: The square root σ H is the standard deviation, so indicated in Figure 1

_____
The second lesson is that the neighborhood distribution is markedly uneven. A significant fraction of influenza subtypes clusters in the upper third of Figure 2 while fewer ones occupy the lower third. Further, there are several low-density regions: these correspond to HA, NA variants which have yet to manifest, or are outright avoided by natural selection.
Information signatures discriminate the subtypes. What do things look like for proteins specific to human populations? For humans, the major circulating strains of influenza A have been H1N1, H2N2, and H3N2; the global pandemic of 1918 was attributed to the first of these [5]. The avian strain H5N1 has rarely infected humans, although it poses high virulence potential. A, B, C, etc.), 1-x, type of paper, doi: xxx-xxxx 5 Figure 3 shows a phase plot based on human host isolates. Different color symbols distinguish the subtypes while the isolate years are included. Not all years are represented as the analysis was directed to complete genomes. The point locus for the 1918 pandemic year sample (Brevig Mission, >gb:AF250356|gi:8572169|UniProtKB:Q9IGQ6|) is marked in red. It is considerably removed from the H1N1 neighborhood. Its nearest neighbors derive from H5N1 isolates. The black, green, and violet symbols are placed, respectively, by H5N1, H2N2, and H3N2 samples.

Discussion
HA and NA exercise complementary functions: the former enables attachment of influenza particles to a cell surface while the latter catalyzes the release [1,2]. Given the essentialness of the functions, there is significant pressure for variants to manifest over time. Variants enable the virus to sidestep host immune responses and to thwart drug therapy. There are >10 5 HA and NA sequences on record, yet this is a paltry number compared to the possibilities.
Thermodynamic analysis of a system commences with variables and functions of state. This has been the approach to HA and NA in constructing information G and H. The functions track the information accumulation and fluctuation in a manner dependent on all the amino acids. No one site or region is viewed as more important than others.
One learns several things, the first being an information method of evaluating HA, NA pairs. All sequences are confounding by their complexity. However, using a spreadsheet and  The second insight is the contrast between seasonal-and pandemic-year proteins. HA and NA from the 1918 pandemic places an outlier point on the phase plot. This placement stems from a lower fluctuation amplitude, compared with that of seasonal proteins. This suggests that lower chemical noise in the primary structures underpins a more invasive chemical function.
The third insight is a strategy for re-directing-and possibly attenuating-the functions. Natural selection favors molecules which promote viral infection and suppresses variants that serve otherwise. We conjecture that the latter type place state points in the low density regions of Figure 2.  1) and (4). With each substitution, there is a displacement of the σ HA , σ NA coordinate. As the primary structures belong to the H1N1 subtype, each pathway commences near the center of the H1N1 neighborhood and terminates in a zero-to-low density region. The paths are annotated by the following ordered-pair sequences:  Figure 4 is traversed via nine pair-substitutions. This demonstrates that the proteins do not have to be radically altered for the information signatures to move out of the virulent neighborhood of origin. In Figure 4, pathways 1 and 2 direct HA and NA away from all the subtype neighborhoods. In contrast, pathways 3 and 4 cross territory allied with highly virulent subtypes. In redirecting HA and NA functions, the upward-going pathways 1 and 2 would seem preferable. Molecules with information removed from the active neighborhoods would likely offer diminished potency, yet stimulate some production of host antibodies. This would enhance the overall immunity of host populations.  1) and (3) of the Introduction. The pathways are annotated above.

Summary and Closing
The primary structure information expressed in influenza HA and NA was investigated using a model