Incorporating structure context of HA protein to improve antigenicity calculation for influenza virus A/H3N2

The rapid and consistent mutation of influenza requires frequent evaluation of antigenicity variation among newly emerged strains, during which several in-silico methods have been reported to facilitate the assays. In this paper, we designed a structure-based antigenicity scoring model instead of those sequence-based previously published. Protein structural context was adopted to derive the antigenicity-dominant positions, as well as the physic-chemical change of local micro-environment in correlation with antigenicity change. Then a position specific scoring matrix (PSSM) profile and local environmental change over above positions were integrated to predict the antigenicity variance. Independent testing showed a high accuracy of 0.875, and sensitivity of 0.986, with a significant ability to discover antigenic-escaping strains. When applying this model to the historical data, global and regional antigenic drift events can be successfully detected. Furthermore, two well-known vaccine failure events were clearly suggested. Therefore, this structure-context model may be particularly useful to identify those to-be-failed vaccine strains, in addition to suggest potential new vaccine strains.


Preprocessing of HA1 sequence
In order to align all the sequences in our dataset into a unified length, the multiple alignment was done with the HA1 sequence of A/Aichi/2/1968 (H3N2) selected as a template. The alignment part of the sequence ranging from the start position of template sequence to the last position was fetched. Those with the alignment part less than 327 amino acid were excluded and the sequences belonged to the same strain with the exact same alignment part were only kept one in our dataset. The amino acids were numbered according to A/Aichi/2/1968(H3N2) HA1 sequence. Totally, 18072 HA1 sequence longer than 327 amino acids from 1968 to 2015 were collected.

Preprocessing of antigenic distance parameter
The antigenic distance between two strains and was defined by Lapedes and Farber in 2001 as following equation 4 :

= √
The HI titer is the maximum dilution of serum raised against strain a, which is necessary to inhibit cell agglutination caused by strain b.
was calculated only if four HI values were available(Haa,Hbb,Hab,Hba). For the HI value with ">" or "<", the double or half of the value were used. For example, the HI value of ">2560" were treated as 2560*2=5120, the HI value of "<20" were treated as 20/2=10. Due to the different experimental conditions, for the same strain pair, the HI measures collected from different reports existed difference. To avoid this affection, for the HI values of same strain pair derived from different reports, the outlines defined as those with| − ̅̅̅̅̅ | ranked within top 10% in descending order were abandoned. The final experimental distance between two strains was defined as the average of those remained . Two viruses were defined as antigenic variants when the −1 was above 4, otherwise, the pair was treated as antigenic similar 5 .
Finally, the intersection of the HI assays and the sequence set were generated as our dataset. This set contains 3867 pairs involving 288 HA proteins with 2286 antigenic variants and 1581 antigenic similar. Data ranging from 2011 to 2013 were selected as the independent dataset to evaluate the performance of this model.

Preprocessing of structure modeling
To describe the spatial features of HA protein, the three-dimensional structures were generated by Modeller 9.11 6 . For all the target sequences, the top 5 templates which shared the high sequence identity from Protein Data Bank 7 were the same: 2VIU_A (pdb_id:2VIU, chain: A), 1HA0_A, 2VIR_C, 1MQM_A and 3EYK_A. Five models were generated based on each template, the coordination of each residue was calculated as the average of the coordination of atoms included. Then the general structure of each protein was produced by averaging the previous five models.
18072 collected HA1 sequences were clustered according to sequence identity of 99%. 1848 cluster representatives were randomly selected with one from each cluster and structures were then modeled.      difference between the scores of amino acids on the position. In case of the comparison between the gap or X and one amino acid, the maximum or average among all absolute difference between this amino acid and all the others on the position was appointed as the score. The score of amino acid Asx (B) was treated as the average of those of Asn (N) and Asp (D). Tables   Table S1. List of screened 47 antigenicity-dominant positions. This table listed the selected antigenicity-dominant positions and the clusters they belonged. Definition of mutation rate and escape ratio see Method Identifying antigenicity-dominant positions 1) and 2).The last column listed the PDB structure (PDB id and HA chain) in which this position was epitope position. In HA protein, epitope positions were defined as those residues for which the least atom distance with antibody was less than 5 Å. Position number was label on A/Aichi/2/1968 HA1 sequence (PDB id: 3HMG_A).   Lee and Chen's method was a classification model without predicted ̂ and the predicted antigenic distance of AntigenCO is different from we used here, thus the RMSE was not calculated for these two methods. * Among 24 structures, only one structure (PDB: 4WE7_A) was found that there is a difference between the antigenic distance calculated by using PDB structure and modeled structure. The averaged absolute error of all tested s was 0.16, and the median ∆ was 0.181.