Machine Learning Methods for Predicting Human-Adaptive Influenza A Viruses Based on Viral Nucleotide Compositions

Abstract Each influenza pandemic was caused at least partly by avian- and/or swine-origin influenza A viruses (IAVs). The timing of and the potential IAVs involved in the next pandemic are currently unpredictable. We aim to build machine learning (ML) models to predict human-adaptive IAV nucleotide composition. A total of 217,549 IAV full-length coding sequences of the PB2 (polymerase basic protein-2), PB1, PA (polymerase acidic protein), HA (hemagglutinin), NP (nucleoprotein), and NA (neuraminidase) segments were decomposed for their codon position-based mononucleotides (12 nts) and dinucleotides (48 dnts). A total of 68,742 human sequences and 68,739 avian sequences (1:1) were resampled to characterize the human adaptation-associated (d)nts with principal component analysis (PCA) and other ML models. Then, the human adaptation of IAV sequences was predicted based on the characterized (d)nts. Respectively, 9, 12, 11, 13, 10 and 9 human-adaptive (d)nts were optimized for the six segments. PCA and hierarchical clustering analysis revealed the linear separability of the optimized (d)nts between the human-adaptive and avian-adaptive sets. The results of the confusion matrix and the area under the receiver operating characteristic curve indicated a high performance of the ML models to predict human adaptation of IAVs. Our model performed well in predicting the human adaptation of the swine/avian IAVs before and after the 2009 H1N1 pandemic. In conclusion, we identified the human adaptation-associated genomic composition of IAV segments. ML models for IAV human adaptation prediction using large IAV genomic data sets can facilitate the identification of key viral factors that affect virus transmission/pathogenicity. Most importantly, it allows the prediction of pandemic influenza.


Introduction
Type A influenza viruses (IAVs) infect a wide range of avian and mammalian hosts, generally with species specificity. Avian influenza viruses (AIVs) typically exist in natural reservoirs, waterfowl, and shorebirds (Yoon et al. 2014), which mostly cause subclinical bird infection (Webster et al. 1978;Long et al. 2019). AIVs sporadically infect mammalian hosts, such as swine (Pensaert et al. 1981), human beings (Subbarao and Katz 2000;de Jong et al. 2006;Lam et al. 2013), and other mammals (White 2013;Lee et al. 2017) and are incapable of intraspecies transmission (Tran et al. 2004;Maines et al. 2006;Long et al. 2019). However, the high frequency of mutation and segment recombination endows AIVs with the chance to obtain human-adaptive genomes, which pose a high pandemic risk. Notably, swine adaptation and swine-adapted IAVs are closely related to human pandemics. All of the last five recorded influenza pandemics were caused by avianorigin, swine-origin, or reassortant IAVs (Reid et al. 2004;Kislinger et al. 2006;Bragstad et al. 2011;Long et al. 2019). Thus, it is of great importance to predict the adaptation of avian or swine IAVs to humans.
Human-adaptive IAVs are capable of infecting and causing disease in humans easily and of spreading among human populations efficiently. To date, H3N2 and H1N1 (including seasonal H1N1 and A(H1N1)pdm09) are dominant humanadaptive IAV subtypes that cause epidemics in humans (Ren et al. 2016). H5N1, H7N9, and other IAV subtypes occasionally infect humans but are not yet capable of spreading in human populations (Yang et al. 2007;Rudge and Coker 2013;Hu et al. 2014;Deng et al. 2017). Laboratory studies have identified numerous viral determinants that are associated with the human adaptation of IAVs via mediating receptor binding, regulating the virus's replication cycle, and antagonizing host immunity (Taubenberger and Kash 2010;Bouvier 2015;Long et al. 2019). However, there are no universal human adaptation determinants for IAVs.
Gene sequencing technology and machine/deep learning methods have facilitated virus sequence identification of a considerably large data set, including IAVs. Machine learning (ML) methods have recently demonstrated their effectiveness in multiple disciplinary fields, including virology. The distinct host tropism protein signatures of IAVs (Eng et al. 2016), the zoonotic risk of various viruses (Eng et al. 2017), and even the avian-to-human transmission risk of IAVs (Qiang et al. 2018) have been recognized. The host dependence of mononucleotides (nts) and tetranucleotide compositions of influenza viruses has also been studied with ML methods (Iwasaki et al. 2013). Notably, the prominent role of dinucleotides in virus genomes has been implicated in both experimental and computational reports. Viral dinucleotides are targets for the host innate immune system (Takata et al. 2017), and they independently regulate the virulence Tulloch et al. 2014) and replication (Witteveldt et al. 2016) of IAV viruses. Species-specific (Glass et al. 2007) and virusfamily-specific (Di Giallonardo et al. 2017) dinucleotide compositions have also been computationally recognized. More recently, the dinucleotide composition in RNA virus genomes accurately predicts viral reservoir hosts and arthropod vectors using ML methods (Babayan et al. 2018). Therefore, we hypothesize that genomic dinucleotide composition is another crucial genomic feature for influenza viruses, which is most likely amino acid independent, and may be useful for characterizing the human adaptation feature of IAVs.
In the present study, 60 types of mono-and dinucleotide compositions were analyzed based on the nucleotide position within a codon in the full-length coding sequences of the first six genomic segments of IAVs: PB2 (polymerase basic protein 2), PB1, PA (polymerase acidic protein), HA (hemagglutinin), NP (nucleoprotein), and NA (neuraminidase). These (d)nts were optimized based on their relative importance, with principal component analysis (PCA) and support vector classifier (SVC) methods. Then, ML models of gradient-boosted regression trees (GBRT), multilayer perceptron (MLP) classifier, random forest (RF) classifier, and SVC were built to analyze and predict the human adaptation of human-, swine-, or avianorigin IAVs. Our models perform well in predicting humanadaptive swine or avian IAVs.

Prediction Pipeline and Data Processing of the Genomic Nucleotide Composition in IAVs
As the workflow diagram in figure 1A shows, data wrangling was performed for IAV open reading frame (ORF) sequences. Twelve types of mononucleotides (nts) and 48 types of dinucleotides (dnts) in the ORF were counted for all the sequence samples. The phylogeny of the sample ORF sequences, the hierarchical clustering of sequence samples based on the 60 (d)nts,, and the distribution of the sequence samples, in each type of sequence label, were analyzed ( fig. 1A). The 60 (d)nts were sorted based on their importance (cross-validation score, cv_Score) for the classifier with PCA and SVC methods ( fig. 1B), and the best (d)nts ( fig. 1B), which were optimized with ML approaches from the sorted (d)nts, were utilized for the final data optimization and the final prediction ( fig. 1B).
In total, 226,183 full-length coding sequences for the first six segments (PB2, PB1, PA, HA, NP, and NA), available up to December 31, 2018, had skewed distributions for the labels of country/area, host, subtype, segment, or year. In total, 8,634 sequences were dropped, beyond the length range had repeated sequence IDs (supplementary table 1 The Characterization of the Human Adaptation-Associated Nucleotide Composition of IAV Sequences The (d)nt composition was counted and compared within and between species based on the profile of relative dinucleotide abundance values according to previous reports (Karlin and Mrazek 1997). Hierarchical clustering was performed for 3.59-5.01& (59-61) randomly sampled sequences from each segment sequence set according to the (d)nt composition. The majority of human and avian IAVs were not clustered into human and avian branches, respectively, for the PB1, PA, HA, and NA segments (supplementary fig. 3, Supplementary Material online), and these selected sequence samples were not clustered into human and avian groups in a phylogeny tree (supplementary fig. 4, Supplementary Material online). Additionally, a PCA transversion of the 60 (d)nts was performed to evaluate the linear separability between major human sequences and avian sequences. There was no such separability in the principal component 1 or principal component 2 of the 60 (d)nts in the PB2, PA, HA, or NA segments; only the PB1 and NP segments were separable for both groups of subtypes for principal component 1 The ML-based GBRT, MLP classifier, RF classifier, and SVC models were utilized to identify the optimal number of (d)nts for the human/avian IAV classification. As indicated in figure 2A-F, a leveling off of the crossing of the crossvalidation score (Cross_val score) with its moving average 3 (MA3) level, along with the (d)nt accumulation, was defined as the indicator of the optimal number of (d)nts. Accordingly, an average of 9-13 top (d)nts in the sorted list was identified as the best/optimized (d)nts by the four types of ML classifiers

Predicting Human Adaptation of IAVs Based on the Characterized Nucleotide Composition
The unsupervised clustering and supervised two-category classification with the four ML classifiers mentioned above were performed to evaluate the effectiveness of the characterized (d)nts for human/avian IAV classification. It was demonstrated that the two principal components of the nine optimized (d)nts were separable in distribution between avian and human sequence sets for the PB2 Original ORF sequences within the length range were utilized for codon dependently counting 12 mononucleotides (nts) and 48 dinucleotides (dnts). The counting file for the 60 (d)nts for the PB2, PB1, PA, HA, NP, or NA segments was randomly resampled to maintain a balanced distribution of human and avian sequences. The (d)nt composition distribution and the sequence phylogeny were analyzed, and then the randomly split (d)nt counting data were utilized for model building and testing. (B) Workflow of the feature extraction with PCA/ SVC methods. The 60 (d)nts were sorted according to the score (AUC) from the SVC analysis, with the PCA-extracted principal component from the randomly selected (d)nt counting information. Accumulated (d)nts in the sorted list were analyzed with ML models, with the cross of the AUC score curve down its MA score (n ¼ 3) as the threshold of the best number of (d)nts. "DF2_60 (d)nts.csv" for the (d)nt sorting and "Best x (d)nts" for data optimization were, respectively, labeled with blue and green text boxes filled in both (A) and (B).
Li et al. . doi:10.1093/molbev/msz276 MBE segments. Interestingly, the PA sequences of A(H1N1)pdm09 were clustered into the avian sequence group, thought distinctive from other human sequences and avian sequences, as indicated previously (Smith et al. 2009).
An SVC was used to predict the human adaptation of all IAV sequence samples, with the optimized (d)nts, with the same number of tail (d)nts in the sorted list as control. The true negative rate and the true positive rate for the control (d)nts were 64.76% and 95.58%, respectively, for the PB2 segment (upper-left panel, fig. 5A); their average AUC for the 5fold tests was 0.861 6 0.004 (upper-right panel, fig. 5A). However, the true negative/positive rates for the optimized (d)nts for the PB2 segment were 98.45% and 94.10%, respectively (lower-left panel, fig. 5A), and the average AUC increased to 0.995 6 0.001 (lower-right panel, fig. 5A). As indicated in figure 5B-F, the prediction and the probability of the optimized (d)nt-based SVC was markedly higher than that of the control (d)nt-based SVC. High performance with the optimized (d)nts was also obtained with the other three supervised learning models (GBRT, RF classifier, and MLP classifier) Supplementary Material online, respectively).
The human adaptation of the sequence-resampled sequence set from the United States and all-sequence set was predicted (with an SVC probability threshold of 0.5) by the above-mentioned SVC model. Then, the association of such adaptation was analyzed with sequence labels, such as subtype and host. Regardless of the other labels, the H1N2 subtype was highly adaptive to humans (approximately 75% by both sequence sets, the left and right parts in fig. 6A), as well as the designated human-adaptive H3N2 and H1N1 subtypes. There were 10% or more human-adaptive sequences for H2N2 and H11N9, mixed or H4N8 sequences in the sequence-resampled sequence set from the United States (left part of fig. 6A), and H16N3 and H4N6 in the allsequence set (right part of fig. 6A). In terms of the host, 4.4% or 2.5% human sequences, mainly from humaninfected AIVs, were not adaptive to humans ( fig. 6B), and almost 70% of swine sequences were human-adaptive sets ( fig. 6B). Surprisingly, 16.4% and 19.4% of the turkey sequences FIG. 2. Optimization by SVC with the most different (d)nts between human and avian virus segments. The sorted 60 (d)nts were successively put into the accumulating feature list for the SVC (random state ¼ 1 and cv ¼ 5); the MA of the three cross-validation scores (cross_val importance) (MA3) and the cross_val importance itself for each segment were curved (A-F). The first 9, 12, 11, 13, 10, 9 (d)nts in each list were defined as the best features for PB2, PB1, PA, HA, NP, and NA (G), respectively, when the cross_val importance curve crossed the MA3 curve. (H) Boxplot of the best 9 (d)nts for the PB2 segment; the Mann-Whitney U test was performed between the two groups for each (d)nt, and the Pvalue is indicated.
Predicting Human-Adaptive IAVs from Nucleotide Compositions . doi:10.1093/molbev/msz276 MBE from the two sets were human-adaptive sets, ranking first for avian hosts; more than 10% sequences were human-adaptive sets from other birds, such as American black ducks, shorebirds, gulls, blue-winged teal, and quails in either sequence set ( fig. 6B).
Additionally, when both the segment and subtype labels were taken into account, more different details appeared in such adaptations. As indicated in figure 7A, most of the human-adaptive sequences (more than 50%) from swine were H3 or H1 in the hemagglutinin subtype and were N1 or N2 in neuraminidase subtype, with H3N8 as an exception. For avian sequences, H11N9, H4N8, H16N3, H3N2, H4N6, H5N2, H1N1, and H3N8 were at the top of the list ( fig. 7B).
Human Adaptation of Avian and Swine IAVs before and after the 2009 H1N1 Influenza Pandemic A(H1N1)pdm09 viruses, also known as swine-origin IAVs (H1N1) (S-OIVs), first emerged in North America (Smith et al. 2009) and spread all over the world within the following 6 months (Swerdlow et al. 2011;Fineberg 2014). Origins and evolutionary genomics of S-OIVs have been well identified (Garten et al. 2009;Smith et al. 2009). Given the high importance in influenza pandemics of swine as a mixer for human and avian IAVs (Vijaykrishna et al. 2011;Nelson et al. 2015), we analyzed the worldwide distribution of human-adaptive swine and avian sequences before the 2009 H1N1 influenza pandemic. It was indicated that North America and East Asia were the high-risk areas of human-adaptive swine sequences (table 1) for the six IAV segments, based on the human adaptation ratio. Notably, the United States is ranked first for the absolute number of human-adaptive sequences for each segment. East Asia, particularly Hong Kong and mainland China, had larger human adaptation ratios. Human-adaptive avian sequences are dominantly distributed in the United States and China (table 1). Interestingly, the United States still led the world in the number or proportion of human-adaptive sequences for the five segments (table 1), except NA (maximum in China) (table 1). Accordingly, such distribution bias of human-adaptive swine and avian sequences was consistent with the origin of each segment for the 2009 H1N1 pandemic (Smith et al. 2009).
Furthermore, the hierarchical clustering analysis of the S-OIVs with the IAVs before 2009 was performed based on the SVC model-characterized (d)nts by random sampling 1,000 samples from the total samples. As indicated ( The influence of the 2009 H1N1 influenza pandemic on the human adaptation of avian and swine IAVs post-2009 was also evaluated. As indicated in table 2, all six segments of avian sequences decreased in terms of the human adaptation ratio, although the adaptive sequence number increased. In particular, the adaptive sequences of PB1, PA, and HA increased much more (over the median level). For swine viruses, all six segments of the sequences, particularly for NA, PA, and HA (over the median level), increased in terms of both the human adaptation ratio and adaptive sequence number.

Discussion
IAVs must acquire sufficient human adaptation before they can promote human pandemics. To date, there has been no universal definition of IAVs' human adaptation, although numerous human-adaptive viral determinants have been reported (Taubenberger and Kash 2010;Bouvier 2015;Long et al. 2019). In the present study, we defined it as the capability to infect humans easily, to transmit among populations efficiently, and to be virulent to some degree to humans. Accordingly, the human-adaptive IAVs were limited to the H3N2 and H1N1 viruses, either of which can continuously cause endemics or even pandemics in humans (Ren et al. 2016), whereas other subtypes of avian IAVs (Yang et al. 2007;Rudge and Coker 2013;Hu et al. 2014) were classified into the avian-adaptive group. There might be a concern about a selection bias of the human adaptation criteria. If so, the "human adaptation" label for the four segments (PB2, PB1, PA, and NP) would be wrongly associated with "H3N2" or "H1N1," which would be inconsistent with the "true" human adaptation of these segments. Under such circumstances, the "true" human-adapted PB2, PB1, PA, and NP sequences of the human-adaptive H2N2 virus (Cox and Subbarao 2000) might be underestimated, and the host adaptation of the four segments of swine H1N2 virus would not be correctly predicted. However, high HA and NA of the human adaptation frequencies were unanimously predicted for the PB2, PB1, PA, and NP sequences for both the H2N2 and H1N2 viruses. Interestingly, since 2005, dozens of human H1N2 infection cases have been reported in the United States (Pulit-Penaloza et al. 2018) and the Netherlands (Meijer et al. 2018); the high human adaptation of the H1N2 virus was also experimentally supported (Pulit-Penaloza et al. 2018). Therefore, the IAV human adaptation criteria are acceptable to some degree, according to the existing human-adapted IAVs.

FIG. 4.
Heatmap and hierarchical clustering of human and avian IAV PB2 sequences based on the Euclidean distance of the optimized (d)nts. The 61 PB2 sequence samples were randomly (random state ¼ 1) selected from PB2 (3.59&) and then clustered with a heatmap and hierarchical clustering for PB2 based on the Euclidean distance of the optimized 9 (d)nts; the sequence identity and (d)nts were clustered. Standardized scaling was performed for data with the function (x-x.mean)/x.std. The color in the heatmap presented the value for each (d)nt on the x-axis, as shown by the color bar in the upper-left corner. The hierarchical relationships for the sampled sequences and (d)nts are indicated on the left and upper sides, respectively, in each image. The red-blue column to the left of the heatmap was utilized to show the human (red) and avian (blue) groups.
Predicting Human-Adaptive IAVs from Nucleotide Compositions . doi:10.1093/molbev/msz276 MBE In the last few decades, the overwhelming majority of studies on viral determinants have focused on protein levels for virus infection, transmission, virulence, and host adaptation. In particular, distinct protein signatures for host tropism (Eng et al. 2016) and avian-to-human transmission (Qiang et al. 2018) have been recognized with ML methods. Recently, accumulating reports found a significant influence of synonymous viral nucleotide or dinucleotide mutation on the virus response to the host's innate immune system (Takata et al. 2017) on virus virulence Tulloch et al. 2014) and virus replication (Witteveldt, Martin-Gans and Simmonds 2016). Here, we compressed the fulllength coding information of the six segments into the counting information of 12 nts and 48 dnts, all of which were sorted according to their classification importance with the PCA/ SVC method. Optimization is one of the crucial parts of machine/deep learning. A moving average (MA; Kashyap 1982), also known as the rolling mean, was utilized here to optimize the number of features for the ML models, with the crossing of the MA with its MA3 value as a cutoff point, at which the number of (d)nts was the best.
Interestingly, the counting information of each monoor dinucleotide varied in the importance of each segment. Besides, 9-13 optimized (d)nts were enough to predict the human adaptation for each of the six segments. Given the high performance in the avian/human adaptation classification, no other optimization methods were explored here.
The species-specific (Glass et al. 2007) and virus-familyspecific (Di Giallonardo et al. 2017) dinucleotide composition has been computationally explored for viruses. The genomic dinucleotide composition of RNA viruses is useful for predicting viral reservoir hosts and arthropod vectors (Babayan et al. 2018). Therefore, we speculated here that the mono-/dinucleotide composition should be another critical genomic feature, and we assume here that there should be a species selection bias of IAV nucleotides/dinucleotides. Taking PB2 as an example, the frequency of T, C, A, or G at the first position and G at the third position within a codon, the odds ratios of ct_N3M1, ag_N12 and at_N12 determined the human adaptation of IAVs. According to the eukaryotic codon list , every amino acid is coded by one (for methionine and tryptophan) to six trinucleotide codons (for leucine, serine, and arginine); six (phenylalanine, leucine, serine, tyrosine, cysteine, and tryptophan), five, seven, and five types of amino acids were respectively dependent on the nucleotides of T, C, A, and G, at the first nucleotide position within a codon, 4-7 types of amino acids were dependent on the four types of nucleotides at the first nucleotide position, and 13-15 types of amino acids were dependent on the four types of nucleotides at the third nucleotide position. Therefore, each mononucleotide feature is theoretically associated only with 5-15 possible amino acids (the stop codon is not considered). Accordingly, every dinucleotide is associated with 1-4 types of amino acids . Therefore, the nucleotide composition was associated only with the amino MBE acid compositional information, with less than 50% probability. Thus, we speculated that the genomic composition of mono-/dinucleotides is another essential genomic characteristic of IAVs and is probably biologically associated with the host adaptation of IAVs.
An A(H1N1)pdm09 virus caused the latest worldwide influenza pandemic (Swerdlow et al. 2011;Fineberg 2014). Here, our results regarding the high human adaptability of swine IAVs before 2009 in the United States precisely predicted the high risk of these IAVs. Of course, a possible underestimation of human-adaptive swine viruses/sequences was not excluded in many high-risk developing countries, such as China and Vietnam, due to a likely undeveloped monitoring/detection program for swine influenza. However, a marked lower human adaptation of the swine IAV sequence was also indicated by our model in the area of the European Union, which is the second largest pig plantation area, with twice as much pig production in this area in 2018 than the Predicting Human-Adaptive IAVs from Nucleotide Compositions . doi:10.1093/molbev/msz276 MBE United States (https://www.statista.com/statistics/273232/ net-pork-production-worldwide-by-country/). Moreover, the 2009 H1N1 pandemic was not initiated in China, although pork production in China is 4-fold that of the United States.
This phenomenon implies that our results might reveal the "true" severity of swine infection of human-adaptive swine viruses in the United States rather than in other areas before 2009. FIG. 7. The prediction of the human adaptation of swine and avian IAV sequences before the 2009 influenza pandemic. Human-adaptive swine (A) and avian (B) IAV sequences before the 2009 influenza pandemic were predicted by the SVC model with a probability threshold of 0.5. The total number of sequences, the number of adaptive sequences, and the human adaptation ratio are represented by the orange, blue, and yellow histograms, respectively. The human adaptation ratio is also presented as a floating number above the stacked histogram. IAV subtypes are indicated by the labels along the x-axis.

MBE
In summary, the human-adaptive and avian-adaptive nucleotide compositions of influenza A viruses (IAVs) were determined with supervised/unsupervised ML methods. ML, based on human-adaptive nucleotide composition, performed well in predicting the human adaptation of IAVs before the 2009 H1N1 pandemic. This approach might be promising for the prediction of the risk of an influenza pandemic and global vulnerability to influenza.

Sequence Data Processing
Full-length coding sequences of the first six IAV segments of PB2, PB1, PA, HA, NP, and NA (the M and NS segments are not included due to their short length) were utilized for the nucleotide composition analysis. In total, 115,917 human sequence samples, 76,538 avian sequences, and 35,569 swine sequences, up to December 31, 2018, were downloaded from the Influenza Research Database (IRD) (Zhang et al. 2017) or from the Global Initiative on Sharing All Influenza Data (GISAID) database . The ID, strain name, sequence length, and other labels were extracted from the definition content of sequence file in FASTA format via a Python script (Script-1, supplementary data, Supplementary Material online). The ID number and other labels are listed in the supplementary data, Supplementary Material online. The composition of mononucleotide (nt, T, C, A, and G) and dinucleotides (dnts, 16 types of combination of every two nts) were counted and calculated according to formulas 1 and 2/3 (Script-2, supplementary file, Supplementary Material online) for each of the three types of nucleotide positions within a trinucleotide codon (Karlin and Mrazek 1997). In total, there were 60 (d)nts, including the frequency of 12 types of nts (freqx n ) and the relative frequency of 48 types of dnts (q x n y n ).
n ¼ codon nt position 1; 2; or 3Þ (1) freqx n y m ¼ Rx n y m P 16 i¼1 x n y m ; ðx; y ¼ T; C; A or G; m ¼ n þ 1 for m 3; m ¼ n À 2 for m ¼ 4; n ¼ codon nt position 1; 2; or 3Þ (2) q x n y n ¼ freqx n y m freqx n Ãfreqy m ; ðx n ; y m ¼ T; C; A or G; n ¼ codon nt position 1; 2; or 3; Sequences with duplicate ID, incorrect labels, or out-oflength ranges were excluded, and the remaining 217,549 sequences are listed in supplementary table 1, Supplementary Material online. To avoid a country/area bias for ML modeling, due to the overwhelming majority of US samples, we randomly resampled the US sequences, with the United States to China ratio of approximately 1:1 for each segment (Script-3, supplementary file, Supplementary Material online), via the pandas.DataFrame.sample (Python) model. In total, 46,042 randomly resampled human-adaptive sequences and 46,488 human-inadaptive avian sequences were utilized for feature extraction and model building. Human-originated H5N1, H7N9, and other subtypes were excluded from the human adaptation set and were not included in the avian set for model building.  Kumar et al. 2016) was utilized to build a maximum likelihood tree with the Tamura-Nei model (Tamura and Nei 1993) for PB2 (A) and the other five segments (B-F). The parameters were set as follows: uniform rates among sites, gaps complete deletion, the ML heuristic method set to the nearest-neighbor interchange, and with making the initial tree automatically (default-NJ/BioNJ) as an initial tree. Multiple and pairwise alignments were performed with ClustalW with a gap-opening penalty of 15, a gap extension penalty of 6.66, an IUB DNA weight matrix, and a transition weight of 0.5 before a phylogenetic tree was built. Another two rounds of random resampling were performed from each segment sequence set with the same sampling ratio as mentioned above, and the maximum likelihood tree was built with the same parameters.
PCA is a widely utilized unsupervised ML model for constructing a low-rank model of a data matrix. For the following (d)nt sorting, an orthogonal transformation by PCA (Jolliffe and Cadima 2016) was performed to convert every four (d)nts with possible correlations into one principal component, with the most significant possible variance (formula 4). For the evaluation of the separability between avian and human sequences, PCA was also utilized to transform the information of all (60) or the optimized (d)nts into two principal components.
Hierarchical clustering is another important unsupervised ML method for hierarchical cluster analysis. Strategies for hierarchical clustering generally fall into two types: "bottom-up" approaches, by which each observation starts from a lower cluster and then are clustered with its paired cluster(s) in a higher hierarchy; and "top-down" approaches, by which all the observations start from the top cluster and then are split into lower hierarchies recursively down the hierarchy axis. For the hierarchical clustering of the IAV sequences based on all of or the optimized (d)nts, the Euclidean distance was calculated and utilized as a hierarchical clustering scalar (formula 5).
A ij À x i y j À Á 2 g; s:t: XR mÂk ; YR kÂn ; k < m or n (4) ka À b 2 k ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi X n i¼1 a i À b i ð Þ 2 s ; a; b ¼ avian; human d nt ð Þ; 1 n 60 (5) An SVC, also known as a support vector machine or a support vector network (Noble 2006), is one of the most popular supervised learning models for classification and regression analysis. An SVM training algorithm builds a model that assigns new samples to one category or another, making it a nonprobabilistic binary linear classifier (Noble 2006) (formula 6). The other three ML models were the GBRT MLP classifier and RFC classifier algorithms, by which a prediction is made to evaluate the probability of human adaptation for each sequence. The GBRT algorithm, also known as gradient tree boosting, is a greedy generalized boosting model for differentiable loss functions.

Feature Extraction
To evaluate the importance of each (d)nt, we first sorted the (d)nts with a PCA/SVC combined model. Three thousand and five hundred and forty iterations of PCA/SVC analysis were performed to transform every four (d)nts into one PCA component, which was then utilized for the SVC classification of the avian and human IAV sequences. Thus, the 60 (d)nts were sorted according to their average area under the curve (AUC) (a) of the repeated above-mentioned PCA/SVC analysis. Supervised ML models (GBRT, MLP, RFC, and SVC) were utilized to evaluate the efficiency of the sorted (d)nts as human/avian classification features. Accumulated (d)nts, from 1 to 60 from the sorted list, were input into each of the four models, and the Cross_val score was utilized as an efficiency indicator. The optimized ML (d)nt number was defined as the number of the accumulated (d)nts, with which the Cross_val score did not increase as much as the (d)nt number accumulation, was evaluated by the MA strategy (Slawnych et al. 2009) and was determined at the crossing point of the Cross_val score curve with its MA3 curve.

Data Availability
Original