Dive into Machine Learning Algorithms for Influenza Virus Host Prediction with Hemagglutinin Sequences

Influenza viruses mutate rapidly and can pose a threat to public health, especially to those in vulnerable groups. Throughout history, influenza A viruses have caused pandemics between different species. It is important to identify the origin of a virus in order to prevent the spread of an outbreak. Recently, there has been increasing interest in using machine learning algorithms to provide fast and accurate predictions for viral sequences. In this study, real testing data sets and a variety of evaluation metrics were used to evaluate machine learning algorithms at different taxonomic levels. As hemagglutinin is the major protein in the immune response, only hemagglutinin sequences were used and represented by position-specific scoring matrix and word embedding. The results suggest that the 5-grams-transformer neural network is the most effective algorithm for predicting viral sequence origins, with approximately 99.54% AUCPR, 98.01% F1 score and 96.60% MCC at a higher classification level, and approximately 94.74% AUCPR, 87.41% F1 score and 80.79% MCC at a lower classification level.


Introduction
Influenza is an infectious disease that affects up to onefifth of the world's population each year, although its prevalence tends to be underestimated (Cox and Subbarao, 1999).In comparison to seasonal influenza pandemics, influenza pandemics occur less frequently, although each such crisis may result in millions of deaths.The influenza epidemic has adversely affected vulnerable individuals with chronic diseases, and the pandemic has affected people of all ages.
The influenza viruses are classified into four types based on their internal ribonucleoprotein: A, B, C, and D. Influenza D virus does not cause disease in humans.The influenza C virus is only infective in humans, however, it is unlikely to lead to epidemics on a large scale.Because of this, seasonal influenza vaccine strains cannot be used to vaccinate against influenza C and D viruses.Seasonal epidemics are primarily caused by influenza A and B viruses.The influenza B virus is only infectious to humans, while the influenza A virus is infectious to humans and animals, and may result in global epidemics (e.g.pandemics).There are two glycoproteins under the virus envelope that distinguish the influenza A virus subtypes: hemagglutinin (HA) and neuraminidase (NA).A total of 18 HA subtypes (numbered 1-18) and 11 NA subtypes (numbered 1-11) have been identified so far (Lazniewski et al., 2018).
HA or NA proteins contain antigenic sites that are recognised by the immune system and which inhibit flu infection.These antigenic sites can rapidly alter in order to escape recognition by the immune system.There is a process known as antigenic drift that generates new influenza A, B, and C strains that are not fully recognised by human immune systems, thereby contributing to seasonal influenza.
Y.XU137@liverpool.ac.uk (Y.Xu); D.Wojtczak@liverpool.ac.uk (D. Wojtczak) ORCID(s): 0000-0003-1028-9023 (Y.Xu) When influenza A virus proteins undergo drastic changes on antigenic sites, this can cause antigenic shifts.Antigenic shifts may result from the reassortment of different viruses within one or more hosts, leading to the emergence of new viruses (Brockwell-Staats et al., 2009).In the past, several pandemics have been the result of extreme antigenic shifts in which most people were incapable of resisting the novel virus.As a result of recombination between animal viruses (swine and avian) and human viruses, four major influenza epidemics have emerged since 1900: Spanish flu (1918)(1919), Asian flu (1957)(1958), Hong Kong flu (1968)(1969), and the 2009 flu pandemic (2009)(2010).
The virus responsible for the Spanish flu was A/H1N1.An estimated 17 -100 million people died in this pandemic and it was the deadliest in recorded history (Spreeuwenberg et al., 2018), (Morens and Fauci, 2007), (Johnson and Mueller, 2002).Despite the mystery surrounding the origins of Spanish flu (Antonovics et al., 2006), recent studies suggest it may have originated in birds or pigs (Taubenberger et al., 2005), (Worobey et al., 2014), (Smith et al., 2009a).The virus adapted and kept playing a major role in flu epidemics until 1957, when major changes in HA and NA resulted in the novel A/H2N2 virus and the Asian flu pandemic.The Asian flu is associated with a higher rate of morbidity and mortality compared to the Hong Kong flu of 1968, as the Hong Kong flu was caused by A/H3N2, which only changed the HA antigen (Kilbourne, 2006).The Asian flu and the Hong Kong flu occurred as a result of reassortment between human and avian viruses.A/H1N1 was also responsible for the 2009 flu pandemic, however, that iteration involved a triple reassortment between human, avian, and swine viruses (Garten et al., 2009).(Smith et al., 2009b).
It is known that influenza viruses can infect a number of hosts, including humans, birds, pigs, and horses.Birds are a major natural reservoir for the influenza A virus (Long et al., 2019), (Gorman et al., 1990), which can infect humans and pigs alike (Webster et al., 1992).Additionally, pigs are considered to be an intermediate host for influenza A viruses between humans and birds (Brown, 2001).The reassortment of viruses between different hosts can result in life-threatening risks for human populations, since the viruses don't require an intermediate host to propagate.
The transmission of influenza viruses can, therefore occur through animal-to-human (zoonosis) as well as animalto-animal (enzootic) contact (Long et al., 2019).Zoonotic infections can either be dead-end transmissions or lead to a pandemic in the human population after accumulating enough adaptive mutations to sustain transmissions between people, which then regularly circulates as a seasonal influenza virus (Long et al., 2019), (Taubenberger and Kash, 2010).The origin of each virus during a virus outbreak is difficult to determine because some viruses can cross species boundaries.It is thus possible to isolate swine-origin viruses from humans.However, the virus must be given sufficient time to complete the adaptive mutation and accumulation process (Long et al., 2019).Thus, early isolation of the original viral host can effectively prevent or control the spread of a viral outbreak.
Most traditional methods are laboratory-based, such as the use of hemagglutination inhibition (HI) assays to subtype viruses.Laboratory-based methods are laborious and time-consuming.To save manpower and time, a variety of machine learning algorithms have been used to predict viral hosts, such as k-nearest neighbours (KNNs) (Sherif et al., 2017), random forests (RFs) (Sherif et al., 2017), artificial neural networks (ANNs) (Attaluri et al., 2010a), and decision trees (DTs) (Kargarfard et al., 2016).Most previous studies selected balanced data sets manually (Sherif et al., 2017), (Attaluri et al., 2010b), used small data sets (Attaluri et al., 2010a), encoded the sequence as sparse matrices (Attaluri et al., 2010a), (Mock et al., 2021) or incorporated feature extraction procedures (Yin et al., 2018).Moreover, some deep learning techniques, such as convolutional neural networks (CNN), have been applied in this field, although only to avian and human viruses thus far (Scarafoni et al., 2019).
We used various machine learning algorithms (i.e., RF, RUSBoost, SVM, XGB, MLP, Transformer, and CNN) to determine the origin host of influenza viruses.To vectorize a sequence, both alignment-based and alignmentfree approaches are used.With respect to alignment-based sequence representation, evolutionary features of viral sequences were extracted by PSI-BLAST and fed into four traditional machine learning models.A word embedding technique is used as an alignment-free approach to process sequences.The morphological structure of sequence data is similar to that of text data, which suggests that methods for encoding sequences in natural language processing (NLP) could also be applied to process viral sequence data, such as word embedding.Word embeddings are a widely used and powerful technique for processing sequential data.By learning the meanings of words, similar words will have similar embeddings.In this study, instead of evaluating models only at a higher classification level, we divided avian classes and evaluated models at a lower classification level (Sherif et al., 2017), (Attaluri et al., 2010a), (Kargarfard et al., 2016), (Attaluri et al., 2010b), (Mock et al., 2021), (Xu et al., 2017), (Yin et al., 2018).In addition, we discussed the impact of incomplete virus sequences on model performance.The results show that the Transformer neural network outperforms other models in most scenarios.

Data
In this study, two data sets were used.The primary difference between the two sets is that data set 2 includes both complete and incomplete HA protein sequences.Therefore, data set 2 is completely unforeseeable for models and noisier than data set 1. We used data set 2 as an additional testing set to test the performance of the pre-trained model further.The per-trained models were trained and validated by data set 1.

Data Set 1
Data set 1 includes the complete Influenza A virus (IAV) hemagglutinin (HA) protein sequences isolated from avian, swine, and human samples in the GISAID (GISAID) database (status 2020-09-25).Only HA protein sequences are used, as HA is the most dominant protein for immunity response and helps the virus bind to target hosts (Earn et al., 2002).To maintain the quality of the data, we further removed sequences that are either redundant, multi-label, or contain amino acids X, B, and Z (WORTHS).Therefore, a total of 59,785 sequences from the original set have been selected.Only data set 1 was subjected to nested crossvalidation (CV), as described in Section 5.2.

Data Set 2
Data set 2 is an additional testing set.It includes the IAV HA protein sequences collected from 2020-09-26 to 2022-05-05 in the GISAID (GISAID) database.Those sequences that appear in both data sets 1 and 2 are removed from data set 2, so the data sets are mutually exclusive.In addition, any redundant or multi-label sequences are also removed.Therefore, the sequences in data set 2 are also unique and have an unambiguous isolated host.The final set consists of 3,686 sequences.
The change in the amount of data before and after filtering appears in Table 1.Fig. 1 and Fig. 2 show the data distribution of different taxonomic levels.In terms of the performance of the model at a higher taxonomic level, we only consider three classes: human, avian, and swine, whereas at a lower classification level, avian data is divided further and results in 26 classes.When the number of classes increases, models face more challenges.

Position-Specific Scoring Matrix
One of the most commonly used methods for extracting evolutionary information from protein sequences is the position-specific scoring matrix (PSSM) (Altschul et al., 1997), which can be generated using Position-Specific Iterated BLAST program (PSI-BLAST) (Altschul and Koonin, 1998).
As protein sequences typically contain 20 different types of amino acids (A, R, ... V), a PSSM is a  × 20 matrix for a query protein sequence with  length.The PSSM for the sequence a  1  2 …   can be expressed as follows: where  , is the score of the amino acid   that mutates to   .It can also be interpreted as a probability of mutation in the range of [0,1] using the sigmoid function: We ran PSI-BLAST (blast) iteratively with default parameters (E-value = 0.001, number of iterations = 3) on a non-redundant (nr) database.PSSMs cannot be fed directly into classic machine learning models due to their variable size.To overcome this hindrance, we propose three sequence encoding schemes based on PSSM.In order to reduce the complexity of proteins and unnecessary computations, we first introduce a residue grouping rule.

Residue Grouping Rule
As amino acids have similar properties in proteins, they can be classified into 10 groups (Li et al., 2003): G1 (F, Y, W), G2 (M, L), G3 (I, V), G4 (A, T, S), G5 (N, H), G6 (Q, E, D), G7 (R, K), G8 (C), G9 (G) and G10 (P).A grouped-PSSM (GPSSM) with  × 10 dimensions can be created by applying residue grouping rules to the columns of the original PSSM: where The GPSSM is produced based on the original PSSM (1), thence ∑  ,  represents the score of an amino acid   that is mutated to an amino acid belonging to group . is the length of sequences; is the number of amino acid types in group .The GPSSM (3) was used to derive the following proposed feature sets: EG-PSSM, GDPC-PSSM, and ER-PSSM.

EG-PSSM
The length of the input sequence as well as the original PSSM (1) and GPSSM (3) may vary.This means that they cannot be directly used in many machine learning models.There is an intuitive method for overcoming this problem by applying the residue grouping rule across rows of the GPSSM (3).This will result in a matrix of 10 × 10.We reformat the matrix by iterating through it row by row, moving from left to right within each row, thereby generating a 1 × 100 feature vector from a GPSSM (3) before feeding it to classical machine learning models: where

GDPC-PSSM
By using dipeptide compositions (DPCs) (Saravanan and Gautham, 2015), we are able to determine amino acid composition information and partial local-order information in protein sequences.DPC acts directly on raw sequence data and generates a 400-dimensional feature vector for each sequence, but it can also be extended to PSSM (Liu et al., 2010).Therefore, each  × 10 GPSSM (3) can be rewritten as a 10 × 10 matrix by grouped dipeptide composition encoding.This 10 × 10 matrix can then be reshaped as a 100-dimensional feature vector: where Each  , is the value of row  and column  in the GPSSM (3).

ER-PSSM
The third proposed sequence representation is adapted from RPSSM (Ding et al., 2014), which computes the pseudo-composition of the dipeptide in sequences.As with GDPC-PSSM, RPSSM also extracts partial local sequence order information from sequences.RPSSMs only compute the pseudo-composition of any two adjacent amino acids.We extended the computation of RPSSM for any two amino acids    + with gap  in sequences and extracted a 91 × 10 matrix per sequence, the matrix can be reformatted as a 1 × 910 feature vector: where and Ḡ is the average of values of GPSSM (3) in column ,   computes the average pseudo-composition of all the amino acids in the protein sequence corresponding to column  in GPSSM (3).

Learning Representations
Traditional machine learning algorithms require manual preprocessing and feature extraction to extract representative features from each protein sequence.A priori knowledge is required to select suitable features during the feature extraction process.Conversely, deep learning algorithms can learn the implicit representations of protein sequences directly.In this study, a powerful vectorisation scheme in natural language processing (NLP), word embedding, is applied to map protein sequences into numerical vectors.

Overlapping N-grams
Protein sequences are morphologically similar to text sentences, except that text is composed of words whereas amino acid letters comprise a protein sequence.In order to transform a protein sequence into a protein "sentence", we split the sequence into overlapping n-grams (n varies from 3 to 5).An n-gram is a protein "word" composed of successive n amino acids.Fig. 3 is an example of overlapping 3-grams for a protein sequence and Fig. 4 shows the word clouds of trigrams for human, avian and swine.

Word Embedding
Word embedding overcomes the drawbacks of one-hot encoding.It cannot only produce dense vectors but also capture the relationship between similar words.It is capable of producing dense vectors as well as capturing relationships between words.The idea behind word embedding is to map words to an embedding space, where words with similar meanings are closer together, and hence have similar embeddings.Word2Vec (Mikolov et al., 2013) is a popular implementation of word embedding, but it does not include domain-specific words.Thus, we generated a custom word embedding from the training set and mapped the n-grams of each sequence to the embedding vectors.An n-gram is represented as a vector of size , and a protein sequence is represented as a  × , where  is the length of the sequence (number of n-grams in the sequence), and  is the embedding dimension.
To unify the dimensions of matrices, we use left-padding on the sequences to match the longest sequence length.Therefore, most sequence information will be retained, however, more noise will be introduced to the shortest sequence.

RUSBoost
Class imbalance can be addressed using data sampling and boosting algorithms.Oversampling (enriching minority classes) and undersampling (decreasing majority classes) are common methods of data sampling.Random undersampling boosting (RUSBoost) (Seiffert et al., 2008) is an algorithm that uses undersampling together with boosting techniques.RUSBoost is computationally cheaper and more efficient than other oversampling methods, such as SMOTE-Boost (Chawla et al., 2003).

Extreme Gradient Boosting
Extreme Gradient Boosting (XGBoost) (Chen and Guestrin, 2016) implements gradient boosting algorithms in a scalable and efficient fashion.Gradient boosting algorithms are similar to AdaBoost, except that gradient boosting algorithms use gradient descent to optimise the derivable loss function when adding new models.XGBoost can help in handling tough problems in data science, such as solving missing values and sparse data automatically.One of the biggest advantages of XGBoost is that it provides parallel training to speed up the training process and can handle large datasets.

Random Forest
The bagging algorithm reduces the variance of the model, while decision trees have higher variance and lower bias.Therefore, this model performs better when bagging and decision trees are combined (random forest).Random forests (Ho, 1995) consider only a small portion of all features in each split.Each decision tree is thus more random.Contrary to boosting-based ensembles, bag-based ensembles tend to construct deep trees, which means bagbased ensembles are more complicated.In this respect, bag-based ensembles may, in some instances, require more training time than boosting-based ensembles, but omit the validation process to estimate generalisation performance.

Support Vector Machine
Support Vector Machine (SVM) is one of the most commonly used supervised learning algorithms (Gove and Faytong, 2012).In addition to being able to classify linearly separable data, SVM can also classify non-linearly separable data by introducing a kernel trick.The kernel trick allows SVMs to successfully handle high-dimensional (even infinite-dimensional) data by mapping lower-dimensional data into higher-dimensional data without explicitly transforming it.We construct the multi-class SVM by using the Gaussian kernel function as the kernel function.

Multi-Layer Perceptron
Multi-Layer Perceptron (MLP) is a feed-forward neural network.It compensates for the limitation of a single-layer perceptron that cannot process linearly separable problems (Minsky and Papert, 2017).A simple MLP usually consists of three layers: an input layer, a hidden layer, and an output layer.Fig. 5 shows an example of a three-layer fully connected MLP, where the input layer, hidden layer, and output layer have three, four, and two neurons, respectively.The number of neurons in the input layer equals the number of features in the input data, and the number of classes in the data set determines the number of neurons in the output layer.

Convolutional Neural Network
Convolutional neural networks (CNNs) are widely used in a variety of fields, including facial recognition, object recognition, and autonomous vehicles.CNNs were initially trained on images and expanded to take in other types of data, such as time series, text, and audio data.As opposed to traditional machine learning, CNNs learn data features in each hidden layer.CNNs also differ from standard fully connected neural networks (Fig. 5) in that they sparsely connect layers and reduce the number of parameters that must be learned.CNNs typically include three layers: the convolutional layer for learning spatial features, the activation layer for activating features, and the pooling layer for downsampling the features.
We designed a simple CNN for classifying protein sequences.The CNN in this study consists of one input layer (Input), three convolution layers (Conv), three max-pooling layers (Max-Pool), one flattening layer (Flatten), three dense layers with Rectified Linear Unit (ReLU) activation (Dense) and one dense layer with softmax activation (Output).PSSM or tokenised virus sequences can be used as input data.The CNN also includes an embedding layer (Embedding) when the input data are tokenised sequences, as shown in Fig. 6.The hyperparameter settings for CNN can be seen in Table 2.

Transformer
Transformer neural network is an improvement on normal recurrent neural networks (RNNs) and performs better than many state-of-the-art models on the Winograd schemas language translation tasks (Ackerman, 2014) according to Bilingual Evaluation Understudy (BLEU) scores (Vaswani et al., 2017).Typical RNNs are slow to train sequential data as they must process words in order resulting in a lack of parallelisation capability.Additionally, RNNs cannot handle long sequences very well due to vanishing and exploding gradients phenomena.
Contrary to RNNs, Transformer networks abandon recurrence and embrace self-attention mechanisms.The attention mechanism, as the name suggests, focuses on the parts of the input it deems important.The attention takes a query and key-value pairs as input.The specific attention mechanism used in the classic Transformer network is scaled dot-product attention, expressed in the following formula: where   is the dimension of the key, so 1 ∕ √   is the scaling factor., and  denote for query vector, key vector and value vector, respectively.The softmax function converts the attention score to attention distribution.The core idea behind dot-product attention is that the dot product is higher between similar sequences than in dissimilar ones.
One of the innovations of the Transformer is multihead attention.The multi-head attention assembles multiple scaled dot-product attentions.Compared with single-head attention, multi-head attention avoids words that focus themselves too much and produces more robust results.Similar to bag-of-words, attention lacks sequence order information.The input embedding layer (Input Embedding) retrieves the meaning of words but not their positions.To compensate for this drawback, the original paper (Vaswani et al., 2017) adds positional encoding to input embedding, and in this paper, we added positional embedding (Positional Embedding) for the same purpose.More details of the Transformer can be referred to (Vaswani et al., 2017).
The sequence-sequence Transformer model usually includes an encoder block and a decoder block which take a sequence as input and output a sequence, such as the sequenceto-sequence model often used in language translation.In contrast, a sequence-to-vector model accepts a sequence as input and outputs a class label for that sequence; no decoder block is needed.We used a sequence-vector Transformer model, which accepts a sequence as input and outputs a class label for that sequence.Thus, the Transformer used in this study only has an encoder block.The Transformer architecture used in this study is shown in Fig. 7.The number of heads (num_heads) ranges from 1 to 5, and the dimension of embedding (embed_dim) varies between 32, 64, and 128.

Cross-Validation
Stratified -fold cross-validation (CV) was used to evaluate models.The class ratio of the training set was almost the same as that of the test set.The generalisation performance of models was only evaluated on the test set (unseen data to the model).Nested -fold CV adds the outer -fold CV for final evaluation to reduce bias when it comes to hyperparameters optimisation or model selection (Cawley and Talbot, 2010).Therefore, a nested -fold CV will take advantage of the full diversity of the data set and ensure that all data will be tested.The example of stratified nested fold CV is shown in Fig. 8.
In this study, we chose   = 5 and   = 4. Figure 6: Example of the CNN architecture with an embedding layer: here for simplicity, we assume that the longest length of the sequence is 500, the dimension of embedding is 100, and the number of filters for the first, second, and third convolutional layers is 64, 32, and 16, respectively.The layers are represented by coloured boxes: input layer (light yellow), embedding layer (light orange), convolutional layer (light red), max-pooling layer (light pink), and flatten/dense layer (light blue).

Input Embedding
Global Average Pooling

Dense Dense Output
(1, 500, 32) (1, 500, 1) set 1, serving as the test set in each iteration.The remaining 80% of the data forms the   set, which is further split into four inner folds for model training and validation.Specifically, within the inner loop, one of the inner folds (representing 25% of the   set, which is 20% of the total dataset) is used for validation.The other three (representing 75% of the   set, which is 60% of the total dataset) are used for training during each inner loop iteration.

Evaluation Metrics
Evaluation measurements used in the study include  1score, Area Under Precision-Recall Curve (AUCPR) and Matthews's correlation coefficient (MCC).The equations of  1 -score and MCC for each class are defined as follows:  In this study, data set 2 and a proportion of data set 1 served as unseen data for pre-trained models.
of data correctly predicted,   is the number of negative data misclassified as positive, and   counts the number of positive data incorrectly predicted as negative.For multiclass classification, the one-vs-all strategy is applied to produce  1 -score, for each class.
The overall  1 -score (O 1 ) and overall AUCRP (OAU-CRP) are the micro-average of the corresponding metric  to each class, and the overall MCC (OMCC) is defined as follows: where   is the number of times that class  truly occurred,   is the number of times class  was predicted,  is the total number of correctly predicted data,  is the total number of data items.
In Section 6, we also used the mean score to present the overall performance of each model.The mean score is defined as the mean of OAUCPR, O 1 and OMCC: Regarding the models' performance in each class, we only show the results of AUCPR as AUCPR is recommended (Branco et al., 2016) for evaluating classifiers when data is highly imbalanced.

Performance at Different Taxonomic Levels
We evaluated all models at different taxonomic levels.The higher taxonomic level represents only three classes: avian, human, and swine, while at a lower taxonomic level, the avian class was further subdivided, resulting in a total of 26 classes.The performance of sequence representations and machine learning algorithms across data set 1 are provided in Section 6.1.1 and 6.1.2.Section 6.1.3presents the overall results of each model.

Comparison of Sequence Representations
In order to encode protein sequences, we used two kinds of representation: sequence alignment-free (word embedding) and sequence alignment-based (PSSM-based representations).Table 3 presents a comparison of sequence representation with metric scores averaged across machine learning algorithms.At lower and higher taxonomic levels, 3-grams word embeddings reach a mean score of 87.73% and 97.94%, respectively.In contrast, PSSM-based representations have wider variability due to their instability and poor performance on RUSBoost.When comparing PSSM-based representations, EG-PSSM has a relatively low deviation and a higher mean score.
At the lower taxonomic level, data becomes more skewed, with mean score drops for all sequence representations ranging from 10% to 15%.ER-PSSM is the most affected, with a mean score drop of 14.85%, while 3-grams word embeddings are the least affected with a mean score drop of 10.21%.

Comparison of Machine Learning Algorithms
Table 4 represents the comparison of machine learning algorithms with the averaged metric scores across sequence representations.RUSBoost performs the worst at both classification levels, but it is the only algorithm with narrower variability at higher classification levels than at lower classification levels.Contrary to RUSBoost, SVM has a larger deviation when the data is more skewed and the class increases.Therefore, the performance of RUSBoost and SVM is most dependent on the sequence representation, but RUSBoost is least affected by data skewness among all methods compared with SVM.
Transformer and XGB perform best at the higher and lower taxonomic levels, respectively.All classifiers perform worse at the lower taxonomic level, with mean score drops ranging from 9% to 30%.SVM is the most affected with a mean score drop of 26.1%, opposite with XGB with a mean score drop of 9.74%

Overall Results
Fig. 9 shows the top 10 models with the highest performance at different taxonomic levels.The results were ranked in descending order according to the mean score of each model.The name of the model is denoted as sequence representation -machine learning algorithm.Most of the models reached at least a 96% mean score when the data had fewer classes.ER-PSSM-XGB, 3-grams-CNN and 5-grams-Transformer work better at both taxonomic levels than other models.

Performance in Individual Hosts
Most machine learning algorithms focus on the majority class; therefore, it can be assumed they perform better on the majority class than on the minority class.In data set 1, 55% of sequences belong to humans, while only 0.01% of sequences belong to partridges.This degree of data imbalance brings great challenges to classifiers.ER-PSSM-XGB, 3-grams-CNN, and 5-grams-Transformer are the top three models that work better no matter how skewed the data.Their AUCPR score in individual hosts of data set 1 is shown in Fig. 10.
All three models scored AUCPR below 80% in all hosts, with the exception of human, swine, and chicken which account for approximately 81% of the sequences in data set 1.However, the baseline of the AUCPR for each class is the proportion of each class in the data set.Therefore, human, swine, and chicken classes also have a relatively higher baseline than other classes.The AUCPR and corresponding baseline for each class are shown in lime and green.These three models achieved higher scores than the baseline, but the variability increased with fewer classes.Fig. 11 illustrates the performance of the three models in individual hosts at a lower taxonomic level for data set 2. The AUCPR of 5-grams-Transformer and 3-grams-CNN in each class still outperformed the baseline, even with the introduction of incomplete sequences, while ER-PSSM-XGB very slightly beat the baseline.

Effect of Incomplete Sequences
Additionally, we investigated the effect of incomplete sequences on all models.Data set 1 contained numerous selected HA sequences that we used to produce a pre-trained model.Data set 2 contains 103 incomplete sequences, which we also used for evaluating the impact of incomplete sequences on models for data set 2, as shown in Fig. 12 and Fig. 13.
The Transformer algorithm outperforms others at both taxonomic levels and is least affected by incomplete sequences.Overall, the performance of each model on data set 2 was reduced, but the impact of the incomplete sequence on the model was small.

Ensemble Results
We assembled all models and generated predictions for data set 1. The prediction results for 97.56% of sequences matched the true label we assigned, while the prediction results for 2.44% of sequences did not match.More specifically, the predictions results for 0.88% of the sequences from all models did not match the true labels.To determine which sequences were not predicted "correctly" by all models, we performed a basic statistical analysis.
Approximately 30% of the collected sequences were collected between 2009 and 2011, when the 2009 Influenza pandemic occurred; 57.52% and 40.76% of the sequences belong to swine and human; and 50.66%, 23.42%, 8.57%, and 4.00% of the sequences belong to H1N1, H5N1, H9N2, and H5N6.We also plotted the word clouds for these sequences, as shown in Fig. 14.The most frequent tokens revealed were LVL and GAI, which were also the most frequent tokens in swine sequences (Fig. 4).
Sequences that failed to fit the model may have been collected during outbreaks (e.g.pH1N1-like) or have strong cross-species capabilities.For example, A/Beijing/1/2017 and A/India/TCM2581/2019 are avian viruses isolated from human (Pan et al., 2018), (Potdar et al., 2019); therefore, the label of these sequences is human while our prediction is avian.Similarly, A/swine/Jangsu/48/2010 is a pH1N1like swine virus used to prove retro-infection from swine to human in China (Zhao et al., 2012).This sequence is labelled swine while our prediction is human.

Discussion and Conclusion
In this work, we applied popular supervised machine learning algorithms to predict Influenza A virus hosts given We implemented a positional-specific scoring matrix (PSSM) to extract evolutionary features of sequences, and then fed them into classic machine learning algorithms (SVM, RF, RUSBoost, and XGB).We proposed three methods to unify the dimension of each PSSM (i.e., ER, GDPC, and EG).For deep learning models (i.e., Transformer, CNN, and MLP), an embedding layer can be added to learn features automatically.Without an embedding layer, a feature extraction step should be applied before feeding the input data into models.For large data sets, the computation of PSSMs is time-consuming and laborious, taking days or even months.To reduce computational time, we can reduce the size of the data set or perform a BLAST search using a partial protein database, but at the expense of reduced accuracy.The other disadvantage of using PSSMs is that they do not have a unified dimension, so unifying strategies are required to process PSSMs in a more efficient manner.Therefore, endto-end deep learning approaches address these limitations without the need for cumbersome feature engineering.In this study, we demonstrate that word embedding can be a useful alternative to processing sequences.
To evaluate the models on different taxonomic levels and different data sets, nested cross-validation and a variety of evaluation metrics were used.Sequences in data set 1 were chosen selectively, including only unique complete sequences.The sequences in data set 2 were collected in the most recent year, with only redundant sequences discarded.The two data sets are mutually exclusive and the model is blind to data set 2. Based on our results, ER-PSSM-XGB has the best performance in data set 1 at a lower classification level, whereas 5-grams-Transformer performs optimally at a higher classification level.
Data set 2 contains a small proportion of incomplete sequences, which can be assumed as noisy data or data with incomplete information.In this circumstance, the 5-gram-Transformer still performs optimally.The Transformer is more powerful than classic RNNs and CNNs because it incorporates the multi-head attention mechanism that focuses on the most important part of the sequence and introduces the information of sequence order by positional encoding, which is more powerful than classic RNNs.
We exclusively utilised supervised machine learning but supervised learning relies on ground truth.We extracted class labels for sequences from metadata, which indicated the isolated host of sequences.As a result, some labels used in this study do not perfectly represent the ground truth.In contrast to supervised learning, semi-supervised or unsupervised learning requires partial or even no ground truth during training and can be combined in future studies.
seq: M L S I T I L F L . . .trigrams: MLS LSI SIT ITI TIL ILF LFL . . .

Figure 5 :
Figure 5: Example of a fully connected MLP architecture

Figure 7 :
Figure 7: Example of the Transformer architecture: here we assume that the number of heads is 3 and the dimension of embedding is 32.

Figure 8 :
Figure8: Example of the nested -fold CV (  = 5 and   = 4): we used models trained during nested crossvalidation (i.e., pre-trained models) to predict unseen data.In this study, data set 2 and a proportion of data set 1 served as unseen data for pre-trained models.

Figure 9 :
Figure 9: Performance of different models at different classification levels on data set 1.

Figure 12 :Figure 13 :Figure 14 :
Figure12: Performance of different models at a higher taxonomic level on data set 2.

Table 1
Number of data before and after selection Each outer fold includes approximately 20% of the data from data

Table 3
Comparison of sequence representations on data set 1.
Performance of different models in individual hosts at a lower taxonomic level on data set 1. Performance of different models in individual hosts at a lower taxonomic level on data set 2.