INFINITy: A fast machine learning‐based application for human influenza A and B virus subtyping

Influenza viruses are one of the main agents causing acute respiratory infections (ARI) in humans resulting in a large amount of illness and death globally. The influenza viruses classification is based on the nomenclature proposed by the World Health Organization (WHO) that is widely accepted and used by the medical and scientific communities throughout the world. Since the pandemic in 2009, two subtypes of human influenza A viruses, A(H1N1)pdm09 and A(H3N2), and two lineages of influenza B, B/Victoria and B/Yamagata, have been responsible for the vast majority of cases each year. Within each subtype and lineage, different clades and genetic groups were described to reflect the continuous viral evolution, driven by antigenic drift. The WHO Global Influenza Surveillance and Response System (GISRS) studies human influenza viruses from >110 countries, to monitor circulating strains, understand epidemiology and evolution, and contribute to verify the vaccine effectiveness and update its formulation each year. A growing number of laboratories and research centers is contributing to this initiative by sequencing the whole viral genome or the hemagglutinin (HA) gene from local strains. Influenza clade classification is usually performed by phylogenetic analysis of HA gene sequences from circulating strains along with reference sequences, which is a time-consuming process and requires specific training and equipment. Alternatively, this can be done by comparing amino acid substitutions, either manually or by using inhouse scripts. While there are currently specific tools available for influenza classification, they have several limitations such as: (a) they require an alignment of the input data against reference sequences (which can be computationally expensive), (b) requirement of multiple ad hoc programs installed, (c) users should be familiar with the command line, (d) users must create a template containing cladedefining amino acid pattern by position, (e) only classifies sequences into type A or B and subtype/lineage but cannot discern clades or genetic groups, and (f) take into account only the most prevalent and recent influenza clades. Advanced machine learning techniques have proven to make accurate predictions, using algorithms that reveal patterns in large datasets. In the analysis of viral data, machine learning methods have been recently implemented, for example, in: COVIDEX, a tool that classifies complete genome nucleotide sequences of SARS-CoV-2 into lineages, a recent application for avian influenza clade classification, the prediction of phenotypes for human influenza A from proteomic input, and detection of new variants using ensemble learning. In this sense, we developed INFINITy, a tool based on alignmentfree machine learning for human influenza virus classification into subtypes and clades. INFINITy is a web application that runs on an internet connection without any installation and has a user-friendly interface. It is fast, sensitive, specific, and ready to implement. Additionally, it is available to run locally for R and Rstudio users as an R package. Furthermore, two docker images are available to secure the reproducibility of the results. INFINITy includes two classification models: one for complete HA sequences (FULL HA, for whole gene sequence length of 1700 bp) and other for the HA1 subunit coding sequence (HA1, for the initial 1030 bp of the HA gene). The influenza classification comprises 75 clades or genetic groups: 25 for A(H1N1)pdm09, 32 for A(H3N2), and 14 for B/Victoria and 4 for B/Yamagata (supporting information Table S1). The overall classification algorithm is divided into three phases:


INFINITy: A fast machine learning-based application for human influenza A and B virus subtyping
Advanced machine learning techniques have proven to make accurate predictions, using algorithms that reveal patterns in large datasets. In the analysis of viral data, machine learning methods have been recently implemented, for example, in: COVIDEX, a tool that classifies complete genome nucleotide sequences of SARS-CoV-2 into lineages, 9 a recent application for avian influenza clade classification, 10 the prediction of phenotypes for human influenza A from proteomic input, 11 and detection of new variants using ensemble learning. 12 In this sense, we developed INFINITy, a tool based on alignmentfree machine learning for human influenza virus classification into subtypes and clades. INFINITy is a web application that runs on an internet connection without any installation and has a user-friendly interface. It is fast, sensitive, specific, and ready to implement. Additionally, it is available to run locally for R and Rstudio users as an R package. Furthermore, two docker images are available to secure the reproducibility of the results.  Table S1).
The overall classification algorithm is divided into three phases: 1. The first phase loads the user data in a multifasta format and performs the k-mer counting operation using the k-mer package. 13 Each k-mer count is normalized over the k-mer size (k = 6) and the sequence length.
2. The second phase calls the ranger package 14 predict function using one of the two pre-trained random forest models (FULL HA or HA1) and obtains a probability score based on the rule of majority vote. From this, the app obtains the score for each query sequence classification, the proportion of N bases in the genome, and the genome length.
3. Finally, two tables are created, one showing the sequences that passed all the quality checks and another with sequences that did not pass some of the filter steps. These filters controls: that each sequence obtained a probability score of 0.4 or more, that the sequence length is close to the expected sequence length for the classification model (FULL HA 1700 or HA1 1030) for a factor of no more that 50%, and that the percentage of ambiguous bases in the sequence (N) is not larger than 2%. A brief report can be pro-  Table S2). Correlation heatmaps, metrics tables, precision-recall curves, and other statistics were generated for each model (supporting information File S1).
To use the app, the user only loads the input file, a FASTA file with unaligned influenza HA or HA1 gene segment query sequences, selects one of the models according to the length of the query sequences (FULL HA or HA1), and presses the run button (Figure 1).
To obtain the most accurate results, we recommend using sequences with a proportion of N bases <1%. Since the HA gene allows for more accurate predictions for subtyping based on phylogeny or machine learning models, the other seven influenza genomic segments were not considered in this version but could be incorporated in the future.
Due to the increasing number of laboratories and researchers using sequencing technologies applied to molecular epidemiology, there is an increasing need of easier and faster applications that allows an accurate and specific classification of viral sequences with no need for specialized training. This is particularly relevant for respiratory pathogens such as influenza viruses that cause annual epidemics with up to 60 million ARI cases worldwide and require a continuous monitoring of circulating strains, which is why we believe INFINITy can help researchers working on this area.

ACKNOWLEDGMENTS
We gratefully acknowledge all the authors, the originating laboratories responsible for obtaining the specimens, and the submitting laboratories for generating the genetic sequence and metadata and sharing via the GISAID Initiative, on which this research is based. We also thank Dr. Andrés Culasso for technical assistance, Dr. Osvaldo Uez for motivation, and Dr. Laura Mojsiejczuk for critical review of the manuscript.
We also thank the Centro de Investigaci on, Docencia y Extensi on en F I G U R E 1 Overview of the INFINITy application. The user loads a sequence file, or copy and paste the sequences, selects the corresponding model, and presses RUN. Two results tables will be shown, showing the sequences that passed the quality controls and those that did not. Although all sequences are classified, the user should carefully interpret the results considering the quality control for each one. Sequences that did NOT passed the quality filters are shown as "LowQuality", and those sequences with a probability score below a value of 0.2 are shown as "unknown". Finally, the user can download an automatic report.

CONFLICT OF INTEREST
Authors declare no conflict of interest.