K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes

Every day more plant genomes are available in public databases and additional massive sequencing projects (i.e., that aim to sequence thousands of individuals) are formulated and released. Nevertheless, there are not enough automatic tools to analyze this large amount of genomic information. LTR retrotransposons are the most frequent repetitive sequences in plant genomes; however, their detection and classification are commonly performed using semi-automatic and time-consuming programs. Despite the availability of several bioinformatic tools that follow different approaches to detect and classify them, none of these tools can individually obtain accurate results. Here, we used Machine Learning algorithms based on k-mer counts to classify LTR retrotransposons from other genomic sequences and into lineages/families with an F1-Score of 95%, contributing to develop a free-alignment and automatic method to analyze these sequences.


INTRODUCTION
The availability of large-scale biological data is changing the way researchers must analyze and find solutions to problems in almost every area of biological sciences.Machine Learning (ML) algorithms can use this data to automatically learn the parameters needed to fit a model to a specific problem (Shastry & Sanjay, 2020) in order to predict known labels.This process is called supervised learning (Zou et al., 2018).Bioinformatics, which is an intersection between computer sciences, biological sciences, and mathematics (Orozco-Arias et al., 2017), plays a central role in storing, analyzing, categorizing, and labeling the huge flow of information generated, for example, by next-generation sequencing (NGS) platforms.Advances in these sequencing technologies have provided a new paradigm in the field of post-genomics (Rigal & Mathieu, 2011;Chen et al., 2014;Rishishwar et al., 2017), which focuses on how to store, process, and analyze these data streams in acceptable time frames (Rigal & Mathieu, 2011;Chen et al., 2014;Rishishwar et al., 2017).
Although there is an open debate on how to classify TEs, the most classification approach is based on their lifecycle (i.e., how they move inside the genome).TEs are classified hierarchically (Orozco-Arias et al., 2019); first, they are divided into two main classes (e.g., Class I or retrotransposons and Class II or DNA transposons (Wicker et al., 2007)) and further divided into orders according to the similarity of their coding domains (Chaparro et al., 2015;Orozco-Arias, Isaza & Guyot, 2019;Neumann et al., 2019).In plants, the most frequent elements are, by far, long terminal repeat (LTR) retrotransposons (LTR-RTs, an order from Class I) (Gao et al., 2012;Grandbastien, 2015a), which account for 80% of the genome size of species such as wheat, barley, or the rubber tree (Rahman et al., 2013).
Indeed, bioinformatic approaches have developed many tools to detect and classify transposable elements, including multiple approaches that group TEs based mainly on their structure, homology, redundancy, or conservation across genomes (Rawal & Ramaswamy, 2011;Jiang & Ramachandran, 2013;Loureiro et al., 2013;Schietgat et al., 2018).Nevertheless, the complexity of these elements does not consistently allow for accurate, reliable, and reproducible results across programs for all types of TEs (Arkhipova, 2017).In recent years, several machine learning-based methods have been proposed and evaluated, which take advantage of the thousands of sequences available in several datasets, such as Repbase (Jurka et al., 2005), RepetDB (Cornut et al., 2019), PGSB (Spannagl et al., 2016a), andInpactorDB (Orozco-Arias et al., 2021).
Several studies have, therefore, proposed the use of ML for TE analysis (reviewed in (Orozco-Arias et al., 2019)) to differentiate between LTR-RTs and SINEs (Ashlock & Datta, 2012) or autonomous and non-autonomous LTR-RTs in the Coffea canephora genome (Arango-López et al., 2017), and to improve the accuracy and performance of the classification (Loureiro et al., 2013;Nakano et al., 2017).Recently, a random forest algorithm was used to broadly classify LTR-RTs into superfamilies (Schietgat et al., 2018), while additional pre-processing techniques and coding schemes allow their deep classification (Orozco-Arias et al., 2020).Moreover, several deep neural network (DNN) architectures that perform TE classification have been published so far.For instance, Nakano et al. implemented a fully connected neural network (FNN) (Nakano et al., 2018), da Cruz et al. used a convolutional neural network (CNN) with a 2D representation of the sequences (da Cruz et al., 2019), andYan et al. used a CNN in 1D to classify TEs into superfamilies (da Cruz et al., 2020;Yan, Bombarely & Li, 2020).
Despite these efforts, none consider the need to both detect and classify elements at the same time, or specifically aim to classify LTR-RTs to the lineage/family level.In this paper, we evaluate and demonstrate the application of ML algorithms in the binary classification between LTR-RTs and other genomic features, multi-class classification into lineages/families or both (in a single process), as well as the importance of the features used, in order to design a free-alignment method for the annotation of LTR-RT in plant genomes based on k-mer frequencies.
For the binary classification task, we randomly selected 10,000 LTR retrotransposon sequences (taken as positive instances) and 10,000 genomic feature sequences for the negative instances.For the classification task into lineages/families, we used only InpactorDB data, while, for the binary plus multi-class classification problem (unified at a single ML process), we filtered the negative instances to retain only sequences longer than 6 Kb.We did this filter in order to reduce the number of sequences from more than 3 million to 34,830 and because the average length of Copia elements in InpactorDB is 5,957.48.In contrast the average length of Gypsy elements is 10,760.57.
As features, we selected k-mer frequencies with 1<=k<=6, as recommended in Orozco-Arias et al. ( 2020), calculating all possible k-mers, and later counting the number of occurrences of them in each sequence.We calculated them for lineage-level classification and the binary plus multi-class classification task.For binary classification, we used the same coding schemes as implemented in Orozco-Arias et al. ( 2020), such as DAX (Yu et al., 2015), EIIP (Nair & Sreenadhan, 2006), Complementary (Akhtar, Epps & Ambikairajah, 2008), Enthalpy (Kauer & Blöcker, 2003), and Galois (4) (Rosen, 2006).Additionally, two techniques were applied to automatically extract features from the sequences: (i) k-mer frequencies were obtained for each element and (ii) three physical-chemical (PC) properties were extracted, such as average hydrogen bonding energy per base pair (bp), stacking energy (per bp), and solvation energy (per bp), which were calculated by taking the first di-nucleotide and moving in a sliding window of one base at a time (Jaiswal & Krishnamachari, 2019).Moreover, we pre-processed the data by scaling, following the strategy implemented in Tabares-soto et al. ( 2020), and performed a dimensional reduction through a principal component analysis (PCA) (Wold, Esbensen & Geladi, 1987) with a cumulative variance of 96% and tolerance of 1e−4.
For the binary classification task, we divided the dataset into a training set (80% of the data) a validation set (10%), and a test set (10%).For multi-class classification into lineages, we used the same partition and additionally, we used k-cross-validation (Komer, Bergstra & Eliasmith, 2014) with k = 9 after tuning hyper-parameters in order to test the generalization property of each model.

Machine learning algorithms used
For binary classification between positive (LTR retrotransposons) and negative (other genomic features) instances, we used the same algorithms described in Orozco-Arias et al. ( 2020); thus, we used Linear Support Vector Classifier (SVC), Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Naive Bayesian Classifier (NB), Multi-Layer Perceptron (MLP), Decision Trees (DT), and Random Forest (RF) and selected the larger F1-Score for different values of a hyper-parameter (as in Orozco-Arias et al. ( 2020)).
For classification, we used the supervised models that showed the best performance in Orozco-Arias et al. ( 2020), such as KNN, LR, SVC, and LDA, but we applied hyper-parameter tuning (Table 1) using GridSearchCV from Scikit-learn (Pedregosa et al., 2011), using only sequences from InpactorDB (we did not include negative instances due to the high memory required).We used the F1-Score as a performance metric in all executions since it is not affected much by unbalanced datasets such as LTR-RTs (Orozco-Arias et al., 2020).
Next, a Stacking Classifier was implemented as an ensemble algorithm, which is a combination of multiple ML models for creating a more complex model (Zhang & Ma, 2012;Müller & Guido, 2016).The stacking classifier comprised LDA, Linear SVC, and KNN algorithms and used Random Forest as a meta-classifier.Similarly, for binary plus multi-class classification, we used a Stacking Classifier with KNN, LDA, and LR algorithms and Random Forest as a meta-classifier.Figure 1 summarizes the three approaches used in this study.The implementation in Python 3 of all algorithms used in this study is available in the Supplemental Material 2 or at https://github.com/simonorozcoarias/MachineLearningInTEs/blob/master/Scripts/binary_plus_multi_clasification.py.

Feature selection
We used the Gradient Boosting algorithm (Friedman, 2002) implemented in Scikit-Learn to determine the importance of each feature using the complete dataset (InpactorDB plus negative instances).Gradient Boosting generates scores for each feature that represent how useful it was in the construction of the boosted decision trees.The more valuable the feature is for making key decisions, the higher its importance score (Hastie, Tibshirani & Friedman, 2009).Thus, we extracted features with an importance score greater than 60, 40, 30, 20, 10, which yielded 65, 289, 508, 1,034, and 2,397 features from 5,460 k-mers.We used the following hyper-parameters: boosting_type='goss', n_estimators=10000, and  class_weight='balance'.Finally, we extracted the selected features in order to create new reduced datasets that were used to train the same ensemble algorithm implemented in the binary plus multi-class classification task.

Binary classification of LTR retrotransposons and other genomic features
As negative instances, we obtained 2,713,028 coding sequences (CDS), 262,925 RNAs of different types (e.g., ncRNA, mRNA, miRNA, rRNA, snRNA, and tRNA), 37,077 TEs that did not correspond to LTR retrotransposons (i.e., TEs class II, LINEs, SINEs, DIRS, and PLEs), and 3566 quimeric sequences from Repbase, PGSB, and RepetDB (Table 2).These sequences, with the exception of TEs (outside LTR-RTs), were obtained from 47 plant species available in public databases, such as Ensembl Plants and JGI (Supplemental Material 1).We used sequences from InpactorDB as positive instances.Due to the high imbalance between the two instances (2,979,519 negative vs 67,241 positive), we randomly extracted 10,000 sequences from each class.Then, the DNA sequences were converted to numerical representations using the coding schemes and automated techniques described in Orozco-Arias et al. ( 2020).Finally, we applied data scaling and a dimensional reduction through PCA (Tabares-soto et al., 2020).Using this dataset, we trained ML algorithms and determined their performance in terms of the F1-Score of each coding scheme over each ML model (Fig. 2).For the binary classification task, we obtained F1-Scores up to 97.9% 96.3%, and 95.9% for MLP, SVC, and LR, respectively, in the test dataset using k-mer frequencies as features.

Multi-class classification of LTR retrotransposons into lineages/ families
For each of the selected models, a dictionary was created containing the hyper-parameters and the values to be iterated.After training each ML algorithm with GridsearchCV (Tabares-soto et al., 2020), we determined the parameters that generated the best performance, as shown in Table 3.After tuning the hyper-parameters, each model was retrained to determine its performance.We obtained F1-Scores of 91%, 97%, 96%, and 97% with LR, KNN, LDA, and SVC algorithms, respectively (Fig. 3).
For the ensemble algorithm, the LR classifier was excluded since it showed the lowest performance (Fig. 3A).Therefore, the Stacking Classifier was implemented as an ensemble algorithm, composed of LDA, Linear SVC, and KNN algorithms, using Random Forest as meta-classifier.The performance of this ensemble model resulted in a 97% F1-Score, accuracy, recall, precision (Fig. 4), and 99% in area under ROC (receiver operating characteristic) curve (AUC) (Fig. S1) for the classification of LTR retrotransposons.

Binary plus multi-class classification task
After obtaining promising results in both the binary and multi-class classification tasks, we proceeded to merge them into a single ML problem.Thus, we included the negative instances as another class but deleted the sequences with a length of less than 6 Kb (Table 4).Furthermore, only k-mers frequencies were used as features because of the high performance obtained for the two problems separately.We also used the hyper-parameter values tuned (Table 3) for KNN, LDA, and LR (Fig. 5).Finally, F1-Scores of 95%, 94%, and 84% were obtained using KNN, LDA, and LR, respectively.We implemented an ensemble method (the same implemented for the classification task) using the three algorithms aforementioned and used RF as a meta-classifier.We obtained an F1-Score of 96% in k-cross validation with k = 9 (Fig. 6).This method also obtained 95% of precision and recall and 98% in AUC (Fig. S2).Furthermore, as shown in Fig. 7, the classes with the lowest F1-Scores are Class 8 (Ikeros, Copia) and 16 (Galadriel, Gypsy) since these classes have the lowest number of samples.

Feature selection and evaluation
Using the Gradient Boosting algorithm and the entire dataset (negative instances plus InpactorDB), we obtained the importance of each feature (k-mers frequencies).The number of features is relatively high (5,460).Since the computational cost to process them can be very high, the number of features must be reduced without reducing the performance of the ML algorithm.Figure 8 shows the importance of all features determined by Gradient Boosting.

DISCUSSION
Transposable element detection and classification are time-consuming tasks for plant genome annotation projects and involve semi-automatic pipelines and curation by experts.These pipelines generally rely on sequence alignment approaches, which have a significant computational cost in the analysis of large genomes or TE compound genomes.The development of automatic algorithms for TE detection and annotation, without sequence alignment, is now required to process the huge amount of genomic information that is being generated.The most frequent repeated sequences in plant genomes are LTR retrotransposons (Gao et al., 2012;Grandbastien, 2015b) Their transcriptional activities and mobilities can have a profound impact on the structure, composition, and size of genomes, among others (Orozco-Arias, Isaza & Guyot, 2019).To detect these elements in assembled genome sequences, the most frequent methods are based on their specific internal structure (e.g., the duplication of their LTR at both ends of the element) and domains, or via homology searches against reference libraries (Loureiro et al., 2013;Nicolas, Peterlongo & Tempel, 2016).Nevertheless, these strategies have strong limitations (Orozco-Arias, Isaza & Guyot, 2019).Structure-based methods will not accurately detect partial elements, variations, or non-autonomous elements such as solo-LTR, TR-GAG (Chaparro et al., 2015), or TRIM (Witte et al., 2001).Moreover, homology-based methods can induce low quality annotations if the reference library is incomplete or constructed with distant species of the species to be annotated (Orozco-Arias, Isaza & Guyot, 2019).These problems can be overcome using new methods, such as machine learning methods, that are not based on structure or homology (i.e., sequence alignment) (Nakano et al., 2018;Schietgat et al., 2018;Orozco-Arias et al., 2019;da Cruz et al., 2020).Although some studies have used ML to analyzed TEs, none of them have focused on the detection and lineage/ family level classification of LTR retrotransposons.In previous studies (Orozco-Arias et al.,  Here, we have shown the possibility of both binary and multi-class classification through ML with different approaches.First, by using a negative dataset (i.e., sequences of genomic features excluding LTR-RTs), we obtained an F1-Score up to 97.9% in the binary classification task.In the detailed lineage/family level classification task, we obtained a performance of 97% by using the same datasets as in Orozco-Arias et al. ( 2020) but including more elements from InpactorDB.Finally, we unified both tasks into a single process, obtaining a 96% F1-Score.These results were using the k-mer frequencies of sequences.K-mers are frequently used in bioinformatics in areas of quality control of generated sequences (Mapleson et al., 2017), metagenomics (Breitwieser, Baker & Salzberg, 2018), de novo assembly of genomes and transcriptomes (Zerbino & Birney, 2008;Simpson et al., 2009), genome size estimation (Sun et al., 2018), and de novo detection of transposable elements (Price, Jones & Pevzner, 2005).In addition to detection, k-mers have also their usefulness in higher hierarchical classification approaches (at the superfamily level).Nakano and (Nakano et al., 2018) exploited the characteristics of k-mer frequencies (with k = 2,3,4) as features to train a deep neural network, and Yan, Bombarely & Li (2020) demonstrated interesting statistical differences using k-mer frequencies with k ranging from 3 to 7. Here, we found that the combination of certain k-mer frequencies with different values of k (from 1 to 6) added reliability to the binary plus multi-class classification task, requiring only 289 features (out of the initial 5,460 features) to obtain an F1-Score of 95%.We also demonstrated that, among all possible k-mers generated, larger k values reduce the number of features with high importance scores.The specific k-mer that contributes the most to the binary and multi-class classification characteristics must be analyzed in detail to understand its involvement in the composition and structure of the elements of each lineage.This work highlights the feasibility of designing and implementing ML-based tools that automate the complex process of annotating LTR retrotransposons in plant genomes.As future work, we propose the implementation of benchmarking to compare this method with well-established methods such as those based on homology and structure in various plant genomes.

CONCLUSIONS
Massive sequencing projects require automatic tools to analyze large amounts of genomic information in a fast yet accurate, reliable, and reproducible manner.The binary classification of LTR-RTs and other genomic features and lineage-level classification of them in plant genomes can be performed using ML-based and ensemble methods, demonstrating good performance (up to 96% F1-Score).This task can be performed with only 289 k-mer frequencies, allowing low computational resources and time.These results can be used in the design and implementation of automatic and alignment-free tools to solve the issue of processing the increasing number of available plant genomes.

Figure 1
Figure 1 Schematic representation of the different approaches used in this study in the binary and multiclass classification tasks in LTR retrotransposons through Machine Learning.Full-size  DOI: 10.7717/peerj.11456/fig-1

Figure 2
Figure 2 F1-Scores obtained by Machine Learning models using different coding schemes for the test set.Full-size  DOI: 10.7717/peerj.11456/fig-2

Figure 4 Figure 3
Figure 4 Learning curves of the Stacking classifier algorithm in the classification of LTR retrotransposons up to lineage/family level.The shadow on the line indicates the standard variation.Full-size  DOI: 10.7717/peerj.11456/fig-4

Figure 5
Figure 5 Learning curves for KNN (A), LDA (B), and LR (C) for the binary plus multiclass classification task.The shadow on the lines indicates the standard variation.Full-size  DOI: 10.7717/peerj.11456/fig-5

Figure 6 Figure 7
Figure 6 Learning curves for the Staking Classifier (ensemble method) for the binary plus multiclass classification task.The shadow on the lines indicates the standard variation.Full-size  DOI: 10.7717/peerj.11456/fig-6

Figure 10
Figure 10 Learning curves for the Staking Classifier (ensemble method) for the binary plus multiclass classification task using only 289 selected features.The shadow in the lines indicates the standard variation.Full-size  DOI: 10.7717/peerj.11456/fig-10

Table 1
Machine Learning models and hyper-parameters tuned.

Table 2
Composition of the negative instances dataset.

Table 3
Tuned parameter values.

Table 4
Dataset composition used in the binary plus multiclass classification task.

Table 4
we have shown that the classification of these elements can be very efficient with ML approaches but that coding schemes and feature extraction techniques can deeply influence the performance of such algorithms.