Using multi-layer perceptron to identify origins of replication in eukaryotes via informative features

Background The origin is the starting site of DNA replication, an extremely vital part of the informational inheritance between parents and children. More importantly, accurately identifying the origin of replication has great application value in the diagnosis and treatment of diseases related to genetic information errors, while the traditional biological experimental methods are time-consuming and laborious. Results We carried out research on the origin of replication in a variety of eukaryotes and proposed a unique prediction method for each species. Throughout the experiment, we collected data from 7 species, including Homo sapiens, Mus musculus, Drosophila melanogaster, Arabidopsis thaliana, Kluyveromyces lactis, Pichia pastoris and Schizosaccharomyces pombe. In addition to the commonly used sequence feature extraction methods PseKNC-II and Base-content, we designed a feature extraction method based on TF-IDF. Then the two-step method was utilized for feature selection. After comparing a variety of traditional machine learning classification models, the multi-layer perceptron was employed as the classification algorithm. Ultimately, the data and codes involved in the experiment are available at https://github.com/Sarahyouzi/EukOriginPredict. Conclusions The prediction accuracy of the training set of the above-mentioned seven species after 100 times fivefold cross validation reach 92.60%, 90.80%, 91.22%, 96.15%, 96.72%, 99.86%, 96.72%, respectively. It denotes that compared with other methods, the methods we designed could accomplish superior performance. In addition, our experiments reveals that the models of multiple species could predict each other with high accuracy, and the results of STREME shows that they have a certain common motif. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04431-x.

origin, and they will begin to replicate from multiple points during replication [2], which are mainly divided into unidirectional replication and bidirectional replication. Abnormal replication may result in heritable variation in the organism. The accurate replication of DNA not only maintains the continuity of genetic information, but also ensures the relative stability of the species.
However, most of related studies only focus on the organism of Saccharomyces cerevisiae. In 2004, Corzzareli's group [3] predicted the starting site in Saccharomyces cerevisiae by using the property of replication initiation to be rich in AT bases. In 2012, Chen et al. [4] studied the replication initiation site of Saccharomyces cerevisiae by calculating the bending degree and cleavage intensity of the DNA sequence, which is highly effective for identifying positive samples. In 2016, Zhang et al. [5] first attempted to study the origin of human DNA replication and constructed a predictor based on random forest. In 2016, Wang et al. [6] studied H. sapiens, M. musculus, E. coli and came up with a method "MaloPred". The AUC values predicted by this method for these three organisms are 0.755, 0.827 and 0.871, respectively. In 2018, Liu et al. [7] studied four kinds of yeasts. In 2019, Dao et al. [8] collected a variety of eukaryotes. Based on characteristics such as Kmer and SVM classifier, they conducted a complete study of each organism and made some progress. In 2020, Wei et al. [9] presented a novel machine learning-based approach called Stack-ORI encompassing 10 cell-specific prediction models. And the prediction of origins of human and other four organisms is excellent. In consequence, it is necessary to further promote the experiment to improve the classification accuracy.
In this study, we collected datasets of 7 eukaryotes, including Homo sapiens (H. sapiens), Mus musculus (M. musculus), Drosophila melanogaster (D. melanogaster), Arabidopsis thaliana (A. thaliana), Pichia pastoris (P. pastoris), Schizosaccharomyces pombe (S. pombe), Kluyveromyces lactis (K. lactis), and conducted independent research on each species. We employed three types of feature extraction methods (TF-IDF, PseKNC-II, Base-content), and performed the two-step feature selection method based on SVM. When selecting classification models, we compared SVM, Naïve bayes, Decision Tree, KNN, MLP, XGBoost to find the best model. In the terminate, we designed the unique classification algorithm for each organism. After the classification experiment, we conducted cross-species tests and sequence analysis using STREME [10], the results showed that there were similar motifs among various species.

Feature analysis
As mentioned above, we utilized three feature extraction methods. In this chapter, we analyzed the four features of Base-content. Firstly, we randomly selected the same number of positive and negative samples from seven species, and then used the graph to describe the four characteristic values corresponding to different samples. As shown in Fig. 2, the features corresponding to the positive and negative samples of H. sapiens, S. pombe are not significantly differentiated, while the other five species have significant differences in the GC-skew and AT-profile, which indicates that the extracted features are very effective.

Feature ranking analysis
As mentioned above, the method we applied originally in feature ranking was F-score. However, when extracting feature TF-IDF, we found that the score of TF-IDF could also be used as the ranking standard of corresponding features. In order to compare the two methods, we respectively used the two scores as the ranking standard to carry out the IFS experiment. As shown in Fig. 3, it is wise to sort features based on TF-IDF scores  and F-score, they can accurately represent the importance of features. When the number of features is small, the feature selection effect based on F-score is better, and the feature selection effect based on TF-IDF is better when the feature number is increased. For species such as H. sapiens, M. musculus and D. melanogaster, utilizing TF-IDF can achieve the best feature selection effect, while A. thaliana, P. pastoris, S. pombe and K. lactis are more suitable for F-score. More important, the experiment in this section could prove that feature selection significantly improves the classification effect.

Performance evaluation on different feature extraction methods
In this experiment, we extracted three features of the sequence: TF-IDF, PseKNC-II, Base-content. By evaluating a variety of feature sets based on the SVM, we obtained the most effective feature set corresponding to each species.
In the first place, the six pseudo-nucleotide features were combined together to compare the classification effect with the single optimal nucleotide features and selected the optimal feature set as the pseudo-nucleotide feature.
After that, we compared the three feature extracted methods, as shown in Fig. 4, the features extracted by TF-IDF are the most effective for H. sapiens, M. musculus, and D. melanogaster; while A. thaliana, P. pastoris, S. pombe and K. lactis are more suitable for extracting pseudo-nucleotide features to represent sequences. The classification results of the specific 6 single nucleotides and combined nucleotides are shown in the Additional file 1.

Performance evaluation on different model
In order to improve the classification accuracy as much as possible, we employed the following 6 classification models. As shown in Fig. 5, MLP is obviously superior to other models for classification of 6 species such as H. sapiens, and only A. thaliana has achieved better results on which KNN is applyed for classification.

Comparison with published methods
In order to verify the advantages of our methods, the detailed comparison was made with the prediction methods proposed by Dao et al. [8] and Wei et al. [9] based on the same training dataset and independent test dataset. As shown in Table 1, after 100 times of fivefold cross-validation, the prediction methods we designed are much better for all species.
Since we only divided the datasets of H. sapiens, M. musculus, A. thaliana and D. melanogaster into training sets and independent test sets, the comparative experiments based on the independent test were only carried out for these four species. The specific results are shown in Table 2.

Cross-species validation and sequence analysis
In this paper, we conducted independent studies on the origin of replication in seven eukaryotes and trained the corresponding models. In order to verify the predictive ability of various species models, we utilized cross-species studies. As shown in the Fig. 6 [10] to analyze the sequences, which was more suitable for processing dataset containing more than 50 sequences than MEME [11,12]. As shown in the Fig. 7, the sequences of H. sapiens, M. musculus and A. thaliana have significantly the same motif fragment "GGG", while the sequences of S. pombe, P. pastoris and K.   lactis have significantly the same motif fragment "AAA", which explains the high prediction accuracy in the cross-species test between H. sapiens and M. musculus, and the results of sequence analysis point out the direction for further research (Additional file 2).

Conclusion
In this work, we studied the identification of origin of replication for seven eukaryotes. Three methods of PseKNC-II, Base-content and TF-IDF were utilized to extract features, and a variety of machine learning models were compared. Our study shows that H. sapiens, M. musculus, and D. melanogaster are more suitable for using TD-IDF to extract features, indicates that the algorithm of text classification is also suitable for sequence classification, and deserves further investigation. While A. thaliana and other three organisms using PseKNC to extract features could achieve the best classification results. After comparing various classification models, we discovered that MLP has a better classification effect for most species. In addition, the models of H. sapiens, M. musculus, and D. melanogaster can predict each other with high accuracy, and the results of STREME reveals that they have a certain common motif. In the terminate, we opened source the code and data employed in the experiment, hoping to provide related study with assistance.

The benchmark dataset
For studying the origin of DNA replication in various eukaryotes, seven sample datasets of eukaryotes were collected, which are H. sapiens, M. musculus, D. melanogaster, A. thaliana, P. pastoris, S. pombe and K. lactis [5,7,8]. Among them, all the sequences are 300 bp in length, the positive and negative sample sets are balanced on the whole. Studies indicates that the existing datasets of the three species of H. sapiens, M. musculus and D. melanogaster contain different cell types, despite the sample sequences of different cell types are quite different [8]. To make a distinction, we collected only one cell type sequence contained in these three species. As shown in the

Feature extraction
For sequence prediction, feature extraction is a necessary step, on account of almost all the machine learning models could only deal with numerical types [13], and it is also a considerably critical step. Extracting effective features could not only express the characteristics of the sequence in effect, but also improve the accuracy of classification using machine learning models. Since the key information extracted by different features is different, our experiments utilized a variety of feature extraction methods and carried out the comparison between TF-IDF, PseKNC-II and Base-content to capture the sequence to a variety of characteristics, raise the accuracy of the prediction.

TF-DIF
TF-IDF [14][15][16][17][18] is a method proposed for text classification. The main idea is to find subject terms which appear in the text all the frequent, and these words only appear repeatedly in this type of article. Such as some common conjunctions "the" and "and", they have a higher frequency in a certain type of text, however, they are not representative, since these words are common in all articles. In general, searching common motifs for sequences is similar to the text classification. On account of that the classic algorithm TF-IDF in text classification was applied in our experiment, we made some modifications to it to extract the sequence features of DNA. The specific formula is shown as follows.
where tf i represents the frequency of the i-th k-tuple nucleotide in the positive sample. The value of k is from 1 to 6, and there are 5460 nucleotides in total, the value of i ranges from 1 to 5460.
where |D| represents the number of all samples, |{j: t i ∈ d j }| represents the number of all samples containing the i-th k-tuple nucleotide, adding 1 to the denominator is to prevent the denominator from being 0.
From this, the TF-IDF score corresponding to each k-tuple nucleotide could be obtained, and then a [5460 * 1] numerical matrix L was employed to represent each sequence and calculate the score of the corresponding position. The formula is as follows.
Among them, tf_idf i represents the TF-IDF score of the k-tuple nucleotide, and n i represents the frequency of this nucleotide in the sequence. (1)

Base-content
Base-content extracts the base information of the sequence. Specifically, the content characteristics of single nucleotides (A, C, G, T) in each DNA sequence was utilized as features. Four base characteristics (GC-skew, GC-profile, AT-skew, AT-profile) were considered in this paper [3,[19][20][21][22].

PseKNC-II
PseKNC-II, also known as the series correlation PseKNC [5,23], which not only considers the frequency information of k-tuple nucleotides, but also calculates the physical and chemical properties of pseudo-nucleotides. In this work, we extracted three pseudonucleotides feature sets on which k = 1, 2, 3, 4, 5 and 6.

Feature selection
When using numerous features, may confront the problem of data redundancy and the prediction accuracy will be influenced on account of the existence of invalid features. Therefore, the two-step [24,25] method was applied to perform feature selection. The main idea is to score all the features based on F-score, and then use IFS to select the features to filter out effective features, which not only saves the calculation time on which forecasting, but also improves the accuracy of the forecast. F-score [26] is a method of measuring the ability of a characteristic to distinguish between two classes. Given the training set x, set n + and n − to represent the number of positive samples and the number of negative samples, respectively. The F-score of the i-th feature could be deduced as represent the average value of the i-th feature in all samples, positive samples and negative samples, respectively. x (+) k,i is the i-th feature of the kth positive sample, and x (−) k,i is the i-th feature of the kth negative sample. The larger the F-score, the more effective this feature is.
The second step of feature selection is incremental feature selection (IFS) [24,27]. First apply a feature as the training set, and then add the extracted feature to the training set one by one from high to low according to the scoring order of F-score and find the number of corresponding features with the highest classification accuracy at last.

Model training
After feature selection based on SVM, the most effective feature set corresponding to each species was selected. In order to further improve the classification accuracy, 7 traditional machine learning classification models were utilized in our study, namely SVM, Decision tree, Naïve bayes [28], XGBoost, KNN and MLP. In order to compare different models with the principle of fairness and objectivity, the selected features were used to train models. Before applying different models, the vital parameters of each model need be adjusted to achieve superior performance which were evaluated by 100 times fivefold cross-validation, as shown in Table 4.

Performance evaluation
In order to better display and compare the experimental results, the fivefold cross-validation [29] was employed on calculating the experimental results, hence more accurate results could be obtained. Evaluation parameters include Acc, Sn, Sp, MCC [30,31]. In addition, the AUC value was also calculated through the ROC curve.
where N + represents the number of origin sequences, N − represents the number of nonorigin sequences, N + − represents the number of misjudged positive samples as negative samples, and N − + represents the number of misjudged negative samples as positive samples.