Computational identification of N4-methylcytosine sites in the mouse genome with machine-learning method

N4-methylcytosine (4mC) is a kind of DNA modification which could regulate multiple biological processes. Correctly identifying 4mC sites in genomic sequences can provide precise knowledge about their genetic roles. This study aimed to develop an ensemble model to predict 4mC sites in the mouse genome. In the proposed model, DNA sequences were encoded by k-mer, enhanced nucleic acid composition and composition of k-spaced nucleic acid pairs. Subsequently, these features were optimized by using minimum redundancy maximum relevance (mRMR) with incremental feature selection (IFS) and five-fold cross-validation. The obtained optimal features were inputted into random forest classifier for discriminating 4mC from non-4mC sites in mouse. On the independent dataset, our model could yield the overall accuracy of 85.41%, which was approximately 3.8% 6.3% higher than the two existing models, i4mC-Mouse and 4mCpred-EL respectively. The data and source code of the model can be freely download from https://github.com/linDing-groups/model_4mc.


Introduction
DNA modifications, such as demethylation and methylation, play important roles in the regulation of gene expression [1]. At the site of (5'-C-phosphate-G-3'), the methylation of cytosine is an important epigenetic trait, which is closely related to cell proliferation and chromosomal stability protection [2,3]. 5-methylcytosine (5mC), 4-methylcytosine (4mC), and 3-methylcytosine are the most common methylations of cytosine in eukaryotic and prokaryotic genomes [4,5]. 5mC is the common kind of methylation of cytosine and, relates to many cancerous and neural diseases [6,7]. 4mC is also an effective modification that guards its own genetic information from deterioration through restriction enzymes [8][9][10]. Accurate recognition of 4mC could provide key clues for understanding its regulation roles. Currently, several experimental methodologies, including mass spectrometry, reduced-representation bisulfite sequencing, and single-molecule real-time sequencing, have been developed to identify 4mC sites [11][12][13]. Although these methodologies are helpful in the identification of 4mC sites, they are highly expensive when implemented on extensively large sequencing data. Thus, a bioinformatics tool to identify 4mC sites is urgently needed. At present, some computational methods have been presented to identify 4mC sites. In 2017, an innovative prediction model based on the confirmed 4mC dataset was constructed to predict 4mC sites in several species [14]. Afterwards, an iterative feature representative algorithm was designed based on the benchmark dataset of Chen et al. [15], which helped to learn and train the features from numerous progressive models to predict 4mC sites. iEC4mC-SVM [16] was developed to predict the 4mC in the Escherichia coli by using light gradient boosting machine feature selection technology. DNA4mc-LIP [17], a linear integration tool, was developed by combining existing prediction methods to identify 4-methyl cytosine sites in multiple species. Then, Meta-4mCpred [18] was developed to predict 4mC sites in the genomes of six species. However, to date, only two predictors, i4mC-Mouse and 4mCpred-EL are available for recognizing 4mC sites in mice [19,20]. These two methods employed various features and machine learning algorithms on the sequence data of mice derived from the Meth-SMRT database [21]. Although both i4mC-Mouse and 4mCpred-EL can produce good outcomes, there is still room for further improvement by extracting more feature information. To address the aforementioned issues, an ensemble model was established to predict 4mC sites in mice. Figure 1 shows the workflow of the proposed model. First, three types of feature descriptors, k-mer, enhanced nucleic acid composition and composition of k-spaced nucleic acid pairs, were used as features to input into a random forest classifier [22] for identifying 4mC sites. After this, the mRMR [23] with IFS [24,25] technique was utilized to get optimal feature vectors. Finally, the best model was examined on an independent dataset. The outcomes on independent-samples indicated that the proposed model outpaced the two existed predictors, i4mC-Mouse and 4mCpred-EL.

Materials and methods
A reliable and accurate dataset is necessary to establish a prediction model. Therefore, we obtained the benchmark dataset from Hasan et al. work [20], and Manavalan et. al. [19]. In their study, they excluded similar sequences using 70% as cutoff of sequence identity [26]. After this elimination procedure, they finally obtained the benchmark dataset of 906 positive and 906 negative sequences with length of 41bp. Subsequently, the benchmark data were separated into 80% training data and 20% independent data to objectively estimate the efficiencies and performances of predictors, as shown in Table 1. Table 1. The distribution of sample numbers in benchmark dataset.

k-mer nucleotide compositions (k-mer NC)
k-mer NC can reflect short-range nucleotide interaction of sequences [44][45][46]. The (N-k+1) nucleotide residues can be obtained via a sliding window method by setting the window size of k bp where Ri signifies the nucleotide (A, T, C, and G) at the i-th position. The sequences can be transformed into the 4 k -D vector using k-mer nucleotide composition as follows where T denotes the transposition of the vector, and f1 k-tuple symbolizes the occurrence of the i-th k-mer nucleotide composition in the sequence. When k =1, a DNA sample can be deciphered into a 4-D vector When k = 2, the DNA sample can be described by a 16-dimension vector.

Enhanced nucleic acid composition (ENAC)
The ENAC calculates the nucleic acid composition based on the sequence window. It can be used to formulate the sequence with equal length. The enhanced nucleic acid composition can be calculated as In Equation (4), k characterizes the size of the sliding window, NA,win denotes the number of nucleotide A in the sliding window p, T ∈ [G, C, A, T], and (p = 1, 2, ..., L-k+1). In this study, the sliding window was set to 5. Then the feature dimension is 148.

Composition of k-spaced nucleic acid pairs (CKSNAP)
The CKSNAP embodies the incidence of nucleotide pairs disconnected by any k nucleotide (k = 0, 1, 2, 3, 4 5). The composition of k-spaced nucleic acid pairs feature comprises 16 nucleotide pairs [AA, AG, ... TG, TT]. By taking k = 1 as an instance, composition of k-spaced nucleic acid pairs can be specified as follows: where * signifies (A, G, C, and T), NY ⃰ Z signifies the number of nucleotides Y*Z pairs in the sequence, and NTotal embodies the total number of single-spaced nucleotide pairs in the sequence. If the nucleic acid pair AA appears j times in the nucleotide sequence, the composition of the nucleic acid pair AA can be equal to j divided by the total number of 0-spaced nucleic acid pairs NTotal in the nucleotide sequence. For k = 0, 1, 2, 3, 4 and 5, the value of NTotal is −1, −2, − 3, −4, −5 and − 6 for a nucleotide sequence of length , respectively. In this study, k = 2 and the dimension of the composition of k-spaced nucleic acid pairs feature was 48.

Feature selection with mRMR and IFS
The insertion of noisy features might result in the unsatisfactory performance of a model. Dao et al. proposed a two-step feature selection strategy to exclude noise [47]. Feng et al. used a mRMR technique to reduced noise [48]. Shao et al. performed three ranking algorithms to exclude irrelevant features [49]. Cheng et al. used MetaMap to reduced noisy features [50]. Other computational works did the similar works [51][52][53]. Therefore, the selection of features is an obligatory phase to remove the less important features and increase the productivity of a model [54]. Many feature selection and ranking techniques are available, such as f-score , mRMR [23], MRMD [55], chi-square [56]. In this study, mRMR with IFS [24,57] was applied to obtain the optimal feature subset. mRMR is a filterbased selection technique [58] to achieve an optimal model. Compactness functions are described as y and z, and P (y) and P (z) are the two corresponding probabilities. P (y, z) is the possibility of compactness, and the common information between the two functions can be demarcated as In shared information, searching a subset S with m optimum features helps to determine the feature transmission, which majorly depends on the target {y } class q.
Minimum redundancy can be defined as Final selection criteria can be articulated as: The principle of the mRMR technique is to use a typical redundancy and relevance to rank features to acquire the best subset. Mostly, if a model was built on a high-dimensional feature subset, it can produce overfitting and informational redundancy problems. Therefore, mRMR (minimum redundancy maximum relevance) with the IFS (Incremental Feature Selection) [24,59] technique and five-fold cross-validation method was applied to examine the optimal feature subset with the maximum accuracy. We ranked all features according to the ∅ -values and obtained new feature vectors, which is given in the below equation 10. * = [ℎ 1 , ℎ 2 , ℎ 3 . . . ℎ ] The first feature subset comprises the feature with the highest ∅ -value * = [ℎ 1 ] . By adding the second highest ∅-value to the first subset, the second feature subset * = [ℎ 1 , ℎ 2 ] is formed and by adding the third highest ∅ -value to the second feature subset, the third feature subset * = [ℎ 1 , ℎ 2 , ℎ 3 ] is formed [47]. The process was repeated until all the features were considered.

Machine learning classifier
Support vector machine is very famous and has been used in many bioinformatics tools [44][45][46].
It performs binary classification on data in supervised learning. We have used a free available package LibSVM version 3.21, which can be easily downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvm/ to train and test the model. We have used rbf kernel function due to its efficiency in non-linear classification. We have optimized cost and gamma parameters of RBF kernel function by using grid search with searching space [2 -5 ,2 5 ] for cost and [2 -12 ,2 1 ] for gamma. Naïve Bayes classifier has been widely used in bioinformatics due to its simplicity and better performance [60]. It is a classification technique and totally depends on Bayes theorem. Ada boost classifier is also very famous and has been widely used in bioinformatics [61]. It is an ensemble technique and combines various classifiers to enhance the accuracy. The main idea of this is to set the classifiers weights and trained the data in each iteration. We implemented these classifiers in Weka (version 3.8.4) [62]. Random forest is a combined knowledge technique extensively applied in bioinformatics [63,64]. The underlying principle is to combine several weak classifiers. The outcome is attained by the voting process therefore, the outcome of the model has higher exactness and simplification. The model was constructed using a random forest algorithm [22] and the complete procedure is clearly described in [65]. Scikit -learn package (v -0.22.1) [66,67] was used to execute the random forest classifiers. Firstly, we used randomized search CV and then grid search CV to tune hyperparameter. The best tuned parameters of the proposed model are given in Table 3.

Evaluation metrics
Matthews correlation coefficient (MCC), accuracy (Acc), sensitivity (Sn) and specificity (Sp) were used in this study to check the overall efficiency of the model defined as Equation 11. where TP represents the correctly identified 4mC sequences in benchmark data and FP signifies the 4mC sequences false-classified as non-4mC. Likewise, TN represents the correctly recognized non-4mC sequences in the data and FN signifies the non-4mC sequences, which were false-classified as 4mC. Consequently, the receiver operating characteristic (ROC) curve was used to illustrate the efficiency of the model graphically. The ROC curvature could assess the projecting ability of the proposed model on the whole assortment of resultant values. The area under the curve was premeditated to check the efficiency of the model. A good classifier gave AUC = 1, and the arbitrary performance gave AUC = 0.5.

Composition analysis of sequences
The sequence pattern around the modification site is an operative stage to predict and interpret the genetic meanings of variations [68,69]. In this study, Two Sample Logo [70] (http://www.twosamplelogo.org/cgi-bin/tsl/tsl.cgi) was used to examine the distribution of nucleotides around 4mC. Figure 2 shows that nucleotide distribution among positive and negative sequences are different in regions flanking the nucleotide C. Both T and C nucleotides were individually abundant at the upstream and downstream of the positive sequences, whereas A and G were correspondingly enriched at the upstream and downstream of the negative samples. Some nucleotides tend to act continuously along the sequences.

Performance evaluation
Based on sequence feature, we constructed a model to identify 4mC site. First, the training data were converted into feature vectors using feature descriptors (k-mer, composition of k-spaced nucleic acid pairs, enhanced nucleic acid composition, and feature fusion). Subsequently, the feature vectors of each encoding model were evaluated by random forest classifier using a five-fold CV test. mRMR with IFS method was used to pick out the best feature subset for the sake of better prediction accuracy. Figure 3 shows the IFS curve for searching optimal features. Table 2 recorded that performances of the three single-encoding models and the feature fusion model. The AUCs of single-encoding models (k-mer, CKSNAP, and ENAC) were 0.88, 0.80, and 0.79, respectively. The AUC of k-mer was around 1%-4% higher than those of the other encodings. Fusion feature-based model could produce the best results. In this optimal model, the Acc, MCC, Sn, Sp, and AUC were 79.91%, 0.598, 81.88%, 78.12% and 0.908, respectively. Figure 4 also shows the AUC of random forest based fusion model on training dataset and independent dataset by using five-fold cross validation. The best parameters were shown in Table 3.

Performance evaluation of different ML algorithms
k-mer, CKSNAP, ENAC and their fusion were inputted into three machine learning classifiers, namely Adaboost, SVM, and Naive Bayes algorithm, for comparing with random forest classifierbased models [71]. Cross-validation is a statistical analysis method and has been widely used in machine learning to train and test model. A five-fold CV test was used to elevate their corresponding machine learning constraints on individual encoding classifiers. In five-fold CV, the benchmark dataset was arbitrarily separated into five groups of about equal size. Each group was individualistically tested by the model which trained with the remaining four groups. Therefore, the five-fold CV method was performed five times, and the average of the results was the final result. Finally, an ideal model was achieved for each classifier. The results are shown in Table 4. We noticed that fused feature did produce high accuracy except Adaboost (69.57%). Then, comparison between feature fusion-based models with single-encoding based models indicates that the multiple information was effective to achieve better results. As shown in Figure 5, based on fused features, random forest model exhibits higher accuracy compare with other three machine learning models. Particularly, the AUC of the feature fusion model based on random forest classifier was 1%-10% higher than that of the other models, indicating that the random forest model was the best for 4mC identification.

Comparison with existing models on an independent dataset
Independent dataset test was used to examine and compare the anticipated model with already published models. Two existing models, i4mC-Mouse and 4mCpred-EL could provide 4mC identification in mouse. Therefore, the efficiency of the proposed model was assessed against that of the aforementioned two existed models on the same independent dataset (160 4mC, and 160 non-4mC), as shown in Table 5. The MCC, Sn, Sp, Acc, and AUC of the i4mC-Mouse were 0.633, 80.71%, 82.52%, 81.61%, and 0.920, respectively. The MCC, Sn, Sp, Acc, and AUC of the 4mCpred-EL were 0.584, 75.72%, 82.51%, 79.10% and 0.881, respectively. The Feature Fusion model could produce 0.711, 82.00%, 89.13%, 85.41%, and 0.944, respectively for MCC, Sn, Sp, Acc, and AUC. Obviously, our proposed model outpaced both existing models by 2.4% and 6.3% in AUC which is shown in Figure 6. The good performance of the proposed model was due to the use of different and accurate encoding schemes and the selection of suitable classifiers.  6. AUC of proposed model and two existing tools.

Conclusions
4mC is a DNA modification with a series of significant genetic progressions such as regulation of gene expression and cell differentiation. The identification of 4mC sites in the whole genome is vital for understanding their genetic roles. To date, numerous predictors have been established to classify 4mC sites in diverse species [14,17,18,[72][73][74], but only two methods 4mCpred-EL [19] and i4mC-Mouse [20] exist for mice. In this study, an advanced ensemble model was established to identify 4mC sites in the mouse genome. In the proposed model, DNA sequences were encoded using k-mer, CKSNAP and ENAC. Then, these encoding-features were optimized by using mRMR with IFS. On the basis of the top feature subset, the finest 4mC sorting model was achieved by the random forest classifier using a five-fold CV test. The estimated outcomes on independent data showed that the proposed model provided outstanding generalization capability. Further studies will aim to create a user-friendly web server for the projected model. Also, additional feature selection methods and algorithms will be implemented to further improve the efficiency to classify 4mC sites.