6mA-Pred: identifying DNA N6-methyladenine sites based on deep learning

With the accumulation of data on 6mA modification sites, an increasing number of scholars have begun to focus on the identification of 6mA sites. Despite the recognized importance of 6mA sites, methods for their identification remain lacking, with most existing methods being aimed at their identification in individual species. In the present study, we aimed to develop an identification method suitable for multiple species. Based on previous research, we propose a method for 6mA site recognition. Our experiments prove that the proposed 6mA-Pred method is effective for identifying 6mA sites in genes from taxa such as rice, Mus musculus, and human. A series of experimental results show that 6mA-Pred is an excellent method. We provide the source code used in the study, which can be obtained from http://39.100.246.211:5004/6mA_Pred/.


INTRODUCTION
DNA modification sites play vital roles in multiple biological processes and are attracting increasing research attention. Methylation continues to be a hot topic in epigenetics, and 5mC methylation has been extensively studied (Liu, Li & Zuo, 2019). With the advancement of sequencing technology, 6mA methylation has slowly attracted increasing attention. 6mA methylation not only affects gene expression but also regulates development in plants and animals (Xu et al., 2020a). Many diseases, including cancer, are related to 6mA methylation (Chen et al., , 2019bXu et al., 2019a). With the progress of 6mA methylation-related research, large amounts of data have been collected. However, effective methods for 6mA site identification are lacking.
Methods for identifying modification sites have consistently been a hot spot in bioinformatics. Many methods have been studied and have achieved good results. Although research on 4mC (He, Jia & Zou, 2019) and 5mC is mature, research on the identification of 6mA modification sites has just begun. The computational method i6mA-Pred was used to identify 6mA modification sites in the rice genome with high accuracy. Several methods for identifying 6mA loci in the rice genome have been proposed, such as MM-6mAPred, iDNA-6mA-rice (Hao et al., 2019), SDM6A (Basith et al., 2019), i6mA-DNCP (Kong & Zhang, 2019) and SNNRice6mA (Yu & Dai, 2019). In addition, methods for the identification of 6mA sites in Mus musculus and humans have gradually emerged, such as iDNA6mA-PseKNC (Feng et al., 2019), csDMA (Liu et al., 2019c), SICD6mA, and 6mA-Finder (Xu et al., 2020b). Several datasets are publicly available, and many desirable features and models have been proposed. Application of the feature algorithms NCP and one-hot, feature fusion and deep learning methods has greatly accelerated the identification of 6mA-modified sites. Among the employed algorithms, SVM and RF exhibit stable performance and perform well on some datasets (Liu, Gao & Zhang, 2019;Sun et al., 2020;Wang et al., 2020aWang et al., , 2020bYan et al., 2020;Zhou et al., 2018Zhou et al., , 2017. In addition, the Markov model has achieved excellent results in predicting 6mA sites in the rice genome. In the application of feature methods, most researchers use multiple feature fusion methods and analyze various features. In general, the different methods have achieved good results and provided direction for subsequent research. In the research mentioned above, most methods have employed machine learning (Patil & Chouhan, 2019;Zou, 2019;Zou & Ma, 2019) and detailed analysis of different feature methods. There are some good models that use deep learning methods, such as SNNRice6mA and SICD6mA. SNNRice6mA employs CNN (Ren et al., 2019) to build a network that works well. SICD6mA uses GRU to achieve a good network structure and has been applied extensively to datasets of two species. In this article, through a summary of the previous research work, we found that LSTM+Attention can identify the modification sites very well, and a large number of experimental results suggest that this is a very good method.

Datasets
Much research has aimed to identify 6mA sites in rice. In reviewing research from the past 2 years, we found that the amount of data on 6mA sites is increasing. We obtained datasets for three species. The first dataset is a rice dataset obtained from 6mA-RicePred (Huang et al., 2020b). This dataset was first used in i6mA-Pred (Chen et al., 2019c) and was provided by the author (Hu et al., 2019). The second dataset is a Mus musculus dataset obtained from iDNA-PseKNC, and it has achieved good results with this dataset. The third dataset is a human dataset obtained from SICD6mA and is the largest of the three datasets. Table 1 provides a summary of each dataset. The lengths of their sequences are all the same: 41 bp. Details of these datasets are provided in their source papers. We have organized the datasets, which can be obtained from https://github.com/huangqianfei0916/6ma-rice.
All three data sets use CD-HIT to remove redundancy. Sequences with the similarity above 80% were excluded by using the CD-HIT program. all negative samples were 41 bp in length and the center was A, but not being detected by the SMRT sequencing technology as of 6mA. Moreover the rice dataset collected negative samples based on the ratio of GAGG, AGG and AG motifs in the positive samples. the mouse dataset removed positive samples with modQV greater than 30.

Feature encoding and classification algorithms
One-hot encoding has been used by many researchers for sequence processing with good results (Cheng, 2019;Cheng et al., 2018a;Li et al., 2020;Liu & Li, 2019;. One-hot encoding encodes each nucleotide separately. A disadvantage of one-hot is the lack of timing. Therefore, we used Kmer word segmentation instead of one-hot to capture the relationship between bases (Zuo et al., 2017). The role of Kmer was to help Embedding generate better word vectors. We investigated both normal word segmentation and Kmer word segmentation, and the experimental results showed that Kmer word segmentation achieved superior performance. Figure 1 shows the process of Kmer word segmentation. Our test for the selection of the k value revealed three to be the most suitable value. the experimental results are shown in Fig. 2. When k is 3, the dictionary size is 64; this is not a large parameter. In the feature extraction stage, the embedding layer is used to extract features. we chose the init method for our experiment. The effect of using init or fine-tune is almost the same, and in some cases, the init method is superior. If there is an excellent pretrained model, it is also a good choice. The quality of the features largely determines the effect of the model. Embedding is a very important module in deep learning, and word2vec is one of the best embedding methods. The encoding of features can be learned dynamically, and a method of secondary learning called finetune can be achieved in deep learning. In this paper, we use simple Init embedding and Kmer word segmentation.
Most methods currently employed for 6mA site recognition are machine learning methods, and most of them are only effective for a single species (Cheng, 2019;Cheng et al., 2019). In reviewing the latest research, we found that there are many similarities between the attention mechanism and the recognition of 6mA sites. Furthermore, LSTM has achieved excellent performance in dealing with sequence problems (Huang et al., 2020a). In constructing the model, we did not adopt a particularly complex structure, and the complexity and effect of the model are not directly related. After feature extraction with the embedding layer, bidirectional LSTM is used to process the sequence features (Xia et al., 2019). The sequence information obtained after LSTM processing can be used to obtain a good feature vector, and this feature is a representation of the overall sequence information. Each time step of LSTM has an output that represents the sequence information up to the current time. The LSTM algorithm can be formulated as follows: In general, LSTM can be used to obtain an output at each time step and obtain a feature containing the sequence information (Liu, Li & Yan, 2020). We can analyze these features to obtain our expected results. The typical approach is to average this information or take the last one and then apply the fully connected layer to obtain the result. Many scholars have added other layers after LSTM to obtain good features. However, the design of these levels of network structure varies according to the specific application scenarios and problems. 6mA-Pred applies the attention mechanism to the output of LSTM and connects the fully connected layer after the attention layer. The attention layer is added after the LSTM, and the output of the LSTM is analyzed with attention. The inner output of the final output of LSTM and the results of the previous time step can be used to generate the corresponding attention score. then, the Softmax layer is added to the attention layer to obtain the weight. The output of LSTM and this weight are weighted to obtain the final context vector. The last layer of the network is the fully connected layer, and this layer can obtain the probability of each category. Figure 1 shows the structure of the entire network and describes the Kmer word segmentation and attention mechanism. The attention mechanism adopted by 6mA-Pred is not complicated and acts directly on the output of LSTM. The purpose of 6mA-Pred is to obtain the final feature through the difference between global information and local information. We know that the feature corresponding to the sequence containing the modification site is very different from the feature corresponding to the sequence not containing the modification site. Because of the differences, their final context vectors differ. We used the inner product method to obtain the attention score to reflect the intersection of global information and local information. The inner product is not the only option; other operations are possible. Self-attention in Transformer is also a good choice, but the network structure of the model is more complicated. Dot product can get the intersection between different sequences. 6mA-Pred uses this structure to increase the amount of local information in the final feature. These metrics are formulated as follows:

PERFORMANCE EVALUATION
TP, TN, FP and FN represent true positive, true negative, false positive, and false negative, respectively. Sn, Sp, Acc, and MCC can be calculated from these indicators. In addition, AUC (area under the ROC curve) was used to evaluate our model (Cheng & Hu, 2018;Cheng et al., 2018b;Ding, Tang & Guo, 2019a. For further experiments, Table 2 records the hyperparameters of the model.

PERFORMANCE COMPARISON WITH DIFFERENT DATASETS
Methods for identifying sites in the rice genome include iDNA6mA-Rice and SNNRice6mA, which are excellent models. After comparing different features in feature extraction, the developers of iDNA6mA-Rice chose binary encoding, and they chose RF (random forest) for the classifier. Both the choice of feature method and the performance of the classifier are excellent. iDNA6mA-Rice was applied to various scale segmentation experiments on a rice dataset and achieved very good results. 6mA-Pred was applied in a similar experiment with the rice dataset. the results are shown in Fig. 3. The performance of 6mA-Pred was better than iDNA6mA-Rice at all ratios. However, iDNA6mA-Rice is also a very good model, and the performance difference between the two models was very small. SNNRice6mA also performs very well for rice genes. Unlike iDNA6mA-Rice, SNNRice6mA uses a deep learning model. SNNRice6mA uses one-hot in the feature encoding stage and has achieved good results. Regarding the overall network structure, SNNRice6mA uses a stack structure of CNN (convolutional neural networks). The network structure of SNNRice6mA was adjusted to derive SNNRice6mA-large, which also achieved good results. SNNRice6mA and SNNRice6mA-large were employed for five-fold cross-validation on the rice dataset. Table 3 shows the results of comparisons among the different models. The performance of 6mA-Pred was excellent compared to that of the other models.
The model also performed well on the Mus musculus dataset. iDNA6mA-PseKNC has achieved good results in predicting 6mA loci in the Mus musculus genome and uses machine learning methods for analysis. iDNA6mA-PseKNC uses NCP as the feature algorithm, and many experiments have been conducted for this feature. In addition, iDNA6mA-PseKNC employs the SVM classifier and achieved very good results. 6mA-Pred is also effective in identifying 6mA sites in the Mus musculus genome. In this study, two experiments were conducted with 6mA-Pred, one involving five-fold cross-validation on the dataset, and one involving independent testing by splitting the dataset. Table 4 shows the results of these two experiments and the results for iDNA6mA-PseKNC. iDNA6mA-PseKNC was evaluated via the jackknife test; for deep learning methods, leave-one-out cross-validation is time consuming and not representative. For evaluation of 6mA-Pred, five-fold cross-validation (Fang et al., 2019;He et al., 2018a;Xiong et al., 2018;Xu et al., 2019b;Zhu et al., 2019) and segmentation of the dataset were employed. As shown in Table 4, the performance of 6mA-Pred remained good.
Among the models used for identifying the 6mA sites of human genes, SICD6mA is currently the best model. SICD6mA is a deep learning model and uses GRU as the basic unit. SICD6mA performs well not only for human genes but also for rice genes. The developers of SICD6mA contributed data and performed extensive data processing. We used the training set and test set provided by SICD6mA's developers for our experiments. SICD6mA does not use one-hot for encoding; rather, it uses 3-mer. Two basic units, BGRU and UGRU, are used in the network model structure, and a two-layer fully connected layer and a Softmax layer are used to improve the network. The experimental results revealed that the performance of SICD6mA was very good. Table 5 shows the experimental results for 6mA-Pred, which were very similar to the SICD6mA results. These findings proved that 6mA-Pred is very effective in identifying 6mA sites in human genes. According to the previous conclusions, we conducted related experiments on traditional machine learning methods. NCP and KMER were used in experiments as excellent feature extraction methods. SVM, RF and XGB were excellent algorithms and performed well in previous studies. Therefore, we use them to carry out further experiments. the experimental results are shown in Fig. 4.

CONCLUSION
Through the analysis of current studies and the performance of a large number of experimental comparisons, we found that 6mA-Pred is an effective method for identifying 6mA sites. LSTM performs well in processing sequence features and can obtain good features. In addition, the attention mechanism we used is effective for identifying 6mA sites. The combination of LSTM and Attention mechanism can produce a theoretically excellent model, and the experiment proves that this conclusion is correct. Related methods will be considered for RNA and protein modification prediction (Dou et al., 2020;He, Wei & Zou, 2018;Huang & Li, 2018) in the future.
The previous studies on this topic are excellent and provide theoretical and experimental support for our research. The attention mechanism in 6mA-Pred can be improved; for example, self-attention or a combination of two attention mechanisms could be used to obtain a better context vector. It is also possible to use a combination of CNN and attention mechanism to obtain an excellent method (Su et al., 2014). These possibilities warrant investigation.

ADDITIONAL INFORMATION AND DECLARATIONS Funding
This work was supported by the Natural Science Foundation of China (No. 61902259) and the Natural Science Foundation of Guangdong province (grant no. 2018A0303130084). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Grant Disclosures
The following grant information was disclosed by the authors: Natural Science Foundation of China: 61902259. Natural Science Foundation of Guangdong province: 2018A0303130084.

Competing Interests
The authors declare that they have no competing interests.

Author Contributions
Qianfei Huang conceived and designed the experiments, performed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Wenyang Zhou conceived and designed the experiments, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Fei Guo analyzed the data, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft. Lei Xu analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.