M6ATMR: identifying N6-methyladenosine sites through RNA sequence similarity matrix reconstruction guided by Transformer

Numerous studies have focused on the classification of N6-methyladenosine (m6A) modification sites in RNA sequences, treating it as a multi-feature extraction task. In these studies, the incorporation of physicochemical properties of nucleotides has been applied to enhance recognition efficacy. However, the introduction of excessive supplementary information may introduce noise to the RNA sequence features, and the utilization of sequence similarity information remains underexplored. In this research, we present a novel method for RNA m6A modification site recognition called M6ATMR. Our approach relies solely on sequence information, leveraging Transformer to guide the reconstruction of the sequence similarity matrix, thereby enhancing feature representation. Initially, M6ATMR encodes RNA sequences using 3-mers to generate the sequence similarity matrix. Meanwhile, Transformer is applied to extract sequence structure graphs for each RNA sequence. Subsequently, to capture low-dimensional representations of similarity matrices and structure graphs, we introduce a graph self-correlation convolution block. These representations are then fused and reconstructed through the local-global fusion block. Notably, we adopt iteratively updated sequence structure graphs to continuously optimize the similarity matrix, thereby constraining the end-to-end feature extraction process. Finally, we employ the random forest (RF) algorithm for identifying m6A modification sites based on the reconstructed features. Experimental results demonstrate that M6ATMR achieves promising performance by solely utilizing RNA sequences for m6A modification site identification. Our proposed method can be considered an effective complement to existing RNA m6A modification site recognition approaches.


INTRODUCTION
To date, approximately 160 chemical modifications have been discerned in RNA, substantially enriching the diversity of RNA function and genetic information (Rehman et al., 2021).Among these modifications, N6-methyladenosine (m6A) stands out as the most prevalent modification type in eukaryotes and the sole dynamic, reversible RNA modification, along with N1-methyladenosine (Wang & Yan, 2018).m6A plays pivotal These studies have made significant progress in identifying RNA m6A modification sites.However, they also have limitations.For instance, incorporating additional information, like the physicochemical properties of nucleotides, alongside RNA sequences may introduce potential information interference.Moreover, these methods have mainly focused on learning features from sequential nucleotide distributions, potentially overlooking associations of nucleotides through self-correlations in RNA sequences.To address these issues, we propose a novel approach in this article, named m6ATMR, for RNA m6A modification site recognition.m6ATMR utilizes Transformer (Vaswani et al., 2017) to guide the reconstruction of the nucleotide similarity matrix, thereby enhancing feature representations of RNA sequences in a sequence-dependent manner.Specifically, RNA sequences are first encoded using the 3-mer method, generating the initial similarity matrix for each sequence.Then, Transformer is applied to further obtain the sequence structure graphs of RNA sequences.To optimize the sequence structure graphs, we calculate the Manhattan distance and perform threshold screening on the vector representation from Transformer.Next, we design a graph self-correlation convolution block to obtain low-dimensional representations of both the similarity matrix and the structure graph.In addition, we dynamically combine the low-dimensional representations obtained from the initial 3-mer representations of RNA sequences, considering both local and global perspectives, to create the final recombined features.To explore potential nucleotide associations in RNA sequences, we use iteratively updated sequence structure graphs to continuously optimize the similarity matrices, further enhancing the end-to-end feature extraction process.Finally, we employ the random forest (RF) algorithm to classify and recognize RNA sequences based on the learned features.By following this approach, m6ATMR aims to overcome the limitations of previous methods and improve the accuracy of RNA m6A modification site identification.
The main contributions of this article are as follows: First, we propose a sequence-based approach for identifying RNA m6A modification sites without introducing potentially misleading additional information.Second, the similarity matrices of RNA sequences are computed to provide more effective information that can be learned for sequences, and on this basis, the Transformer is used to reconstruct the similarity matrices and further optimize the sequence representations.Third, we propose a graph self-correlation convolution to learn a low-dimensional representation of the sequence without introducing prior information about the nodes.A series of experiments demonstrate the effectiveness of the representation strategies of M6ATMR.For M6ATMR, it is worth noting that relying solely on RNA sequences for m6A modification site identification can reduce the need for additional prior information, while still ensuring high identification accuracy.In addition, the sequence representations learned by our model perform consistently well across different classifiers, indicating that our method is not dependent on the choice of a specific classifier.In conclusion, our experimental results demonstrate that M6ATMR achieves excellent performance in identifying m6A modification sites using only RNA sequences.This highlights its effectiveness as a complementary method for RNA m6A modification site recognition.

Problem description and datasets
One of the focuses of RNA modification research is site recognition.For this task, the computational methods are usually to convert the problem into a binary classification problem, which takes the RNA sequence information as the initial input to the classification model and gets the probability value of modification sites.The method in our article follows this paradigm.For a given sequence X ¼ x 1 ; x 2 . . .x n f g , we summarize the model objective as is the optimal mapping relationship, and Y is the prediction label.If the prediction result is a positive sample, Y ¼ 1, otherwise Y ¼ 0. It is worth noting that, as in most studies, the inclusion of m6A modification sites is used as the classification criteria for positive and negative samples.That is, for a given sequence X, if its central position is the modified site, sequence X is regarded as a positive sample.Negative samples do not contain modification sites.To this end, we select the RNA m6A modification site dataset of Arabidopsis thaliana (A101 dataset) (Wan et al., 2015) for study.The dataset contains 2,100 samples, in which the ratio of positive and negative samples is 1:1.

Model description
M6ATMR implements the identification task of m6A modification sites based on several procedures.First, the RNA sequences were processed into readable coding representation based on k-mer algorithm, and the similarity between RNA sequences was measured by Manhattan distance to further construct the similarity matrix.After that, M6ATMR learned the structural information of RNA sequences through the Transformer encoder, and converts this structural information into the structural graph by Manhattan distance.It then inputted the similarity matrix and structure graph into the self-correlation graph neural network to iteratively update the similarity matrix and obtain the low-dimensional representations of each RNA sequence.These low-dimensional representations were passed through a local-global fusion block to generate the final fusion representation, which was fed into the RF for identification.For the convenience of description, section 2 describes M6ATMR in detail from four parts: similarity matrix calculation, structure graph learning, similarity matrix optimization, and local-global representation fusion.The framework of M6ATMR is shown in Fig. 1, and the details of each part are as follows.

Similarity matrix calculation
For similarity matrix calculation, the key step is to transform RNA sequences into the readable representations using k-mer frequency statistics.Statistical k-mer frequency information can reveal the distribution law of seed sequences and is an important tool to study sequence similarity.In sequence X, a substring of length K refers to K monometric units starting from any position in X, which is called k-mer.In this article, considering that each of the three adjacent nucleotides in an mRNA molecule is organized into a group that represents the pattern of a particular amino acid in protein synthesis, therefore, to make the model biological interpretation, we set K to 3. The algorithm requires that the starting position of the sequence set ¼ X 1 ; X 2 ; . . .; X m f gis aligned.For k-mer sequences with fixed K value, at the offset position l 0 l n À k ð Þ , we count the occurrence frequency of different substrings in k-mer sequence with length K starting from the offset position l: where R X 2 R nÂ1 denotes the k-mer representation of sequence X, and fre Á ð Þ represents the frequency of the k-mer substring M i , and È represents the operation of concatenating all substring frequencies.Nom Á ð Þ represents a normalized operation.After representing all RNA sequences using statistical k-mer frequency, we attempt to calculate the similarity between each sequence and construct an initial similarity matrix of RNA sequences.We construct matrices between k-mer representations for the following reasons: Firstly, earlier studies on the prediction of biomedical entity associations (Shao et al., 2020) have shown that similarity matrices can reflect the potential association between different entities, which provides strong support for the establishment of the association between different sequences of nucleotides.Secondly, constructing a graph data structure based on the similarity matrix allows us to capture the relationships between multiple sequences, which creates conditions for further sequence extraction.Therefore, it is a reasonable choice to transform the sequence representing the problem into the optimization task of the similarity matrix.In this article, we choose to employ the Manhattan distance to further measure the degree of similarity between nucleotides on different sequences: where, in the similar matrix S 2 R mÂm , each element represents the Manhattan distance between k-mer sequences of corresponding positions.Taking d m 1 as an example, its value is the Manhattan distance between sequence representation R 1 and R m : where, the R j 1 , R j m represent the jth k-mer value in sequence state R 1 and R m respectively, and Á j j denotes the operation of taking the absolute value.

Structure graph learning
When obtaining the initial similarity matrices of the RNA sequences, we use a Transformer encoder to capture the structural information and learn the structural graph of RNA sequences.To this end, we apply the encoder part of the Transformer to further process the k-mer sequence.For a given Transformer encoder, there are two basic components: the position encoding block, and the self-attention mechanism.In addition, in order to further explore the structural relationship between RNA sequences, we also calculate the Manhattan distance between vector representations of the output of the Transformer encoding block, and strictly constrain the value of the structure matrix within the set of 0; 1 f g.The details are as follows.Position encoding of Transformer is a functional encoder, that is, position vectors are calculated for each element in the sequence: where w i denotes the frequency, which is calculated as follows: where d is the output dimension of the neural network.It is worth noting that the length of this position vector is equal to the length of k-mer representations.Thus, the RNA sequences are represented by k-mer and position encoding: where, p t !denotes the position vectors of all elements in k-mer representations.After calculating the position vectors, the encoder further optimizes the representation R t through the self-attention mechanism: where, Q i , K i and V i are the query matrix, key matrix and value matrix respectively.g denotes the softmax activation function.Through the self-attention mechanism, the Encoder can reveal the potential associations within the sequence and further excavate the structural associations of nucleotides.
In this regard, we believe that it is a valid way to describe the structural association of RNA sequences through internal relationships captured by the self-attention mechanism.To this end, we also measure this structural associations by the same method in section 2.3 and construct the structure graph G st for sequences.Since the model samples the same data and distance formulas during the calculation process, the structure graph has a potential correlation with the similarity matrix, which indicates that it is reasonable to optimize the similarity matrix further through G st .In addition, we ensure that the values of elements in G st are strictly constrained in the set of 0; 1 f g by threshold filtering, as shown below: st , updating the value to 0 if G i; j ð Þ is less than 0.5, otherwise updating the value to 1.

Similarity matrix optimization
To optimize the sequence similarity matrix and update the sequence structure graph, we design a self-correlation graph neural network that does not depend on prior node representations.In traditional graph neural networks, the embedded learning process relies on the existing representations of the nodes or edges in the graph.These prior representations serve as the starting point for the learning algorithm to update and refine the embeddings based on the graph structure.For instance, in some studies related to drug-drug association prediction, the SMILES of drugs are often applied as the prior information to serve as the initial input to the graph neural network.However, for similar matrix and structure graphs of sequences, the initial information of nodes is difficult to be obtained.In addition, both the similarity matrix and the structure graph describe the self-correlation property of the sequence, which makes the introduction of additional prior information may lead to misleading information.Therefore, inspired by the self-attention mechanism, we generate the learnable initial node representations based on the input matrix information.The initial representations are constantly updated and optimized during the process of graph convolution and participate in the optimization of the final embeddings.Taking the processing of similarity matrix as an example, we describe in detail the learning process of representation of similarity graph.
For a given similarity matrix S 2 R mÂm , we define the node initial representation matrix as Er 2 R mÂm .Each value in the matrix is determined and iteratively optimized by the network.S and Er are input into the three-layer self-correlation graph neural networks to get the embeddings related to the similarity matrix S: where, a is the scaling superparameter to prevent the elements in the similarity matrix representation from becoming infinitesimal during graph convolution, and I ¼ 3 denotes the number of convolution layers.Gcov Á ð Þ represents the convolution process.The hidden layer representation of layer i þ 1 ð Þ and the representation of layer i ð Þ satisfies the following equation: where De s denotes the degree matrix of the similar matrix S, and diag Á ð Þ denotes the diagonalization operation.W is the learnable weight.The representation of the hidden layer is initialized to Er, that is, H 0 ð Þ ¼ Er.Similarly, sequence structure graphs are fed into the self-correlation neural networks in the same way.The network further employs learnable initial representations to mine the self-correlation of sequences, which also ensures consistency between representations learned from similarity matrices and that learned from structure graphs.
In addition, another task of the networks is to get better RNA representations by optimizing the similarity matrix.Hence, we apply the reconstructed similarity matrix and sequence structure graph to calculate the loss of the networks: where BCELoss Á ð Þ denotes the binary cross-entropy loss.In the optimization process, we employ the difference between the reconstructed element values and the original element values to measure the performance of the representations.

Local-global representation fusion
In order to obtain comprehensive knowledge of RNA sequences, we propose a local-global strategy for fusing learned embedded representations.We process two kinds of representations of the same sequence respectively from the local and global perspectives of embedding representations, and further determine the weight relationship between embedding representations from similarity matrices and embedding representations from sequence structure graphs.Specifically, we design a local information extraction block and a global information extraction block respectively to calculate the weights of the two types of embedded representations.
For a given similarity matrix embedding representation e sm and structure graph embedding representation e st , we first treat them as residue sequences and weight them to obtain an overall representation e a : e a ¼ e sm þ e st (12) We then enter e a into the local and global information extraction blocks.For the local and global information extraction block, the extraction process is described as follows: & where e a lo and e a gl denote the output of the local extraction block and that of the global extraction block respectively.f and f 0 represent a one-dimensional convolution layer containing normalized functions respectively, and # represents the ReLu function.For the global information extraction block, we add the global average pooling layer d on the basis of the local information extraction block.After that, we further calculate the weight difference w f between the two representations: We utilize this weight difference to further integrate the two types of embedded representations to obtain the similar-structural representation e ss : In addition, we consider the indispensable role of the k-mer representations of the sequences for the recognition of m6A modification sites, and further integrate these representations with the similar-structural representations to obtain the final embedding representations e fi : where, e km denotes the k-mer representations of the sequences, and w f 0 is the weight difference between e km and e ss .

Experiments setting
For a binary classification problem, its prediction states can be divided into the four categories: true positive (TP), false positive (FP), true negative (TN), false negative (FN).Thus, we select some predictive indicators to evaluate the prediction effect, and the calculation processes of these indicators are as below.Moreover, we obtain the area under the precision-recall curve (AUPR) and area under the receiver-operating characteristic curve (AUC) for evaluating our model.
In our study, we set the learning rate to 0.0001 and set the number of layers as three in the self-correlation graph neural network.Considering the time and complexity of training, we reduced the number of heads from eight to six in the Transformer encoder.We set the embedding size of the encoder to 32, the hidden dimension of the feed-forward layer to 128, and the number of encoder blocks to six.

Performance on A101 datasets
We conduct a rigorous 10-fold cross validation to evaluate the performance of our proposed model on the A101 dataset.The dataset is systematically partitioned into ten subsets of equal size, ensuring non-overlapping test sets in each fold.For each fold, we utilize nine subsets for training and one for testing.During evaluation, we consider six essential indicators: AUPR, AUC, Acc, F1 score, Prec, and Sen. The results of the 10-fold cross validation are presented through PR curves and ROC curves in Fig. 2 and summarized in Table 1.Across the 10 folds, the AUC curve demonstrates remarkable stability, with the maximum AUC reaching 93.87% and the minimum at 89.79%.Similarly, the PR curve displays consistent performance, with the highest AUPR at 93.17% and the lowest at 87.66%.Table 1 reveals outstanding performance in various evaluation metrics.The Acc achieves an impressive 84.43%, signifying a high correct identification rate for both TN and TP samples.Additionally, the MCC attains a value of 83.72%, reflecting the overall strength of our model.Furthermore, the values of other metrics, including F1 score, Prec, and Sen, surpass 80%, indicating the reliability of our model.These experimental findings substantiate the robustness and efficacy of our proposed model in identifying m6A modification sites.

Performance comparison of Classifiers
In some studies, the downstream classifiers have shown to significantly influence the classification performance of RNA sequence representations generated by models, potentially leading to model instability.To demonstrate the stability and efficacy of our proposed model, we conduct a comparison experiments using five additional classifiers: logistic regression, eXtreme Gradient Boosting (XGBoost), Light Gradient Boosting Machine (lightgbm), CatBoost, and support vector machines (SVM).Logistic regression employs maximum likelihood estimation to predict model parameters, yielding binary results by minimizing cross-entropy loss during data training.XGBoost, an integrated classification algorithm, employs multiple simple base learners to iteratively train input data, continually reducing the discrepancy between model and input values.In contrast, lightgbm stands out with its advantage of low memory usage and faster training speed.CatBoost is designed to extract the most information from given data and is particularly effective for small machine-learning datasets.SVM, as a binary classification model, aims to find an optimal hyperplane for sample segmentation.For evaluation, we utilize 10-fold cross validation on the A101 dataset, as mentioned in "Performance on A101 datasets".The same set of indicators, AUPR, AUC, Acc, F1 score, Prec, and Sen are employed for assessment.The experimental results, presented in Fig. 3, Fig. and Table 2, indicate that when RF is used as the downstream classifier, the model achieves the highest performance, with AUC at 91.27% and AUPR at 90.40%.While XGBoost exhibits relatively inferior performance compared to logistic regression and RF, the overall classification effect of all three classifiers remains relatively favorable.The difference in AUC values between XGBoost and RF is 7.84%.These findings support the notion that the RNA sequence      superior performance and highlighting its role in effectively capturing comprehensive sequence information.

Performance comparison of predictors
In this study, we conduct a rigorous 10-fold cross validation to compare our model with five other existing methods on the A101 dataset.The compared methods include M6AMRFS, BERMP, RFAthM6A, BERT-m7G (Zhang et al., 2021), and the model designed by Le & Ho (2022).M6AMRFS encodes RNA sequences using two feature descriptors, dinucleotide binary coding, and local site-specific dinucleotide frequency.It enhances feature representation through the F-score algorithm combined with sequence forward search (SFS) and employs XGBoost as the downstream classifier.BERMP utilizes GRU to represent RNA sequences and adopts an end-to-end training process for site recognition.RFAthM6A attempts to classify various types of features derived from RNA sequences using machine learning methods.Our model, M6ATMR, adopts the transformer encoder to extract sequence representations and uses a stacking ensemble classifier for predicting m6A sites.We also consider two other transformer-based models, BERT-m7G, and the model designed by Le & Ho (2022).In their approaches, BERT-m7G uses bidirectional encoder representations from transformers (BERT) to extract sequence representations, while Le & Ho (2022)  However, we note that the specificity (Sp) value of our method is slightly lower than that of BERMP and RFAthM6A, indicating a minor deficiency in predicting true negative samples.Nevertheless, overall, the experimental results clearly demonstrate the effectiveness of our model, which stands out as a superior approach for m6A site prediction on RNA sequences.However, there remain certain limitations in our study that warrant attention.First, the restriction of RNA sequence length necessitates the selection of the A101 dataset for model verification, rendering our approach less adept at handling short RNA sequences.Second, our current model primarily focuses on nucleotide distribution information, with limited exploration of other sequence properties.Future work will address these issues and explore the application of our model to the identification of other modification types, such as M1A, and modifications on DNA sequences.
Minghao Wu conceived and designed the experiments, authored or reviewed drafts of the article, and approved the final draft.

Figure 1
Figure 1 The framework of M6ATMR.(A) The details of similarity matrix calculation process.(B) The details of structural graph learning process with the Transformer encoder.(C) The similarity matrix optimization process with the self-correlation graph neural networks.(D) The structure of the local-global representation fusion block.Full-size  DOI: 10.7717/peerj.15899/fig-1

Figure 3 Figure 4
Figure 3 The comparison results of different classifiers.Full-size  DOI: 10.7717/peerj.15899/fig-3 Features selectionIn this study, we integrate three types of features for comprehensive analysis, including feature representations from similarity matrices, feature representations from sequence structure diagrams, and k-mer sequence representations.K-mer representations are pre-coded representations based on the frequency count of sequence k-mer substrings, while the other two features are dynamically learned through neural networks.To achieve a holistic understanding of the sequences, we perform local-global integration of these features.To illustrate the importance of combining these three features, we construct two additional features, drop-raw and drop-trans, to validate our model's performance.Drop-raw represents features without similarity matrix information, and drop-trans represents features without sequence structure graph information.The comparison results of the three types of features are presented in Fig.5.Our model demonstrates the best recognition performance when all features are utilized.Drop-raw performs slightly better than drop-trans in terms of AUC and AUPR, suggesting that similarity matrices exert a stronger influence on the model compared to sequence structure graphs.Overall, all three types of features are essential and significantly enhance the effectiveness of site recognition.The local-global fusion block is a crucial component of our model, which integrates multiple features from both local and global perspectives by combining learned similarity matrix features with sequence-structure graph features.To demonstrate the necessity of this fusion block, we design another strategy using only weighted fusion, and the results are also presented in Fig.5.The outcomes reveal that the recognition performance of the model is inferior when only weighted fusion is employed compared to the local-global fusion block.This underscores the significance of the local-global fusion block in achieving

Table 1
The value of some indicators in each fold.

Table 2
The AUC value of some indicators in each fold based on different classifiers.
Le & Ho (2022)ned transformer to explore features and a convolutional neural network for further feature extraction.The comparison results are presented in Table3.Our model achieves an Acc value of 84.42% and a MCC value of 83.72%.These indicators demonstrate that our model outperforms the other five methods in most aspects.Compared to the other models, our approach exhibits a remarkable improvement, with a maximum of 11.09% higher accuracy and 16.13% higher MCC value.The model designed byLe & Ho (2022)has the lowest MCC value among all methods, while BERMP and RFAthM6A show similar performance across various indicators.

Table 3
The comparison results of different recognition methods.N.A. denotes the value of the indicator is not provided by corresponding studies.In this article, we commence by reviewing classical methods for identifying RNA m6A modification sites and presenting our own perspectives.Subsequently, we analyze the limitations of these methods, leading us to propose a novel sequence-dependent-only RNA m6A modification site recognition method, named M6ATMR.M6ATMR utilizes the Transformer to guide the reconstruction of similarity matrices for each RNA sequence, thereby optimizing the feature representation of RNA sequences.Comparative analysis with other recognition methods reveals that M6ATMR demonstrates superior predictive performance, as evidenced by improved metrics.Comprehensive experiments further attest to the accuracy and robustness of our model.Additionally, we delve into several critical aspects.First, computing the similarity matrix and optimizing feature generation proves effective in enhancing the recognition performance of RNA m6A modification sites.Second, cooperative updating of similarity matrices and sequence structure graphs in the sequence representation of the same RNA sequence facilitates the retention of richer nucleotide distribution information.Third, the deep fusion of multiple features from both local and global perspectives results in a comprehensive understanding of RNA sequences.