ncRFP: A Novel end-to-end Method for Non-Coding RNAs Family Prediction Based on Deep Learning

Evidence has accumulated enough to prove non-coding RNAs (ncRNAs) play important roles in cellular biological processes and disease pathogenesis. High throughput techniques have produced a large number of ncRNAs whose function remains unknown. Since the accurate identification of ncRNAs family is helpful to the research of their function, it is of necessity and urgency to predict the family of each ncRNAs. Although several traditional excellent methods are applicable to predict the family of ncRNAs, their complex procedures or inaccurate performance remain major problems confronting us. The main idea of those methods is first to predict the secondary structure, and then identify ncRNAs family according to properties of the secondary structure. Unfortunately, the multi-step error superposition, especially the imperfection of RNA secondary structure prediction tools, maybe the cause of low accuracy. In this paper, a novel end-to-end method ‘ncRFP’ was proposed to complete the prediction task based on Deep Learning. Instead of predicting the secondary structure, ncRFP predicts the ncRNAs family by automatically extracting features from ncRNAs sequences. Compared with other methods, ncRFP not only simplifies the process but also improves accuracy. The source code of ncRFP can be available at https://github.com/linyuwangPHD/ncRFP.


INTRODUCTION
THE expression of protein-coding genes (messenger RNAs: mRNAs) has been the focus of life studies for decades. But in recent years, increasing evidence has shown that ncRNAs play multi vital roles in biological processes [1] and complex diseases [2] by means of replication, transcription or gene expression regulation [3], [4], [5]. With the continuous research, a large number of ncRNAs have been identified based on their sequence length, structure properties or function. The most well-known ncRNAs are transfer RNAs (tRNAs) and ribosomal RNAs (rRNAs). tRNAs, which include the highly conserved secondary structure of the 'cloverleaf model', are mainly responsible for the transportation of amino acids in the course of translation [6], and rRNAs mainly take charge of the synthesis of peptide chains [7]. Other famous ncRNAs are microRNAs (miRNAs) and long non-coding RNAs (lncRNAs). miRNAs are small single-strand ncRNAs of approximately 22 nucleotides in length, which are directly bound to the 3'-untranslated region of target genes and negatively regulate them. Hence, miRNAs are essential in many physiological and pathological processes, such as cancer and alzheimer. Through regulation, they can promote or inhibit cell proliferation, and then act as tumor suppressors or oncogenes [8]. lncRNAs are longer than 200 nt nucleotides [9], which are important signal transduction regulators by various patterns. One of the most acknowledges molecule mechanisms of lncRNAs is to act as "sponges" to modulate the activity of miRNAs [10]. Other important ncRNAs are small nuclear RNAs (snRNAs), small nucleolar RNAs (snoRNAs), silencing RNAs (siRNAs), riboswitches, internal ribosome entry sites (IRES) and so on. ncRNAs have a three-hierarchy structure [11]: primary structure, secondary structure, and tertiary structure. The primary structure is the bases sequence. Although it seems that ncRNAs sequences are not as conservative as secondary/tertiary structures, the same family of ncRNAs contains unify seeds. The secondary structure refers to the planar structure formed by a combination of secondary structure elements with a variety of specific shapes through its own folding and base pairing within the sequence. Since the secondary structure with similar functions is conservative, it can be expressed as a graph structure to predict different ncRNAs family. The tertiary structure refers to the stable three-dimensional conformation formed in space by the interaction of secondary structure elements. Although the tertiary structure is conservative, it is not suitable for ncRNAs family prediction because of the difficulty to obtain.
In this era of high-throughput technology [12], a great number of unknown ncRNAs have been found and the function research on these ncRNAs has imposed great challenges to researchers. For the reason that different families of ncRNAs have distinct functions, accurately predict ncRNAs family is conducive to the research on ncRNAs function. Identifying the ncRNAs family via biological experimental methods is time-consuming, laborious and expensive, which cannot meet the needs of high throughput technology. Therefore, it is inevitable to use computational methods to predict the family of ncRNAs. There are several traditional excellent methods (GraPPLE [13], RNAcon [14], nRC [15]) can complete the prediction task. GraPPLE uses machine learning methods to predict ncRNAs family based on the secondary structure features. RNAcon considers 20 graph features obtained from the predicted ncRNAs secondary structure and adopts an RF classifier. nRC currently represents the state-of-art method, where the secondary structure typical features are first extracted by Moss [16] and processed into one-hot code, and then the convolutional neural network is employed to identify ncRNAs. Those traditional methods are all required to get the secondary structure of ncRNAs at the beginning and then identify ncRNAs family with the secondary structure features. Because the real secondary structure of ncRNAs is difficult to obtain, those traditional methods need to use the computational tools (IPknot [17], RNAfold [18]) to get the secondary structure of ncRNAs. There are errors in the process of predicting the ncRNAs secondary structure, which will have disadvantage impact on other steps. Therefore, the low prediction accuracy of traditional methods may be caused by multi-step error superposition, especially the imperfection of secondary structure prediction tools.
In those traditional methods, ncRNAs secondary structure is predicted based on ncRNAs sequences, the characteristics of secondary structure are also hidden in the ncRNAs sequences. Therefore, so long as the method of extracting features is appropriate, perfect family features can be extracted from ncRNAs sequences to identify their family. In accordance with the requirement of ncRNAs family prediction and the shortcomings of traditional methods, a novel end-to-end method 'ncRFP' was proposed to predict the family of ncRNAs. The main novelty of ncRFP is that it is different from those traditional methods to predict the family process, which directly predicts the family based on ncRNAs sequences. With ncRNAs sequences as input, ncRFP extracts features automatically and learn them with the assistance of a deep learning model. In this paper, three models were created based on RNN, CNN, and DNN respectively. After compared those models, the prediction performance of RNN is great than the other two models. Fig. 1 shows the performance comparison among those models. Hence, we chose the RNN model, which contains Bi-LSTM [19], Attention Mechanism [19], and fully connected neural networks [20], as the final model. Compared with other methods, ncRFP reduces method steps and error sources. Therefore, it not only predigests the process but also has the potential to improve the prediction accuracy.

Data Collection and Progressing
The data used in this paper comes from the recent literature [15], which was collected from the Rfam database [21]. It includes: microRNAs, 5S_rRNA, 5.8S_rRNA, ribozymes, CD-box, HACAbox, scaRNA, tRNA, Intron_gpI, Intron_gpII, IRES, leader and riboswitch 13 different types of ncRNAs. There are 6320 nonredundant ncRNAs sequences, which IRES contains 320 ncRNAs sequences, and each of the other family contains 500 ncRNAs sequences. All sequences of each family were randomly divided into 10 parts. One part of each family is randomly selected to form a test set and the rest to form a train set, so that all ncRNAs sequences can form 10-fold cross-validation train sets and test sets. In this paper, two encoding methods were selected to convert ncRNAs sequences into matrixes. The first one is to convert each of base into a vector of 1 Ã 8 [22], another one is to convert each of base into a vector of 1 Ã 4 (one-hot encoding). Tables 1 and 2 show the conversion rules of those two encoding methods.
Because of the different lengths of various ncRNAs, it is convenient for training and testing to process different lengths of ncRNAs into the same length. In this paper, a new intercepting/padding method (IPM) was put forward to process various ncRNAs into a fixed length. ncRNAs with length greater than the fixed length are truncated from the beginning to the fixed length, and with length less than the fixed length are filled with 'N' at the tail to the fixed length. In order to choose an excellent encoding method and appropriate fixed length, we randomly chose a fold train set and test set to training and testing ncRFP. Fig. 2 shows the prediction accuracy in different lengths and different codes. It can be seen that the accuracy of 1 Ã 8 is greater than 1 Ã 4 in all lengths, and when the length is 500 achieved maximum accuracy. Therefore, the length of 500 and the encoding method of 1 Ã 8 were selected. Hence, each of the ncRNAs sequences was processed into 500 bases by IPM and converted into a matrix of 8 Ã 500.

Method
ncRFP is a Deep Learning model composed of Bi-LSTM, attention mechanism (AM), and fully connected network. Bi-LSTM and AM are mainly responsible to encode different ncRNAs into fixed format data, and the fully connected network is to decode the output of Bi-LSTM and AM. Fig. 3 displays the architecture of ncRFP. It can be found that after the original ncRNAs sequences are converted into matrixes, they are used as the input of Bi-LSTM and AM. Bi-LSTM encodes each of base at different locations into fixed size data based on the context. AM primarily facilitates the model to focus on the import locations of the different ncRNAs and to encode the output of Bi-LSTM into the same format size. The fully connected network, which majorly undertakes the task of decoding the output of Bi-LSTM and AM into the corresponding family of ncRNAs.

Bi-LSTM
ncRNAs are context sensitive text data [22], so it is necessary to record the context of each base when predicting their family. Because the bidirectional RNN can effectively record the past and future characteristics of each base, we chose it as the first part of the model to process each base into the same format data combined with its context. The memory and storage capability of ordinary RNN is limited, which will lose the ability to learn information and fall into gradient vanishing with the increase of sequences length. As a special RNN, LSTM solves the problem of gradient vanishing that arises in the ordinary RNN by introducing the memory cell and gate mechanism, which makes it perform better in representing the past information, future information and extracting long-distance dependencies of elements in sequence data. The LSTM memory cell could be implemented as follows: where s is the logistic sigmoid function, i, f, o, and c are the input gate, forget gate, output gate, and cell vector, respectively, all of which are at the same dimension as the hidden vector h. Meanwhile, w denotes the weight matrices and the b indicates the bias vectors. Hence, the Bi-LSTM was selected in ncRFP, which

Attention Mechanism
In recent years, AM has gradually turned to be a research hotspot in the field of deep learning. AM originates from the simulation of attention characteristics of the human brain. The core idea of AM is to allocate more attention to imported information and less to other information, so as to ingeniously and reasonably change the attention to the information from the outside world, ignoring the irrelevant information and amplifying the desirable information. Thus, the receiving sensitivity and processing speed of information in the focused attention area are greatly improved. Ref. [23] first applied AM to investigate representations and proposed that global attention and local attention models should be applied to machine translation. Besides, Ref. [24] applied the AM to image description and obtained satisfying results. It was also mentioned in Ref. [25] that, in the field of automatic document summarization AM was capable of creating the sentence and document embeddings, which in turn enhanced the document summarization task. The same family of ncRNAs have consistent seed sequences, which can be used to distinguish them. Therefore, AM was selected as the second part of ncRFP, which is responsible for focusing more attention of ncRFP on the consistent seed sequences. So as to improve prediction accuracy.

Fully Connected Network
After Bi-LSTM and AM, it is no doubt that it is necessary to decode the output of Bi-LSTM and AM into the corresponding family of each ncRNAs. In this paper, a four-layer fully connected neural network was proposed to accomplish the task. Fig. 2 contains the architecture which consists of one input layer, two hidden layers, and one output layer, in which ReLU [26] was regarded as the activation function. The fully connected neural network could be implemented as follows: where w is the weight matrices, b is the bias vectors, and x as well as y is the input and output between any two layers.

Parameters of ncRFP
In this paper, the Adaptive moment estimation (Adam) optimizer [27] was used to train the method, which dynamically adjusts the learning rate of each parameter. In each iteration, the learning rate has a clear range, so as to effectively prevent the gradient from  disappearing and exploding, and to make the parameters change smoothly. The parameters of ncRFP are given below: 1) Bi-LSTM hidden layer had 128 nodes and the output layer had 256 nodes. The number of nodes in fully connected layers were 128, 64 and 13. 2) All of the weighting matrices were initialized by the Gauss distribution (mean value is 0, the standard deviation is 0.05).
3) The dropout [28] layer was added to prevent over-fitting by reducing the number of points involved in the calculation. The dropout parameter was 0.3 in the Bi-LSTM and 0.4 in the fully connected neural network.

RESULT
In this section, we presented the training results and prediction results of ncRFP. In order to verify the stability of ncRFP, we used the 10-fold cross-validation to train ncRFP and obtain the training process of each epoch. After training ncRFP, we used it to get the prediction results of 10-fold cross-validation test sets and compared it with nRC and RNAcon in two aspects. The first aspect was compared to the average results of all test sets, another aspect was compared to the average results of each family. Fig. 4 shows the average accuracy and loss of each epoch in the 10-fold crossvalidation experiments. It can be seen that in the last few epochs, the accuracy and loss of the test sets tend to be stable, which indicates that ncRFP can successfully complete the prediction of ncRNAs family. Tables S1-S3 (supplementary document, which can be found on the Computer Society Digital Library at http:// doi.ieeecomputersociety.org/10.1109/TCBB.2020.2982873) is the detail average prediction results of ncRFP, nRC, and RNAcon in the 10-fold cross-validation test sets. In Table S3, available online, because the prediction results of RNAcon do not contain scaRNA, we deleted scaRNA in the test sets when RNAcon was used to predict ncRNAs families. The green parts of each table are the correct prediction number of each ncRNAs family. In order to estimate the performance of each method, the indexes of Accuracy, Sensitivity, Precision, F-score and MCC are used to compare ncRFP, nRC, and RNAcon. Accuracy is the proportion of predicted correct ncRNAs in the total ncRNAs, Sensitivity (Equation (1)), Precision (Equation (2)), F-score (Equation (3)) and MCC (Equation (4)), which are well known and have been applied  earlier in various prediction methods [13], [14], [15], where TP, TN, FP, and FN are True Positives, True Negative, False Positives, and False Negatives respectively (4) Table 3 shows the average performance comparison of different methods in the 10-fold cross-validation test sets. According to Table 3, it can be concluded that ncRFP achieves the best performance as far as all the indexes are concerned. Fig. 4 demonstrates the average performance of different methods in each family. It can be found that ncRFP achieves optimum in 5S_rRNA, 5.8S_rRNA, tRNA, CD_box, miRNA, Intron_gpII, HACA-box, IRES, riboswitch, leader and scaRNA. In ribozyme, ncRFP achieves optimum in sensitivity and suboptimal in other indexes. With regard to Intron_gpI, ncRFP achieves suboptimal in all indexes.

DISCUSSION
RNAs are crucial biological macromolecule in life activities, which can be mainly divided into coding RNAs and non-coding RNAs. Coding RNAs can participate in life activities by guiding protein synthesis, and non-coding RNAs themselves can participate in life activities. Owing to the complexity and diversity of non-coding RNAs, researchers have confronted great challenges from the function research on non-coding RNAs. In the meanwhile, the application of high throughput technology has generated a large number of unknown ncRNAs, which also poses a challenge to researchers. Studies have revealed that the function of ncRNAs is closely related to their families, so accurate recognition of the ncRNAs family is helpful to the study of ncRNAs function. In the face of huge high-throughput ncRNAs sequences, it is necessary to use computational method to predict ncRNAs family. Hence, we proposed a novel computational method 'ncRFP' to predict ncRNAs family. Different from traditional methods, ncRFP directly uses ncRNAs sequences as input, which reduces the intermediate process and error sources. ncRFP is a deep learning model, which contains Bi-LSTM, AM, and Fully Connected Network. In order to select an excellent model, we created three models based on RNN, CNN and DNN respectively and adjust their performance to optimal. Fig. 1 shows the performance comparison of those three models. The performance of RNN and CNN is close, which means that both RNN and CNN models can be used for ncRNAs family prediction. Because of the performance of RNN is slightly better than CNN, we chose the RNN structure model as the final model. In the process of training, we use 10-fold cross-validation to train it and compare it with several excellent methods.
In the process of multiple methods comparison, our method can achieve the optimal in the whole ncRNAs sequences and 11 single families: 5S_rRNA, 5.8S_rRNA, tRNA, CD_box, miRNA, Intron_gpII, HACA-box, IRES, riboswitch, leader, and scaRNA. In the whole ncRNAs sequences, Accuracy, Sensitivity, Precision, F-score and MCC increased by 13.88, 14.36, 14.92, 14.61 and 16.40 percent respectively. Table 4 shows the detail improvement of indexes in each family. There are two main reasons for the best performance of our method in the whole ncRNAs and most families. On the one hand, our method is a deep learning model, which can comprehensively extract the features of ncRNAs for prediction and the model including Bi-LSTM and AM can effectively focus on the important sentences of ncRNAs sequences. On the other hand, our method is an end-toend method, which reduces the intermediate process. In Tables S1 and S2, available online, ribozyme and Intron_gpI have the most mispredictions between each other, which means that Intron_gpI and ribozyme are similar in sequence and secondary structure. There are two main reasons why the performance of ncRFP is not optimal in ribozyme and Intron_gpI. On the one hand, the high similarity between ribozyme and Intron_gpI makes the extracted features unable to predict their families. On the other hand, our method has some loss when intercepting ncRNAs sequences, which makes our method not optimal.
In future work, we will create a model, which combines RNN and CNN to simultaneously learn the primary and secondary structural features of ncRNAs to improve the prediction accuracy. The new model will be predicting more types of ncRNAs to provide more calculation support for biological experiments. Like nRC, we will also create a publicly available web service for the family prediction of unlabeled non-coding RNA sequences.