MU-PseUDeep: A deep learning method for prediction of pseudouridine sites

Graphical abstract


Introduction
Pseudouridine (W) is one of the most abundant RNA modifications in a cell [1]. W is also known as the fifth nucleotide base of RNA [2]. It results from the isomerization of a uridine base. This process of isomerization is a post-transcriptional mechanism known as pseudouridylation [3,4], which is catalyzed by pseudouridine synthases (PUS) [5][6][7][8]. W has significant functional and disease implications. For some types of cancers, W provides important biomarkers [9][10][11][12][13]. The sequencing method (Pseudo-seq) can identify W sites on a large scale at a single nucleotide resolution, but it requires a high sequencing depth as well as multiple biological replicates in order to do so accurately; thus, Pseudo-seq can be very costly [14,15]. A low-cost alternative is predicting W sites using machine learning. Most machine learning methods developed so far use traditional approaches like support vector machines (SVM). For instance, PPUS, an SVM based method uses nucleotides around W as features [16]. In contrast, iRNA-PseU uses the pseudo-nucleotide composition, including a combination of physicochemical properties of nucleotides and nucleotide densities as features for SVM [17]. Another method, pseudo-uridine (W) identification (PseUI) uses five different kinds of features including nucleotide composition (NC), dinucleotide composition (DC), pseudo dinucleotide composition (pseDNC), position-specific nucleotide composition (PSNP) and position-specific dinucleotide propensity (PSDP), followed by a sequential forward selection strategy to select features for SVM classification of mRNA fragments [18]. While these methods have reasonable performance, there is considerable room for improvement. These tools do not benefit from the latest deep learning techniques, which pose several advantages in comparison to traditional machine learning methods. First, deep learning has been demonstrated to significantly outperform traditional machine-learning methods in multiple domains. Secondly, deep learning reduces the need for feature engineering. Lately, there has been an upsurge in the development of deep learning methods in genomics [19]. Some of these prediction methods have been in the area of RNA modification prediction [20][21][22].
We novelty in this work is to use the secondary structure context of an mRNA fragment as an input feature in addition to the sequence for the input to our deep learning model. W modification plays an important role in stabilizing the secondary structure of RNA. Ribonucleoproteins depend strongly on the structural context of RNA when they catalyze the isomerization of uridine to W [23].
Thus, it is reasonable to hypothesize that the secondary structure is crucial for the identification of W sites. To the best of our knowledge, secondary structure context has never been used for this problem, although deep learning approaches do exist, which utilize a secondary structure context to predict RNA-protein sequence and structure binding [18]. By including secondary structure features, we significantly improved the prediction of W sites in comparison to other available methods. Very recently, a new study explored a deep learning approach to predict W sites called iPseU-CNN [24].
Since no source code or webserver was available for this approach, a direct comparison is impossible; however, we have compared our method with a sequence-only CNN, the architecture of which closely resembles that of iPseU-CNN. Compared to the sequence-only CNN which only uses RNA sequence fragment encoded with Oneof-K encoding, our method which combines both sequence and secondary structure information shows significant improvements. We have made predictions using the MU-PseUDeep model for human, mouse and yeast datasets. We have also identified potential W sites by conducting a transcriptome-wide prediction of a human transcriptome at >0.99 precision threshold, to explore the functional importance of W in mRNA. These predicted W sites may provide useful hypotheses for experimental validations.

Data collection and pre-processing
The W site information was downloaded from RMBase v2.0 [23] for all three species, namely human, mouse, and yeast. For each of the three species, we extracted the W and surrounding 25 bases upstream and downstream nucleotides using BEDTools [24] with reference files of three species, hg19 (human), mm10 (mouse), and sacCer3 (yeast). To create the negative dataset, we collected those regions of RNA that did not contain any experimentally validated W sites. Since in nature, W sites are relatively rare, the number of negative samples in our data is 10 times larger than the number of positive samples, which is a classical imbalance machine learning problem. We did a 10-fold stratified split of positive and negative RNA samples into an 80:20 ratio for training and testing data to maintain the same class ratio in training and testing sets using pandas and Scikit-learn [25]. We reduced the sequence identity between training and testing sets for each fold using cdhit-est-2d with a minimum sequence identity threshold (0.8) (allowed by cd-hit for RNA sequence with a word length of 4) [26].
To further reduce the sequence identity, we globally aligned the remaining training set against the test set using the Needleman-Wunsch algorithm and removed the sequences from the training set that had >60% sequence identity with the test set. The high sequence identity was further removed within the test set using cd-hit-est at the above-defined sequence identity threshold and word size. For the processed sequence data, the abstract secondary structure dot-bracket notation was generated using the RNAshapes package [27,28]. The dot-bracket notation was further converted into secondary structure context using EDeN (https:// github.com/fabriziocosta/EDeN), a neighborhood subgraph pairwise distance kernel-based method for explicit feature representation of graphs. The RNA secondary structure context is represented by six generic sub-shapes, namely Stem (S), multi-loops (M), hairpins (H), internal loop (I), dangling start (F), and dangling end (T). Each 51 bp RNA fragment was coded into a secondary structure context, where each nucleotide was coded into one of the abovementioned sub shapes. Sequence data was converted into a oneof-K encoding binary matrix of size 51 Â 4, where 51 is the length of the fragment of 4 nucleotides. Similarly, the secondary structure was encoded into a one-of-K encoding binary matrix of size 51 Â 6, where 51 is the length of the fragment with six sub-shapes of the RNA fragment. There are two input layers for sequence and secondary structure. Both layers are one-of-K encoding of a 51-base pair RNA fragment and its secondary structure context. Feature maps for each encoding are generated using two convolutional layers for each of the two encodings. Feature maps are then concatenated and fed into the 512-neuron dense layer. The Final layer is a 2-neuron dense layer with a softmax binary output.

Deep learning architecture of MU-PseUDeep
For each input (sequence and secondary structure), a pair of 1D CNN was used. The first layer of sequence input (seq_1) and secondary structure input (sec_1) both have a filter size of 5 and a kernel size of 10. Similarly, the second 1D CNN layer for both sequence input (seq_2) and secondary structure input (sec_2) has a filter size of 9 and a kernel size of 4. The kernel initializer for each 1D CNN layer was 'glorot_normal.' The kernel regularizer weight for each layer (rounded to four decimal places) followed 0.0321 (seq_1), 0.01608 (seq_2), 0.00109 (sec_1) and 0.0340 (sec_2). Dropout rate for each layer was as follows: 66.5% (seq_1), 3.8% (seq_2), 74.5% (sec_1) and 36.9% (sec_2). All layers had a 'PRelu' activation function. All layers were concatenated and fed into the dense layer with a 'softplus' activation function. A stochastic gradient was used as the optimization algorithm with a learning rate of 0.0137. A binary cross-entropy was used as a loss function with an early-stop patience of 20 and a model checkpoint serving as a callback for fitting the model. The batch size was 32 and the number of epochs was set to 500. The total number of trainable parameters in the network was 661,118. The model was implemented in Keras version 2.2.2 with a Tensorflow (1.10.1) backend [29].

Hyperparameter optimization
A hyperparameter optimization of various hyperparameters was carried out using Hyperas (https://github.com/maxpumperla/hyperas), a convenience wrapper for Hyperopt (https:// github.com/hyperopt/hyperopt), and a distributed asynchronous hyperparameter optimization library. A tree-structured Parzen estimator approach was used to optimize the models by maximizing each model's F1-score on validation data for a single fold [30]. We optimized several hyperparameters of our deep learning architecture including ''dropout-rate," ''kernel regularizer weight," ''optimization algorithm," and ''learning rate for the optimizer." The performance of the top 10 hyperparameter-optimized models on our test data is shown in Supplementary Fig. S1.

Bootstrapping
A bootstrapping approach was applied, like the one used by Wang et al. (2017). In this case, we divided our negative samples into N bins. Each bin was the same size as the number of samples in the positive class and was iterated when training the model with the positive class. The final results were calculated by averaging the results from each iteration of every fold [31][32][33].

Transfer learning
A bootstrapped hyperparameter optimized human model was further finetuned for transfer learning on yeast and mouse data. All layers were kept fixed/untrainable except for each of the two 1D CNN layers and dense 512 neuron layers. The learning rate of the stochastic gradient descent algorithm was reduced from 0.0137 to 0.0086.

Human transcriptome scanning
The human transcriptome was obtained using BedTools from the hg19 genome and gencode gtf file. The coding sequences were converted from DNA to RNA based on their strand. The positive pseudouridine sites from RMBase were masked with BedTools along with the 25 bases flanking upstream and downstream. Running windows of 51 base pair fragments were generated using Seq-Kit [34]. Only those fragments with uridine at their center were considered for further prediction. A precision threshold of >0.99 was used to predict whether the uridine site is a potential W site.
The GO, pathway, and disease enrichment was performed for genes containing the predicted sites using clusterProfiler [35]. Network construction was based on GO semantic similarity with each edge representing the semantic similarity score between two genes. The GO semantic similarity scores were calculated using GOSemSim [36], and the network construction was done using Cytoscape [37] and a ClueGO plugin [38]. Motif visualization was based on ggseqlogo [39].

Results
We compared MU-PseUDeep, which used both sequence and secondary structure context as features, with the one using the sequence-only context (a deep learning model which closely resembles iPseU-CNN) or only the secondary structure context as input. The results of MU-PseUDeep indicate a significant improvement in performance in comparison to either only-sequence CNN or only secondary structure CNN. The improvement was by 3-4% for accuracy and F1 and up to 9% for sensitivity in the balanced dataset in comparison to sequence CNN (which had proved to be better than a secondary CNN structure). Similarly, for the imbalanced dataset, our combined model outperformed the sequence CNN with a 2% accuracy. The improvement was also 2% for F1, up to 4% for MCC, and up to almost 7% for sensitivity as shown in Table S1. Fig. 2 shows the Precision-recall curves of the optimized models for all three species for both balanced and imbalanced test data.  Table 1 for both balanced and imbalanced datasets. For balanced human data, the performance metrics of our model in comparison to PseUI [18] improved on average by 7% for accuracy, 10% for F1 score, 46% for MCC, 5% for sensitivity, and 19% for specificity. Similarly, in comparison to iRNA-PseU [17], the performance metrics of our model improved by 8% for accuracy, 14% for F1, 53% for MCC, and 13% for sensitivity and specificity. Likewise for XG-PseU [40], the accuracy improved by 11%, F1 score by 14%, MCC by 69.5%, sensitivity by 9.2%, and specificity by 22.9%. For the imbalanced data, the performance metrics improved in comparison to PseUI by 30% for accuracy, 41% for F1, 64% for MCC, 5% for sensitivity, and 18% for specificity. In comparison to iRNA-PseU, our method improved by 24% for accuracy, 37% for F1, 61% for MCC, 13% for sensitivity, and 11% for specificity as shown in Table 1 and Fig. 3. Correspondingly, in comparison to XG-PSeU, we saw improvements in accuracy by 35.6%, F1 score by 54.2%, MCC score by around 94.2%, and specificity by 23.4%. Similar improvements were noticed for both the mouse and yeast data as well, as shown in Table 1 and Figures S2-S6. The performance of the MU-PseUDeep model was further assessed by visualizing t-SNE plots of the feature map of the deep learning model. Fig. 4(a) and (b) shows a good separation between positive and negative classes. Similar results were observed for mouse and yeast datasets as shown in Figs. S7 and S8. We clustered the last feature map of our deep learning model on the whole positive dataset as shown in Fig. S9. Subtle differences can be seen between clusters of fragments for nucleotides surrounding the W site within the positive class as shown in    5, and Figs. S10 and S11 for human, mouse and yeast, respectively. Furthermore, using secondary structure context, we improved the sensitivity of our model.

Identification of W sites in human transcriptome
We applied MU-PseUDeep to scan the protein-coding genes in the human transcriptome, which identified 2441 genes with one or more predicted W site at the >0.99 precision threshold (details about genomic region can be found in the supplemental Excel file, ''Supplemental_table2.csv"). Among them, 284 genes already had one or more known W site as documented in the RMBase. Func-tional enrichment of all 2441 genes indicated a few interesting categories, namely 'guanyl-nucleotide exchange factor activity,' 'DNAbinding,' 'Protein-binding,' as shown in Fig. 6. Likewise, the KEGG pathway enrichment against the whole gene-set background resulted in several enriched pathways namely 'Cushing syndrome,' 'cortisol synthesis,' and 'Hippo signaling pathway.' Network visualization of some of the most functionally similar genes based on GO semantic similarity score >0.9 indicates how some of the genes which contain known as well as predicted W sites are strongly connected with the ones which have one or more predicted W site.
Some of these genes belong to a specific functional/pathway category as shown in Fig. 5, and Supplementary Figs. S12 and S13. Some of these genes are enriched in signaling pathways that have potentially an important role to play in brain functions. Our prediction results justify our hypothesis of the potential importance of RNA secondary structure that is critical for PUS (pseudouridine synthase) to successfully catalyze the Pseudouridylation reaction.

Discussion and conclusion
Our transcriptome scanning results indicate how most genes enriched for predicted W sites have a role in nucleotide and protein binding. In addition, enrichment of these genes in certain cancer pathways, cholinergic pathways, and calcium and potassium gated ion channel activity implies their potential involvement in some types of cancers as well as brain disorders, which has already been demonstrated for W in t-RNAs and r-RNAs.
Literature mining for W along with any of the PubMed search terms related to enriched pathways, diseases, and molecular functions, as well as biological components and cellular compartments revealed some interesting relationships between W and insulin secretion [41]. One of the enriched molecular function terms is the ''guanyl-nucleotide exchange factor" activity. It is known that W and other modified ribonucleosides play an important role in inter-nucleotide bond formation by means of guanyl-specific ribonucleases [42]. W is also known to bind protein phosphatase in some bacterial species [43][44][45]. Previous research has also indicated the importance of W in regulating neuronal functions [46], which is corroborated by the enrichment of biological processes shown in Fig. S12(a) and (b). Disease gene network enrichment articles have shown a relationship of W to brain disorders as shown in Fig. S12(c), which is consistent with an earlier study suggesting that W has a role in mental retardation [12]. Other evi-dence of W's role in neural disorders is by elevated levels of W in the urine of mild to moderate-severe Alzheimer patients [47]. Pseudouridylation has also been linked to high oxidative stress, which is known to be one of the risk factors for increased neurodegeneration [48]. W modification has also been linked to myotonic dystrophy [49], a type of genetic neuromuscular disease, which is associated with intellectual disability-another enriched term from the disease gene network database for our list of genes containing putative W sites. This is perhaps the only method that utilizes RNA secondary structure context along with sequence as featured in W site prediction using a deep learning architecture. We significantly improved upon the performance of existing methods by incorporating both secondary structure and sequence information. Our method has shown considerable improvement in terms of accuracy, F1 score, MCC, sensitivity, and specificity for both balanced and imbalanced datasets over the existing tools including PseUI and iRNA-PseU.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement
The authors would like to thank Juexin Wang and Chunhui Xu for helpful discussions, and Northeast Normal University for their GPU resources. Fig. 6. Pathway enrichment of predicted W site containing genes. After transcriptome scan the genes containing one or more W sites were examined for gene enrichment in pathways, GO ontology and disease enrichment (Fig. S12) (a) the KEGG pathway enrichment of genes containing one or more predicted W sites at >0.99 prediction threshold and (b) gene ontology (GO) molecular function enrichment of genes containing one or more predicted W sites.