Inferring RNA sequence preferences for poorly studied RNA-binding proteins based on co-evolution

Background Characterizing the binding preference of RNA-binding proteins (RBP) is essential for us to understand the interaction between an RBP and its RNA targets, and to decipher the mechanism of post-transcriptional regulation. Experimental methods have been used to generate protein-RNA binding data for a number of RBPs in vivo and in vitro. Utilizing the binding data, a couple of computational methods have been developed to detect the RNA sequence or structure preferences of the RBPs. However, the majority of RBPs have not yet been experimentally characterized and lack RNA binding data. For these poorly studied RBPs, the identification of their binding preferences cannot be performed by most existing computational methods because the experimental binding data are prerequisite to these methods. Results Here we propose a new method based on co-evolution to predict the sequence preferences for the poorly studied RBPs, waiving the requirement of their binding data. First, we demonstrate the co-evolutionary relationship between RBPs and their RNA partners. We then present a K-nearest neighbors (KNN) based algorithm to infer the sequence preference of an RBP using only the preference information from its homologous RBPs. By benchmarking against several in vitro and in vivo datasets, our proposed method outperforms the existing alternative which uses the closest neighbor’s preference on all the datasets. Moreover, it shows comparable performance with two state-of-the-art methods that require the presence of the experimental binding data. Finally, we demonstrate the usage of this method to infer sequence preferences for novel proteins which have no binding preference information available. Conclusion For a poorly studied RBP, the current methods used to determine its binding preference need experimental data, which is expensive and time consuming. Therefore, determining RBP’s preference is not practical in many situations. This study provides an economic solution to infer the sequence preference of such protein based on the co-evolution. The source codes and related datasets are available at https://github.com/syang11/KNN. Electronic supplementary material The online version of this article (10.1186/s12859-018-2091-8) contains supplementary material, which is available to authorized users.


Measuring the co-evolution between RBPs and their binding sites
The co-evolution measurement approach was used in our previous study for DNA-protein interaction (please refer to the article for details) [2], and was derived from the "mirror tree" method originally used in the protein-protein co-evolution by Pazos et al. [4]. The intuition is that we can build a multiple sequence alignment with a set of RBP sequences from the same family but different species (i.e. orthologs), then derive a phylogenetics tree. If each RBP has a corresponding RNA binding site and the RBP co-evolves perfectly with the RNA, then we could construct a phylogenetics tree which looks perfectly symmetrical (like a mirror) to the RBP tree from these RNA sequences. In reality, however, it is unlikely to observe exactly mirrored trees since the co-evolution is usually not perfect. Also, due to the limited data available at present, the proteins are not necessarily orthologs (could be paralogs or mix). As suggested in the Pazos et al.'s study [4], it is not necessary to construct the phylogenetics trees since the structure information of the phylogenetics tree is comprised in a similarity matrix. By constructing the similarity matrices, the "mirror tree" method is independent of any given tree-construction method. Therefore, instead of the phylogenetics trees, we constructed a pairwise similarity matrix based on the protein sequence identity for each protein family. We used ClustalW [5] to build a multiple-sequence alignment and computed pair-wise sequence similarities. Correspondingly, we also constructed a pairwise similarity matrix for RNA motifs i.e. PWMs. We used several different metrics to compare the similarity of a pair of PWMs, which we will describe in the next paragraph. Finally, to quantify the co-evolution between RBPs and their binding motifs, we compared the RBP pairwise similarity matrix with the RNA pairwise similarity matrix directly by concatenating all the rows in each matrix into a vector and computing a Pearson's correlation coefficient, as suggested in the Pazos et al.'s study [4].

PWM-PWM pairwise similarity
The main principle to compare a pair of PWMs is as follows: assuming a PWM contains 4 rows (one for each base) and k columns (one for each position), to compare two matrices, we only need to compare the similarity values of individual column pairs and use them to compose the overall value on matrix level. To measure the similarity, we tried two different metrics. The first one is Pearson's correlation, again. PCC has been suggested as a standard way for DNA PWM comparison and has superior performance than other measures such as Euclidean distance or Chi-squares according to several studies [6][7][8][9]. We used PCC (different from co-evolution PCC) as the similarity measure to compare two PWMs by computing the correlation of each position (represented by column vector) pair of the two PWMs then averaging over all positions, as suggested in Yang et al. [2].  Figure S1: PCCs for shuffled pairs. The figure shows PCCs from 1000 randomly shuffled sets. For each set, a PCC is computed using random KH RBP, RRM PWM pairs from the same species at RBR level.

Assess the significance of co-evolution
As suggested in Yang et al.'s study [2], here we used both a nonparametric test and a parametric test to assess the significance of the co-evolution PCC. In brief, for the nonparametric test, first we permuted the protein sequences and the PWMs 1000 times to generate a background pool of the random sequences and the PWMs which have the same compositions and lengths as the original ones. Then, the real PCC was rank tested against a null distribution of PCCs which were computed from the permuted proteins and PWMs. For the parametric test, we converted the real PCC to a test statistics following a t-distribution, followed by a two-tailed test against the null hypothesis that PCC=0.

Control for the effects of speciation
It is worth to note that each of the RRM-FL, RRM-RBR, KH-FL, and KH-RBR set contains proteins from multiple species. The observed correlation could be simply due to speciation. If it is true, we should see similar correlations between the KH's PWMs and RRM's RBPs or RRM's PWMs and KH's RBPs from the same distribution of species. Also, as we observe from Table 3 in the main paper, KH-RBR and RRM-RBR did not show statistically significant PCCs in the parametric test. Are these correlations really strong or not? For these purposes, we calculated the PCCs between the randomly paired RRM's PWMs and KH's protein sequences, and vice versa, keeping species the same, as suggested in [2]. The PCCs from both shuffled sets were much smaller. And the PCC values in Table 3 always ranked the highest in a 1000 times random shuffling pool (Supplementary Figure S1). The tests suggested that the correlations in Table 3 were not due to speciation, and they are strong.

The importance of sequence or structure preference alone
To assess the importance of the sequence information alone, RCK provided the functionality of sequence mode (run the program with -q option) which ignored all input structural probabilities by setting them to uniform distribution instead. When applying this functionality to KNN-RCK, this was equivalent to taking our PWM computed sequence scores as features and fitting a nonlinear model to training data. Moreover, similar to RCK's native sequence mode, KNN-RCK had the functionality of structure mode which ignored all sequence information by simply setting the input PWM to uniform distribution at each position (i.e. 0.25 for all entries in the matrix). These modes provided the flexibility for us to separately examine the importance of the sequence or structure information alone in RBP binding events. We demonstrate this in Supplement Figure S2, using two datasets: a subset of the InVitro dataset and the InVivoRay dataset. The subset of the InVitro dataset contained all the overlapped entries between the full InVitro dataset and the InVivoRay dataset. As shown in Supplement Figure S2A, when evaluated on the InVitro dataset, the performance of structure alone (average PCC=0.409) was almost the same as the full model (0.417) with no statistically significant difference. While the performance of sequence mode (0.368) was significantly worse than the full model (p-value=0.022), and also worse than structure mode (p-value=0.050). However, when evaluated on the in vivo InVivoRay dataset (Supplement Figure S2B), the performance of sequence mode (average AUC=0.755) was better than structure mode (0.652), and was even better than the full model (0.667). These results were indeed consistent with our observations in Table 4 and Figure 2 that models using structure features did not perform as well as sequence models on the InVivoRay dataset, because the training data were short RNA probes which had weak structures and the testing data were long RNA segments which formed much more structures (i.e. the training and testing data were not quite identically distributed). Nevertheless, if in the future we have more structured RNA data so that we can train on them, then we may see better performance when incorporating structure information to the model.