RNAcontext: A New Method for Learning the Sequence and Structure Binding Preferences of RNA-Binding Proteins

Metazoan genomes encode hundreds of RNA-binding proteins (RBPs). These proteins regulate post-transcriptional gene expression and have critical roles in numerous cellular processes including mRNA splicing, export, stability and translation. Despite their ubiquity and importance, the binding preferences for most RBPs are not well characterized. In vitro and in vivo studies, using affinity selection-based approaches, have successfully identified RNA sequence associated with specific RBPs; however, it is difficult to infer RBP sequence and structural preferences without specifically designed motif finding methods. In this study, we introduce a new motif-finding method, RNAcontext, designed to elucidate RBP-specific sequence and structural preferences with greater accuracy than existing approaches. We evaluated RNAcontext on recently published in vitro and in vivo RNA affinity selected data and demonstrate that RNAcontext identifies known binding preferences for several control proteins including HuR, PTB, and Vts1p and predicts new RNA structure preferences for SF2/ASF, RBM4, FUSIP1 and SLM2. The predicted preferences for SF2/ASF are consistent with its recently reported in vivo binding sites. RNAcontext is an accurate and efficient motif finding method ideally suited for using large-scale RNA-binding affinity datasets to determine the relative binding preferences of RBPs for a wide range of RNA sequences and structures.

counts, or expected word counts, is done as part of procedure that iterates between attempting to locate the NBP binding sites within the input sequences and refining the fit of the motif model [2,12].
Formally, let {[P-s 1 ], [P-s 2 ], . . . , [P-s N ]} be the counts (or expected counts) of how often various words {s 1 , s 2 , . . . , s N } appear as binding sites of protein P in the input set. Let P r(s; Θ) be a word frequency motif model where P r(s; Θ) is a probability distribution over words s parameterized by Θ. Often, but not always, the support of P r(s; Θ) is NA words of a fixed, pre-defined length K. The parameters Θ can be learned by optimizing the fit to the empirical word probabilities, i.e., Until recently, NBP binding preferences were estimated from a relatively small number of binding sites defined in vivo or through low-throughput in vitro selection procedures. This small number of observations was insufficient to reliably estimate P r(s i ; Θ) for each word, and as such word probabilities for fixed-length motif models were estimated using a product-multinomial model, commonly known as a position frequency matrix (PFM), in which the distributions of bases at each position in the word are independent. In the PFM, P r(s i ; Θ) = K k=1 Θ k,si(k) where s i (k) indicates the k-th base in word s i . This model only contains a small number of parameters, 3K, and its maximum likelihood estimate is easily found by setting Θ k,si(k) = f k,si(k) where f k,si(k) is the frequency of s i (k) at position k. However, because PFM models are inaccurate representations of transcription factor (TF) binding affinity [13,14], a variety of more complex probability distributions have been developed to model interactions between bases [15][16][17][18][19] or variable-spacing between binding sites of obligatory heterodimers like bZIP or bHLH proteins [20].

Affinity-based motif models
These models are based on physical principles of protein-ligand interactions. In particular, consider the equilibrium reaction of binding of a protein P to the NA word s: where k on and k of f represents the protein binding and dissociation rates respectively. The binding affinity of the protein for s can be expressed in terms of its equilibrium constant K a (s): where [P], [s], [P-s] correspond to the concentrations of the unbound protein, unbound s, and the protein in complex with s respectively. Note that we are using the same notation for concentration as we do for word counts because these two quantities differ only in their units. Affinity-based motif finding methods fit parameters Ω of their motif models W (s; Ω) by trying to match their affinity estimates for a given word s those implied by the input, i.e. W (s; Ω) ≈ K a (s).
Note that, in addition to K a (s), motif models in this class have also been designed to estimate a number of other measures of binding affinity, e.g. the dissociation constant K d (s) = K a (s) −1 , the log binding affinity log K a (s), or the relative binding affinity CK a (s) up to an unknown constant C that is independent of s [8]. The popular position weight matrix (PWM) [21,22] or the position-specific affinity matrix (PSAM) [23] are examples of these types of models. Note also, that one can derive an estimate

Estimating sequence affinity from word affinity
Rarely do the sequences input into motif finding algorithms consist of delineated binding sites. Furthermore, the input sequences "enriched" for binding sites can contain more than one binding site, or possibly none at all. As such, an important component of any motif finding procedure, is a sequence scoring function that takes as input the probabilities or affinities assigned to each word by the motif model and outputs a "score" for the entire sequence that reflects the number of likely NBP binding sites therein and their strength.
Word frequency motif models are usually paired with probabilistic generative models of sequences.
The "score" computed by these generative models for an arbitrary sequence is the probability of generating the sequence under the model. These procedures are often subject to certain constraints about how the sequences were generated, such as the presence of exactly one, zero or one, or an arbitrary number of binding sites per sequence. In the MEME (and MEMERIS [25]) algorithm, for example, these three possibilities are called the OOPS, ZOOPS, and TCM options, respectively. Further refinements to generative models of sequences employ, for example, Hidden Markov Models to model steric hindrances that prevent overlapping binding sites and/or to model clustering of binding sites within cis-regulatory modules. A good summary of recent work in this area is provided in [26]. One advantage to this approach is that it is easy to incorporate competition for NBP binding sites from, for example nucleosomes [27] or internal RNA secondary structure [25], by assessing a prior probability on possible NBP binding sites according to the strength of competition for the site. A disadvantage to this approach is that the physical interpretation of these generative probabilities becomes difficult when there are multiple binding sites within a sequence.
There remains some controversy about the best approach for scoring sequences using affinity-based motif models. Early algorithms (e.g. [6][7][8]) used the sum of the affinities of each word in the sequence as an estimate of the NBP affinity for the entire sequence. One criticism of this approach is that the number of proteins bound to the sequence (also called the "occupancy" of the sequence) also depends on the number of proteins initially available for binding. So, if the initial protein concentration is low and the sequence has many high affinity binding sites, the actual occupancy of the sequence can be much lower than its potential occupancy implied by the estimated affinity. To address these concerns, some sequence scoring functions use affinity-based motif models to compute a function N (s), the "occupancy" of word s [28][29][30], that also considers the initial concentration of proteins available for binding. This occupancy, which represents the proportion of words s bound by the protein, can be expressed as follows: which is equal to: So, given an affinity-based motif model W (s; Ω) of binding affinity K a (s) = K d (s) −1 , measured under the same conditions, one can calculate an estimateN (s) of occupancy as follows: When fitting an affinity-based model using occupancy-based scoring, one can represent the often unknown constant log[P] with a bias β and train it while fitting the model. Note also, that one can adapt an existing affinity-based motif modelW for an NBP to predict occupancy under different experimental conditions (i.e. a change in temperature) by also introducing a scale α, so that the final model becomes: N (s) = 1 1 + exp(−α logW (s; Ω) − β) .
Occupancy-based sequence scoring functions include those that simply sum the occupancy of all words in the sequence [30], those that calculate the probability that at least one site in the sequence is bound using the "noisy-OR" function [29], and more complex schemes that consider competitive and cooperative binding [29,31].