An Integrated Method for Detecting Micro RNA Target Proteins through Reverse-phase Protein Arrays

Objective: Understanding functions of microRNAs (or miRNAs), particularly their effects on protein degradation, is biologically important. Emerging technologies, including the reverse- phase protein array (RPPA) for quantifying protein concentration and RNA-seq for quantifying miRNA expression, provide a unique opportunity to study miRNA-protein regulatory mechanisms. One naïve way to analyze such data is to directly examine the correlation between the raw miRNA measurements and protein concentrations estimated from RPPA. However, the uncertainty associated with protein concentration estimates is ignored, which may lead to less accurate results and significant power loss. Methods: We propose an integrated nonlinear hierarchical model for detecting miRNA targets through original RPPA intensity data. This model is fitted within a maximum likelihood framework and the correlation test between miRNA and protein is assessed using Wald tests. We compare this model and the simple method through extensive simulation studies and a real dataset from the Cancer Genome Atlas (TCGA) project. Results: This integrated method is shown to have consistently higher power than the simple method, especially when sample sizes are limited and when the RPPA intensity levels are close to the boundaries of imaging limits. Conclusions: Our proposed method is powerful in detecting miRNA’s protein target through RPPA. We recommend this method in practice.


Introduction
MicroRNA (miRNA) is a set of small, non-coding RNA molecules that can post-transcriptionally regulate a broad range of gene expression in both plants and animals. They have been suggested to be involved in many important biological processes, such as normal physiological development and disease onsets. In the past decade, many efforts have been put to search for the miRNA targets [1]. Although our understanding on some miRNAs has been dramatically improved, as of today, the targets of many others remain largely unknown. Therefore, powerful methods for efficient detection of miRNA targets are still in great need.
In general, miRNA regulates the expression of its target genes through two mechanisms-mRNA degradation or translation inhibition. That is, if a miRNA and its target gene can complement extensively, the miRNA-mRNA target may form a double-strand RNA (dsRNA) structure, after which, the mRNA can be cleaved and degraded to lower the mRNA expression and subsequently protein expression [2,3]. On the other hand, if a miRNA and its target can only complement partially, the target mRNA will not be directly degraded but its translation may be repressed [4,5]. So, in both mechanisms, the total protein level relating to the miRNA targets would be reduced, resulting in their functional losses.
Based on the phenomenon that the sequences of miRNAs and their target genes complement to each other, or at least partially, one way of the miRNA target identification is through in silico prediction. Several software tools have been developed for such purpose, each with its own unique feature. For example, miRanda scored the likelihoods of mRNA down-regulation according to a regression model that is trained on sequence and contextual features of the predicted miRNA::mRNA duplex [6], and Target Scan studies on the miRNA::mRNA duplex interactions according to a thermodynamics-based modeling and comparative sequence analysis [7]. Based on these computational tools, several databases with predicted miRNA targets have been generated (microRNA.org and targetscan.org). However, one major limitation is that they all suffer from large percentage of false positives, which hinder their practical usage.
Another popular way to determine the miRNA targets is through experimental data by measuring downstream effects of miRNAs. Since miRNAs can induce protein reduction via both functional mechanisms, protein expression seems to be the right mark. However, due to difficulties in high-throughput quantification of protein expression but relative ease in that of mRNA, conventionally, scanning of the miRNA targets is mainly through testing negative correlations between miRNAs and mRNAs. For example, high-throughput techniques, such as miRNA and mRNA gene microarray, can be applied to measure their expression levels, and then the correlations analyses can be conducted subsequently to filter out miRNA-mRNA pairs that show significant negative correlations as potential candidates for further analyses [8]. More recently, with the advent and rapid advance of sequencing techniques, the miRNA sequencing (miRNA-seq) and RNA sequencing (RNA-seq) platforms have become more and more popular for the quantification of the miRNA and RNA expressions. The main drawback for the miRNA/mRNA correlation analyses is that they can only identify miRNA targets that may change at mRNA levels, but is determined to fail for those modulated through translation inhibition. Therefore, miRNA/mRNA correlation analysis is only able to find partial targets of miRNAs.
It is only until recently that a new technology, called the reverse phase protein array (RPPA) or protein lysate array has been developed to simultaneously quantify the protein expression in a large sample cohort [9]. In contrast to the usual array design that quantifies expression levels of multiple genes in one sample, the RPPA measures the protein expression levels of many samples on one array. Usually, a 2-fold serial dilution of samples is used to avoid signal saturation for very high protein concentration. Since the RPPA data are measured through image signals, which are characterized by background noises at the low end and saturation signals at the high end, a sigmoid-like response curve model has been used to estimate protein concentration [10,11]. Additionally, some flexible nonparametric joint sample models have also been proposed [12]. In principle, we can directly use the protein concentrations estimated from these methods to correlate with miRNA expression for miRNA target identification. However, in this case, the uncertainties associated with the protein concentration estimates are ignored, leading to less accurate and less powerful inference.
In this article, we propose an integrated hierarchical model to detect miRNA targets based on protein expression data measured by the recently emerged RPPA and high throughput miRNA-seq data. Extensive Monte Carlo simulations have been performed to examine the performance of our proposed model under different scenarios. This model was further applied to a real dataset from TCGA to demonstrate its practical use.

PPA
To quantify the protein expression on a RPPA array, a sigmoidal model is commonly assumed to describe the relationship between the intensity level and the protein concentration with additive error [10,13] where is the gray-level intensity from sample at th dilution, =1,…, and =1,…, , is the binary logarithm of the median effective protein concentration level (EC 50 ), a single quantity per dilution series to represent the concentration of the protein, + is the binary logarithm of the protein concentration after h dilution where (1 ) 2 ε i j is the error term assumed to have a normal distribution with mean 0 and variance 2 1 σ , and ={ 1 , 2 , 3 }. Since 1 lim ( ) 1 is interpreted as the lowest intensity level without noise, and 2 is the increment from the lowest to the highest intensity or the saturation level.
To estimate the relative protein concentration in RPPA, one algorithm based on logistic model fitting used by Hu [12] is given as follows: The initial intensity data are first transformed as and initial are set as The initial median effective protein concentration level are estimated by using: , , ) β β β were calculated based on the following model [14]: After obtaining , β β β , the nonlinear least-square method is used again to update the relative protein level ={ , =1… } and 2 1 σ .
This iteration continues until convergence.

A naïve model for correlating miRNA and protein expression
Since ={ , =1,2,… }, the log2 transformation of the median effective protein concentration levels, can be estimated from the RPPA, a straightforward way to examine the relationship between miRNA and protein expression levels is through Pearson's correlation coefficients or simple linear regression models, which is referred as the naïve model in this article.
The linear relationship between protein and miRNA in the naïve model can be expressed as: and { , =1,2… } are log-transformed expression levels of a specific miRNA from sample =1,2… .
The parameter estimates 1 2 0ˆ, , α α α can be calculated by using linear regression. 2 α is our parameter of interest, describing the correlation between a miRNA and protein pair. We can test 0 : 2 =0 to determine if a particular pair of miRNA and protein is related or not.

A hierarchical model for correlating miRNA and protein expression
Although the naïve method is straightforward, uncertainty associated with protein concentration estimates is ignored. As demonstrated in later sections, it leads to less accurate results and significant power loss. Here we propose a nonlinear hierarchical model for studying the relationship between miRNA and protein expression, in which the correlation analysis is integrated with the estimation of protein concentration. The model is given as follows: Here (.) is a general function to describe how , the protein level, and , the miRNA expression level, is related. = { 1 , 2 , 3 } is the parameter vector of the response curve function (.). To directly is the number of quadrature points which is set to 5 in our analysis. and denote the standard Gauss-Hermite abscissas and weights; ̂ minimizes (10) and Γ is the Hessian matrix from the minimization.
With the approximate likelihood function, we employed the quasi-Newton algorithm for parameter estimation. Unlike the Newton algorithm, approximate Hessian matrix was used instead of true Hessian matrix in the quasi-Newton algorithm. The step size was calculated by quadratic interpolation and cubic extrapolation and BFGS method was used to update the approximate Hessian matrix. A grid searching with center from the estimates of Naïve model was applied in our algorithm.

Hypothesis testing
Since it is expected that miRNA negatively regulate the protein level of its target gene, to test if there is a significant relationship between a specific pair of miRNA and protein, the hypothesis test can set up as a one-sided test: Once the maximum likelihood estimates are obtained, a likelihood ratio test (LRT), a Wald test or a Score test can be constructed. However, LRT can be very time consuming and is not appropriate for one-sided test. For Score test, the confidence interval of 2 is difficult to be calculated. Thus Wald tests were used in our simulation and real data example, and its test statistic is: where () represent the fisher information matrix of the likelihood function. The null hypothesis was rejected when p-value was above 0.05 in our simulation study.

Simulation studies
Extensive simulation studies were carried out to examine the performance of our proposed integrated model and to compare with the naïve model approach. Protein intensities were generated by using a sigmoidal response curve (Figure 2a). And a typical miRNA expression distribution in the TCGA data was borrowed in this simulation to mimic the real data and generate protein EC 50S (Figure 2b). Also, the true values of { 1 , 2 , 3 , 0 , 1 } were set as {50, 30000, 1, 1, and 300} to mimic parameter values estimated from a real TCGA ovarian cancer data set. Different strengths of correlation between miRNA and protein expression levels, as characterized by 2 , were examined in a range from 0, which represents the null hypothesis, to -1.5, which yields the power of 1 for the integrated method. In order to investigate the performance of two models with protein intensity values located in different areas of the sigmoidal curve, 1 was set as 0 and 5 corresponding to the middle part and upper part of sigmoidal curve, respectively. The upper part of a sigmoidal curve corresponds to a scenario where most of intensity levels are close to the saturation point. The RPPA intensity levels range between 10 and 30100. An illustration of the sigmoidal curve used to generate simulated data was showed in Figure 2a. The locations of compare with the naïve model, we assume ( ) to be linear in equation (6), that is, We further assume that the two error terms, and are independent of each other. In this hierarchical framework, the relationship between miRNA and protein expression level can be estimated without explicitly quantifying the protein concentration based on intensity data first. This model is referred as the integrated model hereinafter.

Computational algorithm
The unknown parameters ={ 1 , 2 , 1 , 2 , 3 } can be estimated within the maximum likelihood framework. The adaptive Gaussian quadrature method is used to approximate the integral and the dual quasi-Newton method can be further applied in maximizing the likelihood function [15][16][17]. Our computational algorithm is illustrated in Figure 1. This algorithm is implemented in SAS nlmixed procedure and a SAS macro is available upon request to fit the integrated model.
Our likelihood function can be written as Step 0: estimate the initial value of , 0 , 1 , denoted as (0), by using the naïve model; Step 1: generate the approximate likelihood function by using adaptive Gaussian Quadrature method; Step 2: compute the quasi-Newton direction Δ , determine the step size to satisfy the Goldstein conditions; Step 3: update parameters value; Step 4: update Hessian matrix; Step 5: check if the iteration stops. If not, go to Step 2, if yes, the iteration stops. A grid searching with center from the estimates of naïve model was applied in our algorithm.
protein intensity center were marked by circles. If simulated intensity values are beyond the imaging boundary, they would be replaced with the boundary value with small error (Gaussian distributed with mean 0 and standard deviation 5). 1000 simulations were carried out for each parameter setting under different sample sizes (N=20, 50, 100 and 300). Generally, there are 5 diluted samples in one dilution series, so =5 were used in our simulation setting. Pre-specified Type I error was set to be 0.05.
The false positive rates and detection powers for miRNA targets for both the integrated model and the naïve model under different sample sizes were shown in Figures 3 and 4. It is clear that when there was no relationship between miRNA and protein ( 2 =0), both models can well control the pre-specified type-I error when sample size were bigger than 50. The integrated model was consistently more powerful than the naïve model, especially when the RPPA intensity levels are close to the boundaries of imaging limits ( Figure 4). Table A1-A2 in appendix listed the detailed point estimates of all parameters and their corresponding standard errors. Figure A1 and A2 in appendix illustrated the variations of the point estimates of 2 under different simulation settings. The integrated model consistently yielded parameter estimates of 2 with similar or much less standard errors than those from the naïve model. When the RPPA intensities reached the upper flatter part of the sigmoidal curve, which caused information loss because of intensity level truncation at the saturation points, both the naïve and the integrated method over-estimated 1 , which represents the lower imaging limits. However, in this situation the integrated method still had a much larger detection power than the naïve method ( Figure 4).

Analysis of TCGA Ovarian Cancer Data
Both models were applied onto an ovarian cancer dataset from the TCGA project. In this dataset, there were 333 ovarian cancer samples with both miRNA and RPPA data available. 352 miRNAs having more than 50% of non-zero counts and 165 proteins were included in our analyses.
The results from both naïve and integrated models on predicting miRNA targets were reported in Table 1. False Discover Rate (FDR) at 10% was used to adjust for multiple testing [18]. The integrated model approach we proposed found 1106 potential miRNA-protein pairs, 797 of which were on non-phosphorylated protein array. 822 pairs were found on non-phosphorylated protein array: 250 out of them were found by integrated model only and 25 pairs were found by naïve model only. Integrated model found significantly more number of potential miRNA-protein pairs (P<0.0001 according to McNemar's test). Furthermore, we compared our results with miRNA targets identified by miRanda algorithm [19][20][21][22]. 98 targets, which were found by both the integrated and the naïve model, and 31 targets, which An illustration of (a) a sigmoidal shape response curve. When 1 was set to be 0, the center of the EC50s would located at 0; When 1 was set to be 5, the center of EC50s would located at 5; (b) distribution of a typical miRNA expression in the TCGA data we used in the simulation. and the integrated model (dashed line) according to different simulation scenarios: sample size ranged from 20 to 300 and the protein intensities were located in the middle part of a sigmoidal curve. Detection powers (type I error if 2 =0) denoted by 1 and 2 under different correlation strengths were report on the bottom of each plot for the naïve and integrated models, respectively. Both models can well control the pre-specified type-I error when sample size were bigger than 50. Two models had similar detection performance, especially when sample size increased.

Figure 4:
Power curves of the naïve model (solid line) and the integrated model (dashed line) according different simulation scenarios: sample size ranged from 20 to 300 and the protein intensities were located in the upper part of a sigmoidal curve. Detection powers (type I error if 2 =0) denoted by 1 and 2 under different correlation strength were report on the bottom of each figure for the naïve and integrated models, respectively. Both models can well control the pre-specified type-I error when sample size were bigger than 50. The integrated model was consistently more powerful than the naïve model.   tau-m-c hsa-mir-142 MAPT cdk1-r-v hsa-mir-1247 CDC2 Table 3: A list of miRNA/protein pairs suggested by the naïve model and the integrated model and they were classed into three groups: "Found by the integrated model only", "Found by both the integrated model and the naïve model" and "Found by the naïve model only". Pairs found by the integrated model were sorted by ascending order of adjusted p-values from the integrated model and the naïve model, respectively. Pairs found by the naïve model only were sorted by ascending order of adjusted p-values from the naïve model andintegrated model, respectively.In addition, a number "1" will mark under the column for pairs found by MirTarbase, MirTarBase with strong experimental evidences or miRanda, and those pairs will be list on the top of each group after pairs got ordered. igf-1r-beta-r-c hsa-mir-223 IGF1R 1 1 igf-1r-beta-r-c hsa-mir-139 were found only by the integrated model, were confirmed by miRanda database. However, only 6 targets found by the naïve model only were confirmed by miRanda database. MirTarbase, a dataset based on manually surveying pertinent literature [23] was used to further verify our results. 15 suggested targets found by both the integrated and the naïve model were supported by the MirTarBase dataset. 11 of the 15 suggested targets found by both the integrated and the naïve model were supported by strong experimental evidences according to the MirTarBase dataset. One suggested target found by the integrated model only were supported by strong experimental evidences according to the MirTarBase dataset. None of the suggested targets found by the naïve model only were supported by strong experimental evidences according to the MirTarBase dataset. This suggests that there could be a number of undiscovered miRNA targets included in the findings of integrated and naïve models. The list of 822 miRNA/protein pairs was included in the appendix (Table A3).

Discussion
The traditional way to detect direct targets of miRNA using miRNA-mRNA experiment method is limited, due to the fact that miRNAs may regulate their targets post-transcriptionally. In addition, other computational methods, which were based on optimal sequence complementarity of miRNA and mRNA, suffer from large percentage of false positives and of limited practical use. Taking the advantage of recent technique advance in measuring of miRNA expression and protein concentration levels in a high-throughput scale, we proposed to search for potential miRNA targets through a nonlinear hierarchical model. Computationally, this integrated model measures the correlation between miRNA and its targeting protein without making estimation of protein expression levels first as in the naïve method. We used both simulation studies and an application to the real data to compare our proposed method and the naïve method. Our simulation results suggested that both integrated and naïve methods can well control their type-I errors, however, the integrated method consistently showed higher detection powers than the naïve method under different scenarios, particularly when the protein intensity values were located close to the saturation point or the background noise level. In the real data example, our proposed integrated method detected much more potential miRNA targets than the naïve method. Furthermore, the number of potential miRNA targets, which can be confirmed by computational methods or literatures, is larger in the integrated method than that in the naïve method.
A significant association between a miRNA/protein pair can be either direct or indirect. For example, a miRNA may directly target and degrade a transcription factor (TF), which in turn induces indirect cascading effects of down-regulating the TF's target genes. The association analyses from the simple or our integrated model would reveal both direct and indirect associations. In contrast, the other computer-based algorithms, e.g. miRanda, can only predict direct miRNA targets based on sequence comparison. In the real data analyses (Table 1), the relatively smaller percentage of overlap between our findings and miRanda database suggests that our algorithm may detect more indirect targets. This is sound since our algorithm is more powerful, as demonstrated by our simulation studies, and hence is capable of detecting smaller indirect associations. With the crossreference to miRanda database, those direct miRNA targets of more biological relevance could be filtered out to serve as top candidates for further biological validations. It is worth noting that our algorithm can indeed detect more direct miRNA targets in absolute number. Also, in Table 1, the results were based on a FDR of 10% for the multiple test adjustment; however, we also checked a FDR at 5% level and found the conclusion remained the same. That is, the proposed integrated method found more miRNA targets that appear in other existing databases, demonstrating its advantage over the naïve method.
Unknown parameters in our proposed model were estimated within the maximum likelihood framework. Using the asymptotic properties of maximum likelihood estimates, test statistics were straightforward to construct. However, some improvement can be made to further improve the proposed model. For example, we assumed a linear relationship between miRNA and protein to directly compare with the naïve method and to illustrate our model using simple examples, but in reality, the relationship between miRNAs and proteins could follow a nonlinear relationship, such as a dose-response curve. In this case, ( ) can be replaced by other parametric or nonparametric functions. With some simple modifications, our model can be easily extended to relax these assumptions. Additionally, in this article the random error terms for different dilution steps were set to be independent and identically distributed as proposed in other RPPA analysis papers [10]. However, it is possible that the errors may be highly correlated. In this case, more complicated dependence matrix among serial dilution steps can also be readily incorporated into our model framework.