Prediction of Combinatorial Protein-Protein Interaction from Expression Data Based on Conditional Probability

on C as the corresponding z-score. Finally, we create a ranking of all the combinations of the three proteins by ordering them by the z-score. Proteins are indispensable players in virtually all biological events. The functions of proteins are coordinated through intricate regulatory networks of transient protein-protein interactions (PPIs). To predict and/or study PPIs, a wide variety of techniques have been developed over the last several decades. Many in vitro and in vivo assays have been implemented to explore the mechanism of these ubiquitous interactions. However, despite significant advances in these experimental approaches, many limitations exist such as false-positives/false-negatives, difficulty in obtaining crystal structures of proteins, challenges in the detection of transient PPI, among others. To overcome these limitations, many computational approaches have been developed which are becoming increasingly widely used to facilitate the investigation of PPIs. This book has gathered an ensemble of experts in the field, in 22 chapters, which have been broadly categorized into Computational Approaches, Experimental Approaches, and Others.

In this article, we treat interactions among three proteins. We derive the combinatorial effect level, which emerges only when the three proteins are together, besides the sole effects that emerge between two proteins. The combinatorial effect level is estimated in a statistical manner, which will lead to a better understanding of protein interactions and a guide to deeper investigations.
The remainder of this paper is organized as follows. In Section 2, we describe related work to understand the current state of the art in this research area. In Section 3, we describe the model of protein-protein interactions used in our method, and present the method to retrieve the combinatorial effect of three proteins. In Section 4, we evaluate our method by applying it to real protein expression data, and finally in Section 5 present the conclusions.

Related work
In this section, we give a short introduction of the major approaches used to predict proteinprotein interactions.
Many computational methods to predict protein-protein interactions have been proposed. They utilize various kinds of public data such as genome sequences, amino-acid sequences, pathways, domains, 3D-structures, motifs, and phylogenetic profiles, to identify a property of protein pairs in order to predict protein-protein interactions. One typical genomesequence-based technique is based on conservation of gene neighbourhood [3]. This technique assumes that genes with similar functions or genes that are in the same pathways are transcribed together as a single unit known as an operon. Thus, finding two proteins that are neighbours in several genomes infers that they interact or have similar functions. Another typical sequence-based technique is called the Rosetta Stone method [4] [5]. This method is based on the fact that several pairs of proteins interacting with each other have their homologs in other single proteins, called Rosetta Stone proteins. The phylogenetic profile method [6] uses a series of gene sequences in evolution and detects the set of genes that are simultaneously present or absent in the sequences. Since proteins in interaction tend to disappear simultaneously, finding the set of such genes predicts that the corresponding proteins interact. In addition, the in silico two-hybrid system [9] provides a fully alignmentbased protein-protein interaction prediction. This technique tries to detect physical interaction of proteins within their 3D structures by means of correlation of sequences of sites among target proteins. Recently, docking analysis using 3D structures of proteins has progressed rapidly. The main difficulty in docking analysis is that there are many potential Prediction of Combinatorial Protein-Protein Interaction from Expression Data Based on Conditional Probability 133 ways in which proteins can interact, and protein surfaces are flexible. Currently, one of the major approaches is a global search based on fast Fourier Transform [10]. Including the methods introduced in this brief discussion, there are a tremendous number of techniques to predict protein-protein interactions, and their algorithms and results are available in public databases. For more details, see [7] [11].
Boolean networks [12] and Bayesian networks [8] are well known as computational methods to predict interactions from expression data. It is important to note that they treat gene interactions rather than protein interactions since most of them originally suppose microarray data as their source of analysis. However, they can also treat protein expression data.
A Boolean network [12] is a network that represents causal association and it is typically generated from a pattern of time-series expression data. In Boolean networks, a set of expression levels for a sample at time t is regarded as "state" at some time t, where each expression level is typically represented by "1 (expressed)" or "0 (not expressed)." To compute the network, the time-series state transition is analyzed to learn the functions to determine the state at time t+1 from the current state at time t. As a result, an expression level of a protein at time t+1 is determined depending on the expression level of several proteins at time t. This dependency indicates the protein-protein interaction, although it does not always indicate a direct interaction. There are several versions and extensions of Boolean networks. Akutsu et al. proposed a model and an algorithm of Boolean networks that is generated from non-time-series expression data [13]. Laubenbacher et al. proposed multistate Boolean networks [14]. However, these models cannot treat noise and, thus, often fail in computing networks. To overcome this problem, Shumulevich et al. proposed a model of probabilistic Boolean networks [15] that enables Boolean networks to apply to practical real expression data that includes noise.
A Bayesian network [8] is also a model of interactions often used in computational approaches that is typically built from expression data with discrete expression levels. Bayesian networks represent a joint distribution of random variables, and its direct edge between nodes represents causal association of those nodes. The learning process of a Bayesian network includes the optimization of network topology, where the evaluation of topologies is based on some information criterion, which is typically based on entropy. Note that it evaluates, for each node, the strength of the relationship between the node and its parents in the network, meaning that the sole effects and the combinatorial effects are evaluated together. Later, as an extension of the model, the Dynamic Bayesian network model was proposed [16], which handles time-series expression data. For details of this kind of network learning, there are several survey articles available, such as [17][18].

Expression data used in our method
In this section, we explain the typical representation of protein expression data. Protein expression data represents the expression level of each protein i in sample j. Typically, the number of proteins in the data are several hundreds to thousands while the number of samples is usually several tens and at most hundreds.  Protein expression data is obtained from several methods or devices such as protein arrays, 2D electrophoresis, and mass spectrometry. Among these, we now introduce a 2D electrophoresis-based method [19] as a typical way of generating protein expression data. The process of obtaining protein expression data is somewhat complicated compared to microarray data that measures gene expression levels (see Figure 1). First, we prepare target samples and obtain 2D electrophoresis images from each target sample through an experimental biological process. Second, we identify areas (in the rest of this article we call them spots) of separated proteins using image-processing software and measure the expression level of each spot. Third, we match the spots among different images such that the matched spots indicate the same protein. Finally, we normalize the values of expression levels using a normalization method as a preprocess to the data mining processes. As a result, we have a set of protein expression levels as shown in Figure 2, which shows the expression levels of each protein in each sample.

Combinatorial protein-protein interaction model
The protein-protein interaction model we try to predict in this paper is shown in Figure 3. Three proteins, A, B, and C, are related to this model, where A and B individually effect the expression level of C, but if both A and B are expressed together, they have a far larger effect on the expression level of C. We call the effect from A to C (resp. B to C) the sole effect, and we call the whole effect from A and B on C the total effect. Note that the total effect consists of two sole effects and the combinatorial effect appears only if both A and B express. What we want to retrieve from expression data is the combinatorial effect of A and B on C.
To measure the combinatorial effect, we first estimate the amount of total effect of A and B on C. Then from the estimated total effect level, we subtract the two sole effects, i.e., the effect of A−C and B−C, to obtain the combinatorial effect level.
Note that the three proteins may interact directly or indirectly. We try to extract the three proteins that work in the same functional groups by identifying the behaviour of expression levels following our model of interaction.

Estimating sole and total interaction levels based on conditional probability
We use conditional probability to retrieve this interaction from expression data. The probability of the sole interactions of A−C and B−C are measured by conditional probability, as shown in Figure 4. Namely, the sole interaction effect level of A on C is measured as the ratio of the number of samples in which the expression levels of both A and C are sufficiently high out of the number of samples in which the expression level of A is sufficiently high. The total interaction effect of A and B on C is also measured in a similar manner, i.e., the ratio of the number of samples in which the expression level of A, B, and C are all sufficiently high out of the number of samples in which the expression levels of both A and B are sufficiently high.
The definitions and formulation of our problems are as follows. We handle proteins i (1 ≤ i ≤ I) and samples j (1 ≤ j ≤ J), both of which are included in the input expression data. We also call the proteins A, B, C, ..., and so on. As a parameter, we define r (0 < r <1) as the threshold of the ratio used to judge the expression, i.e., if the expression level of sample j for protein i is within the top r among all the expression levels of protein i, we call the protein i "expressed" in sample j. Let | | be the number of samples in which protein A is expressed, and similarly, let | ∩ | be the number of samples in which both protein A and B are

Retrieving combinatorial effect
What we want to estimate is the amount of the combinatorial interaction effect level, which can be estimated from the total interaction level (presented in the previous section) and the sole effect levels of A−C and B−C (see Figure 5). To estimate the combinatorial effect level for the combination of the three proteins A, B, and C, we split the total interaction effect into two parts, i.e., into two sole interaction effects and the combinatorial effect. Then, the difference between them is regarded as the combinatorial effect level that we wish to compute. To obtain the combinatorial effect level, we compute the statistical distribution of the total effect levels ′ , = | ∩ ∩ | | ∩ | , which are computed through the simulation executed under the assumption that no combinatorial effect exists over A, B, and C. From the distribution of ′ , = | ∩ ∩ | | ∩ | and the total effect score , = | ∩ ∩ | | ∩ | , which is the total effect level presented in the previous subsection, we can estimate the combinatorial effect level.
The computer simulation to compute the distribution of ′ , = | ∩ ∩ | | ∩ | is performed as follows. For the corresponding value of and , which are the sole effect values for the combination A−C and B−C, we first create distributions of A, B, and C randomly such that the sole effect levels of A−C and B−C are and , respectively. Since those distributions are created randomly, it is possible to assume that they do not include any combinatorial effect. Then we compute the total effect score of the combination A, B, and C. After a sufficient number of repetitions of this process, we obtain the distribution of ′ , as the accumulation of the total effect scores. Note t h a t w e d o n o t c o n s i d e r w h a t k i n d o f distribution A, B, and C follow in our method since we determine if the protein is expressed using the threshold r of the ranking in expression levels.
From this total effect distribution ′ , , we compute the combinatorial effect as a z-score in the distribution of ′ , . The z-score , is defined as , = , , where , is the total effect level of A, B, and C obtained from the real data, and and are the average and the standard deviation of the distribution of ′ , obtained from the computer simulation, respectively. Namely, the z-score is the difference between the average of the distribution of ′ , and the real total effect level obtained from the real data, which is measured as the unit value . Intuitively, the z-score indicates the probability of the value , assuming that the combinatorial effect does not exist, which implies the level of the combinatorial effect.
To compute the distribution of the total effect levels through the simulation, however, requires considerable computing time so it is desirable to precompute the distribution. Thus, we prepared a distribution table that shows the average and the standard deviation of the distribution for each value of and , as shown in Figure 6. Note that when we compute the distributions in Figure 6, we prepared the data of A, B, and C with 10,000 samples and we perform 5,000,000 trials for each pair of and . Because we computed the table for 20 values of and between 0 and 1, for obtaining the corresponding values of and we used the value in the table that is the closest to and of A, B, and C. Now we summarize the proposed method. First, we enumerate every combination of the three proteins A, B, and C from the input data set. For each of the combinations, we compute the total effect level , o f A , B , a n d C . B y r e f e r r i n g t o t h e p r e c o m p u t e d distribution table, we find the distribution of ′ , corresponding to the value and of A, B, and C. From the distribution of ′ , , and the total effect level , , we obtain the combinatorial effect level of A and B on C as the corresponding z-score. Finally, we create a ranking of all the combinations of the three proteins by ordering them by the z-score.

Property of expression data used in our method
In this section, we explain the preprocess applied to the expression data, and also describe the basic property of the data. The expression data used in this experiment originated from the sample of fat near the kidney of black cattle. We performed 2D electrophoresis on each sample and measured the volume of each separated spot that corresponds to each protein.
For details of the protocol of the experiment, see [19].
We preprocessed the expression data to improve the reliability of the expression data. Our preprocess consists of the following three steps. First, we removed from the data the samples and the proteins that included more than 10% of null expression levels. This was done because samples or proteins with so many null values significantly reduce the reliability of the expression data. Next, we normalized the expression data with the global scaling method [20], where for every sample a scale factor is applied such that the total sum of the protein expression levels in the sample is 1. Finally, we removed the samples with high repetition error. Note that, in fact, in this data set, we performed 2D electrophoresis twice for each sample to confirm the accuracy of each electrophoresis experiment. To maintain the reliability of the data, we removed the sample in which more than 30% of the spots have a high repetation error or null value. Specifically, we consider a spot to have high repetation error if the larger expression level is larger than 1.3 times the value of the smaller expression level. Otherwise, the average of the two expression levels is used for each sample-protein pair. As a result, the expression data used for our evaluation consist of 124 samples and 670 proteins.
In order to indicate a characteristic of this data, we investigated the correlation between proteins. See Figure 7 for the results of calculating correlation coefficients for all pairs of the proteins. Note that the number of pairs is 670 C 2 in total. Figure 7 is the histogram where the horizontal axis shows the correlation coefficient separated into classes with 0.05 intervals and the vertical axis shows the frequency of each class. From this result, we can see that most of the correlation coefficients take positive values, and many of them take relatively large values.

Methods
We performed the experiment to evaluate the performance of the proposed method by applying it to the expression data described in Section 4.1. As a parameter of the experiment, we used the values of 50% and 30% as the threshold r to define the phenomenon that a protein is expressed.
To maintain statistical reliability, we excluded from the analysis the combinations of three proteins where the number of samples was insufficient. Namely, we ignored the combinations of the three proteins if |A∩B|, which is the denominator in the total effect level , , was less than 35 in case of r is 50%, and less than 20 in case r is 30%. Similarly, we also removed the combinations if |A∩B∩C| was less than 18 in case of r is 50%, and less than 10 in case r is 30%. Furthermore, for the computation, we only used the samples in which all the expression levels of the three proteins are not null.

Results
In this section, we describe the results of the evaluation experiments. Figure 8 shows the histogram of the case of r = 50%, where the horizontal axis indicates the z-scores separated into classes with 0.5 intervals, and the vertical axis indicates the number of combinations in each class. Figure 9 shows the ranking of the top 30 combinations of proteins in terms of zscore. This table includes the columns of the spot numbers of proteins A, B, C, z-score of the combinations, and (the sole effect levels), , (the total effect level), |A∩B| and |A∩B∩C| (the number of samples contained in each phenomenon).
Under the significance level of 1%, we extracted 462,706 combinations in which a strong combinatorial effect is inferred. Here, we caluculate the corresponding p-value to the significance level of 1% using the formula of the Bonferroni correction presented in [21], i.e.,

p-value = −
, where n is the number of combinations of three proteins and is the significance level. This suggests that if p-value = − . , , = . 3 × or less, the combinatorial effect exists. When the p-value is . 3 × , then the corresponding zscore is 6.423. This is computed as the point in the normal distribution where the probability that the value will become more than the point is p-value = . 3 × . Figure 8 shows only the part where the z-score is larger than 6.423. Note that the probability of a z-score larger than 6.423 is only . 3 × if we assume that there is no combinatorial effect. This and the results of Figure 8 imply that our expression data includes many combinations in which the combinatorial effect exists. Figure 9 shows that most of the sole effects of the shown combinations occur between 0.4 and 0.45, and the total effects occur between 0.45 and 0.55. Moreover, in most of the combinations, |A∩B| takes values close to 35, which is the threshold value to judge statistical reliability. This implies that combinations of lower |A∩B| tend to have larger zscores. Although it is not shown in Figure 9, the combinations of lower ranks have larger values of |A∩B|. Figures 10 and 11 show the results with r = 30%. Compared to Figure 8, z-scores tend to have lower values. In addition, the number of combinations with z-scores larger than 6.423 decreases to 167,320. Here, 6.423 is the corresponding p-value with the significance level of 1%. In Figure 11, all of the total effects take a value of 1.0 and all of |A∩B| take a value of 20, which is the threshold value to judge statistical reliability. Furthermore, about 97.8% of the total effects take 1.0 in the retrieved 167,320 combinations. This means that in most of retrieved combinations, protein C is expressed in all the samples in which both proteins A and B are expressed. This appears to be an unusual tendency. Since in the case of 30% the number of samples in the phenomenon "express" is smaller than in the case of 50%, it is possible that the number of samples is not sufficient to ensure a reliable statistical analysis. One of our future projects will be to clarify why this result appears in the case of r = 30%.

Procedure to exchange proteins
In this section, for the combinations that have high z-scores, we investigate the z-scores when we exchange protein A with protein D in the case where D has a high correlation coefficient with A. Figure 9 shows that many high z-score combinations include C as the common protein, although A and B are also found as common proteins. Since our method defines the samples with the top r expression levels as expressed, having similar z-scores is intuitively inferred if we exchange A with D when D has a high correlation coefficient with A. We believe this is because there are many pairs of proteins in our data set that have a high correlation coefficient allowing us to retrieve so many combinations with a high combinatorial effect. In order to confirm this, we performed an experiment where we exchanged proteins.
The experiment is as follows. First, we create the list of proteins for D that have correlation coefficients against A that are larger than a certain threshold value. Next, we exchange A with D, and calculate the z-score , for all combinations of proteins D, B, and C. Figure 12 shows the value of the z-scores , when A and D are exchanged in the highest zscore combination of A, B, and C in the case r = 50%, where A is exchanged with D if D has the correlation coefficient with A larger than 0.8. This table includes the columns of the spot numbers of proteins A, B, C, protein D exchanged with A, correl(A,D) (the correlation coefficient of A and D), (the sole effect level when A and D are exchanged), (the sole effect level of before exchanging), , (the total effect level), |D∩B| and |D∩B∩C| (the number of samples contained in each phenomenon). In addition, this table is sorted in descending order of z-score. Figure 12 shows that the lowest z-score as a result of exchanging is 5.503. Note that there are only three combinations that have a z-score less than 6.423, by which the combinatorial effect is inferred under the significance level of 1%. This means that the z-score tends to be high when two proteins with a strong correlation are exchanged. Accordingly, one of the reasons that so many combinations that have a combinatorial effect are retrieved in our data seems to be that our data includes so many pairs of proteins in which the correlation coefficient is high.

Conclusion
In this paper, we proposed a method to retrieve the combinatorial protein-protein (or genegene) interactions from expression data using statistics of conditional probability. We suppose a model of protein-protein interactions in which the expression level of C takes a large value only if proteins A and B are expressed together. This is the first study to estimate the combinatorial effect level apart from the sole effect. In this study we described our method to treat protein interactions, but note that our method is also applicable to gene expression data generated from microarray experiments.
We evaluated our method using real expression data obtained from a 2D electrophoresisbased experiment. We performed two evaluation experiments with two different parameters, i.e., r = 50% and r = 30%. As a result, the real expression data used in our experiment included a considerable number of combinations in which combinatorial effect is inferred. However, the results are quite different between the two parameters of r that we used in our expeirment. This may be because the number of samples is not sufficient for statistical analysis, and we hope to clarify the validity of our method in detail in our future work. Further, we confirmed that we can exchange protein of A with D when D has strong correlation with A, and we found that the combinatorial effect is still strong even when A is exchanged with D.
In the future, we would like to perform more experiments to further validate our proposed method. In addition, we would like to develop an algorithm for the analytical computation of the statistical distribution under the assumption of no combinatorial effect, i.e., we would like to compute the distribution shown in Figure 6 without simulation. If such fast computation is possible, it enables us to easily vary the threshold r, and it also enables us to compute a more accurate analysis. Finally, we also would like to find the known interactions in our results verify the value of this data-mining method.

Acknowledgment
This work was partly supported by the Program for Promotion of Basic and Applied Researches for Innovations in Bio-oriented Industry. Proteins are indispensable players in virtually all biological events. The functions of proteins are coordinated through intricate regulatory networks of transient protein-protein interactions (PPIs). To predict and/or study PPIs, a wide variety of techniques have been developed over the last several decades. Many in vitro and in vivo assays have been implemented to explore the mechanism of these ubiquitous interactions. However, despite significant advances in these experimental approaches, many limitations exist such as falsepositives/false-negatives, difficulty in obtaining crystal structures of proteins, challenges in the detection of transient PPI, among others. To overcome these limitations, many computational approaches have been developed which are becoming increasingly widely used to facilitate the investigation of PPIs. This book has gathered an ensemble of experts in the field, in 22 chapters, which have been broadly categorized into Computational Approaches, Experimental Approaches, and Others.