An integrative in silico approach for discovering candidates for drug-targetable protein-protein interactions in interactome data

Background Protein-protein interactions (PPIs) are challenging but attractive targets for small chemical drugs. Whole PPIs, called the 'interactome', have been emerged in several organisms, including human, based on the recent development of high-throughput screening (HTS) technologies. Individual PPIs have been targeted by small drug-like chemicals (SDCs), however, interactome data have not been fully utilized for exploring drug targets due to the lack of comprehensive methodology for utilizing these data. Here we propose an integrative in silico approach for discovering candidates for drug-targetable PPIs in interactome data. Results Our novel in silico screening system comprises three independent assessment procedures: i) detection of protein domains responsible for PPIs, ii) finding SDC-binding pockets on protein surfaces, and iii) evaluating similarities in the assignment of Gene Ontology (GO) terms between specific partner proteins. We discovered six candidates for drug-targetable PPIs by applying our in silico approach to original human PPI data composed of 770 binary interactions produced by our HTS yeast two-hybrid (HTS-Y2H) assays. Among them, we further examined two candidates, RXRA/NRIP1 and CDK2/CDKN1A, with respect to their biological roles, PPI network around each candidate, and tertiary structures of the interacting domains. Conclusion An integrative in silico approach for discovering candidates for drug-targetable PPIs was applied to original human PPIs data. The system excludes false positive interactions and selects reliable PPIs as drug targets. Its effectiveness was demonstrated by the discovery of the six promising candidate target PPIs. Inhibition or stabilization of the two interactions may have potential therapeutic effects against human diseases.

Here we propose a novel and integrative in silico approach for discovering candidates for drug-targetable PPIs by computationally screening large amounts of PPI data. To begin with, this approach is applied to the previouslyinvestigated target PPIs, then the effectiveness and potential of the approach is demonstrated by applying the methodology to original human PPI data produced by our HTS-Y2H assays.

Synopsis of our in silico system
Many previously-investigated target PPIs satisfy several criteria sufficient to be chosen as drug targets. One criterion is that interacting domains involved in a PPI have been already identified. Domain-domain interactions responsible for PPIs are more informative for researchers than PPIs to select potential drug targets [36]. This is because two domains that exclusively interact with each other can be specifically inhibited by a SDC without other PPIs being inhibited. In contrast, if a domain targeted by a SDC is shared with a large number of interacting proteins, and if this domain interacts with other domains, it is likely that the SDC will cause an off-target effect by inhibiting non-targeted PPIs that are essential to the organism.
A second criterion is the presence of SDC-binding pockets on the surface of the interacting protein. In many cases of the previously-investigated target PPIs, SDCs interact with a pocket in which the small number of amino acid residues exist that contribute the large fraction of protein-protein binding free energy, so-called 'hot spots' [1,37]. In order to inhibit a PPI by SDCs, one or both of the two interacting proteins should have a pocket on protein surface to which SDCs can bind. This criterion holds whether the SDCs exhibit their inhibiting effects via direct binding to the PPI interface, or via allosteric effects caused by SDCinduced conformational change to the tertiary structure of the SDC-interacting protein.
A third criterion is that the biological roles of the PPI are well understood. This is necessary in order to infer the phenotypic effects caused by inhibition of the PPI in the cell. In addition, if the two interacting proteins detected in an experimental study have the same cellular location and/or have similar biological functions, it is more probable that the interaction between these two proteins actually occurs in living cells.
Based on the idea of the in silico structure-based drug design, our novel and integrative in silico system discovers candidates for drug-targetable PPIs satisfying the abovementioned criteria by integrating three independent assessment procedures: • detection of protein domains responsible for PPIs, • finding SDC-binding pockets on protein surfaces, • evaluating similarities in the assignment of GO terms between specific partner proteins. The in silico system is schematically represented in Figure  1. The first assessment procedure utilizes protein domain information in the Pfam [38] database. In the second assessment procedure, we use two programs, CASTp [39] and MOE Alpha Site Finder [40], to find SDC-binding pockets. Similarity scores for GO-term assignment between specific partner proteins are calculated in the third assessment procedure. Statistical significance of the scores is also evaluated. For more details of these methods, see Methods section. In the following studies, we investigate a suitable threshold in each assessment procedure by applying our system to the previously-investigated target PPIs. Then, our system is applied to original human PPI data composed of 770 unique binary interactions produced by our HTS-Y2H assays.
• A domain pair in the PPIs has been already known or predicted as interacting partner in the public databases.
• One or both proteins have at least one pocket on the protein surface to which SDCs can bind.
• Similarity score for the GO-term assignment is statistically significant (P < 0.05) in two out of the three GO categories.
By adopting the thresholds, our system can select 8 PPIs (BAK/BCL2(BCL-X L ), β-catenin/Tcf4, CD4/MHC class II, IL1α(IL1β)/IL1R type I, iNOS/iNOS, LFA1/ICAM1, NGF/ p75 NTR , and p53/MDM2) from the 15 previously-investigated target PPIs. In addition, the locations of the pockets found on the 8 PPIs are in good agreement with those of pockets targeted by SDCs in the previous studies (data not shown). Thus, we consider the thresholds to be suitable for assessing drug-targetability of each PPI, although some PPIs may be missed as false negatives.

Application to original human PPI data
Most PPIs in original human PPI data are those between human transcription factors (baits) and other proteins (preys) (see Additional file 2). The number of unique baits and preys are 99 and 738, respectively ( Table 2). The baits and preys used in our HTS-Y2H assays were sequence fragments. Protein domains included in the bait and prey fragments are likely involved in the interaction between the two fragments. All domains in the bait and prey fragments used in the present study were retrieved from the Pfam database (see Methods). We identified Pfam-A and/ or Pfam-B domains in most of the bait (98% (97/99)) and prey (97% (714/738)) fragments ( Table 2). Table 3 indicates that in most (95% (734/770)) bait-prey pairs, both fragments have Pfam-A and/or Pfam-B domains. This table also shows that only 3% (23/770) of bait-prey pairs satisfy the first criterion of our system, dramatically reducing candidate PPIs. Then, we further identified two domains as interacting partner domains, when a single domain was present in the bait fragment and a single Schematic representation of our novel and integrative in silico system for discovering candidates for drug-targetable PPIs in binary PPI data Figure 1 Schematic representation of our novel and integrative in silico system for discovering candidates for drug-targetable PPIs in binary PPI data. The system uses binary PPI data as an input and assesses each PPI based on three independent in silico investigations; detection of protein domains responsible for PPIs, finding SDC-binding pockets on protein surfaces, and evaluating similarities in the assignment of GO terms. By integrating the results of these three investigations, the system discovers candidates for drug-targetable PPIs. domain in the prey fragment. Among the bait and prey fragments with domains, 32 (33%) bait and 350 (49%) prey fragments have a single domain. In 62 (8%) out of the 734 bait-prey pairs, we detected a single domain in both the bait and the prey fragments. As a result, we identified interacting partner domains in 83 (11%) bait-prey pairs. It is highly probable that these domain pairs are involved in the interaction between the bait and prey fragments. See Additional file 2 for the full list of the detected domains in the fragments.
In order to computationally detect pockets on the surfaces of domains/proteins in the bait and prey fragments, it is essential that tertiary structures nearly identical to the bait and prey fragments are available. To detect protein tertiary structures nearly identical to the fragments, we searched for entries in the PDB [51] database showing high amino acid sequence identity and sequence coverage rate to the fragments (see Methods). The rigorous threshold of sequence identity ≥ 90% and coverage rate ≥ 90% in the results of sequence-similarity searches was adopted in the present study. This is because we detected pockets based on their volume and the number of hydrophobic amino acid residues in pockets, and these pocket properties are very sensitive to a slight conformational change of protein tertiary structure caused by amino acid replacement, deletion, or insertion. If sequence identity between a bait or prey fragment and a PDB entry fell within the range of 50%-90%, one could reconstruct a tertiary structure of the protein with homology modeling based on the template structure of the PDB entry. In these situations, however, pocket properties on the reconstructed tertiary structure would be not always nearly identical to those on the template structure. Therefore, we adopted the rigorous threshold of sequence identity ≥ 90% and coverage rate ≥ 90% for pocket detection. Results of the sequence-similarity search indicate that 15% (15/99) of bait and 7% (51/ 738) of prey fragments have nearly identical tertiary structures in the PDB database ( Table 2). Most of the bait and prey fragments (100% (15/15) in bait, 84% (43/51) in prey) have one or more pockets on their protein surface. Table 3 shows that one or both fragments in 27% (211/  In the column of 'presence of pockets', 'yes' means that one or more pockets were found by at least one of the two programs, 'no' means that no pocket was found, and '-' indicates that pockets were not searched because of lack of nearly identical tertiary structures. Statistical significance of similarity scores (S i C , S i F , and S i P ) for GO-term assignment are indicated by '*' (* P < 0.05, ** P < 0.01). In the row of 'BAK/BCL2(BCL-X L )', similarity scores for BAK/BCL-X L are shown in parentheses. 770) of bait-prey pairs have nearly identical tertiary structures. In 96% (203/211) of the bait-prey pairs, we found SDC-binding pockets in one or both fragments. See Additional file 2 for the full results of the pocket analyses.
GO [52] is useful for assessing the biological significance of the bait-prey pairs and for selecting well-studied pairs. This is due to the hierarchical data structure of GO in which many biological terms are highly systematically organized to allow the computational handling of many terms related to biology. We counted the numbers of shared identical GO terms and calculated similarity scores between the bait and prey fragments (see Methods). Table  2 shows that most bait proteins (> 90%) and many prey ones (> 80%) have at least one GO term in any of the three GO categories. Table 3 indicates that many bait-prey pairs (> 75%) share one or more identical GO terms. We calculated similarity scores and evaluated statistical significance of the scores based on frequency distributions of scores calculated for PPI data composed of random protein pairs (see Additional file 3). The number of bait-prey pairs with a statistically significant (P < 0.05) score is shown in Table 3. Among these pairs, 201 bait-prey pairs have the statistically significant scores in two out of the there GO categories. See Additional file 2 for similarity scores calculated for all bait-prey pairs and results of the statistical evaluation of these scores.
Among the 770 unique bait-prey pairs, we selected candidates for drug-targetable PPIs that satisfy all the three criteria. As shown in Table 3, 83 bait-prey pairs satisfied the first criterion. The number of bait-prey pairs satisfying the second or third criterion was 203 or 201, respectively. Figure 2 illustrates the distribution of the bait-prey pairs satisfying one, two, or three criteria described above. Twentysix bait-prey pairs satisfy the first and second criteria, 70 pairs the second and third ones, and 29 pairs the first and third ones. Nine bait-prey pairs (6 protein pairs; RXRA/ NRIP1, PPARA/RXRA, RXRB/PPARD, STAT1/STAT6, CDK2/CDKN1A, and STAT3/DST) were discovered as candidates for drug-targetable PPIs satisfying all the three criteria.

Drug-targetability of selected PPIs
In this section, we discuss the drug-targetability of the two candidate PPIs, retinoid × receptor α (RXRA)/nuclear receptor-interacting protein 1 (NRIP1) and cell division protein kinase 2 (CDK2)/cyclin-dependent kinase inhibi-Selecting candidates for drug-targetable PPIs  tor 1 (CDKN1A) ( Table 4). The two candidates were selected, because both bait and prey fragments had a single domain, and interacting partner domains were explicitly determined, and because similarity scores for GOterm assignment were statistically significant in all the three GO categories. We further examined the two candidates with respect to their biological roles, PPI network around each candidate, and tertiary structures of the interacting domains.

RXRA/NRIP1
Biological functions of RXRA and NRIP1 have been studied in detail [53][54][55][56]. The statistically significant similarity scores for the GO-term assignment indicate that RXRA and NRIP1 have related biological functions (Table 4). In fact, the two proteins share a number of gene-transcription-related GO terms; 'nucleus' in the cellular component category, 'transcription coactivator activity' and 'DNA binding' in the molecular function category, and 'regulation of transcription, DNA-dependent' and 'positive regulation of transcription from RNA polymerase II promoter' in the biological process category. RXRA is a member of the nuclear hormone receptor family. When a ligand binds to its hormone receptor domain, RXRA forms a homo-or hetero-dimer with other nuclear hormone receptors in order to function as a transcription factor [56]. NRIP1 interacts with homo-or hetero-dimers of various nuclear hormone receptors and modulates their function by repressing transcriptional activity of the dimers [53][54][55]. Figure 3 shows the interaction network based on PPI data originally produced by our HTS-Y2H assays and retrieved from a public PPI database, HPRD [57] (see Additional file 4 for the original and larger version of Figure 3). The network shows that RXRA interacts with proteins related to a tumor (THRA related to pituitary adenome) and those related to certain diseases caused by abnormalities in lipid metabolism (e.g., NR0B2 related to obesity, PPARA to hyperapobetalipoproteinemia, and PPARGC1A to lipodystrophy). Among the proteins interacting with RXRA and NRIP1, several proteins (e.g., PPARA, THRA, RARG, and RXRA itself) are targeted by the drugs approved by the Food and Drug Administration (FDA) [58]. Indeed, members of the nuclear hormone receptor family, including RXRA, have been intensively studied as targets for therapeutic drugs for human diseases such as type II diabetes, obesity, and cancer [56]. Considering the biological functions of RXRA and NRIP1, we speculate that SDCs inhibiting the RXRA/NRIP1 interaction may have an effect similar to that of a RXRA agonist. If inhibition of the RXRA/NRIP1 interaction by the SDCs results in NRIP1 separating from a protein complex composed of RXRA, another nuclear receptor, and NRIP1, the transcription factor functionality of the resulting dimer would be restored.
We identified interaction between the Hormone_recep domain (ligand-binding domain) [Pfam:PF00104] in RXRA and a fragment of the PB064381 domain containing LXXLL motifs in NRIP1 ( Table 4). The RXRA/NRIP1 interaction is believed to occur between α-helix 12 (H12) located in the C-terminal region of the Hormone_recep domain in RXRA and the LXXLL motifs in NRIP1 [54,55]. Since RXRA interact with NRIP1 in a ligand-dependent manner [53][54][55], one would expect to detect pockets on the surface of RXRA in the ligand-bound state. 1LBD in Table 4, however, is not suitable for the present study because it is the tertiary structure of RXRA homo-diners in the non-ligand-bound state. Then, we further detected pockets on 1MVC_A (RXRA in the ligand-bound state) with the second-highest score to the bait fragment from RXRA in the sequence similarity search. Figure 4(a) and 4(b) show the locations of the found pockets and of the H12 from the Hormone_recep domain superimposed on the tertiary structure of 1MVC_A. We found four pockets using CASTp and three using MOE Alpha Site Finder on the surface of the Hormone_recep domain in RXRA. The PPI network connecting proteins used in the HTS-Y2H assays in the present study Figure 3 PPI network connecting proteins used in the HTS-Y2H assays in the present study. Part of the network around the RXRA/ NRIP1 and CDK2/CDKN1A interactions is enlarged in the upper frame. Proteins are represented as diamonds (targets of drugs approved by FDA) and circles (non-targets of FDA-approved drugs). The information on target proteins of FDAapproved drugs was obtained from the DrugBank database [58]. RXRA, NRIP1, CDK2, and CDKN1A are colored yellow. Proteins related to OMIM [96] diseases are colored brown and the remaining proteins are grey. Interactions between proteins are indicated by lines. Novel PPIs detected in this study are shown in red, and those retrieved form a public database, HPRD [57], are in blue. PPIs are colored green if the interaction was detected in the present study and also retrieved from the HPRD. The network was drawn using the program Cytoscape (version 2.3.2) [97]. See Additional file 4 for the original and larger version of the PPI network.
pockets range in size from 152Å 3 to 1,092Å 3 . The ratio of the number of hydrophobic amino acid residues to that of total residues was calculated for each pocket, ranging from 48% to 82%. The pocket with the size of 152Å 3 and 78% hydrophobic residues (shown in yellow in Figure 4(a)) seems most adequate for SDCs designed to inhibit RXRA/ NRIP1 interaction, because several amino acid residues in the pocket are shared with the H12 (Figure 4(b)). Based on this structural information, it may be possible to discover inhibitors of the RXRA/NRIP1 interaction by designing SDCs to specifically bind to the pocket. Peptidomimetics of the LXXLL motif [5] in NRIP1 could be used as templates for designing RXRA/NRIP1-inhibiting drugs. In addition, the PB064381 domain is unique to NRIP1 [59], suggesting that inhibition of the Hormone_recep/PB064381 interaction may not affect other domain-domain interactions in living cells.

CDK2/CDKN1A
CDK2 and CDKN1A share several GO terms; 'nucleus' in the cellular component category, 'protein kinase activity' and 'protein binding' in the molecular function category, and 'cell cycle' in the biological process category. This indicates that the both proteins have biological functions Locations of the detected pockets superimposed on the tertiary structures of proteins or the amino acid sequence  in signaling pathways related to cell cycle regulation in the nucleus. CDK2 forms a protein complex with a member of cyclin family proteins, and functions in cell cycle progression at the transition between the G1 and S phases [60]. CDKN1A arrests cell cycle progression by acting as an inhibitor of CDK2/cyclin protein complex [61]. The PPI network illustrated in Figure 3 shows that CDK2 interacts with the TP73 protein related to neuroblastoma. Like the RXRA, the CDK family proteins have attracted the researchers' interest as targets for anticancer drugs [62][63][64]. A large number of SDCs have been developed that interact with ATP-binding pocket and inhibit CDKs' kinase activity [63,64]. Likewise, CDK/cyclin protein complexes have well studied as therapeutic target [62]. CDKN1A represses CDK2/cyclin activity by simultaneously binding to the 'cyclin groove' on cyclin and ATP-binding pocket on CDK2 [61,62], which suggests that CDKN1A has an effect similar to that of an antagonist of CDK2's kinase activity. Indeed, Kontopidis and his colleagues have obtained some peptides that mimic cyclin-groove-binding motif in CDKN1A and inhibit interaction between CDK/cyclin complex and transcription factors [62]. In addition to these peptidomimetics of CDKN1A, SDCs, called 'dimerizers' [65], that induce or stabilize CDK2/cyclin A/ CDKN1A protein complex could potentially lead to treatments for cancer.
We identified domain-domain interaction between the Pkinase domain [Pfam:PF00069] in CDK2 and the CDI domain [Pfam:PF02234] in CDKN1A (Table 4). This is in good agreement with the results in the previous studies [66] identifying interaction interface of CDK2/CDKN1A. One strategy for inducing or stabilizing a PPI is to design a SDC that can simultaneously bind to a pocket laid across two interacting proteins on a protein complex. In the case of CDK2/CDKN1A, we found pockets on the Pkinase domain [PDB:1V1K_A] in CDK2 but did not detect any pocket on the CDI domain in CDKN1A because it has no nearly identical tertiary structure ( Table 4). Instead of 1V1K_A, we further investigated a tertiary structure of protein complex [PDB:1JSU] composed of CDK2, cyclin A, and CDKN1B that is a homolog of CDKN1A (sequence identity < 45%). Figure 4(c) shows that there is a pocket (shown in blue in Figure 4(c)) composed of atoms from CDK2 and from CDKN1B. Most of the atoms overlap with those composing ATP-binding pocket on CDK2. The size is 714Å 3 , and the ratio of hydrophobic residues in the pocket is 50%. It is highly probable that CDK2/CDKN1A complex has a tertiary structure not nearly identical but similar to CDK2/CDKN1B complex, and that CDKN1A binds to CDK2 in a similar mode to CDKN1B [67].

Advantages of targeting PPIs
Targeting PPIs has distinct advantages over targeting single proteins; a larger number of undiscovered potential drug targets. Using traditional approaches for drug target discovery from the human proteome, drug targets were single proteins and limited to a small number (~480) of proteins such as membrane receptors and enzymes [70]. Furthermore, most pockets targeted by small chemical drugs in these approaches were those to which endogenous small molecule ligands or substrates bind. By focusing on PPIs, the number of latent and novel drug targets can be expected to dramatically increase. This is because the size of the human interactome must be considerably larger than that of the human proteome and because many pockets involved in PPIs but not targeted in the traditional approaches become accessible. Since the total number of proteins encoded on the human genome is about 25,000 -40,000, the size of the human interactome has been estimated to be 40,000 -200,000 PPIs, based on extrapolation from the yeast interactome (10,000 -30,000 PPIs (3 -10 interactions/protein)) [71]. However, the number of human PPIs, registered in the public interaction database, is limited to ~38,000 [57]. Therefore, it is highly probable that most PPIs, including those which could be potential drug targets in the human interactome, remain undiscovered. For example, some PPIs, including BAK/BCL2, BAK/BCL-X L , p53/MDM2, and homo-or hetero-dimers of nuclear receptors, are mediated by hydrophobic grooves formed by three α-helices [1,56]. These PPIs utilizing α-helix grooves are thought to be amenable to small-molecule drug discovery [1], and thus may be promising targets of PPI-inhibiting SDCs [1,5].
Our in silico system can select more reliable interactions as drug targets by excluding spurious interactions via the three independent assessment procedures. PPI data used in the present study were obtained from our HTS-Y2H assays. In general, the false positive rate of HTS-Y2H methods has been believed to be higher than that of other physical, genetic, biochemical, or immunological methods for experimental detection of PPIs, mainly due to 'sticky' proteins that non-specifically interact with various proteins [72]. While a recent study on PPI prediction by the Support-Vector-Machine-based method has implied that PPI data produced by our HTS-Y2H assays are more reliable than data in the previous HTS-Y2H studies ( Table  4 in [73]), we do not neglect the possibility that our PPI data also contain false positive interactions. Indeed, our HTS-Y2H assays identified PPIs between baits derived from nucleus-located proteins and preys from extracellular proteins such as collagen α-1(XV) chain (COL15A1), extracellular matrix protein 1 (ECM1), and laminin proteins (LAMA3, LAMB3, and LAMC2) (see Additional file 2). These PPIs are highly probable to be false positives. Our in silico system, however, can exclude these spurious interactions, because, in these cases, similarity scores for GO-term assignment are not statistically significant in the cellular component category. Therefore, our approach should be widely applicable to PPI data even if a number of false positive interactions are included.

Issues in out approach
Our approach has some advantages described above, but some issues should be noted for further refinement of the approach. For more careful assessment of domain detection, we did not identify interacting partner domains when bait and/or prey fragments have multiple domains, so long as a domain pair was not registered in the public domain-domain interaction databases. However, a large number of human proteins are multi-domain ones, and this is also the case in the bait (> 60%) and prey (> 45%) fragments used in the present study. Several computational methods have been developed in recent years for predicting interacting partner domains from large amounts of experimental PPI data [74][75][76][77][78][79][80]. Application of the methods to the PPI data used in this study will be needed for more exhaustive identification of interacting domains. For the purpose of pocket detection, we adopted simple criteria mainly based on pocket volume and the number of amino acid residues composing the pocket. Many studies in past few decades have revealed various properties of pockets involved in endogenous ligand binding or PPI [ [37,[81][82][83] and references therein]. These properties, such as volume, shape, hydrophobic clusters, shallowness, roughness, and accessible surface area, can be taken into consideration as parameters for assessment of drug-targetability of each pocket. We are now developing a computer program that evaluates drug-targetability of pockets based on these parameters. The program will enable us to judge whether a pocket is suitable for drug target. To investigate whether biological function of each PPI has been well understood or not, we assessed each PPI by using GO terms. GO has been frequently used in PPI network studies for researchers' purpose of annotating biological function of PPIs [28][29][30][31][32]34], but it has also a weak point that well-studied proteins have many GO terms and poorly-understood ones have little. While PPIs between well-studied proteins have been annotated too much, those between poorly-understood ones too little. Thus, when our approach assesses PPIs by using GO terms, it may miss poorly-understood but therapeutically important target PPIs as false negatives. But, one of the aims of our system is to select PPIs on which biological information are more abundant. In vivo and in vitro validation process of PPIs as drug target, it is more desirable that a researcher can obtain as much information as possible on biology of the PPIs. Since PPIs annotated too little are considered as difficult target in this respect, our system does not select the PPIs in this study. More accumulation of GO annotation will help us select therapeutically important target PPIs that are annotated too little by GO terms at present.

Future directions
Our in silico system can be further expanded for more precise assessment of candidates for drug-targetable PPIs if other computational methods are incorporated. These methods include the prediction of interaction interfaces on protein tertiary structures, the prediction of disordered regions, and the evaluation of similarities in the expression patterns of messenger RNAs encoding the two interacting proteins in every tissue/organ. In the case of RXRA/ NRIP1 and CDK2/CDKN1A, it is fortunate that the interaction interfaces have been well studied by biochemical and immunological approaches [54,55,66], although the tertiary structures of the protein complexes remain unsolved. However, if the interaction interface of a candidate target PPI has not been well studied and the tertiary structure of the protein complex is unknown, computational methods to predict the PPI interface [84][85][86][87][88] are required in order to determine whether a detected SDCbinding pocket is located at the interface. Cheng and colleagues [89] recently proposed that interaction interface regions in proteins tend to have disordered tertiary structures and that information regarding these disordered regions is useful for drug target discovery. As for gene expression patterns, two proteins could presumably interact in living cells, if the expression patterns of their corresponding genes were similar to each other.
We focused on discovering drug targets for SDCs based on the idea of the structure-based in silico drug design, although there are various other types of drugs, including peptides, antisense RNAs or DNAs, aptamers, and antibodies. Candidate target PPIs for each type of drugs, as well as small chemical drugs, will be selected by adopting distinct criteria based on the three (or more) independent in silico investigations in our system. For example, to select candidate target PPIs for antibodies, one can adopt criteria so that i) at least one tertiary structure of the interacting domains is known, ii) the interacting domain has an interaction interface predicted to be recognized by antibodies, and iii) the interacting proteins share identical GO terms such as 'extracellular' in the cellular component category and have expression patterns similar to each other.

Conclusion
In this paper, we propose a novel and integrative in silico approach for discovering candidates for drug-targetable PPIs in interactome data. The system excludes false positive interactions and selects more reliable PPIs as drug targets. The application of our system to original human PPI data demonstrated its effectiveness by discovering the six promising candidates for drug-targetable PPIs. Advances in HTS technologies for detecting PPIs and the accumulation of high fidelity PPI data in the near future will enable our system to facilitate the more comprehensive exploration of drug-targetable PPIs.

PPI data
The PPI data analysed in the present study consists of 770 binary interactions between human proteins. The data were produced by our HTS-Y2H assays supported by the Genome Network Project from the Ministry of Education, Culture, Sports, Science and Technology of Japan. See Additional file 2 and the website of the Genome Network Platform [90] for all PPI data used in this study. Most of bait proteins used in the HTS-Y2H assays are transcription factors, including members of the nuclear hormone receptor family (NR1D1, NR1D2, PPARA, PPARD, RORB, RXRA, THRA, etc), those of the Signal Transducer and Activator of Transcription (STAT) family (STAT1, STAT3, and STAT4), homeodomain proteins (FOXP2, LHX1, LHX2, PKNOX1, etc), and zinc-finger proteins (RFP, ZNF31, ZNF581, TRIM21, etc). Preys used in the assays were prepared from cDNA libraries derived from various cell lines (brain, breast cancer/prostate cancer, liver, and macrophage). Our HTS-Y2H method uses sequence fragments as baits, and preys isolated with the baits are also sequence fragments. This enables us to identify protein domains responsible for PPIs because it is highly probable that protein domains included in the bait or prey fragments are involved in the interactions between the two fragments. Full details of our HTS-Y2H method, including experimental materials and conditions, will be reported elsewhere in near future.

Detection of protein domains responsible for PPIs
All domains in the bait and prey fragments were retrieved from the Pfam (version 20.0) database [38] [49], and DIMA [50].

Finding SDC-binding pockets on protein surfaces
Using amino acid sequences of the bait and prey fragments as queries, we searched the PDB database [51] (the version at the date of 2006/5/18) for tertiary structures similar to each fragment using the program BLASTP (version 2.2.13) [93]. This similarity search was performed with the default program parameters except for '-F F' (no mask for low complexity regions) and '-e 0.001' (E-value < 0.001). We considered the fragment to have a tertiary structure nearly identical to the chain, when a bait or prey fragment had sequence identity of ≥ 90% and query coverage rate (length of query sequence showing the identity/ full length of the query sequence) of ≥ 90% to a chain in a PDB entry, and if the sequence length showing the identity was ≥ 50 residues. If no nearly-identical tertiary structure was detected for a fragment, the fragment was further searched in the PDB database using the program PSI-BLAST (version 2.2.13) [93]. The default program parameters were used for the PSI-BLAST search except for '-j 10' (10 times the iteration search).
The search for pockets on protein surfaces was performed for the bait and prey fragments showing high sequence identity (≥ 90%) to a chain in a PDB entry. We used two programs, CASTp [39] and MOE Alpha Site Finder [40], which implement different pocket-search algorithms. Coordinate data for the chains in the PDB showing high sequence identity to the bait and prey fragments were used as input to the programs. We counted the number of pockets satisfying the following empirically-determined criteria in order to detect potential SDC-binding pockets: in the case of CASTp, i) the volume (v) of a detected pocket was within the range of 150Å 3 <v ≤ 2000Å 3 ; ii) in that of MOE Alpha Site Finder, a) the number of atoms comprising the side chains of the amino acids inside the pocket was ≥ 37 or b) the number of hydrophobic atoms inside the pocket was ≥ 22.

Evaluating similarities in the assignment of GO terms between specific partner proteins
Based on GO terms assigned to two proteins from which the bait and prey fragments were derived, we evaluated similarities between fragments by counting the number of shared identical GO terms. GO terms assigned to the proteins were retrieved from the QuickGO database [94] using the UniProt/TrEMBL accession numbers. GO organizes a wide variety of biological terms as hierarchy. If a specific term is assigned to a gene product, then all 'parent' terms in all paths ascending from that specific term to the top level terms ('cellular component', 'biological process', and 'molecular function') of the hierarchy are also assigned to that gene product [96]. Thus, we collected all parent terms of specific ones assigned to each protein. A similarity score (S i ) between a protein pair i is calculated as where L j is the jth level of GO hierarchy (in the present study, L j = 1, 2, 3, ..., 13, from the top level term (L j = 1) to a specific term (L j > 1)) and n ij is the number of shared identical GO terms in the jth level between a protein pair i. We calculated the scores for the three GO categories; cellular component (S i C ), molecular function (S i F ), and biological process (S i P ).
Statistical significance of the similarity scores was evaluated on the basis of frequency distributions of scores calculated for PPI data composed of 10,000 random pairs of human proteins (see Additional file 3). The random pairs were constructed from proteins in the UniProt and TrEMBL database with GO terms. The frequency distributions of random scores were calculated for all three GO categories, and probabilities of the real scores were estimated based on the distributions.
Publish with Bio Med Central and every scientist can read your work free of charge