Integration and Prediction of PPI Using Multiple Resources from Public Databases

Background: The analysis and usage of biological data is hindered by the spread of information across multiple repositories and the difficulties posed by different nomenclature systems and storage formats. In particular, the study and use of pr otein-protein interactions is one area wher e ther e is an impor tant need for data integration. Without good integration strategies, it is difficult to assess how much interaction data is available and its properties. Results: We pr esent a data integration approach for protein-protein interactions. This integrative approach has been implemented into PIANA, a protein-protei n interaction software framework under the GNU Public License (http://sbi.imim.es/piana). We find that the integrated network of interactions shows properties very similar to those observed in pr eviously reported protein interaction networks. We also find that interaction prediction methods find interactions for many proteins for which experimental methods have not produced any information.


Introduction
The completion of genome sequencing projects stimulated the development of high-throughput experimental methods aimed at functional characterization of the discovered genes. In particular, the identification of protein-protein interactions has been accelerated by the development of new technologies such as two-hybrid assays (Parrish et al., 2006;Rual et al., 2005;Stelzl et al., 2005) and affinity purifications followed by mass spectrometry (Gavin et al., 2006; Krogan et al., 2006;Puig et al., 2001). Thus, a vast amount of protein-protein interaction data has been collected, including proteome-scale interactome maps for yeast (Ito et al., 2001;Uetz et al., 2000), fly (Giot et al., 2003) and worm (Li et al., 2004), and a partial map for human (Rual et al., 2005;Stelzl et al., 2005). In addition to providing insights about biological systems (Barabasi et al., 2004;, protein interaction maps can be used to infer the function of proteins (Sharan et al., 2007), detect remote homologs (Espadaler et al., 2005a) and to identify the binding sites of a protein . However, interaction data is spread across multiple repositories and codified using various nomenclature systems (Mathivanan et al., 2006). In consequence, experimental biologists face difficulties when trying to find all known interactions for their proteins of interest, and the computational analysis and usage of protein interaction data is usually constrained to using a partial subset of all available knowledge. For example, any comprehensive search of interactions for a particular protein must include at least seven databases of protein-protein interactions: the Database of Interacting Proteins (DIP) (Salwinski et al., 2004), the MIPS database of interactions (Pagel et al., 2005), the Molecular INTerations database (MINT) (Chatr-aryamontri et al., 2007), IntAct , the Biomolecular Interactions Database (BIND) (Alfarano et al., 2005), the BioGrid (Stark et al., 2006) and Human Protein Reference Database (HPRD) (Peri et al., 2003).
Besides, each database uses different strategies for identifying proteins, and translations between synonym identifiers (i.e. identifiers linked to the same protein sequence) are required before any manual search or automatic processing. Moreover, there are methods for predicting protein interactions that can be used when no experimental interactions have been detected for a protein, but results from these methods are usually spread across multiple websites, each one in its own format.
There are efforts to standardize and harmonize protein interaction data. HUPO-PSI (Hermjakob, 2006) has developed a schema that enables the description of interactions between a wide range of molecular types, thus facilitating the access and data exchange between different research groups. The IMEx consortium ) is a group of major public interaction data providers sharing curation effort and exchanging completed records on molecular data following the HUPO standard exchange format. In consequence, the rate of data curation and data sharing between different repositories has been improved, but integration is still not completed. For example, HUPO PSI-MI 2.5 format allows the identification of interactors by unique identifiers from different databases, but the guidelines implemented do not include a strategy for naming proteins, which leaves unresolved many of the integration issues.
The issue of protein nomenclature has been addressed by internationally recognized scientific organizations like HGNC (Wain et al., 2002) and SGD (Christie et al., 2004), but they do not cover all species and do not map all database identifiers. IPI (Kersey et al., 2004) offers a non-complete redundant data set with cross-references with external identifiers.
The importance of protein interactions analysis has prompted the development of tools focused on protein interaction networks and their visualization, analysis and data integration (Aittokallio et al., 2006;Cline et al., 2007). For example, Cytoscape is focused in centralizing network analysis tools on a single platform with built-in visualization (Shannon et al., 2003). Other visualization and analysis tools include Osprey (Breitkreutz et al., 2003), VisANT (Hu et al., 2004), and ProViz (Iragne et al., 2005). On the other hand, current packages aimed at data integration include tYNA (Yip et al., 2006), a web system for managing, comparing and mining multiple networks, and cPath (Cerami et al., 2006), a platform for collecting and storing biological pathways that can be used from third party softwares for visualization and analysis. Some other works provide merged views of most public interaction data, such as MiMI (Jayapandian et al., 2007), APID (Prieto et al., 2006), and UniHI (Chaurasia et al., 2007). While these tools have been shown to be useful for creating and analyzing protein-protein interaction networks, there is still the need for an integration engine that truly unifies all available data into a single network and allows automatic analyses on a global scale. Most current integration tools are designed to work with interactions coming from one single type of data format, and others have problems when dealing with interactions codified using different types of protein identifiers. They concluded that the overlap between repositories is small but significant, and showed that the different interaction maps suffer from sampling and detection biases. The integration strategy of both works consisted in mapping all binary interactions to pairs of Entrez Gene identifiers. Marcotte and coworkers (Hart et al., 2006) analyzed yeast and human interaction data sets, and estimated that their protein interaction networks should contain 37,800-75,500 and 154,000-369,000 interactions respectively.
In a recent work, we presented PIANA (Protein Interactions And Network Analysis), a framework for creating, managing and analyzing protein-protein interactions (Aragues et al., 2006). Here, we describe the PIANA approach to protein nomenclature and its strategy to proteinprotein interaction data integration. Furthermore, we describe the properties of the experimental interaction network obtained for all species by integrating interactions from DIP (Salwinski et al., 2004), MIPS (Pagel et al., 2005), MINT (Chatr-aryamontri et al., 2007), IntAct , BIND (Alfarano et al., 2005), BioGrid (Stark et al., 2006) and HPRD (Peri et al., 2003). We also describe the properties of the interaction networks obtained from different methods of protein-protein interaction prediction. We conclude by discussing potential enhancements to the integration approach here described. (see sections 'Mapping protein identifiers' and 'Protein-protein interactions integration'). When translating the nodes of the network to external protein identifiers (process referred as 'unifying the network'), there are two possibilities: 1) one proteinID corresponds to a single external identifier and 2) different proteinIDs correspond to the same identifier, and thus, nodes and interactions are merged. Therefore, the same PIANA proteinID network will correspond to different unified networks, depending on the ex- Summary of the most relevant protein identifier types, calculated from a total of 6,476,028 distinct sequenceIDs in the database. Columns are: identifier type, number of distinct identifiers, the proportion of proteinIDs with respect to external identifier correspondences, the proportion of external identifiers with respect to proteinIDs, and the percentage of proteinIDs covered by the external identifier. Primary gene symbols are those gene symbols that have been established as the official gene name by nomenclature authorities such as HUGO (Wain et al., 2002) or FlyBase (Crosby et al., 2007).

Methods for the Prediction of Protein Interactions
We used predictions of protein-protein interactions obtained by four different methods: (i) Gene fusion, in which two proteins are predicted to interact if their corresponding genes appear fused in another genome (Enright et al., 1999); (ii) Phylogenetic profiles, in which similarity of phylogenetic profiles is interpreted as being indicative of two proteins need to be simultaneously present to perform a given function together (Pellegrini et al., 1999); (iii) Distant conservation of sequence patterns and structure relationships, in which structural similarities among domains of known interacting proteins and conservation of pairs of sequence patches involved in protein-protein interfaces are used to predict putative protein interaction pairs (Espadaler et al., 2005b); and (iv) Structural interologs, in which interactions are transferred between proteins with the same structural domains (Aragues et al., 2006). Interactions for the two first methods were retrieved from STRING (von Mering et al., 2007) by querying the database for interactions with a score higher than 0.7 for that particular methodology. Interactions for (iii) were obtained from the work of Espadaler et al. (Espadaler et al., 2005b). Interactions for (iv) were predicted by transferring experimental interactions in PIANA between proteins with a domain within the same SCOP family.
ternal identifier. Statistics in this article have been obtained after unifying the networks by NCBI geneID. Although geneIDs only cover 42% of proteinIDs, the cardinality proteinID:externalIdentifier is the highest (Table 1), and therefore geneID is the best suited identifier type for obtaining an unbiased view of the integrated protein interaction network. Protein sequences of unknown geneID were unified using UniProt accessions.

Mapping Protein Identifiers
PIANA handles an extensive set of protein identifiers types: UniProt entries and accessions; gene symbols; NCBI gi, geneID, Unigene and accession numbers; ENSEMBL; RefSeq; PDB; and FastA formatted sequences. PIANA internally identifies proteins with proteinIDs (integers). Each proteinID is linked to a pair [aminoacid sequence, taxonomy id], so there is a unique identifier for each protein sequence for a given organism. This allows PIANA to use the <sequence, species> of the protein as an inter-lingua between the external identifiers provided by the main repositories of genes and proteins. Therefore, one external protein identifier (e.g. UniProt entry THRB_HUMAN) can be associated to one or more proteinIDs (e.g. 11483), which are in turn linked to other external identifiers that are also used to represent that protein (e.g., gene symbol 'f2' and Unigene 'Hs.410092'). Consequently, along the different processes involved in inputting/outputting PIANA, external identifiers are 'translated' to proteinIDs, the desired operations are performed, and finally, if needed, proteinIDs are returned into the external identifier expected by the user (Figure 2). This strategy reduces the ambiguity and processing problems to the minimum: there is no need for continuously translating between distinct types of protein identifiers, since all information has been previously stored by assigning it to specific proteinIDs. Furthermore, codifying interactions in terms of proteinIDs allows PIANA to capture a larger number of interactions than platforms based on third party protein identifiers. A set a parsers inserts information from external repositories into the PIANA database. data warehousing approach and software architecture of PIANA are shown in Figure 1 (see additional file X for details). The PIANA database is accessed by the Graph library through a database interface, which is also used by the PIANA library to create, manage and analyze proteinprotein interaction networks. The whole process can be controlled from a user interface module.  PIANA keeps all information in terms of proteinIDs (an integer that uniquely identifies a protein sequence of a given taxonomy). User inputs are immediately translated to proteinIDs. Once this translation has been performed, all operations are performed at the sequence level, reducing ambiguities and synonyms conversions to a minimum.
Moreover, PIANA uses a number of techniques to assure the quality and completeness of the identifiers used as input/output: 1) inferring correspondences between identifiers and sequences even in the case that no external database explicitly contained the cross-reference: if one database identifies sequence A with identifier id1 and another database uses identifier id2 to sequence A, PIANA infers that id1 is equivalent to id2; 2) uniqueness of output protein identifiers: if two proteinIDs are linked to the same external identifier, those proteins are considered to be the same, and hence, merged into a single network node; 3) avoiding gene name ambiguities: thanks to integrating the species of the protein into the internal identifier, gene names are not confounded even if the same symbol is used for several species; and 4) using representative protein identifiers: (i) PIANA will use the identifier labeled as 'preferred' by the source database (eg. official gene symbol) unless the user says the contrary; and (ii) any input identifiers given by the user are prioritized over other identifiers in the PIANA database.
Since PIANA works internally with identifiers linked to the sequence of proteins (i.e. proteinID), the output identifier that is used for proteins depends not only on the type of identifier chosen by the user (e.g. UniProt) but also on the specific results that are being outputted. The reason is that one proteinID can be associated to several external identifiers (i.e. one sequence is associated to three gene names) and consequently, one of those external identifiers has to be chosen above the others. The algorithm used to chose among external identifiers depends on the input identifiers given by the user (they are prioritized over other identifiers) and the number of external databases that linked that sequence to the identifiers. Therefore, one proteinID will not always be represented in the output by the same external identifier.
Our internal protein identifiers do not distinguish between identical paralogs. We believe this distinction is not needed, since most repositories of interactions do not reach that level of specificity. Finally, proteinIDs are not intended to be new external protein identifiers, their only purpose is to be used for integration. Therefore, the way the integration is performed remains transparent to the user, whose only concern is to decide on the type of identifiers for input and output.

Protein Sequences Integration
Sequence and taxonomy data was obtained from Uniprot Swissprot and Trembl (The Uniprot Consortium, 2007), NCBI GenBank (Benson et al., 2007) and NCBI Blast nr (Maglott et al., 2007) databases (see additional file 2 for the complete list of protein sequence repositories used). Unexpectedly, UniProt Swiss-Prot (i.e. curated sequences) and UniProt TrEMBL (i.e. predicted sequences) have a significant overlap (additional file 3). Moreover, the overlap between TrEMBL and GenBank is lower than anticipated. Cross-references between external identifiers and  Table 1 shows the coverage provided by the main protein external identifiers for all proteinIDs (i.e. pair [protein sequence, taxonomy]) in the PIANA database.

MProtein-Protein Interactions Integration
Each interaction described in a third-party database is 'translated' to one or more interactions between proteinIDs. For example, if the external database contains an interaction between proteins A and B, with A corresponding to two proteinIDs (e.g 1 and 2) and B to one proteinID (e.g. 3), two interactions (1-3 and 2-3) will be inserted into the PIANA database. Both interactions will be described in the PIANA database as coming from that specific external database and labeled with the method used to detect the interaction between A and B. For example, HPRD describes an interaction between Entrez Gene 217 (mitochondrial ALDH) and Entrez Gene 3336 (heat shock protein). According to the correspondences in the PIANA database, Entrez Gene 217 corresponds to 13 different proteinIDs, and Entrez Gene 3336 corresponds to 12 proteinIDs. Therefore, PIANA will internally store the interaction between those two proteins as 156 different interactions. This methodology allows PIANA to give full control to the user: 1) interactions can be retrieved from any type of identifier; 2) a network can be created for a given external database (e.g. use only interactions from IntAct) and/or a specific method (e.g. do not use interactions detected in two hybrids assays) and/or a species (e.g. only interested in human interactions); 3) PIANA outputs can be set to use any type of protein identifier and therefore, interactions between proteinIDs are transformed to non-redundant interactions between protein identifiers (Methods). Consequently, describing interactions in terms of protein sequences instead of external identifiers provides a true integration of all known interactions into a single network, while keeping record of the source databases and detection methods associated with the interaction.  ., 2004) and, in general, any interaction data that is in tabulated or PSI-MI (Hermjakob, 2006) formats. See additional file 2 for the detailed description of interaction repositories that have been used in this work. Furthermore, data does not to have to be integrated indiscriminately without differentiating high-throughput versus small-scale experiments and literature annotation. Therefore, PIANA allows users to define subsets of interactions based on the source repository and detection methods employed. For example, a subset of reliable interactions can be extracted by requiring them to be in at least two different repositories.

Experimental Interactions
The integrated set of experimental interactions consisted of 4,055,698 interactions between 113,785 different proteinIDs. When grouping proteinIDs by their associated NCBI geneID (Methods), there were 405,808 interactions for 53,143 proteins, an average of 7.63 interactions per protein.

Interactions Distribution
The experimental interactions in the PIANA integrated database have been obtained from 7 different repositories, belong to 736 different species, and were detected using 106 different experimental methods. As shown on Table 2, the species with the largest number of experimental interactions are yeast (111,535 interactions) and human (110,457 interactions). Most interactions were found in just one database and were detected by just one method (Figure 3). The high correlation between the number of methods and databases is explained by the fact that most interactions appear in just one external repository, and these repositories usually label interactions with a single detection method. We calculated the overlap between 7 repositories with experimental information in terms of interactions (Table 3A) and proteins (Table 3B). BioGrid (Stark et al., 2006) is the repository with the highest number of interactions (216,370) and with the highest number of unique interactions (163,700). The two repositories that show the greatest overlap are MINT and IntAct (61% of interactions and 82% of proteins in MINT are also in IntAct) while the lowest overlap was between HPRD and DIP (only 4% of interactions and 9% of proteins in HPRD are also in DIP). Most low overlaps in terms of interactions are explained by the low overlap in terms of proteins. Therefore, data integration is required in order to obtain an interaction network that covers most proteins and interactions.
We were interested in analyzing the distribution of interactions in terms of the detection method employed. We examined the overlaps between different detection methods in terms of interactions (Table 4A) and proteins (Table 4B). We observed that high-throughput methods account for most of the known interactions (126,136 for affinity methods and 103,334 for yeast two hybrid assays). The overlap between the interactions detected by the different methods is low, even in cases where the overlap at the protein level is high. For example, while 51% of proteins with interactions from affinity methods also had interactions detected by yeast twohybrid methods, only 9% of interactions from yeast twohybrid were also detected by affinity methods. Therefore    For each repository, cells show the overlap with other repositories in terms of (A) interactions and (B) proteins. In parenthesis, the percentage that the overlap represents over the repository from the pair with less interactions or proteins is shown. Unique interactions and proteins are those only appearing in that repository. This table reflects the overlaps in the interaction network unified by NCBI geneID identifiers.

Properties of the Experimental Integrated Protein Interaction Network
Well-documented observations about protein interaction networks are confirmed when analyzing the integrated experimental interaction networks of different species. Moreover, the integrated network shows the modular functional organization of the proteome reported by previous works (Gavin et al., 2006). In particular, proteins tend to interact with proteins of the same Gene Ontology (GO) (Harris et al., 2004) biological process (Table 5). Furthermore, 95% of the interacting proteins in the integrated network have the same cellular component according to GO. In addition, the following properties were observed for the yeast protein interaction network (Table 6): (i) yeast hubs (proteins with 5 or more interactions) are more likely to be essential (Giaever et al., 2002) than non-hubs (22% of hubs are essential versus only 5% of non-hubs), although this might be a reflection of hubs usually having multiple interfaces ; (ii) approximately 59% of the interactions have the same cell localization according to (Lee et al., 2002); (iii) approximately 60% of the interactions reported are found coexpressed during the yeast cell cycle according to (Cho et al., 1998). Table 5: Commonalities in localization, molecular function and biological process of experimentally detected interacting proteins.
This table shows the fraction of experimentally detected interacting proteins with the following properties: a) co-localized according to GO cellular component terms; b) same biological process according to GO biological process terms; and c) same molecular function according to GO molecular function terms. An interaction was considered to respect the GO restriction if both interacting proteins shared a GO term when retrieving GO parents up to level 3 (Harris et al., 2004). In parenthesis, the percentage of interactions where both interacting proteins share a GO term is shown. Interactions were used for the study only if both proteins had at least one GO term assigned. Interactions where a protein interacts with itself were discarded for this study.  Yeast co-localization data was obtained from the work of Lee and coworkers (Lee et al., 2002). Yeast co-expression data was obtained from the work of Cho et al (Cho et al., 1998). Yeast essentiality data was obtained from the work by Giaever et al (Giaever et al., 2002). A yeast protein was considered a hub if it had 5 or more interaction partners. The interactions and proteins were included in the study for those cases in which information was available. Interactions where a protein interacts with itself were discarded for this study.

Protein Function Prediction from the Experimental Integrated Network
Recently, it has been shown that the number of common interaction partners between two proteins can be used to annotate proteins (Brun et al., 2003;Samanta et al., 2003). We have studied the use of this heuristic to predict molecular functions and biological processes as defined by GO (Harris et al., 2004), by calculating the percentage of shared GO terms between proteins with common interaction partners ( Figure 4). As expected, we observe that the interactions of a protein in the integrated network can be used to predict its function and the biological processes in which it intervenes. For example, proteins with 10-20 interaction partners in common share 90% of their GO biological process terms. Moreover, the accuracy of the predictions based on the integrated network is similar to that obtained when solely using the subset of interactions from DIP (Salwinski et al., 2004), while the number of annotated proteins is much higher (additional file 4).

Predicted Interaction Networks
We were interested in assessing protein interaction predictions and evaluating the similarities between the predicted interaction network and the experimental interaction network. In particular, we studied 4 different types of predictions  (Aragues et al., 2006). We calculated the overlap between the different experimental and prediction methods in terms of interactions (Table 7A) and proteins (Table 7B), observing a high overlap between prediction methods based on genomes analyses (i.e. gene fusion events and phylogenetic profiles) and a very low overlap between all other prediction methods. This minimal overlap between interaction predictions is explained by the different types of input data used by each method and the type of proteins for which the methods are capable of predicting interactions. For example, the method based on structural interologs predicts interactions for proteins with known 3D structure, while STRING predictions from gene fusion events were mainly applied to prokaryotes. Most proteins with known 3D structure are eukaryotes (Berman et al., 2000), and therefore, the two methods rarely predict similar interactions. Moreover, there is low overlap between predicted interactions and those obtained by experimental high throughput methods, both in terms of interactions and proteins. These results indicate that different methods identify interactions for different proteins. For example, there are many species for which no yeast two-hybrid experiments have been carried out, while many predictions can be 'transferred' to those species on the basis of genomes analysis, resulting in a low overlap at the interaction and protein level between the two methods.  The percentage of shared GO terms is plotted as a function of the number of common interaction partners.  Table 7: Pairwise overlaps of protein interactions and proteins for four interaction prediction methods, two types of high-throughput methods (yeast two hybrid assays and affinity purification methods), and curated data (invitro and invivo).
For each method, cells show the overlap with other methods in terms of (A) interactions and (B) proteins.
In parenthesis, the percentage that the overlap represents over the method from the pair with less interactions or proteins is shown. This table reflects the overlaps in the interaction network unified by NCBI geneID identifiers.
We evaluated whether interacting proteins according to different prediction methods tended to share biological process, molecular function and cellular component according to GO (Table 8). We observed that the method that better captures functional relationships between proteins is the one based on gene fusion events (Methods): 85% of the predicted interacting pairs belong to the same biological process. Moreover, all prediction methods detected a sensible number of colocalized proteins. For example, 87% of interacting proteins according to the prediction method based on structural interologs had the same cellular location.

Discussion
We presented the data integration approach of PIANA, a software framework designed for creating, managing and analyzing protein-protein interaction networks. PIANA was created to address nomenclature and integration issues common in protein interaction repositories and network visualization tools. Moreover, the modular approach of PIANA makes it a useful resource for bioinformaticians wishing to avoid the low-level details related to working with protein interaction networks.
Many areas of biological research are hampered by the difficulties found in accessing all biological information available. In particular, protein-protein interactions analysis is usually biased by the input sources of data. PIANA is one of the very few protein interaction platforms where all interactions from all external databases can be found for a protein of interest, regardless of the type of identifier used as input or the name given to the protein by the researcher that submitted the interactions. We presented a detailed analysis of the protein-protein interactions in the integrated network, in terms of their distribution across different databases and detection methods. We showed that most interactions appear in just one database and the overlap in terms of interactions is below 50% between most repositories, reinforcing the need for tools that unify all known interactions into a single network. Moreover, this integrated network has been shown to agree with properties previously reported about protein-protein interaction networks retrieved from just one database/detection method, such as its capability of predicting the function of proteins. Besides, the overlap between different experimental and prediction methods for protein-protein interaction identification was low, both in terms of interactions and proteins for which at least one interaction has been described. Despite this low overlap, interaction prediction approaches such as those based on gene fusion events and structural interologs were successful at identifying pairs of proteins within the same GO biological process. However, more in-depth studies are undertaken to evaluate the ability of annotating proteins based on interaction predictions ( Our analysis of protein interaction data in the public domain is similar to the studies of Herzel et al. (Futschik et al., 2007) and Pandey and coworkers (Mathivanan et al., 2006). However, our study includes protein interactions for all species, as well as predicted interactions from diverse methods. Moreover, we have analyzed the overlap between diverse experimental and prediction methods. The main conclusions from the studies in (Futschik et al., 2007) and (Mathivanan et al., 2006) are confirmed for interactions for organisms other than human. However, we found a higher overlap between the different interaction repositories, probably due to recent efforts in data exchange. Moreover, the total number of interactions in the experimental human integrated network is 110,457, compared to the 154,000-369,000 interactions estimated by Marcotte and coworkers (Hart et al., 2006). PIANA's approach to data integration is a good equilibrium between reliability and flexibility, while giving a good coverage of the information available. Two potential improvements to the current integration approach are: (i) the implementation of more sophisticated gene name disambiguation (Schijvenaars et al., 2005;Xu et al., 2007); and (ii) the capability of detecting highly similar protein sequences (e.g. via sequence alignments) and thus, transferring interactions and identifiers between similar proteins. The data integration techniques described here could also be of help for areas other than protein-protein interactions, such as gene expression studies or regulatory networks. This table shows the fraction of predicted interacting proteins with the following properties: colocalized according to GO cellular component terms; same biological process according to GO biological process terms; and same molecular function according to GO molecular function terms. An interaction was considered to respect the GO restriction if both proteins shared a GO term when retrieving GO parents up to level 3. Interactions were used for the study only if both proteins had at least one GO term assigned. Interactions where a protein interacts with itself were discarded from this study.