Benchmarking Human Protein Complexes to Investigate Drug-Related Systems and Evaluate Predicted Protein Complexes

Protein complexes are key entities to perform cellular functions. Human diseases are also revealed to associate with some specific human protein complexes. In fact, human protein complexes are widely used for protein function annotation, inference of human protein interactome, disease gene prediction, and so on. Therefore, it is highly desired to build an up-to-date catalogue of human complexes to support the research in these applications. Protein complexes from different databases are as expected to be highly redundant. In this paper, we designed a set of concise operations to compile these redundant human complexes and built a comprehensive catalogue called CHPC2012 (Catalogue of Human Protein Complexes). CHPC2012 achieves a higher coverage for proteins and protein complexes than those individual databases. It is also verified to be a set of complexes with high quality as its co-complex protein associations have a high overlap with protein-protein interactions (PPI) in various existing PPI databases. We demonstrated two distinct applications of CHPC2012, that is, investigating the relationship between protein complexes and drug-related systems and evaluating the quality of predicted protein complexes. In particular, CHPC2012 provides more insights into drug development. For instance, proteins involved in multiple complexes (the overlapping proteins) are potential drug targets; the drug-complex network is utilized to investigate multi-target drugs and drug-drug interactions; and the disease-specific complex-drug networks will provide new clues for drug repositioning. With this up-to-date reference set of human protein complexes, we believe that the CHPC2012 catalogue is able to enhance the studies for protein interactions, protein functions, human diseases, drugs, and related fields of research. CHPC2012 complexes can be downloaded from http://www1.i2r.a-star.edu.sg/xlli/CHPC2012/CHPC2012.htm.


A drug-complex network
The drug-complex network for our CHPC2012 complexes consists of 2648 nodes, including 1835 drugs and 813 complexes, and 9916 edges as shown in the following Figure S1. In this figure, orange diamonds represent drugs and green circles are protein complexes in CHPC2012.

Human protein complexes in Gene Ontology (GO)
We processed the "cellular-component" sub-ontology of GO and collected 486 complexes for human. Table S1 shows the co-complex associations for these 486 complexes in GO, as well as three aforementioned databases, i.e., CORUM, HPRD and PINdb. We can find that the complexes in GO have the lowest fraction of co-complex protein associations in HPRD and BioGrid PPI databases. This indicates that the raw set of GO complexes has the lowest quality. We also processed the GO complexes with our Algorithm 1 (see the main manuscript). Table S2 shows that the quality of GO complexes is not improved in terms of the percentage of co-complex associations in existing PPI databases. However, the quality of complexes in other 3 databases (CORUM, HPRD and PINdb) is improved significantly as shown in Table S2. Note that CHPC2012 is obtained by integrating CORUM, HPRD and PINdb databases. The results in Tables S1 and S2 demonstrate that the quality of GO complexes is not good and this is the main reason we did not include GO complexes to build our CHPC2012.

Parameter Settings
In the Algorithm 1 in our main manuscript, there are two parameters namely overlap thres and merge thres.
For the parameter overlap thres, it is used to determine whether two complexes are redundant (i.e., they can match each other). In a previous study [1], this parameter is set as 0.5. Therefore, we have also followed their suggestion and set it as 0.5 in our experiments. Let us give a specific example for this parameter. Suppose that there are two complexes and both of them have 8 proteins, they can be considered to be matching when the number of proteins in common between them is at least 6 (if the intersection has 6 proteins, the union will have 10 proteins and the Jaccard coefficient is 0.6; if the intersection has 5 proteins, the union will thus have 11 proteins and the Jaccard coefficient is 0.455) [1].
For the parameter merge thres, it is used to determine whether to merge two redundant complexes or not. We are cautious to process those redundant complexes. If the value of the parameter merge thres is set too low, then we may arbitrarily merge two different complexes as long as they share some protein components, which could generate false positive protein complexes.
As introduced in the main manuscript, co-complex protein associations are defined as all the pair-wise links between proteins within the same complexes. We can further assess the quality of a set of protein complexes by mapping its co-complex associations to existing PPI databases -A set of protein complexes that have higher percentage of co-complex associations overlapping with existing PPI databases tend to have higher quality [2].
As we know, CHPC2012 will have a different number of complexes when we use different values for merge thres. Following your comments, we have performed additional experiments to investigate how the values of merge thres affect the quality of final protein complex list. In particular, Figure S2 shows the percentage (ratio) of co-complex associations of CHPC2012 that overlaps with existing PPI databases, i.e., HPRD and BioGrid. The overall trend for the ratio curves is quite obvious-the ratio (i.e., the quality of CHPC2012) increases as the values of merge thres incresase. However, we can still have the following two additional observations from Figure S2.
Firstly, the ratio increases quite rapidly as we increase the value for merge thres when merge thres is small (i.e., in [0.5, 0.65]). Smaller merge thres leads to even lower quality of CHPC2012, indicating that the merging of complexes is indeed arbitrary using small values for merge thres.
Secondly, the ratio increases much slowly when merge thres is relatively large (i.e., in [0.7, 0.95]). This indicates that the quality of CHPC2012 is stable and guaranteed after merge thres becomes big. Therefore, we prefer to set merge thres in the range [0.7, 0.95]. On the other hand, the number of co-complex associations and proteins covered by CHPC2012 decreases as we increase the value for merge thres. This demonstrates that the coverage of CHPC2012 will decrease as merge thres increases. To balance the coverage and the ratio of co-complex associations, we finally set merge thres as 0.8 in our experiments.