Proteome Data Improves Protein Function Prediction in the Interactome of Helicobacter pylori *

Helicobacter pylori is a common pathogen that is estimated to infect half of the human population, causing several diseases such as duodenal ulcer. Despite one of the first pathogens to be sequenced, its proteome remains poorly characterized as about one-third of its proteins have no functional annotation. Here, we integrate and analyze known protein interactions with proteomic and genomic data from different sources. We find that proteins with similar abundances tend to interact. Such an observation is accompanied by a trend of interactions to appear between proteins of similar functions, although some show marked cross-talk to others. Protein function prediction with protein interactions is significantly improved when interactions from other bacteria are included in our network, allowing us to obtain putative functions of more than 300 poorly or previously uncharacterized proteins. Proteins that are critical for the topological controllability of the underlying network are significantly enriched with genes that are up-regulated in the spiral compared with the coccoid form of H. pylori. Determining their evolutionary conservation, we present evidence that 80 protein complexes are identical in composition with their counterparts in Escherichia coli, while 85 are partially conserved and 120 complexes are completely absent. Furthermore, we determine network clusters that coincide with related functions, gene essentiality, genetic context, cellular localization, and gene expression in different cellular states.

Helicobacter pylori (H. pylori) 1 is a pervasive pathogen that is uniquely adapted to life in the acidic environment of the human stomach and associated with gastric inflammation and duodenal ulcer (1,2). Persisting in such an environment by tightly associating with epithelial cells, H. pylori affects an estimated half of the human population. As a consequence, H. pylori is notorious for causing low-level inflammation and duodenal ulcer as well as stomach carcinoma and MALT (mucosa-associated lymphoid tissue) lymphoma (1-3), causing 700,000 deaths annually worldwide (4).
The genome of H. pylori reference strain 26695 was completely sequenced in 1997 (5) and encodes ϳ1,587 proteins with about 34% remaining uncharacterized (6). Given its impact on world health, a concerted effort is required to understand this significant number of proteins and their role in infection and disease.
Interactions between proteins are needed for almost all biological processes, helping to understand pathways as well as linking poorly or uncharacterized proteins. Only a few comprehensive bacterial interactome studies have been published to date, such as Escherichia coli (7), Campylobacter jejuni (8) and Mycobacterium tuberculosis (9). In particular, protein interactions of H. pylori were among the first to be determined in bacteria (10), an interactome that has been recently expanded (11), capturing roughly 70% of the proteome. While such interactomes have been detected using yeast two-hybrid methods, a few studies also identified bacterial protein complexes (12)(13)(14)(15).
Several studies have attempted to characterize the proteome of H. pylori. Bumann et al. (16) found more than 1,800 protein spots on 2-dimensional gels, of which 200 were identified. Similarly, Jungblut et al. (17) found up to 1,800 protein spots on 2-dimensional gels. 152 were identified, including 27 proteins that corresponded to hitherto hypothetical proteins (17). Govorun et al. (18) analyzed the proteomes of four H. pylori clinical isolates and identified 126 proteins. More recently, Jungblut et al. (19) used intensive prefractionation to identify a total of 567 proteins (36.6% of the proteome). Recently, we have identified 1,190 and 1,143 proteins by 2D-LC-MS and GeLC-MS, respectively (20,21), representing roughly 72% of the H. pylori proteome.
As proteomes and interactomes have been determined independently, their relationship remains unclear. Here, we integrate proteomic quantitative measurements in a network of roughly 3,000 protein-protein interactions (10,11). Our analyses of diverse datasets allow us to explore the role of abundance in both the proteome and interactome as well as the structure and functionality of networked patterns. Investigating the proteomes of spiral and coccoid forms of H. pylori, we find that proteins that are critical for the control of the underlying interactome are significantly enriched with genes that are differentially expressed in the spiral form. Such observations potentially point to single proteins that play a role in the adaptability of the pathogen to different physiological conditions. Furthermore, we predict the function of more than 300 previously poorly annotated genes as well as protein complexes and functional network clusters in H. pylori. As a consequence, our integration and analysis of various large-scale datasets provide new insights into the proteome, interactome, and physiology to significantly improve our knowledge of this important pathogen.

EXPERIMENTAL PROCEDURES
Essential Genes-We collected essentiality data from several comprehensive genetic studies in H. pylori (22,23). Furthermore, we added genes that were essential for H. pylori colonization (24,25).
Relative and Absolute Protein Quantification-Data on relative changes in protein abundance between coccoid and spiral cells were extracted from our previous study (21). Briefly, four biological replicates of coccoid and spiral H. pylori cells were measured by LC-MS using a stable isotope labeling in cell culture based approach, capturing 1,143 proteins.
As for proteomic abundances, we collected data from a previous study as well (20). Nine replicates were measured by LC-MS/MS analysis without sample fractionation, capturing the abundance values of 1,190 proteins in H. pylori.
Protein-protein Interactions in H. pylori-We collected a total of 3,002 protein interactions from two high-throughput studies. In particular, we considered a set called PIM1 from (10) and PIM2 from (11) that both were determined by yeast two-hybrid approaches. We also identified 1,466 interactions that were classified as core data (i.e. high confidence) as they represent the overlap of PIM1 and PIM2 (10,11).
Clustering Analysis-We used the Markov Clustering (26) algorithm (MCL) to identify clusters of interactions in the combined core H. pylori network. Applying different combinations of parameters, we automatically assessed each cluster's ability to significantly enrich coherent proteins. In particular, we utilized functional annotations from the Comprehensive Microbial Resource (27), gene ontology (GO) (28) and Kyoto Encyclopedia of Genes and Genomes (KEGG) database (29). Furthermore, we utilized gene essentiality from (22)(23)(24)(25). For microarray analyses, we utilized 16 sets of gene expression analyses of H. pylori (27) from the Comprehensive Microbial Resource and considered three cases: genes up-regulated (ϩ), genes down-regulated (Ϫ), and genes differently regulated (ϩ or Ϫ). Each experiment is identified by a number, a title, and the author (Table S4). For GO term enrichment analysis, we used the TopGo python library (30). For other annotations, we used Fisher's exact test and considered clusters if they enriched genes with p Ͻ 0.05.
Functional Classes of Proteins-H. pylori proteins were grouped according to broad functional classes that were defined by clusters of orthologous groups (COGs) (31,32) since COGs provide a consistent classification of bacterial genes based on orthologous groups.
Enrichment Analysis-Binning proteins with a certain characteristic d (e.g. with a given number of interactions), we calculated the fraction of proteins that had a feature i in each group d, f i (d). As a null model, we randomly sampled protein sets with feature i of the same size 10,000 times and calculated the corresponding random fraction, f i,r (d). The enrichment/depletion of proteins with feature i in a group d is then defined as Interactions Between Functional Classes-Proteins of H. pylori were grouped according to their protein abundance. Focusing on a set of protein interactions, we counted the occurrence of different abundance group combinations (33). For each combination of abundance groups i, j, we determined its probability p o ͑i, j͒ ϭ n ij N , where N is the total number of interactions between the underlying abundance groups. As a null model, we determined an expected probability of interactions between classes i, j: p e ͑i, j͒ ϭ Specifically, v i is the number of viable proteins in group i (i.e. proteins of group i that are involved in at least one interaction in the underlying set), and J i,j is the number of genes that are involved in both groups. Combining these probabilities, we determined a log-odds ratio: r ϭ Ϫ p e ͒ Ϫ1 . For large samples, we estimated the variance of the odds distribution In particular, we calculated a p value for the significance of a link between two groups by a Z-test, where z ϭ r and considered each link with p Ͻ 0.05 (33).
Bacterial Meta-protein Interaction Data-We used 2,231 binary interactions between E. coli proteins that we have previously determined through yeast two-hybrid screens (7). As for other yeast twohybrid screen sets, we utilized 12,012 interactions in Campylobacter jejuni (8) (33), 783 interactions in Bacillus subtilis (37), and 8,042 interactions in M. tuberculosis (9).
Utilizing all-versus-all BLASTP searches with the InParanoid script (38) in protein sets of two species, sequence pairs with mutually best scores were selected as central orthologous pairs. Proteins of both species that showed such an elevated degree of homology were clustered around these central pairs, forming orthologous groups. The quality of the clustering was further assessed by a standard bootstrap procedure. We only considered the central orthologous sequence pair with a confidence level of 100% as the real orthologous relationship. Protein sequence information of bacterial organisms was retrieved from Uniprot (39).
Functional Prediction of Unknown Proteins in H. pylori-We modeled the prediction of a functional class of a protein i as a Potts model (40). In particular, we considered functional annotation of proteins in H. pylori using COG classes as of the EggNOG database (24). All proteins without a functional annotation as well as proteins that were either classified as unknown or had a general function (such as membrane protein or ABC transporter) were randomly assigned a function out of the remaining 23 classes. In particular, we minimized the following global function is the adjacency matrix of the interaction network that accounts for unclassified proteins. In particular, J ij ϭ 1 if unclassified proteins i and j interact and vice versa. ␦(i,j) is the discrete ␦ function, where ␦ ϭ 1 if unclassified proteins i and j have the same function (i.e. i ϭ j ) and vice versa. As a consequence, the first term allows us to optimize the number of interactions between unclassified proteins if they are pre-dicted to have the same function. Depending on the assigned function to an otherwise unclassified protein, the second term aims to optimize support for the assigned function of protein i. In particular, we determine the number of classified proteins h i ( i ) that interact with unclassified protein i with the same function that was assigned to unclassified protein i. To minimize E, we applied a simulated annealing approach that features an effective temperature T. After initially assigning random functions to all unclassified proteins, we randomly selected a protein, changed its function to a different class, and determined the energy of the new configuration. If the difference of energies ⌬E Յ 0, the new configuration was accepted. If ⌬E Ͼ 0, the new configuration was accepted with probability p ϭ e Ϫ⌬E/T . To obtain stabilized functional configurations, we repeated such a Monte Carlo step 10,000 times (40). Subsequently, we increased the inverse of T by 0.01 in each step and repeated such Monte Carlo steps. Since minimum energy solutions are not unique, we repeated such runs of simulated annealing 100 times and considered the fraction of times an unclassified protein i was observed in a certain functional state as an estimate of the probability that protein i belongs to class .
Heterogeneity of Functional Prediction-The Simpson s-index considers the fractions with which a given protein was assigned to a functional class. In particular, we calculated its heterogeneity of functional fractions as a Simpson diversity (41) index defined as where p i is the fraction with which a given protein was Determination of Critical, Intermediate, and Redundant Proteins-We defined a set S ʕ V of nodes in a network G ϭ (V, E) as a minimum dominating set (MDSet) if every node v ʦ V is either an element of S or adjacent to an element of S. In a binary integer linear programming problem (ILP) we assigned a binary variable x v ϭ 1 when a protein v ʦ V that participates in interactions E in a protein interaction network G is an element of the MDSet, and x v ϭ 0 otherwise. The smallest set of MDSet nodes is obtained by min͚ v ʦ V x v , subject to the constraint x v ϩ ͚ w ʦ ⌫͑v͒ x w Ն 1 where ⌫(v) was the set of interaction partners of protein v. However, many optimal solutions exist that provide MDSets of the same size. Such characteristics suggest the existence of subset of nodes that always (critical nodes), never (redundant nodes), and sporadically appear in MDSets (intermediate nodes). To find such subsets, our objective is to determine if v ʦ MDSet always appear in the MDSet of any solution. For each v ʦ MDSet, we create an ILP as before and assume that x v ϭ 0 (i.e. not participating in the MDSet). After solving the ILP, we determine the size of the corresponding MDSet N v that we obtained with x v ϭ 0. If N v Ͼ N, v is a critical node and intermediate otherwise. For all nodes that did not participate in the original MDSet, vԫMDSet, we need to check if they always appear outside MDSets. For vԫMDSet, we create an ILP as before and assume that x v ϭ 1 (i.e. participating in the MDSet). After solving the ILP, we determine the size of the corresponding MDSet N v that we obtained with x v ϭ 1. If N v Ͼ N, v is a redundant node and intermediate otherwise (43,44). To solve these ILP problems, we utilized a branch-and-bound algorithm (45) as implemented by the lpSolve library.
Betweeness Centrality-As a global measure of its centrality, we calculated a node's betweenness, indicating a node's appearance in shortest paths through the whole network. In particular, we defined where s st was the number of shortest paths between proteins s and t, while s st (v) was the number of shortest paths running through v.

RESULTS
Proteome Versus Interactome-We combine interaction datasets that have been determined by yeast two-hybrid approaches and obtain an interactome of H. pylori that connects 1,060 proteins (ϳ70% of the proteome) through roughly 3,000 interactions (10,11). Furthermore, a "core" interactome, capturing high-confidence interactions, connects 759 proteins (49% of the proteome) through 1,466 interactions. In Fig. 1A, we label each protein with its functional class, essentiality, and abundance in this high-quality core network of protein interactions. As for estimating absolute quantities, we utilize data from our previous study (20) where we measured the abundance of proteins in H. pylori without sample fractionation with a LC-MS approach. Only accounting for proteins with at least three unique peptides, we obtain abundance values of 1,130 proteins that correspond to 831 interacting proteins in the combined interactions network of H. pylori. In Fig. 1B, we determine the functions of the most abundant proteins by binning proteins according to their abundance in three groups. Utilizing functional annotations from the EggNOG database (46), we find that most abundant proteins are involved in metabolite biosynthesis, transport and catabolism, protein turnover, translation and energy production. Fig. 2A indicates that essential proteins are significantly overrepresented among highly abundant proteins (Student's t test, p Ͻ 10 Ϫ2 ). To account for interactions between proteins, we determine the proteins' propensity to interact with proteins of certain abundance levels. In particular, we calculate each protein's abundance specific Z-score and group proteins in bins of certain Z-scores. In Fig. 2B, we determine the enrichment of interactions between proteins that appear in a given Z-score bin, utilizing the combined network of all protein interactions. Generally, we observe that proteins predominately interact with proteins of similar abundance. Still, our results further indicate that proteins in low-abundance bins The Functional Cross-talk of the Interactome-Resembling other interactome hairballs (Fig. 1A) the H. pylori interactome does not show clear functional clustering. However, the interaction data are well supported by interactions between proteins of the same functional group (Fig. 3). This observation can be used to validate the reliability (or at least plausibility) of an interaction dataset since random interactions would provide no significant enrichment. Interestingly, we additionally observe some unexpected functional cross-talk. For example, ribosomal proteins and proteins involved in translation interact with proteins involved in motility more often than expected by chance (groups I and K in Fig. 3). Similarly, motility proteins also interact with proteins involved in amino acid metabolism.
Predicting Functions of Proteins Using a Bacterial Metainteractome-To investigate the functional predictive power of our initial network of experimentally determined interactions in H. pylori, we randomly pick 80% of all functionally annotated proteins 1,000 times to predict the functions of the remaining 20% in each random run. Using a stochastic model (40), we represent every protein by a profile that reflects the probability of having a certain function. Applying different probability thresholds for the presence of a functional annotation, we determine receiver operating characteristic curves and consider the corresponding area under the curve as a measure of prediction quality (47) (Fig. 4A). To increase the predictive power of the underlying protein interaction network, we augment our network in H. pylori with protein interactions from other bacteria (36,48). Specifically, we consider interactions that have at least one interacting protein with a functionally annotated ortholog in H. pylori, while its interacting counterpart is at least functionally annotated in the corresponding organism. Focusing on the same previously sampled sets of proteins, we predict the functions of the corresponding 20% by utilizing the augmented network. Notably, we observe a significant shift toward increased values of the area under the receiver operating characteristic curve (p Ͻ 10 Ϫ50 , Student's t test), suggesting that the augmentation of the original network with interactions from other bacteria significantly improves the quality of functional predictions (Fig. 4A). Since each protein is represented by a profile of function-specific probabilities, we calculate the Simpson s-index (41) as a measure of heterogeneity of predicted functions. Such a measure tends to be 1 if one function dominates the distribution of fractions (i.e. has a high probability). In turn, the s-index approaches 0 if probabilities are equally distributed. Since our sampling approach randomly picks a subset of proteins and predicts functions based on the remaining proteins in both the original interaction network of H. pylori and the augmented network, we directly compare the impact of the augmented network on the homogeneity of functional prediction. In Fig. 4B, we calculate the mean s-indices of each protein, suggesting that functional predictions of the majority of proteins benefit from the addition of the bacterial metainteractome. Based on our observations that interactions from other bacteria have a considerable benefit on our ability to predict functions, we apply our approach to the functional prediction of 337 poorly characterized or previously unknown H. pylori proteins. While we determine the probability that a given protein has a particular function, we assess the significance of our predictions by randomly sampling known functions 100 times. Applying a Z-test, we determine a corrected p value for each score (49) that we consider significant if FDRϽ0.05. The heatmap in Fig. 4C shows the range of functions predicted for these proteins, including a sizeable fraction to be involved in transcriptional and translational activities. In Supplemental Table S1, we present the functional profiles of all proteins in the order in which they appear in Fig. 4C.
Control of the H. pylori Protein Interaction Network-Considering the network of protein-protein interactions in H. pylori, we aim at the elucidation of proteins that are important for the topological controllability of the underlying network (43, FIG. 3. Functional crosstalk in the interactome of H. pylori. Determining the prevalence of interactions between functional groups, we observe that the majority of interactions appear between proteins of the same functional class. 44,50). In particular, networks are dominated by minimum dominating sets (MDSet) that can be determined by an ILP. Such a method allows us to find the smallest set of nodes where each non-MDSet node is adjacent to a node in the MDSet. However, many different configurations of MDSets exist that have the same number of critical proteins. As such, an assumption implies sets of nodes that always, partially, or never participate in MDSets. Therefore, we define proteins as critical if they always participated in the MDSet of a given configuration (Fig. 5A). Furthermore, we consider redundant nodes that never appeared in MDSets while intermediate nodes sporadically occur in MDSets. Applying an algorithm that allows us to determine such sets of nodes (43,44), we observe that the percentage of critical nodes is roughly Ͻ10%, while intermediate nodes constitute Ͻ30% of all proteins (Fig. 5B). The mean degree of critical proteins far exceeds the corresponding values of intermediate and redundant proteins that are close to the mean degree of all proteins in the underlying interaction networks (Fig. 5B).
As for other topological characteristics, we calculate the betweeness centrality of all nodes in the underlying network. Defining the top 20% of proteins with highest betweeness centrality as a set of bottleneck nodes, we calculate the enrichment of such proteins in sets of critical, intermediate, and redundant proteins. Given all proteins in the underlying interaction network, we sample sets of proteins by randomly shuffling their labels, generating nonoverlapping, random sets of critical, intermediate, and redundant proteins. We observe

FIG. 4. Functional prediction of unknown proteins in H. pylori using a bacterial meta-interactome. (A)
To assess the quality of our classification procedure, we randomly sample 20% of all functionally annotated proteins in H. pylori and utilize the remainder to predict their functions. To measure prediction quality we calculate the area under the receiver operating characteristic curve, suggesting that the addition of the bacterial meta-interactome allows for better functional prediction (p Ͻ 10 Ϫ50 , Student's t test). (B) We consider all randomized samples and calculate the mean s-indices of each gene of unknown function (circles) in both the original network of H. pylori and the augmented network. In the scatter-plot the homogeneity of the functional prediction of the majority of genes (78.6%) benefit from including the bacterial meta-interactome. (C) Combining the network of protein interactions of H. pylori and the bacterial meta-interactome, we predict the functions of 337 proteins with unknown or poorly characterized functions (FDR Ͻ 0.05).
that critical proteins in all organisms are strongly enriched with bottlenecks (p Ͻ 10 Ϫ4 ). Albeit insignificantly, intermediate proteins are enriched with bottleneck nodes as well, while critical proteins hardly are bottlenecks (p Ͻ 10 Ϫ4 , Fig. 5C).
Additionally, we compare protein levels in the spiral and coccoid cells of H. pylori based on previously published proteomic data (21) to link the generated interaction network with cell physiology. These cellular forms were analyzed by LC-MS, allowing the comparison of relative changes between the two states with high accuracy. Determining the fold change and the corresponding p value using a Student's t test of proteins comparing the spiral and coccoid expression levels, we generate a Volcano plot where we label each protein as critical, intermediate or redundant (Fig. 5D). Qualitatively, we observe that critical proteins seem to have a higher abundance in the spiral form of H. pylori. As a corollary, we con-sider a set of regulated genes defined as proteins with Ϫ0.5 Յ fold change Ն 0.5 and p Ͻ 0.05 (21). Randomly sampling sets of regulated genes 10,000 times (Fig. 5E), we observe that critical proteins are significantly enriched with regulated genes (p Ͻ 0.05) while redundant proteins are found diluted (p Ͻ 0.05). Assuming that critical proteins play a role in the transition between the spiral and coccoid forms, we perform an analysis of their functions. Fig. 5F indicates that functions of critical proteins mostly revolve around transcriptional and posttranslational modification functions. In Supplemental Table S2, we annotate each protein with its role, fold change comparing coccoid to spiral form, and functional annotation.
Genomic Organization and the Interactome-Bacterial genomes are typically organized through functional gene clusters such as operons that encode functional units such as We define proteins with a fold change of Ͼ0.5 and ϽϪ0.5 (p Ͻ 0.05) as regulated proteins (shaded areas), suggesting that critical, regulated proteins predominantly appear as being present in the spiral form. (E) As a corollary, we randomly sample sets of regulated proteins 10,000 times. We observe that critical proteins are significantly enriched with regulated genes. (F) We determine the enrichment of functions in the set of critical proteins by randomly sampling their functions. We observe that critical proteins predominately appear in transcriptional and posttranslational modification functions. protein complexes. Interdependence of genomic loci and protein interaction maps has been demonstrated for the T. pallidum interactome (33) as well as for phage interactomes (51,52). To reveal such genomic links we map protein interactions onto operons in the H. pylori genome (53) (Fig. 6). For instance, the well characterized urease gene cluster (operons 4721-4722) reveals interactions among the enzyme's core components (UreA-UreB), between the urease accessory factors UreE-UreG and UreF-UreH, and finally between UreA and UreH (see Fig. 6 in (11)). Another example is observed in operon 4,987, a gene cluster that encodes enzymes involved in tryptophan biosynthesis. A comparison with experimental protein structures from orthologs shows that our proteinprotein interaction (PPI) studies (11) capture most PPI interactions that are important to assemble the enzyme complexes (Fig. 6, right panel): TrpA-TrpB and TrpE-TrpG are organized as heterotetramers with two subunits of each protein. TrpD is a homodimer, and TrpCF is a single protein that functions as a monomer. Likewise, operon 4888 encodes for enzymes involved in molybdopterin biosynthesis while the protein interaction map accurately reflects the organization of the enzyme complex.
An example for enriched crosstalk between different gene clusters is found between two operons (4960 and 5013), encoding ribosomal proteins or products that are related to protein translation and tRNA modification, respectively. The hypothetical protein HP1414, the first gene in the miaA operon, binds the ribosomal proteins L19 and S16 as well as the hypothetical protein HP1150 (which belongs to COG1837, a family of putative RNA-binding proteins). In fact, HP1414 is the H. pylori homologue of the ribosomal silencing factor RsfS (ϭRsfA) that we previously showed to bind to ribosomal protein L14, preventing association of the small and large ribosomal subunit (54). L19 is located in the direct neighborhood of L14 in the ribosome-forming bridges (B8 and B6) to the small ribosomal subunit (55), potentially representing a novel or additional hotspot for RsfS action. Both operons are functionally associated since both encode for products that are involved in protein translation.
One more example of interconnected operons is found between operon 4815 and 5035 that encode several uncharacterized (HP0469-HP0465) and flagellar rod proteins (FliE, FlgC, FlgB), respectively, suggesting that the 4815 operon may be involved in motility. Involvement of HP0466 in flagellar biosynthesis has already been suggested by others based on its interaction with FlgB and homology comparisons of the operon member HP0465 with motility accessory factors of C. jejuni (56). Moreover, transposon insertion into the HP0466 locus causes a colonization defect (24) whereas for HP0468 no functional information is available.
Protein Complexes in H. pylori-Protein functions are often mediated by protein complexes that are defined as stable assemblies of multiple proteins. Since complexes have not been studied systematically in H. pylori, we utilize extensive experimental data on protein complexes from E. coli (57) and M. pneumoniae (14) to predict homologous complexes in H. pylori. Utilizing orthologous protein information from the COG database (31,32), we find that H. pylori shares 786 orthologous proteins with E. coli but only 260 with M. pneumoniae (Fig. 7A). As an example for different levels of complex conservation, we observe that the murein tripeptide transporter, a hetero-pentamer, is well conserved in H. pylori, while only three out of five subunits in the periplasmic nitrate reductase are present. In contrast to E. coli, the cascade complex is completely missing from H. pylori (Fig. 7B). In Fig. 7C, we count the number of complexes with different degrees of conservation in H. pylori using reference sets of E. coli complexes from the EcoCyc database (58) and the dataset of (57). Furthermore, we use protein complex information from M. pneumoniae (14). The degree of conservation (Fig.  7A) prompts us to focus on E. coli, indicating that E. coli may be a good model for some processes in epsilon-proteobacteria but not for others. Using a reference set of 285 well-studied E. coli complexes from EcoCyc (58), we predict 80 H. pylori complexes to be identical (in composition) with their counterparts in E. coli. Another 85 complexes are partially conserved while 120 are completely absent. All predicted complexes in H. pylori are available in Supplemental Table S3.
Functional Integration of Gene Expression-To identify functionally relevant network clusters, we systematically analyze the H. pylori high-quality core protein interaction net- work to identify subnetworks that overrepresent certain functional groups, using functional terms from the Comprehensive Microbial Resource (27), GO (28), KEGG (29), gene essentiality, genetic context, cellular localization, and gene expression data. Some of these clusters are illustrated in Fig. 8 while detailed results can be found in Supplemental Table S4. For instance, cluster 16 consists of nine proteins that are highly interconnected by interactions. Involved genes are co-expressed under different conditions: Six genes of the cluster are up-regulated when the growth conditions are shifted to low pH values while three members are up-regulated under limited iron accessibility in the stationary phase. Finally, three proteins have an increased expression level when H. pylori is grown in contact with liver cells versus medium alone. While cluster members belong to very different pathways (e.g. Cag17 and Cag20 belong to the type IV secretion system, IspA is a geranyltranstransferase, CeuE a periplasmic ironbinding protein, and Ggt is a gamma-glutamyltranspeptidase), they are connected by interactions and gene expression. While such discordant expression patterns are found in other clusters as well (e.g. clusters 17, 26, and 36), our results suggest the presence of conditions under which these genes are co-expressed, allowing proteins to interact. Clusters 26 and 36 are enriched for proteins related to chemotaxis and motility. Notably, our screens detect all three CheV paralogs in the H. pylori genome (HP0019, HP0393, and HP0616) to bind to the hemolysin secretion protein precursor HylB (cluster 36). Moreover, we find that HP0019 and HP0393 interact with the methyl-accepting chemotaxis transducer (TlpC) but not HP0616. Cluster 6 shows that this combination strategy unearthed additional interesting aspects that cannot be detected when one parameter is analyzed in isolation. While seven members that belong to three different operons are connected, GO assignments of HP0164 (signal-transducing protein, histidine kinase) and OmpR (response regulator HP0166) suggest an involvement of the cluster members in two-component signaling. DISCUSSION Given that H. pylori is a major human pathogen causing millions of ulcers and other health problems each year surprisingly little is known about its molecular biology. To fill this FIG. 8. Selected interaction clusters with various enriched functional terms, genomic context, co-expression, co-localization, and phenotypes. We depict interaction clusters that we derive from the combined H. pylori core protein interaction map with significant enrichments (p Ͻ 0.05). All results can be found in Supplemental Table S5. Node labels represent protein names (if available) or the locus number of the corresponding protein. Co-expressed genes based on operons are given in the legends by the transcription unit number as used in (53) and gene expression data (Supplemental Table S5). For each cluster, we provide a separate legend highlighting the enriched properties. In particular, clusters 16 and 17 are enriched for differentially, co-expressed proteins when growth is shifted to low pH values conditions. Clusters 13 and 65 are enriched for essential genes, while cluster 23 is enriched with ribosomal proteins. Clusters 26 and 36 include flagellar/chemotaxis proteins, and cluster 6 is enriched for intra-and interconnected operons. gap, we investigate the proteome and interactome in a more systematic way.
Investigating the abundance of proteins in H. pylori, we find that highly abundant proteins revolve around translational, posttranslational modification, protein turnover, metabolite biosynthesis, transport, catabolism, and energy production functions. Furthermore, we find that abundant proteins are typically encoded by essential genes.
Combining protein interactions with protein abundances, we observe that proteins of similar abundance preferably interact with each other. While we find some interactions between proteins of different abundances, our results clearly confirm assumptions that interacting pairs of proteins are usually present in roughly stoichiometric ratios.
Notably, H. pylori still encodes a large number (ϳ500) of uncharacterized proteins. Among proteins of known function, we find that interactions usually connect proteins of similar activity. Based on such characteristics, we utilize a bacterial meta-interactome of closely related bacteria to predict the functions of unknown proteins. In particular, we account for interactions that are conserved in other closely related bacteria. Such an augmentation of our initial network of protein interactions allows us to increase the accuracy of our classification method significantly and to predict the function of more than 300 proteins with previously poorly annotated or unknown function. Resembling the spectrum of functions of abundant proteins, we find that the majority of proteins thus obtained mostly revolve around translational and posttranslational modification functions.
Utilizing our network of protein interactions in H. pylori, we determine sets of proteins that topologically control the underlying network. In particular, we find sets of critical, intermediate, and redundant proteins that always, partially, or never appear in different control configurations of the underlying network. In particular, each control configuration features a minimum dominating set (MDSet) so that every node is either an element of the MDSet or adjacent to a protein of the MDSet. Notably, critical proteins appear to be enriched with regulated genes that are significantly present in the spiral form of H. pylori. The spiral form of H. pylori is mostly dividing while the coccoid is a nonculturable but viable form. The observation that genes that are overexpressed in the spiral compared with the coccoid form are enriched with critical proteins suggests that the underlying topology network plays a role in the switch of the two bacterial forms. Notably, critical proteins are also enriched with proteins of high betweeness, representing central topological proteins with a propensity to connect different, disparate parts of the network. Therefore, we surmise that critical proteins may assume the role of levers that allow the bacteria to activate certain functions to change between forms as well as integrate different parts of the network to carry out the transformation from coccoid to spiral form. As a corollary, we hypothesize that such proteins carry functions that contribute to the spiral form. Indeed, we find that critical proteins are mostly enriched with transcriptional and posttranslational modification functions.
As for protein complexes, we integrate protein complex information of E. coli and M. pneumoniae and infer potential complexes in H. pylori by determining evolutionarily conserved complex components. As expected, we find a higher rate of conserved complexes when we consider E. coli protein complexes. Such a result may be rooted in the fact that E. coli has almost six times as many proteins than M. pneumoniae. Furthermore, its protein complexes are better investigated than their counterparts in Mycoplasma, suggesting that E. coli is a better model for protein complexes as H. pylori shares significantly more orthologs with E. coli than M. pneumoniae. Moreover, we integrate the interactome with functional and expression profiles of genes in H. pylori, allowing us to find significant protein clusters. Our analysis reveals an abundance of different network clusters that combine certain functions that integrate the placement of operons of cluster members as well. Such an observation clearly suggests that expression, function, and operon regulation are driving forces of the observed network clusters.