The Human-Bacterial Pathogen Protein Interaction Networks of Bacillus anthracis, Francisella tularensis, and Yersinia pestis

Background Bacillus anthracis, Francisella tularensis, and Yersinia pestis are bacterial pathogens that can cause anthrax, lethal acute pneumonic disease, and bubonic plague, respectively, and are listed as NIAID Category A priority pathogens for possible use as biological weapons. However, the interactions between human proteins and proteins in these bacteria remain poorly characterized leading to an incomplete understanding of their pathogenesis and mechanisms of immune evasion. Methodology In this study, we used a high-throughput yeast two-hybrid assay to identify physical interactions between human proteins and proteins from each of these three pathogens. From more than 250,000 screens performed, we identified 3,073 human-B. anthracis, 1,383 human-F. tularensis, and 4,059 human-Y. pestis protein-protein interactions including interactions involving 304 B. anthracis, 52 F. tularensis, and 330 Y. pestis proteins that are uncharacterized. Computational analysis revealed that pathogen proteins preferentially interact with human proteins that are hubs and bottlenecks in the human PPI network. In addition, we computed modules of human-pathogen PPIs that are conserved amongst the three networks. Functionally, such conserved modules reveal commonalities between how the different pathogens interact with crucial host pathways involved in inflammation and immunity. Significance These data constitute the first extensive protein interaction networks constructed for bacterial pathogens and their human hosts. This study provides novel insights into host-pathogen interactions.


Introduction
Bacillus anthracis, Francisella tularensis and Yersinia pestis are known to cause pathogenesis, in part, by evading or suppressing immune responses. For instance, it is well recognized that anthrax lethal toxin (LT) is a key player in the B. anthracis pathogenic mechanism that induces macrophage apoptosis [1] and cleavage of MAPK at specific recognition sites [2]. Y. pestis suppresses local inflammation and impairs macrophage phagocytic activity through a complex type III secretion system (T3SS) and its associated protein LcrV [3]. F. tularensis either fails to induce an immune response or causes immune suppression by inducing transforming growth factor (TGF-b) [4]. Both Y. pestis and F. tularensis are Gram-negative bacteria that synthesize lipopolysaccharide (LPS) with poor Tolllike receptor 4 (TLR4)-stimulating activity, although F. tularensis can signal via TLR2 [5]. Thus, all three pathogens share similar mechanisms of pathogenesis that involve modulation of immune responses. Traditional microbiology and immunology approaches have characterized only a few pathogenic proteins for each microbe, resulting in a limited understanding of pathogenicity and evasion mechanisms.
In contrast to investigating either the host or the pathogen, focusing on interactions between host and pathogen proteins may uncover hidden associations that have not been detected by traditional methods. To uncover host-pathogen protein interactions on a genome-wide scale for these three immune-evading systems and to define a target set of proteins for understanding mechanisms of pathogenicity, we designed a high-throughput yeast two-hybrid assay aimed at characterizing protein-protein interactions (PPIs) between human and bacterial proteins. We generated DNA-binding domain libraries for each pathogen and activation domain libraries containing human proteins in a haploid Mata strain of Saccharomyces cerevisiae. We tested for activation of the two-hybrid reporter genes using a similar protocol that was previously used for identifying interactions between proteins in Plasmodium falciparum [6]. We then sequenced positive colonies to identify interacting partners (see Figure 1A). In total, we performed more than 250,000 screens across the three pathogens. We obtained 3,073 PPIs between 1,748 human proteins and 943 B. anthracis proteins, 1,383 PPIs between 999 human proteins and 349 F. tularensis proteins, and 4,059 PPIs between 2,108 human proteins and 1,218 Y. pestis proteins. We used an independent computational analysis to study the network properties (degree and centrality) of the human proteins that interact with pathogen proteins in our dataset. Additionally, we used a graph-alignment algorithm to identify conserved subsets of human-pathogen PPIs found across multiple networks.
These data constitute the first extensive protein interaction networks constructed for bacterial pathogens and their human hosts. Typically, data detailing host-pathogen interactions is ascertained from small-scaled experiments that are designed to target specific proteins, complexes, or pathways of interest. This is evident from the number of interactions between host and bacterial pathogens currently available in seven public resources [7,8,9,10,11,12,13]. For example, these databases only contain one human-B. anthracis interaction, no human-F. tularensis interactions, and seven human-Y. pestis interactions.

Results and Discussion
In total we identified 3,911, 1,942, and 5,157 PPIs for the human-B. anthracis, human-F. tularensis, and human-Y. pestis networks respectively. We filtered this set of PPIs by removing human proteins that interact with large number of pathogen proteins identified by multiple screens with other pathogens (unpublished data), reasoning that such interactions are likely to be false positives. This step yielded a final set of 3,073, 1,383, and 4,059 PPIs for the human-B. anthracis, human-F. tularensis, and human-Y. pestis networks respectively (see Table 1). We found that 888 human-B. anthracis, 167 human-F. tularensis, and 2,205 human-Y. pestis PPIs contain pathogen proteins that are labeled as ''putative'', ''hypothetical'', or ''uncharacterized''. See Figure 1B for a comparison of the sets of human proteins found to interact with each of these pathogens.

Bacterial pathogens have evolved to interact with human hubs and bottlenecks
Several recent studies [14,15] have suggested that viral proteins have evolved to preferentially interact with protein hubs (proteins with many interacting partners) and bottlenecks (proteins that lie in shortest paths between many pairs of proteins) in the human PPI network. We hypothesized that bacterial proteins interact with human proteins with high degree and centrality, since pathogens may have evolved to control and disrupt essential complexes and pathways governing the host response. Our analysis supports this hypothesis. More specifically, Figure 2(a) displays a log-log plot of the degree distributions of six sets of proteins in a human PPI network collated from multiple databases [7,8,9,10,11,12,13]. These plots show that across almost the entire range of degrees, human proteins interacting with bacterial pathogens tend to have higher degree than proteins that do not interact with any bacterium. The betweenness centrality results display the same trend (see Figure 2(b)). We used Gene Set Enrichment Analysis (GSEA) [16] to test whether the gaps we observe in Figure 2 between the curve for all non-pathogen interactors and the other curves are statistically significant. GSEA yields p-values less than 10 26 for both degree and centrality for all sets, supporting the conclusions we draw from Figure 2. To address the possibility that the observed patterns may be artifacts of experimental biases or errors in the human PPI network, we followed an earlier approach for viral-human PPIs [15]: we repeated the GSEA analyses using two subsets of the human PPI network: (i) interactions detected by small-scale experiments and (ii) interactions observed by largescale studies. We obtained statistically-significant results in both cases (see Table 2).

Bacterial pathogens target host defense pathways
Since conserved interaction networks between bacterial proteins and the host may be indicative of putative novel targets for broadbased immunotherapeutic development, we asked if human proteins interacting with multiple pathogens may be involved in functions related to host response. Since manipulation of immune responses in the host has been linked to infection by all three pathogens [17,18,19], we identified 60 human immune modulation proteins using annotations from the Gene Ontology [20] and their bacterial interactors (see Figure 3). While many of the proteins in the human-respiratory pathogen interaction map play a role in apoptosis, they are also important effectors of immune response signaling. Thus, the double role in apoptosis and immune response regulation should be considered when interpreting these results. This network includes interactions among sets of bacterial and human proteins involved in innate immunity (i.e., TLR4 and TLR7), inflammation (IL-8RB, NF-kB and Bcl-6), recruitment of inflammatory cells, regulatory function, maturation and activation of T cells (i.e., CXCR4, STAT3, NOTCH2, and LCK). For example, LCK is a tyrosine kinase expressed in T cells associated with the cytoplasmic tail of CD4 and CD8 co-receptors. Functionally, LCK is a crucial regulator of T cell activation [21]. Of note, LCK interacts with proteins from all three pathogens, suggesting that these bacteria may have developed conserved mechanisms of impairing effector T cell responses by targeting and possibly disrupting LCK signaling, which is required for inducing acquired immune responses and immune-mediated protection against infectious diseases.
CXC-chemokine receptor 4 (CXCR4) is the major coreceptor for human immunodeficiency virus in CD4 + T cells and a promising new target for developing anti-HIV drugs [22]. We find that CXCR4 interacts with the yscP protein, a known virulence factor from Y. pestis and a secreted component of the Yop secretion system [23]. The natural ligand for CXCR4 is CXCL12 or SDF1 (stromal cell-derived factor-1) -a chemokine involved in the recruitment of down-modulatory FOXP3 + regulatory T cells (Treg) into inflamed tissues [24]. In addition, STAT3 is required for expression of FOXP3 in Treg [25]. Our data demonstrate that STAT3 interacts with the Y1119 protein of Y. pestis. In turn, we show that TGF-b1, a down-modulatory cytokine produced by Treg, interacts both with Y. pestis and F. tularensis proteins. The Schu4 strain of F. tularensis has been shown to suppress inflammation in infected mice, and this inhibition has been attributed to induction of TGF-b, another member of the apoptosis PPI network [4]. Similar patterns have been observed in B. anthracis and Y. pestis [26,27]. The existence of a putative mechanism of down-regulating immune responses by targeting regulatory pathways merits closer attention.

Comparative analysis of human-pathogen networks
Encouraged by the evidence in our data suggesting that all three pathogens target proteins involved in host response to infection, we sought to perform a more systematic comparative analysis of the three host-pathogen PPI networks. In preparation for computing conserved modules of host-pathogen PPIs, we used Inparanoid [28] to identify orthologous proteins and OrthoMCL [29] to identify paralogous proteins. From the Inparanoid algorithm we identify a total of 686, 1,179, and 834 pairs of orthologous clusters for the B. anthracis-F. tularensis, B. anthracis-Y. pestis, and F. tularensis-Y. pestis comparisons respectively. We find that 181,505, and 184 of these clusters from the respective comparisons contain more than one protein from either organism. Of these, 93,210, and 129 clusters contain at least one pathogen protein from both organisms that was observed to interact with a human protein in our dataset (see Table 3). We find 1,900 clusters of human proteins from the OrthoMCL analysis.
First, we performed simple intersections of the detected hostpathogen PPIs. We looked for interologs [30] i.e., a pair of humanbacterial PPIs where the bacterial proteins are orthologous and the human proteins are related. More specifically, we searched for three types of configurations: i. a triple of proteins a, b, and c, where a is a human protein that interacts with a protein b in one of the three bacteria and with a protein c in another bacterium in our data and b and c are orthologs of each other, ii. a quadruple of proteins a, b, c, and d, where a, b, and c are as defined before, d is a human protein, a and d interact with each other physically in the human PPI network, a interacts with b in our data and d interacts with c in our data, and iii. a quadruple of proteins a, b, c, and d, where a, b, and c are as defined before, d is a human protein, a and d are paralogs, a interacts with b in our data and d interacts with c in our data. As can be seen from Table 4, our interaction data contains a very small number of interologs.
Since, simple intersections of host-pathogen PPIs did not yield substantial information on conserved PPI networks, we applied four published algorithms for identifying conserved protein interaction modules (CPIMs) amongst the three human-pathogen networks and homology relationships previously identified: Graemlin [31], Match-and-Split [32], NetworkBLAST [33], and GraphHopper [34]. These methods were originally designed to identify conserved modules between intra-species PPI networks. Graemlin requires the user to provide the topology of expected conserved modules as positive examples. Thus, Graemlin was not directly applicable to our scenario since there are no such examples available for these systems. Using the Match-and-Split algorithm we were not able to identify any CPIMs in any of the comparisons. In the case of NetworkBLAST, where there are a number of user parameters that can be adjusted, e.g., complex density and false negative rates, we tried different combinations of values. For the parameter complex density, we varied the input value from 0.50 to 0.95, adjusting values by 0.05 for each test. We performed the same procedure for testing the range of 0 to 0.80 for the FN ratios. Varying the parameters for the NetworkBLAST algorithm had no effect on the identified CPIMs in our case, yielding three CPIMs for the B. anthracis-F. tularensis comparison, ten CPIMs for the B. anthracis-Y. pestis comparison, and two CPIMs for the F. tularensis-Y. pestis comparison. Using the GraphHopper [34] algorithm we were able to identify many more significant CPIMs. In total we identified 39 CPIMs for the B. anthracis-F. tularensis comparison, 64 for the B. anthracis-Y. pestis comparison, and 41 for the F. tularensis-Y. pestis comparison. Table 5 displays the number of identified CPIMs for each of the algorithms. We discuss two sets of CPIMs computed by GraphHooper below related antigen presentation and immune modulation.  The major histocompatibility complex (MHC) proteins are responsible for presenting antigens to T cells. Antigen processing and presentation is crucial for activating T cells and mounting protective immune responses. Our analysis captures CPIMs containing human proteins enriched in both antigen processing and presentation functions (Figure 4(a) shows the network for the B. anthracis -Y. pestis system). We find an interaction between the human HLA-B protein and the B. anthracis pagA protein. HLA-B is an MHC class I protein responsible for presenting antigen fragments to CD8 + T-cells. The pathogen pagA protein, along with the lethal factor and oedema factor, is one of three proteins composing the anthrax toxin. Functionally, the pagA protein facilitates the translocation of enzymatic toxins across the cell membrane. Also interacting with HLA-B is the Y. pestis yscP protein, which is part of the Yersinia outer-membrane protein (YOP) secretion system. Members of the YOP family have been shown to interact with MHC I proteins in the closely related pathogen, Yersinia enterocolitica [35]. Other members of the MHC class I family in these CPIMs include HLA-A, HLA-C, and HLA-E. We also identify a number of interactions for human proteins belonging to MHC class II (e.g., HLA-DRA, HLA-DPB1, HLA-DQB1, and HLA-DMB), which are responsible for presenting antigens to CD4 + T cells. These MHC class II proteins interact with various pathogen proteins including pathogen membrane proteins and yet uncharacterized proteins.
The CPIMs in Figure 4(b) represent pathogen interactions with human proteins involved in immune response pathways for the B. anthracis -Y. pestis system. Each CPIM includes NF-kB, which is a transcription factor found at the crossroads of numerous immune and inflammatory pathways leading to the induction of innate and acquired immune responses. NF-kB is found downstream of the Toll family of receptors, which participate in signaling in response to infection. Pathogens have evolved to disrupt this critical process and thereby evade the host response. Inhibition of the NF-kB pathway impairs both the activation and differentiation of T cells and antigen-presenting cells. In the case of Y. pestis, the inhibition of the NF-kB pathway is necessary for rapid apoptosis in infected macrophages [36]. We find several members of the Y. pestis YOP family, including yscI, yscK, and yopD along with virulence factors such as the toxin tccC1 and the protein kinase ypkA interacting with NF-kB. Many of the other pathogen interactors of NF-kB are labeled as ''uncharacterized'' proteins. We also observe interactions between the Y. pestis proteins usg (an aspartatesemialdehyde dehydro-genase) and tar1 (a methyl-accepting chemotaxis transmembrane protein) and the human IKK-A protein. IKK-A phosphorylates inhibitors of NF-kB, leading to their degradation and resulting in NF-kB activation. We report an interaction between human NFkB-IA, a NF-kB inhibitor that binds to NF-kB and traps it in the cytoplasm, and the Y. pestis protein y3760, a putative multi-drug resistance protein. Upstream of NF-kB we demonstrate alr1-TLR4 and Y1119-TLR7 interactions. TLR4 and TLR7 are receptors for LPS and viral singlestranded RNA, respectively. It is well recognized that both Y. pestis and F. tularensis synthesize LPS with poor TLR4-stimulating activity. However, these further interactions may render NF-kB non-functional. Our findings suggest a strong interaction between bacterial proteins and proteins of the human immune system that are both crucial for effector activity and conserved.

Experimental Methods
We used a random yeast two-hybrid approach to identify physical interactions between human proteins and pathogen proteins. See Figure 1 for an overview of the experimental analytical processes used in this study. Vectors and strains. The two-hybrid vectors that we used for the random two-hybrid process are based on the Saccharomyces cerevisiae Gal4p DNA-binding domain (amino acids 1 to 147 for DBD constructs) and transcriptional activation domain (amino acids 768 to 881 for activation domain libraries). Both vectors have elements suitable for growth in both bacterial and yeast cells. Two Interactions of human proteins involved in the innate immune response. We divided the human protein into subsets based on whether they induce or prevent apoptosis, or whether they regulate apoptosis. Proteins in the group labeled ''Non-specific'' do not have an annotation more specific than ''Apoptosis'' in the Gene Ontology [20]. For clarity this image shows only interactions involving virulence factors and uncharacterized pathogen proteins. As a result, some human proteins in the figure may appear to have no interacting partners. doi:10.1371/journal.pone.0012089.g003 Summary of ortholog groups identified by Inparanoid [28]. The column marked ''# clusters (.2 proteins)'' is the number of orthologous clusters that contain more than a single protein from each organism. The column marked ''# clusters (pathogen interactors)'' is the number of orthologous clusters which contain a pathogen protein from each organism that is known to interact with a human protein in our dataset. doi:10.1371/journal.pone.0012089.t003  , the majority of inserts will be from non-coding regions or will be out of frame, and therefore of no utility in a two-hybrid assay. Using this ORF selection strategy, greater than 80% of the cloned inserts in these vectors contain open reading frames after nutritional selection. The DNA-binding domain vectors we used, pOBD.111 and pOBD.109, are slightly modified to facilitate the cloning of bacterial genomic DNA fragments that have had linkers added to their ends. The haploid yeast strain used to express the DNA-binding domain fusions, PNY200, has the following genotype: MATa trp1-901 leu2-3,112 ura3-52 his3-200 ade2 gal4 gal80. The haploid yeast strain used to express the activation domain fusions, PJ69-4A1, has the following genotype: MATa trp1-901 leu2-3,112 ura3-52 his3-200 ade2 gal4 gal80 GAL2-ADE2 LYS2::GAL1-HIS3 met2::GAL7-lacZ. The two yeast strains are derived from the same parent cell line and display high mating efficiencies. Both allow for the introduction and selection of vectors carrying the yeast selectable markers TRP1, LEU2, and URA3. The activation domain strain contains three different Gal4p-responsive reporter genes: GAL2-ADE2 and GAL1-HIS3, which are assayed by selection on yeast synthetic media lacking either adenine or histidine, respectively, and GAL7-lacZ, which can be monitored using colorimetric or luminescent assays for beta-galactosidase activity. The HIS3 reporter exhibits a low level of background His3p expression that can be counteracted by use of 3mM 3-amino-1,2,4-triazole, a competitive inhibitor of the His3p protein. These markers are unrelated except for the small GAL4 binding sites in their promoters. Since it is very unlikely that all three genes would be spuriously activated if their promoters are so distinct, the likelihood of false-positives is reduced. Generation of DNA-binding domain libraries. We cloned fragments of B. anthracis, F. tularensis, and Y. pestis genomic DNA into DNA-binding domain vectors pOBD.111 and pOBD.109 to create libraries for two-hybrid analysis. We obtained the genomic DNA from the laboratories of Dr. Kenneth Bradley (University of California, Los Angeles), Dr. Martha Furie (Stony Brook University), and Dr. James Bliska (Stony Brook University) respectively. Bacterial genomic DNA insert preparation involves the mechanical (sonication) and enzymatic (cviJI**) shearing to produce random fragments of an average size of 500 bp. We blunted single-stranded overhangs to recover fragments of desired size. We then ligated purified fragments to linkers and cotransform them into bacterial cells with an equimolar amount of linearized vector. We then transformed the entire ligation and plate onto selection plates for amplification. We pooled colonies and isolated plasmid DNA for transformation into yeast.
Preparation of DNA-binding domain fusions. In order to randomly screen each DBD library we plated an aliquot of the DNA-binding domain library on yeast synthetic media lacking tryptophan at a density that allows the selection of individual yeast colonies. After a three to four day incubation, we picked individual yeast clones into a 96 well plate containing yeast rich media (YPD). We then incubated the plate at 30u for one to two days to permit the growth of a sufficient quantity of DNA-binding domain yeast.
Random yeast two-hybrid screens. Our strategy is similar to one used by LaCount et al. [6] used to identify interactions between proteins in Plasmodium falciparum.
We generated DNA-binding domain libraries in a haploid MATa strain and the human spleen activation domain library in a MATa strain. We mated each haploid yeast culture containing a single DNA-binding domain fusion to generate diploid yeast cells that express both the DNA-binding and activation domain fusions. In contrast to LaCount et al., we used a liquid-format mating strategy in a 96-well plate (as opposed to mating on filters or agar), thus allowing for the generation of greater than 500,000 diploids (and, therefore, protein combinations). We selected two-hybrid positives on yeast minimal media lacking tryptophan and leucine (to select for mating events), and lacking histidine and adenine (to select for activation of the two-hybrid reporter genes).
The goal was to analyze the vast majority of B. anthracis, F. tularensis, and Y. pestis proteins as DNA-binding domain fusions. The DNA-binding domain libraries contain fragments sizes selected to be larger than 300 bp and with the average insert size of 500 bp (167 amino acids). We chose the 300 bp minimum because many recognizable protein domains are in this size range; in addition, this size of fragment works well in yeast two-hybrid assays.
We generated comprehensive protein interaction maps by performing a ten-fold coverage of the coding capacity of each of the pathogen genomes. We calculated the number of screens by dividing the total genomic sequence of the pathogen by the average fragment size in the DNA-binding domain library (500 bp) and multiplying by ten (fold coverage).
Analysis of positive screens. We incubated the yeast selection plates for ten days. We experienced three different outcomes: 1) Some plates exhibited no growth of yeast colonies and are discarded without further analysis; 2) Some plates exhibited growth of a very large number of colonies (from hundreds of yeast colonies to a lawn of yeast); 3) The remaining plates contained a modest number of colonies, from one to a few hundred. In the first scenario where there are no colonies returned, we assumed that there are no detectable interactors for those protein fragments. In the second case where a very large number of colonies are found, it is likely that the DNA-binding domain fusions possess inherent selfactivation ability and were not worthy of further investigation, as they did not represent protein interaction pairs. After analyzing many thousands of searches, it is our experience that DNA-binding domain fusions yielding in excess of 100 colonies per search are likely self-activators. Typically, our ORF-selected DNA-binding domain libraries contain two to five percent self-activating baits, in agreement with the frequencies observed for random fragments of Escherichia coli and bacteriophage T7.5 [37,38]. We selected colonies for further analysis and transfered them to fresh media. We amplified both the DNA-binding and activation domain inserts by PCR and sequenced the resulting products using dye-primer chemistry on capillary instruments. We used the resulting sequence information to identify the interacting protein fragments.
Filtering positive interactions. We retained interactions for positive colonies in which the insert is in the correct orientation, contains one but no more than two annotated genes, and does not contain multiple genomic fragments that had been ligated together.

Computational methods
Notation. We represented each experimentally derived human-bacterium protein-protein interaction (PPI) network as a bipartite graph B = (H, P, I), where H is the set of human proteins, P is the set of proteins in the bacterium, and I is a set of edges (interactions), each of which connects one protein in H to a protein in P. Further, we represented the set of known intra-species (human) protein-protein interactions as an undirected graph G = (V, E), where V is the set of nodes (human proteins) and E is the set of edges (interactions). We now describe in detail the tests we used to analyze each of the human-pathogen networks.
Analysis of degree in the human PPI network. The degree of a protein in a graph is the number of interactions in which it participates. We plotted the degree distributions for six sets of human proteins: (i) the set of all human proteins not interacting with a pathogen protein in our dataset, (ii)-(iv) three sets of human proteins contained within each of the human-bacteria networks, (v) the set of human proteins found to interact with at least two pathogens, and (vi) the set of human proteins found to interact with all human pathogens (B. anthracis, F. tularensis, and Y. pestis). A bias towards high-degree proteins in the last five distributions would suggest that B. anthracis, F. tularensis, Y. pestis have evolved to interact with higher degree proteins in the human PPI network.
Analysis of betweenness centrality in the human PPI network. The degree of a protein captures only its local connectivity. Betweenness centrality (BC) measures capture both global and local features of a protein's importance in a network [39]. A protein with high betweenness centrality is characteristic of a bottleneck in an interaction network (i.e., there are many paths which pass through this protein) [40]. The betweenness centrality for a protein v M V is defined as the fraction of shortest paths in G between all protein pairs (u, w) that pass through the protein v. Given u, v, w M V, let S uw denote the number of shortest paths between proteins u and w. There may be multiple equally long paths between u and w that are shorter than any other path between u and w. Let S uw (v) denote the number of these that pass through v. Then the betweenness centrality of v is To compute the betweenness centrality for each protein in G, we used the algorithm devised by Brandes [41]. This algorithm runs in time proportional to the product of the number of nodes in G and the number of edges in G. We plotted distributions for the same six sets as in the degree analysis. Again, if the distribution for the last five sets is biased toward higher values of centrality than the distribution for the first set, we could hypothesize that B. anthracis, F. tularensis, and Y. pestis have evolved to interact with proteins with high betweenness centrality in the human PPI network.
Gene set enrichment analysis. We used Gene Set Enrichment Analysis (GSEA) to determine if the human proteins interacting with B. anthracis, F. tularensis, and/or Y. pestis have significantly higher degree or betweenness centrality than randomly picked proteins in G [16]. Let L be the ranked list of the proteins in V, where we rank the proteins either by degree or by betweenness centrality. Given L and a predefined set S of proteins of interest (e.g., those interacting with B. anthracis), we used GSEA to determine whether the proteins contained in S are randomly distributed throughout L or concentrated at the top. In the ranked list L, let l i be the value (of degree or centrality) at index i; 1#i#|L|. We abuse notation and say that an index i is an element of S if the protein whose rank is i belongs to S. First, we indicates that the proteins in S have high degree or high betweenness centrality. Note that our modification of the original definition of the enrichment score [33] ensures that if S mainly contains proteins with low degree or betweenness centrality, then the score will be close to 0, since P hit (S, i)2P miss (S, i) will be negative for most indices. To compute pvalues for an observed enrichment scores, we generated a null distribution of scores by repeatedly selecting |S| random nodes in L and computing the enrichment score for each random subset of nodes. We repeated this process 1,000,000 times and estimated the p-value for s as the fraction of random sets whose enrichment score is at least as large as s.
Identifying paralogous and orthologous protein pairs. In preparation for computing conserved protein interaction modules, we computed orthologous pairs of proteins in every pair of pathogens. We used Inparanoid [28] with default parameters to define orthologous pairs of proteins. The Inparanoid algorithm outputs pairs of clusters. Each cluster in a pair contains proteins from the same organism. The protein at the center of a cluster has a weight of one and the other proteins in the cluster have a weight between zero and one, depending on their similarity to the protein at the center. In a given pair of clusters, for every pair of proteins (one from each cluster), we use the products of the weights of the two proteins as an estimate of the degree of orthology of the protein pair. In addition, we used OrthoMCL [29] with default parameters and a BLAST e-value cutoff of 10 210 to identify paralogous pairs of human proteins. We assigned a weight of one to all paralogous pairs. For the sake of convenience, we considered a human protein appearing in one humanpathogen PPI network to be paralogous to a copy of the same protein appearing in another human-pathogen network.
Conserved human-pathogen PPI modules. Given a pair of human-pathogen PPI networks B 1 and B 2 , let Z be the bipartite graph whose edges are the orthologous and paralogous pairs of proteins between B 1 and B 2 , as computed above. We used a weight of one for all edges (the PPIs) in B 1 and B 2 . For edges in Z, we used the weights defined in the previous sections. Let w e denote the weight of edges e in Z. Following the GrapHopper algorithm [34], we defined a Conserved Protein Interaction Module (CPIM) to be a triple (T 1 , T 2 , O) where T 1 and T 2 are connected subgraphs of B 1 and B 2 , respectively, and O#Z such that (a, b) M O if and only if a is a node in B 1 and b is a node in B 2 . Thus, O is the subgraph of Z induced by the nodes in T 1 and T 2 . We used two measures of quality for a CPIM: interaction score and conservation score.
We defined the interaction score of a CPIM (T 1 , T 2 , O) to be simply the total number of host-pathogen PPIs in B 1 or in B 2 and denoted this score by q(T 1 , T 2 , O). Given T 1 and T 2 , a small value of the score indicates that we could connect the proteins in T 1 and in T 2 using a small number of PPIs. The conservation score of a CPIM (T 1 , T 2 , O) measures the amount of evolutionary similarity (at the amino acid level) between the human-pathogen interaction networks T 1 and T 2 . Let P 1 (respectively, P 2 ) be the sets of nodes (both human and pathogen) in T 1 (respectively, T 2 ). We defined the conservation score of the CPIM (T 1 , T 2 , O) as i.e., the total weight of the orthologous or paralogous pairs of nodes in the CPIM divided by the total number of nodes in the CPIM. The larger this score, the more evolutionary conserved we consider T 1 and T 2 to be, since there are fewer proteins without orthologs or paralogs in the CPIM. Note that if we are given T 1 and T 2 , we can maximize this score by making O the subgraph of Z induced by P 1 and P 2 .
The GraphHopper Algorithm. We used the GraphHopper algorithm [34] to compute CPIMs. For the sake of completeness, we describe the algorithm here. Given two human-pathogen PPI networks B 1 and B 2 , GraphHopper finds CPIMs by ''hopping'' from one network to another using orthology and paralogy relationships. We did not provide PPIs between human proteins as input to GraphHopper. GraphHopper attempts to find CPIMs with high conservation and low interaction score. At a high level, the algorithm starts with a connected basis CPIM that contains four nodes and edges. Iteratively, the algorithm ''hops'' from one PPI network to another. In each iteration, GraphHopper expands the CPIM to increase the conservation score, while attempting to keep the interaction score as low as possible. We now provide details about the algorithm. Although the GraphHopper algorithm has been described earlier [34], we include these details here in order to make this work self-contained. Our inputs are two human pathogen protein interaction networks B 1 = (V 1 , E 1 ) and B 2 = (V 2 , E 2 ) and a set Z of orthologous or paralogous protein pairs.
Computing basis CPIMs. We start by constructing a basis set of CPIMs in which each CPIM (T 1 , T 2 , O) has the following properties: Thus, each basis CPIM consists of two or four host-pathogen PPIs (one or two each in T 1 and in T 2 ) and two orthology or paralogy edges. The basis set consists of all such CPIMs.
Expanding a basis CPIM. GraphHopper processes each CPIM in the basis set using the following iterative algorithm (see Figure 5). Let (T 1 , T 2 , O) be a basis CPIM. In iteration k.1, we construct a CPIM (T 1 k , T 2 k , O k ) such that , O k21 ), i.e., the new CPIM has a higher conservation score, and iii. q(T 1 k , T 2 k , O k ) . q(T 1 k21 , T 2 k21 , O k21 ) is as small as possible, i.e., the new CPIM has as few PPIs added to it as possible.
We keep either T 1 k21 or T 2 k21 fixed and ''expand'' the other graph. Without loss of generality, we assume that T 1 k = T 1 k21 and T 2 k21 is a subgraph of T 2 k in the following discussion. We construct (T 1 k , T 2 k , O k ) using the following steps: i. We identify a set P#V 2 of nodes such that each node vMP is not a node in T 2 k21 and is connected by an edge in Z to at least one node in T 1 k .
ii. For each node v M P, we use breadth-first search to compute the shortest path p v in B 2 that connects v to T 2 k21 , i.e., for each node u M T 2 k21 , we compute the shortest path between u and v in B 2 , and set p v to be the shortest of these paths. iii. We find the node v9 in P such that p v9 is the shortest among all paths computed in the previous step. iv. We set T 2 k to be the union of T 2 k21 and p v9 .
v. We set O k to be the union of O k21 and the set of edges in Z incident on v9and a node in T 1 k .
vi. We compute w(T 1 k , T 2 , O k21 ), we go to Step (i) and expand w(T 1 k , T 2 k , O k ) iv. vi. v.
while keeping T 2 k fixed. Otherwise, we stop expanding this CPIM and proceed to the next basis CPIM.
The rationale for these steps is as follows. To expand the CPIM (T 1 k21 , T 2 are available from the Bioinformatics Resource Center Portal at http://www.pathogenportal.net/prc/. The interactions have also been submitted to the IMEx (http:/www.imexconsortium.org) consortium through IntAct [9] and assigned the identifier IM-13779.