Unsupervised extraction of functional gene expression signatures in the bacterial pathogen Pseudomonas aeruginosa with eADAGE

While the large sets of publicly available gene expression data contain substantial information about relationships between mRNA expression profiles and genetic background, environment, and cellular state, cross experiment comparisons of public data are challenged by technical noise that masks biological signals. We previously showed that one could reveal biological signatures within compendia of expression data using an unsupervised neural network algorithm, called ADAGE, which excels at detecting patterns in noisy datasets. Here, we show that the generation and integration of multiple ADAGE models, resulting in an ensemble ADAGE (eADAGE), better identified biological pathways. For the bacterium Pseudomonas aeruginosa, our analysis found that on the order of 1000 samples were needed to build pathway-level gene expression signatures. The P. aeruginosa gene expression compendium contains experiments performed in 78 different media, and we used eADAGE to identify expression signatures associated with medium-type across multiple experiments. We identified a subset of media, including several complex media that were not designed to limit phosphate, in which P. aeruginosa exhibited a phosphate starvation response controlled by PhoB. Furthermore, while it was expected that PhoB activates the phosphate starvation response in low phosphate, our analyses found that PhoB was also active in moderate phosphate concentrations and predicted that activity required a second stimulus provided by a sensor kinase, KinB, which was validated in subsequent experiments including a screen of a histidine kinase knock out collection confirmed the specificity of its role in the activation of the Pho regulon. Algorithms that extract biological signal from large collections of public gene expression data, such as eADAGE, can highlight opportunities to discover mechanisms that are currently unrecognized from public data.


54
Available gene expression data are outstripping our knowledge about the organisms that we're 55 measuring. Ideally each organism's data reveals the principles underlying gene regulation and 56 consequent pathway activity changes in every condition in which gene expression is measured. 57 Extracting this information requires new algorithms, but many commonly used algorithms are 58 supervised. These algorithms require curated pathway knowledge to work effectively, and in 59 many species such resources are biased in various ways (Schnoes et al, 2013; Gillis & Pavlidis,60 2013; Greene & Troyanskaya, 2012). Annotation transfer can help, but such function 61 assignments remain challenging for many biological processes (Jiang et al, 2016). An 62 unsupervised method that doesn't rely on annotation transfer would bypass the challenges of 63 both annotation transfer and biased knowledge. 64 65 that corresponded to more biological pathways. This indicates that this method more 110 effectively identifies biological signatures from noisy public data. While ADAGE models reveal 111 biological features perturbed within an experiment, the more robust eADAGE models also 112 enable analyses that cut across an organism's gene expression compendium. 113 114 To assess the utility of the eADAGE model in making predictions of biological activity, we 115 applied it to the analysis of the Pseudomonas aerguinosa gene expression compendium which 116 included 1051 samples grown in 78 distinct medium conditions, 128 distinct strains and isolates, 117 and dozens of different environmental parameters. After grouping samples by medium type, 118 we searched for eADAGE-defined signatures that differed between medium types. This cross-119 compendium analysis identified five media that elicited a response to low-phosphate mediated 120 by the transcriptional regulator PhoB, and only one of these five media was specifically defined 121 as a condition with low phosphate. While PhoB is known to respond to low phosphate through 122 its interaction with PhoR in low concentrations (Wanner & Chang, 1987), our analyses indicated 123 that PhoB is also active at moderate phosphate concentrations. Specifically in media with 124 moderate phosphate concentrations, the eADAGE model predicted a previously undiscovered 125 role for KinB in the activation of PhoB, and our molecular analyses of P. aeruginosa confirmed 126 this prediction. Analysis of a collection of P. aeruginosa mutants defective in kinases validated 127 the specificity of the KinB-PhoB relationship. 128 129 In summary, eADAGE more precisely and robustly captures biological processes and pathways 130 from gene expression data than other unsupervised approaches. The signatures learned by 131 eADAGE support functional gene set analyses without manual pathway annotation. The 132 signatures are robust enough to enable biologists to identify not only differentially active 133 signatures within one experiment, but also cross-compendium patterns that reveal 134 undiscovered regulatory mechanisms captured within existing public data. 135 136

138
Analysis of the effects of model size on pathway characterization of ADAGE models from P. 139 aeruginosa gene expression. 140 Determining the optimal structure of a neural network is challenging. In the case of an ADAGE 141 model with a single hidden layer we term the number of nodes in the hidden layer to be the 142 model size. We evaluated the model size through both a data-driven heuristic and a knowledge-143 driven heuristic. Importantly, the data-driven heuristic requires no curated pathway 144 information and can be applied even when such resources are unavailable for an organism. 145 During ADAGE training, neural networks are trained to reconstruct the input from data with 146 noise added. The reconstruction error can be used to estimate model sizes that can be 147 supported by the available P. aeruginosa gene expression data. The reconstruction error quickly 148 decreases as model size increases and reaches a floor at model size of approximately 300 149 ( Figure 1A). Further increasing model size does not improve reconstruction, suggesting that the 150 available data are insufficient to support larger models. 151 152 While ADAGE models are constructed without the use of any curated information, we can use 153 experimentally-derived knowledge of gene functions to provide heuristic information about the 154 number and types of pathways captured by a model and determine how this varies with model 155 size. We define a functional signature learned by an ADAGE model as a set of genes that 156 contribute the highest positive or highest negative weights to a specific node (see methods for 157 detail). Therefore, one node results in two gene signatures, one on each high weight side. These 158 high-weight (HW) genes are often involved in a common biological process as demonstrated by 159 the fact that there is often a statistically significant enrichment in specific KEGG pathways 160 within each signature. For models of different sizes (10-1000 nodes), we determined the 161 number of KEGG pathways significantly associated with at least one gene signature in a model, 162 referred to as KEGG pathway coverage for that model, and found that KEGG pathway coverage 163 increased as model size increased until a model size of approximately 300 ( Figure 1B). The 164 number of pathways per node (including pathways associated with both the positive and 165 negative signatures in a node) for all nodes with at least one associated KEGG pathway 166 decreased as model size increased ( Figure EV1), suggesting that multiple pathways were 167 grouped in small models and were separated into more discrete features in large models with 168 more nodes. Though this method was unsupervised, we inferred that methods that extracted 169 signatures corresponding to known pathways better captured biological signals in the 170 compendium. Therefore, considering the data-driven and knowledge-driven heuristics together, 171 we identified a 300-node neural network model as most appropriate for the existing P. 172 aeruginosa gene expression compendium. 173 174 Analysis of the effects of sample number in the training set on ADAGE models 175 The expression compendium contains 1051 samples from 125 experiments. We aimed to 176 identify the amount of data required to saturate the method's ability to discover biologically 177 supported signatures and to identify how far the compendium could be reduced before 178 performance dropped precipitously. To assess this, we performed a subsampling analysis in 179 which we trained ADAGE models on randomly selected sets of 100, 200, 500, and 800 180 expression profiles. We examined the number of KEGG pathways associated with at least one 181 gene signature (pathway coverage) as a function of the size of the training set ( Figure 1C). In 182 the 50-node models, the size used in (Tan et  Using the subsampling strategy, we also evaluated the reconstruction error of each model on 195 its training set and a randomly chosen held out test set of 200 samples. As sample size 196 increased, training reconstruction errors increased slightly while testing reconstruction errors 197 dropped dramatically ( Figure 1D). We fitted exponential models between sample size and the 198 differences of training and testing errors (R 2 = 0.78 for 50-node models and R 2 = 0.83 for 300-199 node models). We extrapolated from these models to predict that testing errors would  200  approximately match training errors when sample size was 782 for 50-node models and 1076  201 for 300-node models. These results suggested that smaller models were less sensitive to sample 202 size, likely because they have fewer parameters to fit and also that our 1051 sample 203 compendium was sufficient to train a 300-node model. and corADAGE were all fixed to 300 nodes, which we found to be appropriate for the current P. 218 aeruginosa expression compendium. 219 220 eADAGE models exhibited greater KEGG pathway coverage than those generated by other 221 methods. We evaluated ADAGE, corADAGE, and eADAGE for the number of covered KEGG 222 pathways ( Figure 2B). Both corADAGE and eADAGE covered significantly more KEGG pathways 223 than ADAGE (t-test p-value of 1.04e-6 between corADAGE (n=10) and ADAGE (n=1000) and t-224 test p-value of 1.41e-6 between eADAGE (n=10) and ADAGE (n=1000)). Moreover, eADAGE 225 models covered, on average, 10 more pathways than corADAGE (t-test p-value of 1.99e-3, n=10 226 for both groups), confirming the critical roles of an ADAGE node's HW gene signatures in 227 defining biological pathways. Genes that participate in multiple pathways can influence 228 pathway enrichment analysis, a factor termed pathway crosstalk (Donato et al, 2013). If 229 eADAGE signatures tended to include genes that participated in many pathways, this could also 230 drive the increase in number of observed pathways. To control for this, we performed crosstalk 231 correction (Donato et al, 2013). After correction, the total number of covered pathways 232 dropped approximately by half ( Figure EV2), but eADAGE still covered significantly more 233 pathways than corADAGE (t-test p-value of 1.60e-3) and ADAGE (t-test p-value of 6.16e-07). 234 These results suggested that eADAGE effectively integrates multiple models to more broadly 235 capture pathway signals embedded in diverse gene expression compendia. 236 237 We next evaluated how specifically and completely signatures learned by the models capture 238 known KEGG pathways. We use each gene signature's FDR corrected p-value for enrichment of 239 a KEGG pathway as a combined measure, as this captures both the sensitivity and specificity. If 240 a pathway was significantly associated with multiple gene signatures in a model, we only 241 considered its most significant association. We found that 71% of pathways were more 242 significantly enriched (had lower median p-values) in corADAGE models (n=10) when compared 243 to individual ADAGE models (n=100) ( Figure EV3). This increased to 87% for eADAGE (n=10). We 244 also directly compared eADAGE and corADAGE by this measure and observed that 74% of 245 pathways were more significantly enriched in eADAGE. Our earlier evaluation of pathway-based 246 heuristics showed that different pathways were best captured at different model sizes ( Figure  247 EV3). We next compared the 300-node eADAGE model to individual models of each size. 248 Although the 300-node eADAGE models were constructed only from 300-node ADAGE models, 249 we found that 69% pathways were more significantly enriched (i.e. lower median p-values) in 250 eADAGE models than ADAGE models of any size, including those with more nodes than the 251 eADAGE models ( Figure EV3). Three example pathways that are best captured either when 252 model size is small, large, or in the middle are all well captured in the 300-node eADAGE model 253 ( Figure 2C). These results demonstrate that eADAGE's ensemble modeling procedure captures 254 signals across model sizes more effectively than individual ADAGE and corADAGE models. Thus 255 eADAGE more completely and precisely captures the gene expression signatures of biological 256 pathways. 257 258 We designed eADAGE to provide a more robust analysis framework than individual ADAGE 259 models. To assess this, we examined the percentage of models that covered each pathway 260 (coverage rate) between ADAGE and eADAGE ( Figure EV4). The pathways covered by each 261 individual ADAGE model were highly variable. Most KEGG pathways were covered by less than 262 half of individual models but more than half of eADAGE models ( Figure EV5), suggesting that 263 eADAGE models were more robust than individual ADAGE models. We excluded all pathways 264 always covered by both individual ADAGE and eADAGE models and observed that 72% of the 265 remaining pathways were covered more frequently by eADAGE than ADAGE. This suggests that 266 their associations are stabilized through the ensemble construction procedures. In summary, 267 these comparisons of eADAGE and ADAGE reveal that not only are more pathways captured 268 more specifically, but also those that are captured are captured more consistently. performed PCA and generated multiple ICA models from the same P. aeruginosa expression 275 compendium and evaluated their KEGG pathway coverage following the same procedures used 276 for eADAGE. eADAGE substantially and significantly outperforms PCA in terms of pathway 277 coverage ( Figure 2D). We observed that low-order PCs tend to be associated with more 278 pathways than high-order PCs, which is consistent with the higher variance explained by low-279 order PCs. ICA and eADAGE covered a similar number of pathways at the significance cutoff of 280 FDR 0.05. However, we observed that eADAGE represented KEGG pathways more precisely 281 than ICA. Specifically, among pathways significantly enriched in either approach, 68% pathways 282 exhibited more significant enrichment in eADAGE. Increasing the significance threshold for 283 pathway coverage demonstrates the advantage of eADAGE ( Figure 2D). 284 285 Pathway databases provide a means to compare unsupervised methods for signature discovery. 286 Not all pathways will be regulated at the transcriptional level, but those that are may be 287 extracted from gene expression data. The unsupervised eADAGE method revealed signatures 288 that corresponded to P. aeruginosa KEGG pathways better than PCA, ICA, ADAGE, and 289 corADAGE. It had higher pathway coverage (breadth), covered pathways more specifically 290 (depth), and more consistently than existing methods (robustness). 291 292 Elucidating functional signatures that are indicative of growth medium 293 Analysis of differentially expressed genes is widely used to analyze single experiments, but 294 crosscutting signatures are required to reveal general response patterns from large-scale 295 compendia. Signature-based analyses can suggest mechanisms such as crosstalk and novel 296 regulatory networks. However, in order for this to be effective, these signatures must be robust 297 and comprehensive. By capturing biological pathways more completely and robustly, eADAGE 298 enables the analysis of signatures, including those that don't correspond to any KEGG pathway, 299 across the entire compendium of P. aeruginosa. 300 301 Gene expression experiments have been used to investigate a diverse set of questions about P. 302 aeruginosa biology, and these experiments have used many different media to emphasize 303 different phenotypes. Our manual annotation showed that 78 different base media were used 304 across the gene expression compendium (Table EV1). While the compendium contains 125 305 different experiments, it is exceedingly rare for investigators to use multiple base lab media 306 within the same experiment. There were only two examples in the entire compendium (Table  307 EV1). Further, other than LB, which is used in 43.6% (458/1051) of the samples in the 308 compendium, most media are only represented by a handful of samples. Comparing samples 309 across the compendium can help shed light on the influence of medium on P. aeruginosa even 310 if the medium wasn't widely used. 311 312 For biological evaluation, we built a single new eADAGE model with 300 nodes. The model's 313 weight matrix (Table EV2), positive and negative gene signatures for each node (Table EV3), and  314 signature activities for each sample in the compendium (Table EV4) are provided. To compare 315 the P. aeruginosa response to the different media in the compendium, we looked for signatures 316 that were differentially active in samples grown in different media and media groups (see 317 methods). Table EV5 lists signatures that are active in a specific medium above a stringent 318 threshold and Table EV6 lists signatures that are most active in a group of media (a complete  319 list of signature-media group associations is in Table EV7). 320 321 Distinct aspects of the response to low phosphate are captured among the most active 322 signatures 323 The two signatures with the highest pan-media activation scores are Node164pos and 324 Node108neg (Table EV6), therefore we examined them to better understand their roles in the 325 transcriptional response to different media. To evaluate the basis for the high activation scores, 326 we examined their underlying activities across all media (Node164pos is shown as an example 327 in Figure 3A). This revealed that they were highly active in King's A medium, Peptone medium, 328 and NGM+<0.1mM phosphate (NGMlowP), but not in NGM+25mM phosphate (NGMhighP). 329 The difference in their activities between NGMlowP and NGMhighP suggested that these 330 signatures respond to phosphate concentrations. Interestingly, the other two media (Peptone 331 and King's A) in which P. aeruginosa gene expression leads to consistently high activities also 332 had low phosphate concentrations (0.4 mM) relative to other media in the compendium. For 333 example, commonly used LB has a phosphate concentration of approximately 4.5 mM (Bertani,334 2004) and many others have concentrations above 20 mM. 335 336 Consistent with the observation that Node164pos is only active in low phosphate media, a 337 KEGG pathway enrichment analysis of Node164pos genes suggested a strong enrichment in 338 phosphate acquisition related pathways ( The transcript levels of genes in Node164pos are higher in Peptone, King's A, and NGMlowP 346 medium relative to the other samples in the compendium including NGMhighP ( Figure 3B). 347 348 Among the highest weight genes in Node164pos is a gene that encodes alkaline phosphatase 349 (PhoA), an enzyme with an activity that can be easily measured using a colorimetric assay. As 350 expected, PhoA activity (blue color) was high when P. aeruginosa was grown on NGMlowP and 351 not when grown on NGMhighP ( Figure 4A). The same trend was observed in another medium, 352 MOPS, with the same low and high phosphate concentrations. Also consistent, PhoA was not 353 active on the phosphate-replete medium LB. Although King's A and Peptone are not considered 354 phosphate-limited media, their phosphate concentrations were low enough to provoke PhoA 355 activity, as predicted by Node164pos's signature-medium relationship ( Figure 4B). Furthermore, 356 PhoA activity was dependent on PhoB and the PhoB-activating histidine kinase PhoR, which is 357 consistent with previous publications (Bielecki et al, 2015). These results provide striking 358 evidence that low phosphate media, including Peptone and King's A, induced PhoB activity as 359 predicted by the eADAGE analysis and previous characterizations of the P. aeruginosa 360 phosphate response. 361 362 Unlike Node164pos, Node108neg is not characterized by KEGG pathways associated with 363 phosphate starvation (Table EV6). Of the thirty-two genes in Node108neg, more than half of 364 the genes are shared with Node164pos, suggesting a relationship between the two signatures. 365 Node108neg was also enriched of PhoB regulated genes (FDR q-value of 5.2e-9 in 366 hypergeometric test), but contains a much narrower set of PhoB regulon. Of the PhoB-367 regulated genes present in Node108neg, we found almost all of the genes (six of seven) that 368 were also regulated by TctD, a transcriptional repressor described by Haussler and colleagues 369 (Bielecki et al, 2015). TctD binding to these promoters prevents their activation by PhoB. 370 Therefore, we predicted that Node108neg represented a group of PhoB-activated genes that 371 were also negatively regulated by TctD. This is consistent with the observation that 372 Node108neg was the most differentially active signature in the RNA-Seq experiment that 373 compares ∆tctD and the wild type (E-GEOD-64056). eADAGE learned the relationship between 374 PhoB and TctD directly from an expression compendium that did not contain any samples of 375 tctD mutants, suggesting that the signature was derived from samples with differences in TctD  376 activity. This demonstrates the value of KEGG-independent eADAGE signatures for the analysis 377 of large-scale compendia. 378 379 We evaluated whether the PhoB and TctD signals were also extracted by PCA, ICA, or ADAGE 380 using ten models constructed by each method except PCA, which is deterministic and produces 381 a single model models. The PhoB regulon was captured with less fidelity by ICA and ADAGE than 382 by eADAGE, as reflected by the smaller overlap with the PhoB regulon (Table EV8). PCA 383 captured a strong PhoB signal in its 19th principle component. However, it did not learn the 384 subtler TctD signal. Component19pos was the only signature that was highly active in the low 385 phosphate media group (Table EV9). In summary, the other methods were able to capture 386 some of this signature but in a manner that was less complete or failed to separate TctD. As predicted by Node164pos activity, PhoA activity was evident and was KinB dependent on PIA 400 medium ( Figure 4B). Notably, PhoA activity was still dependent on PhoB and PhoR as it was on 401 Peptone and King's A. Over time, the ∆phoR mutant developed PhoA activity on all three media, 402 but the ∆kinB mutant on PIA did not ( Figure 4C). Recovery of PhoA activity in the ∆phoR mutant 403 suggests that there are PhoR-independent paths for PhoB activation. The co-dependence on 404 KinB and PhoR suggest that these kinases do not perform redundant functions but rather 405 regulate PhoB in conjunction with each other. To determine if the deletion of kinases non-406 specifically altered PhoB activation, we screened 63 in-frame deletion mutants each lacking a 407 histidine kinase (Table EV9)  (WT) showed PhoA activity at 0.5 mM, ∆kinB did not ( Figure 4D). Here we develop heuristics that target two aspects of the problem: the model needs to be well 439 supported by the amount of available data and the extracted features should well resemble 440 known biological processes. Our data-driven heuristics can be applied to organisms for which 441 gene-process annotations are lacking. We expect that additional data will support larger models, 442 especially data that measure experimental conditions that are not tested in the existing 443 compendium. 444 445 We also contribute a novel eADAGE algorithm. This algorithm combines multiple ADAGE 446 models into one ensemble model to address model variability due to stochasticity and local 447 minima. The algorithm is inspired by consensus clustering, which reconciles the differences in 448 cluster assignments in multiple runs. Comparable approaches have also been applied for ICA, 449 where researchers have used the centrotypes of multiple ICA models as the final model 450 (Frigyesi et al, 2006 There are now abundant public gene expression data. Cross-compendium analyses provide the 508 opportunity to efficiently use existing data to identify regulatory patterns that are evident 509 across multiple experiments, datasets, and labs. To tap this potential, we will require algorithms 510 that robustly integrate these diverse datasets in a manner that is not tied to only aspects of 511 biology that are well understood. We expect that robust unsupervised data integration 512 methods, like eADAGE, will play a key role in this process. 513 514

515
Data processing 516 We followed the same procedures for data collection, processing, and normalization from (Tan 517 et Construction of ADAGE models 528 We constructed ADAGE models as described in (Tan et al, 2016b). To summarize the process 529 and outputs, we constructed a denoising autoencoder for the gene expression compendium. 530 Denoising autoencoders model the data in a lower dimension than the input space, and the 531 models are trained with random gene expression measurements set to zero. Thus an ADAGE 532 model must learn gene-gene dependencies to fill in this missing information. Once the ADAGE 533 model is trained, each node in the hidden layer contains a weight vector. These positive and 534 negative weights represent the strength of each gene's connection to that node. 535 536 Gene signatures as sign-specific high-weight gene sets 537 In previous work (Tan et al, 2016b) we defined high-weight (HW) genes as those in the 538 extremes of the weight distribution on the positive or negative side of a node. Here, we use a 539 more granular definition that accounts for sign specificity. Each node's gene weights are 540 approximately normal and centered at zero in ADAGE models (Tan et al, 2015(Tan et al, , 2016b. We 541 defined positive HW genes as those that were more than 2.5 standard deviations from the 542 mean on the positive side, and negative HW genes as those that were more than 2.5 standard 543 deviations from the mean on the negative side. After this split, a model with n nodes provides 544 2n gene signatures. Because a node is simply named by the order that it occurs in a model, we 545 named two gene signatures derived from one node as "NodeXXpos" and "NodeXXneg". 546 547

KEGG pathway enrichment analysis 548
To evaluate the biological relevance of gene signatures extracted by an ADAGE model, we 549 tested how they relate to known KEGG pathways (Kanehisa, 2000). We tested a signature's 550 association with each KEGG pathway using hypergeometric test and corrected the p-value by 551 the number of KEGG pathways we tested following the Benjamini-Hochberg procedure. We 552 used a false discovery rate of 0.05 as the significance cutoff. 553 554 Genes can be annotated to multiple pathways. To control for this effect in our analysis, we also 555 performed a parallel analysis after applying crosstalk correction as described in (Donato et al,556 2013). This approach uses expectation maximization to map each gene to the pathway in which 557 it has the greatest predicted impact. A gene-to-pathway membership matrix, defined using 558 KEGG pathway annotations, initially makes the assumption that each gene's role in all of its 559 assigned pathways remains constant independent of context. We then applied pathway 560 crosstalk correction using genes' weights for each node in the ADAGE model. We used the 561 expectation maximization algorithm to maximize the log-likelihood of observing the 562 membership matrix given each node's weight vector. This process inferred an underlying gene-563 to-pathway impact matrix and iteratively estimated the probability that a particular gene g 564 contributed the greatest fraction of its impact to some pathway P. Upon convergence, we 565 assigned each gene to the pathway in which it had the maximum impact. The resulting pathway 566 definitions do not share genes. We then used these corrected definitions for an analysis parallel 567 to the KEGG process described above. 568 569 Reconstruction error calculation 570 The training objective of ADAGE is to take a sample with added noise and return the originally 571 measured expression values. The error between the reconstructed data and the initial data is 572 the 'reconstruction error.' To summarize the difference over all genes we used cross-entropy 573 between the original sample and the reconstruction, which has been widely used with these 574 methods and in this domain (Vincent et al, 2008;Tan et al, 2016b). This matches the statistic 575 used during training of the model. To calculate reconstruction error for a model, we use the 576 mean reconstruction error across samples. 577 578 Model size and sample size heuristics 579 One important parameter of a denoising autoencoder model is the number of nodes in the 580 hidden layer, which we refer to as the model size. To evaluate the impact of model size and 581 choose the most appropriate size, we built 100 ADAGE models at each model size of 10, 50, 100, 582 200, 300, 500, 750, and 1000, using different random seeds. The random seed determines the 583 initialization statuses of the weight matrix and bias vectors in ADAGE construction and thus 584 different random seeds will result in training stopped at different local minimums. Other 585 training parameters were kept the same and set to the values identified as suitable for a gene 586 expression compendium (Tan et al, 2015). In total, 800 ADAGE models with 100 at each model 587 size were generated in the model size evaluation experiment. 588 589 To evaluate the impact of sample size on the performance of ADAGE models, we randomly 590 generated subsets of the P. aeruginosa expression compendium with sample size of 100, 200, 591 500, and 800. We then trained 100 ADAGE models at each sample size, each with a different 592 combination of 10 different random subsets and 10 different random training initializations. To 593 evaluate each model, we randomly selected 200 samples not used during training as its testing 594 set. We performed this subsampling analysis at model size 50 and 300. In total, 800 ADAGE 595 models were built in the sample size evaluation experiment. 596 597 Construction of eADAGE models 598 We constructed ensemble ADAGE (eADAGE) models by combining many individual ADAGE 599 models in to a single model. For each eADAGE model we combined 100 individual ADAGE 600 models. The 100 models were trained with identical parameters but distinct random seeds. For 601 an eADAGE model of size 300, we trained 100 individual models with 300 nodes each, which 602 provided 30000 total nodes. Each node has a weight vector. We have previously observed that 603 high-weight genes provided the most information to each node (Tan et al, 2016b) We constructed PCA and ICA models and defined each model's weight matrix following the 630 same procedures in (Tan et al, 2016b). To compare with the 300-node eADAGE, we generated 631 models of matching size (300 components). For ICA, we evaluated 10 replicates. PCA provides a 632 single model. PCA and ICA models were evaluated through the KEGG pathway enrichment 633 analysis described above. 634 635 pathway coverage increases with sample size and peaks at 500 samples. 300-node models 858 cover more pathways than 50-node modes in general and maintain a slow growing trend of 859 pathway coverage at the maximum sample size. 860 D Data-driven sample size heuristics on reconstruction error. In both 50-and 300-node 861 models, the reconstruction errors on the test set get closer to the reconstruction errors on the 862 train set as sample size increases. 863 864 Figure 2: The construction and performance of eADAGE. 865 A eADAGE construction workflow. 100 individual ADAGE models were built using the same 866 input dataset (step 1). Nodes from all models were extracted (step 2) and clustered based on 867 the similarities in their associated weight vectors (step 3). Nodes derived from different models 868

Figure Legends
were rearranged by their clustering assignments (step 4). Weight vectors from nodes in the 869 same cluster were averaged and thus becoming the final weight vector of a newly constructed 870 node in an eADAGE model (step5  node were added together. When model is small, one node needs to account for multiple KEGG 908 pathways. As model size grows, more nodes become available and pathways also tend to 909 spread into different nodes. 910 911 Figure EV2: Pathway coverage comparison between individual ADAGE and ensemble ADAGE 912 after correcting pathway crosstalk effects. eADAGE models (n=10) covers significantly more 913 pathways than both corADAGE (n=10) and ADAGE (n=1000). 914 915 Figure EV3: The association significance of each KEGG pathway in the 300-node eADAGE models 916 (n = 10), 300-node corADAGE models (n=10) and ADAGE models with different number of 917 nodes (n = 100 for each model size). 918 919 Figure EV4: The coverage rate of each KEGG pathway in 300-node ADAGE models (n=1000) and 920 300-node eADAGE (n=10) models. 921 922 Figure EV5: The distribution of KEGG pathway coverage rates in 300-node ADAGE models 923 (n=1000) and 300-node eADAGE models (n=10). eADAGE shows a higher density in distribution 924 on the high coverage end. 925 926 Figure EV6: