Using genome-wide expression compendia to study microorganisms

Graphical abstract

inform our understanding of biological mechanisms in multicellular eukaryotes [18], and underlie ecological cycles of biotransformations [19]. Therefore, transcriptomic studies are commonly used to examine trait-associated genes and their regulation.
Early experiments revealed the importance of transcriptional regulation in microbes. For example, experiments in the model organisms Escherichia coli (E. coli) and Saccharomyces cerevisiae (S. cerevisiae) revealed that common gene expression responses were elicited by different environmental stressors [20,21]. Studies in Pseudomonas aeruginosa (P. aeruginosa), an opportunistic gram-negative pathogen, found transcription factors, including the global regulator LasR, that control the expression of a number of extracellular factors that contribute to virulence [6,22,23] including proteases [5,24]. Overall, by studying the global transcriptome response, we can start to understand the mechanisms of traits of interest.
Microbial transcription in response to microbial interactions and environmental cues is complex. Genome organization affects transcription through diverse mechanisms, including factors like 3dimensional organization [25], gene proximity [26], and promoter location [27]. Transcriptional regulators interact with internal and external cues to achieve transcription programs that reflect their environment. For example, in microbial quorum sensing (QS), a cell-cell communication process that allows microbes to respond to population density through signal molecules, microbes produce and respond to signals that facilitate adaptation to varying conditions [28,29]. For example, farnesol, a QS signaling molecule produced by Candida albicans (C. albicans), inhibits biofilm formation [30]. Similarly, the accessory gene regulator locus in Staphylococcus aureus (S. aureus), which encodes the QS system, mediates the development of biofilm in response to nutrient availability [31]. QS regulators can also impact environmental responses by regulating other transcription factors such as the oxygen-sensitive Anr in P. aeruginosa [32]. Anr activity is higher in QS-defective strains that lack function of the LasR QS regulator. Thus, lasR mutants (LasR-), which are frequently isolated from CF patients, are more fit in microoxic conditions than their LasR + counterparts [33].
In addition to environmental cues, microbes tend to grow in polymicrobial communities where they sense and transcriptionally respond to other microbes. Both competitive and cooperative behaviors [34,35] influence phenotypes [36,37], eliciting interactions like the production of public goods, (cross-feeding) [38], resource consumption, interference competition [39], or coordinate production of phenotypes (increased virulence and antibiotic resistance [40]). For example, in co-infection of S. aureus and P. aeruginosa, P. aeruginosa exoproducts can select for S. aureus small colony variants that are aminoglycosides resistant [40]. Finally, these microbe-microbe interactions are also dependent on environmental factors. Doing et al. found that P. aeruginosa produced antifungal phenazines against C. albicans, but that this antagonistic interaction depends on phosphate availability and C. albicans fermentation [41,42]. Even two different genotypes can influence each other as in citrate cross-feeding found by Mould et al. [43].
Given the context-specific nature of transcription, leveraging data across many experiments allows researchers to study how microbes regulate transcription of different genes and pathways across different conditions -to gain a more systems level understanding of the transcriptome. Gene expression compendia, which are integrated collections of experiments, are one solution for examining transcriptional patterns across different contexts. In this review we describe how these compendia are constructed and the challenges faced as well as highlight analyses using compendia to reveal patterns of interest.

Construction of microbial expression compendia
For the purposes of this review, we defined an expression compendium to be a heterogeneous collection containing hundreds to thousands of samples that span multiple experiments assembled from data collected for diverse purposes. The range of experiments in work that we identified as meeting these criteria starts from 8 experiments and goes upwards of 100 experiments, which ensures that the compendium contained enough samples to apply systemlevel computational tools. Notable existing microbial compendia can be found in Table 1. The construction of each of these compendia began with the collection of relevant gene expression experiments from public repositories like ArrayExpress [44], Gene Expression Omnibus (GEO) [45], Sequence Read Archive (SRA) [46] and others [47,48]. Experiments of interest were then downloaded from these public repositories. In the case of the compendia represented in Table 1, all experiments (i.e. samples deposited together) within a given compendia were measured on the same platform to avoid bias and maintain a uniform reference. Additional filtering of samples were optionally performed to ensure that removal of spurious random correlation between genes [49]. Next, the samples were normalized to allow for cross sample comparison. Overall, filtering, consistency in platform and normalization ensure that the compendium data is uniformly processed.
There are different normalization techniques available depending on the technology. As an example, the P. aeruginosa RNA-seq compendium started with expression profiles downloaded from SRA and then median-ratio (MR) normalized [49,50]. The authors evaluated well-known RNA-seq normalizations, transcripts per million (TPM) and trimmed mean of means (TMM) [51], which corrected for spurious correlations; however correlations between random pairs of genes were still elevated compared to using MR normalization, which was their preferred strategy. These RNAseq normalization methods address systematic variation, including differences in library size (i.e. sequencing depth) [52] and gene length [53], allowing for between sample and gene comparisons. Similarly, there also exist systematic variation in measurements using array technology though the sources are different and include differences in preparation protocol (i.e., total quantity of starting RNA, dye labeling) or differences in processing (i.e., different scanners or runs). One of the well-established normalization methods for the Affymetrix GeneChip system, which most of the compendia in Table 1 used, is RMA [54] which is a quantile method. In comparison to other single label normalization methods, Bolstad et al. [55] reported that RMA successfully reduced bias at reasonable compute speed compared to other global normalization methods. A similar review of two-color array technology, performed by Yang et al. [56], showed that different global or locationbased normalization methods should be performed depending on the set of control spots. In a couple cases, where the compendium integrated across different platforms, such as two different array technologies or combining array and RNA-seq, studies used quantile normalization [57][58][59]. Regardless of the technology used, expression levels between samples can vary due to technical reasons, mentioned above, and so it's important to use normalization methods to adjust for these differences in order to compare between two gene expression profiles for applications such as gene function prediction, transcription regulatory network (TRN) inference and feature extraction.
Most of the existing compendia in Table 1 did not apply batch correction. In one case, where the compendium combined array and RNA-seq data, ComBat [58] was applied. While normalization is necessary in the context of compendia and facilitates crosssample comparisons, batch correction is an optional step, and its application depends on the experiments included in the compendia. See section 'Challenges integrating across experiments' for a discussion of batch correction.
As more transcriptome data are generated, repositories like refine.bio [59], COLOMBOS [60,61] PILGRM [62], and M 3D [63] are being developed to provide easily downloadable compendia where the data has been uniformly processed for different bacterial species. In general, the abundance of data has facilitated the generation of compendia to study transcriptional patterns across experiments.

Systems-level models
The construction of compendia, which contain hundreds to thousands of samples, has opened the door to the development of computational approaches, especially machine learning methods that have been successful at prediction tasks [64] and pattern extraction [65] in computer science, to discover transcriptional patterns in microbes.
Compendia can contribute to helping us gain a systems-level understanding of microbial biology. One major goal for systems biology is to model how information is encoded, specifically to reverse engineer the hierarchy of the transcriptomic regulatory network (TRN) [66][67][68][69][70][71]. Knowing the organization of a regulatory network allows us to control or optimize parts of the system, a necessary step for many biotechnological advances [72][73][74][75]. This task requires a large amount of heterogeneous data, which compendia provide, to identify shared patterns looking across a variety of interventions [76].
Dimensionality reduction methods can also be deployed to extract key patterns in data and reveal the transcriptional relationships between sets of genes [77]. Applying dimensionality reduction models to compendia allows users to study changes in gene sets and reveal more subtle and possibly undiscovered signals that could be masked by strong signals (i.e. a large fraction of genes representing the same pathway) [78,79]. For example, a denoising autoencoder trained on a P. aeruginosa compendium, ADAGE, captured regulation patterns and biological processes [80]. Tan et al. showed that cooperonic genes were weighted highly in the same latent variables and, similarly, KEGG gene sets were enriched in some latent variables. They also showed that function prediction using the ADAGE weight matrix was more accurate compared to using a randomly permuted gene weight matrix. Furthermore, the latent representation of the gene expression data detected existing subtle expression differences [80] and also revealed a new aspect of low phosphate response that depends on the media [81]. These latent variables were also shown to detect pathway-pathway relationships -i.e. pathways that co-occur in the same latent variable [82]. A similar dimensionality reduction analysis was performed applying a sparse autoencoder to a yeast (S. cerevisiae) compendium, where Chen et al. found latent variables represented pathways and other layers of biological abstractions such as a transcription factor complexes and signaling pathways [83]. In other studies, applying independent component analysis (ICA) to a compendium of transcriptome data revealed transcription modules [68][69][70]. Specifically, Poudel et al. identified differentially active modules in S. aureus that varied based on the growth different media, which revealed metabolic regulators that respond to shifts in nutrients available [69]. These unsupervised approaches summarize patterns in the expression compendia that can abstract different layers of a biological system that are useful for understanding the interaction between different molecular processes as well as generating new hypothesis. Webtools were developed to facilitate the exploration of the summarized data, like ADAGE [84], as well as to search through the experiments available in compendia such as the ones found in COLOMBOS [61,85,86], PILGRM [62] and others [87] in order to direct future research.

Methodologies to leverage compendia
With the breadth of transcriptional patterns captured by compendia, recent approaches have been developed that demonstrate how compendia can be used to put new experiments in the context of existing ones as well as to leverage the aggregation of patterns available to study genomic patterns. Lee et al. developed a general framework for distinguishing between common and experimentspecific differentially expressed genes, called SOPHIE (Specific cOntext Pattern Highlighting In Expression data) [88]. This approach compares gene expression changes in their target experiment with changes in a background set of experiments thereby allowing researchers to interpret and prioritize patterns in differentially expressed genes. The authors demonstrated that SOPHIE successfully prioritized genes with small differences in expression that were directly due to the perturbation being studied and not due to condition-specific secondary effects. In general, reanalysis and mining of the experiments within these compendia can be facilitated by tools like SOPHIE [88] or algorithms like GAUGE [89], which automate sample group detection for downstream statistical analyses. Overall, approaches like SOPHIE can find patterns that generalize across compendia.

Condition-specific responses
Transcriptional profiling is a snapshot of an organism's state, which includes numerous diverse processes. Understanding the information that is captured in these profiles is important to understand how microbes that sense and respond to their environment. For example, Kim et al. inferred E. coli cellular and environmental state, like growth phase or aerobic conditions, from a gene expression compendium and identified pathways that are associated with the genes that are most predictive of these cellular states [58]. In other examples, studies also used gene expression to annotate the functional roles of genes [90][91][92][93][94]. For example, Troyanskaya et al., introduced a method called MAGIC, which predicts if two proteins are functionally related using multiple data types including gene expression data. They demonstrated that MAGIC function predictions were consistent with GO terms using S. cerevisiae expression data [92]. Overall, by using these compendia to make predictions we can learn what genes are involved in different environmental conditions or processes, which can improve our understanding of microbial condition-specific responses.
Importantly, the identification of conditional regulons requires the study of a response of interest across multiple conditions. The diversity of condition-specific responses has been elucidated in targeted studies that have examined expression profiles in response to multiple stimuli such as various stressors [95]. However, the comprehensive mapping of condition-specific responses is often beyond the scope of an individual experiment. The reanalysis and meta-analysis of publicly available data revealed subsets due to the natural differences in how separate groups studied related phenomena in a way that informed each other [81,93,96]. For example, through compendium-wide analysis of the low phosphate response, Tan et al. identified a condition-specific element of the low phosphate signaling cascade [81]. In another example, Huttenhower et al., developed an approach that provided condition-specific context for gene function predictions. They suggest a novel connection between S. cerevisiae sporulation response and the introduction of xylose metabolism genes [93]. These results would not have stood out from any individual experiment but was clear when the larger compendium was analyzed.

Inspiration from non-microbial expression compendia
Non-microbial gene expression compendia have also been generated and used for a variety of purposes, many of which may inspire future endeavors for microbial compendia [97][98][99][100][101][102][103]. A human-based gene, ortholog, or k-mer based tool could facilitate rapid searches of the microbial compendia to identify samples from different experiments with similar expression profiles. Transfer learning has also successfully transferred knowledge contained in publicly available data sets and databases to rare disease samples [99,102]. Such methods could be applied to better unravel pathway-level patterns for rare microbial species. Lastly, human compendia have been leveraged to identify alternative splicing [100], lessons which may be applied to the discovery of polycistronic transcripts directly from RNA-seq reads. Further research is needed to explore how lessons learned from human transcriptome compendia can best apply to microbial transcriptomics.
These studies demonstrate that the versatile data that is available in compendia provides a valuable resource to gain a systems level understanding of transcriptional signaling as well as to make predictions. Additionally a low dimensional representation of compendia capture transcriptional patterns that can reveal coordinated activity of gene sets and pathways as well as allows researchers to generate new hypotheses [41,81]. Finally new methods are being developed to further leverage the benefits of compendia to improve different types of analyses.

Challenges integrating across experiments
While compendia are rich community resources that can be leveraged to gain new insights into transcription, two challenges make integration across experiments a difficult endeavor: batch effects and strain variation. Batch effects introduced by technical sources (lab that produced the data, sequencing depth) or biological sources (experimental conditions) can either obscure or highlight biological signals, while strain-level genome differences can lead to reduced detection of transcription due to incomplete read mapping.

Batch effects
In general, batch effects can disrupt detection of biological signal [104][105][106][107]. Consequently, it might be expected that compendia, which integrate many different types of experiments together, require batch correction. However, a recent study by Lee et al. [108] examined the effect of technical sources of variability in a compendium setting. They simulated gene expression compendia with varying amounts of technical variability and assessed the ability to detect the original underlying structure in the data after noise was added and then after batch correction was applied. In general, they found that for compendium with a few sources of technical variation batch correction can be effective, however with many more sources of technical variation batch correction isn't necessary and can even start to remove some of the desired biological signal. If correction is applied to a compendium where the experiment-specific noise is largely independent, more of the biological information is removed since biological signals are consistent while noise is experiment specific.
In the case where a compendium contains a few sources of technical variability, like different platforms [58], the dominant signal is the variability between platforms and applying batch correction methods should recover the underlying biological signal. In contrast, in the case where a compendium contains many sources of variability, like many different types of experiments each contributing independent sources of noise, then the aggregation of each experiment-specific source of variability washes out from the underlying biological signal that is consistent across experiments. In this scenario, applying batch correction methods will remove more of the biological signal.
For the cases where batch correction is effective, commonly established methods like Limma [109] and ComBat [110] allow scientists to set sources of variability as covariates [58]. Limma removes technical noise by first fitting a linear model, using lmFit, which describes the relationship between the input gene expression and the experimental design labels such as batch assignments and covariates. The resulting model is a coefficient matrix that contains weights for the contribution of the noise component contained in the total observed gene expression matrix. This estimated contribution can be subtracted out from the input expression data. Similarly, Combat also assumes that the input gene expression signal contains an additive batch effect component that can be removed by estimating the batch effect using empirical bayes and subtracting this out.

Strain variation
Microbial strain variation further hinders integration across experiments. Strain variation refers to genomic variation that occurs at the sub-species level and can take the form of single nucleotide variants (SNVs) or indels of different sizes, distinct complements of accessory genes, and genomic rearrangements [111]. While strain variation is a critical component of understanding a species' ultimate phenotypic variation, each form of variation causes distinct challenges for integrating expression data across strain types. For example, SNVs decrease the average nucleotide identity between the reference sequence used for read quantification and the sample, which can decrease mapping rates nonuniformly across samples [112]. Similarly, the reference sequence may not contain the same set of genes as is present in the sample. This is because many microbial species have a large number of accessory genes, genes which are not universal within that species. For example, accessory genes comprise 20 % of the genome for some staphylococci [113][114][115]. When the reference sequence does not contain the same genes as are present in a sample, this can lead to decreased mapping rates and unobserved gene expression [116]. Lastly, genomic rearrangements or insertions may cause difficulties for counting spanning reads that are present in a sample but not represented in a reference [117]. However, integrating strain variation is important not only to understand within-species phenotypic diversity, but also because accessory genes can modify function of the core genome [118]. Even given these challenges, different approaches have been developed to take advantage of publicly available microbial expression data sets in the face of strain variation. For example, P. aeruginosa has five major lineages detected upon genome analyses of over a thousand strains [119] with two major clades that many strains belong to including the widely studied strains PAO1 and PA14 [120]. Strains PAO1 and PA14 contain different sets of accessory genes. One common solution is to only consider core genes since they are shared across strain type [121][122][123][124][125]. In order to include accessory genes, separate compendia can be generated so that major strain types (PAO1 and PA14) are separated but there are PAO1-specific genes within the PAO1 compendium [49,126].
Most compendia are comprised of a single strain of microorganism (Table 1). This can be achieved by relying on the metadata associated with experiments available in the data repository or using information provided in publications to collect experiments from a single strain. However metadata are notoriously incompletely recorded [127] and difficult to harmonize across studies [128], which may lead to inappropriate inclusion or exclusion of samples in a compendia. Notably, less than half of the publicly available microbial RNA-seq data has been submitted to the GEO or Array Express or Expression Atlas. These three platforms provide detailed and standardized meta-data that can be accessed programmatically and easily used in high throughput computational analyses [129]. An alternative approach is to verify the strain annotation using taxonomy assignments provided in the SRA Run Browser analysis tab, or to perform assignment using with a tool like sourmash gather, which selects the minimum set of reference genomes in a database necessary to cover the reads in a sample [130].
Alternatively, a pangenome could be used as a reference so that core genes are collapsed across strain types while accessory genes are included in the analysis [116]. Using these pangenomes as a reference balances computational cost and fidelity to sample genomes, and can take advantage of databases designed to address similar problems for metagenomic sample processing [131]. This approach was pioneered for the analysis of S. aureus strains directly from metatranscriptomes, as no reference genome was available with which to perform read quantification. This approach may be successful for building species-wide compendia but needs further research. Indeed, one substantial draw back would be the negation of spanning reads, as pangenomes are typically built from genes and not operons. The increasing use of metatranscriptomics to contextualize a species' function presents computational challenges but also opportunities to identify unique transcriptional signatures in their native and highly complex environment, such as Haemophilus influenzae during viral infection, or S. aureus' host defense response in the nares [132,133].
Overall, despite some challenges to constructing compendia, there are existing solutions that make compendia analysis possible and the benefits of the biological discoveries we can glean make it worth it. Additionally, it is worth noting that a recent analysis of a normalized and quality filtered compendium of over two thousand samples found strong gene expression correlations between coregulated genes, even without batch correction, for genes present in the reference genome [49].

Future directions
As more transcriptomic data is generated, it is feasible to construct compendia for a wide array of organisms that include measurements of diverse conditions. Consequently, developing and systematizing strategies for how compendia can be analyzed to best learn from the broad array of conditions that they represent is critical. Our review integrates and synthesizes current strategies in this area. Looking ahead, hurdles still exist.
One methodological concern with using compendia, which integrate multiple experiments, is: when do we batch correct and what methods would be the most effective? Batch effects are the systemic sources of variance present in and between individual experiments. These sources of variance, which include noise generated by different experimental designs or different labs running the experiment, can confound the biological patterns that we are interested in detecting. Despite the presence of hundreds of technical sources of variability in compendia, previous compendia analyses have successfully extracted relevant biological patterns [70,80,81,83]. One hint for why these approaches work comes from a study by Lee et al. that found that a consistent biological signal was preserved in compendia when technical noise was independent between experiments. Applying batch correction in this setting was harmful -not helpful. Efforts to correct for technical artifacts in this setting would best focus on those that span multiple experiments: an area that remains relatively under-explored.
Current studies have done a lot of work to integrate bulk transcriptomics data in microbes using dimensionality reduction methods [68][69][70][81][82][83][84]. In the future, we can extend this work to new data types and computational models. One possible avenue to explore would be integrating data from multiple resolutions. Single-cell data provide an opportunity to examine variability between cells; however, technical limitations mean that the earliest single-cell data are likely to focus on the easiest to assay settings. It may require many years before the available single-cell data measure as many conditions as existing bulk assays. Bringing these resolutions together in compendia will require certain foundational work: for example, of the type performed by Doing et al. [49] to assess the mapping, quantification and normalization which remains to be done for microbes in the single-cell context.
Often these strategies are used in an interpretive manner, but emerging approaches may support genome-wide predictions of physiological state. One area of interest is latent space arithmetic, which is an approach simulating new samples using vector arithmetic to manipulate samples in an encoded space, to predict the response to perturbations [134,135]. This strategy might reveal the effect of a perturbation in a specific never-before tested setting. Perhaps it could inform the design cell-state-specific targets, providing increased specificity for anti-microbial agents by targeting microbes in more virulent states.
All compendia that we examined focused on one genome or transcriptome at a time. The future might include pan-genome approaches [116] to identify strain specific genes that induce novel transcriptional effects in different contexts. The methods and work that we discuss present the potential for a major conceptual shift in microbiology: instead of examining transcriptional profiles from an individual experiment in isolation, investigators studying compendia analyze patterns across experiments to reveal novel relationships that further our understanding of systems-level biology. Work to date has focused on a relatively narrow slice of what is possible, both from the point of view of data types and analytical approaches, and it has already been a fruitful strategy to identify and understand novel biological mechanisms.

Discussion
With advancements in high throughput sequencing technology more transcriptome data has become available, presenting opportunities for integration of diverse experiments into compendia. Recent successes of computational methods, especially unsupervised machine learning approaches, have demonstrated that biologically meaningful patterns can be extracted from microbial compendia. Given these recent advances, as well as tools developed in the analysis of human expression compendia, we anticipate development in the computational tool space will continue to drive biological discovery from microbial compendia.
While computational approaches for using heterogeneous compendia have been around for approximately 15 years [79], there remains work to be done to evaluate the computational methods that are most suitable for capturing the transcriptional patterns in compendia. Given the success to date of unsupervised learning methods [68][69][70]81,83,84,88], and the work that has been done in this space in human expression compendia [136,137], we anticipate that future development and evaluation of these methods will prove useful in the analysis of microbial expression compendia. A comprehensive analysis using human compendia showed that different models and model architectures captured different pathways, revealing that the use of multiple analysis methods led to more complete biological representations [136]. Similarly, there has been some assessment of microbial compendia examining pathway representation using different forms of dimensionality reduction methods [81,83] and expression changes captured using variational autoencoders [108]. However, an equivalent comprehensive evaluation as undertaken in human compendia is needed to assess the information captured in microbial compendia -what types of signals are captured when the model architecture, regularization, penalty functions, connectivity between layers is varied? This information will determine what model, or range of models, are appropriate for downstream analyses. As new feature extraction models continue to be developed to improve the information captured by and the interpretability of these models, such as through the incorporation of prior information [138], such assessment becomes important to help guide researchers on the computational strategy they use.
Microbial gene expression compendia have proven to be a fruitful resource for studying systems-level changes and have been leveraged to infer TRNs [66], make predictions about phenotypes [58], and reveal coordinated gene sets [70,80,81,83]. Furthermore, compendia have been shown to improve the analysis of individual experiments [88] and to reveal specific genomic patterns [126]. The advancements in computational tools and webtools, which have made the information in some existing compendia easily accessible, is opening the door to new avenues of research, situating the study of transcription in a global context.

Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Casey S. Greene is a consultant for Arcadia Science, which aims to use non-traditional model organisms to make discoveries and develop new technologies.