Evaluating bacterial and functional diversity of human gut microbiota by complementary metagenomics and metatranscriptomics

It is well accepted that dysbiosis of microbiota is associated with disease; however, the biological mechanisms that promote susceptibility or resilience to disease remain elusive. One of the major limitations of previous microbiome studies has been the lack of complementary metatranscriptomic (functional) data to complement the interpretation of metagenomics (bacterial abundance). The purpose of the study was twofold, first to evaluate the bacterial diversity and differential gene expression of gut microbiota using complementary shotgun metagenomics (MG) and metatranscriptomics (MT) from same fecal sample. Second, to compare sequence data using different Illumina platforms and with different sequencing parameters as new sequencers are introduced and determine if the data are comparable on different platforms. In this study, we perform ultra-deep metatranscriptomic shotgun sequencing for a sample that we previously analyzed with metagenomics shotgun sequencing. We validated the sequencing and analysis methods using different Illumina platform, and with different sequencing and analysis parameters. Our results suggest that use of different Illumina platform did not lead to detectable bias in the sequencing data. The analysis of the sample using MG and MT approach shows that some species genes are more highly represented in the MT than in the MG, indicating that some species are highly metabolically active. Our analysis also shows that ~52% of the genes in the metagenome are in the metatranscriptome, and therefore are robustly expressed. The functions of the low and rare abundance bacterial species remain poorly understood. Our observations indicate that among the low abundant species analyzed in this study some were found to be more metabolically active compared to others and can contribute distinct profiles of biological functions that may modulate the host-microbiota and bacteria-bacteria interactions.


INTRODUCTION
The human microbiota represents a complex community of numerous and diverse microbes that is linked 27 with our development, metabolism, physiology, health, and is considered functionally comparable to an 28 introducing new DNA sequencers with versatile sequencing parameters. This has also complicated the 1 comparison of data within and among the samples. Thus, there is a need to compare the sequencing data 2 from same samples using different platforms. Many previous studies employed targeted amplicon 3 sequencing of the conserved prokaryotic 16S ribosomal RNA (16S rRNA) gene (Human Microbiome Project 4 2012, Huse et al 2012, Stulberg et al 2016). This method identifies operational taxonomic units (OTUs) and 5 are correlated with bacterial taxa; however, assignment of taxa defined by OTUs is commonly limited to the 6 genus level due to low accuracy at the species level. In contrast, metagenomics shotgun sequencing 7 (MGS), which is employed in our study, can determine taxonomic annotations at the species level. 8 Although the association of multiple diseases with dysbiosis of the microbiome has been 9 established, the elucidation of the underlying biologic mechanisms that promote pathological phenotypes 10 has been elusive in most cases. A major limitation of both targeted amplicon and metagenome shotgun 11 sequencing is that bacterial functions are predicted based on the genome sequence of the associated taxa. 12 However, it is well established that there is differential bacterial gene expression at the transcriptional level 13 in response to environmental and dietary exposures. For example, it has been reported that there is a set 14 of constitutively expressed core genes that mediate core microbial functions as well as a highly regulated 15 subset of genes that respond to unique environmental influences (Booijink et al 2010, Ursell & Knight 2013. 16 In addition, some bacteria may exist in an inert state or spore form and thus not contribute to the biological 17 response (Franzosa et al 2014). Thus, an analysis of bacterial gene expression with metatranscriptomics 18 approach could provide additional insight into the biological functions of specific microbiomes. 19 The gut microbiota is composed of highly abundant few species and less abundant many rare 20 bacterial species, thus to understand the complex functions of the microbiota it is essential to understand 21 the functions of both the high-and low-abundant bacterial species. Analyses of MG and MT data are often 22 challenged by the sequencing depth, parameters, and sequencing platforms, which limits the power of 23 functional classification and abundance estimation, this in turn hampers the downstream data analyses of 24 differentially expressed genes. The unique feature of our study is that we are comparing the sequencing 25 reads at different depths, platform, read length, read and contig based comparison for MG and MT for the 26 same sample. To develop a comprehensive understanding of the ecological functions of a microbiome, it 27 is essential to determine not only the metatranscriptome, but also to ascertain the functional contributions 28 of both the abundant and the rare species in a microbiome. To investigate these questions, we analyzed 29 both the metagenome and the metatranscriptome using shotgun sequencing which can determine the 30 abundance of gene transcripts relative to the abundance of the genome. This allowed us to identify both 31 over-and under-expressed transcripts. In this study, we identified biological functions in both rare and 32 abundant bacterial species using metagenomic and metatranscriptomic methods optimized and validated 33 in our laboratory.  The study was approved by the Institutional Review Board   2   of the University of Illinois at Chicago, and the experimental methods were performed in accordance with   3   the approved guidelines. A 33 year, male subject without known medical conditions provided the signed   4 informed consent and self-collected stool in a EasySampler Stool Collection kit (Alpco Diagnostics). The 5 fecal sample was immediately aliquoted into sterile 1.5 ml Eppendorf safe-lock tubes and stored at -80ºC 6 till further DNA and RNA isolation was carried out. 7 8 RNA isolation from fecal sample and mRNA enrichment: The objective of the study was to perform 9 matched metagenome and metatranscriptome studies of the same fecal sample. We investigated the same 10 fecal sample we previously analyzed by metagenomics sequencing. Total RNA was isolated using the 11 PowerMicrobiome RNA Isolation Kit (Catalog # 26000-50, MO BIO Laboratories, Inc) from a fecal sample. 12 For efficient lysis of the microbes in the sample, 200 µL of Phenol/Chloroform/Isoamyl alcohol (25:24:1) 13 (Catalogue #327115000, Acros Organics) was added to the reagents provided with the kit. The contents 14 were vortexed for 1-2 min with a table top vortexer and homogenized twice at speed 10 for 5 min with air- 15 cooling using the Bullet Blender Storm Homogenizer (Catalogue # BBY24M, Next Advance Inc). Total RNA 16 was isolated with the manufacturer's recommended procedure including the on-column DNase treatment 17 (to remove the potentially co-isolated DNA). The RNA was eluted with 1×TE, pH 8.0, and stored at -80ºC. 18 The quality and quantity of the DNA was accessed using a spectrophotometer (NanoPhotometer Pearl, 19 Denville Scientific, Inc), agarose gel electrophoresis, fluorometer (Qubit® RNA Broad Range assay, Life 20 Technologies Corporation), and Agilent RNA 6000 Nano Kit on 2100 Bioanalyzer instrument (Agilent 21 Technologies, Inc.). Total RNA was enriched for mRNA by subtractive hybridization using the 22 MICROBExpress™ Bacterial mRNA Enrichment Kit following manufacturers recommended protocol 23 (Ambion, Life Technologies). The mRNA enrichment and rRNA depletion was analyzed using an Agilent 24 RNA 6000 Nano Kit on 2100 Bioanalyzer instrument (Agilent Technologies, Inc.).

26
Fecal metatranscriptome library preparation and shotgun sequencing: The enriched mRNA was 27 mechanically fragmented to a size range of ~200 bp with an ultrasonicator using the adaptive focused 28 acoustics with the following manufacturer recommended protocols (Covaris S220 instrument, Covaris Inc). 29 The fragmentation of mRNA was assessed using Agilent RNA 6000 Pico Kit on 2100 Bioanalyzer  The sequence reads were processed and analyzed using the CLC Genomics workbench version 7.5 9 (Qiagen, Aarhus, Denmark). Raw reads were trimmed to a minimum Phred quality score of 20. Raw reads 10 were filtered by mapping against human reference genome to remove human sequences. The non-human 11 reads were de novo assembled using the CLC assembler using a word size (k-mer) of 50, minimum contig 12 length 200bp, to construct the de bruijn graphs. De novo assembly was used to map reads back to the 13 contigs (mismatch cost 2, insertion cost 3, deletion cost 2, length fraction 0.8, similarity fraction 0.8).
14 Taxonomic and functional annotations of the reads and contigs were obtained using the automated 15 annotation pipeline at MG-RAST web server using the default parameters (best hit classification, maximum 16 e-value 1e-5 cutoff, and minimum 60% identity cutoff) using M5NR and KEGG databases (Meyer et al 2008, 17 Mitra et al 2011). The limma analysis was used to identify species and KEGG functional pathways that were 18 differentially abundant between metagenome (MG) and metatranscriptome (MT) (Praveen et al 2015). 19 Limma uses an empirical Bayes method to test the differential expression based on the fitting of each 20 species/gene to a linear model (Smyth 2004). This provides the rich features for complex experimental 21 designs and overcomes the small sample size problem, in addition to providing enhanced biological

30
Ultra-deep metatranscriptomic shotgun sequencing (MTS) 31 In our previous study of ultra-deep metagenome shotgun sequencing (MGS) we demonstrated effective 32 identification of abundant species (defined as >1% relative abundance) with as few as 500 reads; however, 33 the detection of low abundance or rare species required high numbers of sequence reads. For example, 34 with a total of 163.7 million sequence reads generated by metagenome shotgun sequencing (MGS), the 35 rarefaction curve did not show saturation for the identification of additional species (Ranjan et al 2016) . 36 Based on these data, in the current study of the metatranscriptome we performed ultra-deep MTS sequencing. We performed optimization and validation of our sequencing protocol using multiple 1 sequencing platforms and analytic strategies ( Fig. 1). High quality total RNA was isolated (Supplementary sequencing parameters. In total, we obtained a total of 139.6 million sequence reads by combining the 10 HiSeq and MiSeq sequence data in silico (HS100+MS151+MS301) ( Table 1). 11 12 Comparison of analytic strategies 13 In our previous analysis of ultra-deep MGS data, we observed a substantial increase in the average length 14 of the assembled contigs (904 bp) compared with the average read length 170 bp., and the average N50 15 length of the contigs was 6,262 bp (Ranjan et al 2016). Therefore, we compared the effect of analyzing the 16 reads versus assembled contigs in the metatranscriptome (MT) data. In the MT data, the average contig 17 length was 268 bp which was modestly longer than the average read length of 136 bp (Table 1). The short 18 length of the assembled MT contigs compared to the metagenomic (MG) contigs is likely due to the smaller 19 size of the microbial transcripts compared to the larger size of the genomes. In terms of reproducibility, we 20 did not detect significant differences between the number of reads or assembled contigs among the 12 21 replicate libraries as analyzed by Shapiro-Wilk normality test (data not shown). Thus, the assembly of the 22 contigs generated a modest increase in length compared with average read length of the MT reads. 23 Next, we compared the bacterial taxonomic assignments based on read and contig analyses.  Fig. 5A). We observe that the increase in number of reads resulted in 35 increase of depth of coverage, whereas no significant increase in contig length was detected. In summary, 36 we previously showed that a contig based analysis is more specific for species identification (Ranjan et al 37 2016) in the MGS dataset; however, these data suggest that a read based analysis is more comprehensive 1 for identification of both genera and species in metatranscriptome data. 2 To determine if different numbers of reads were skewing the analyses, we generated datasets that 3 contained an equal number of reads. We randomly sampled 30 million reads from the HiSeq 100 PE, MiSeq 4 151 PE (MS151) and MiSeq 301 PE (MS301) data, and the reads were assembled into contigs. More 5 contigs were generated in MS301 (97,631) compared to HS100 (8,253) and MS151 (42,153), most likely 6 because of a longer sequencing read length. However, there was no substantial increase in the average 7 length of contigs most likely due to the limitation based on transcript length (Supplementary Table 3). We 8 observed a similar abundance profile of bacterial phyla, genera and species as in the complete datasets 9 indicating that differences in read number were not skewing the assignment of taxa in the contig analyses  Fig. 2A). 1245 bacterial species were shared among the MG and MT ( Fig. 2A), representing 16 the metabolically active species, in the sample at this particular time point. In the phylum Firmicutes, 17 Bacteroidetes, Actinobacteria, Proteobacteria, Fusobacteria, and Verrucomicrobia 356, 117, 138, 439, 23, 18 and 6 species were shared, respectively. This accounted for 60% to 92% of the species shared between 19 the MG and MT defined phyla (Fig. 2B). The detection of MG sequences lacking corresponding MT reads 20 suggests unexpressed genes or even dormant bacteria. As expected, very few sequences were unique to 21 the MT, and they were present in extremely low abundance (< 0.001%) presumably because transcripts 22 are not expressed in the absence of the genome, and likely these sequences were not identified in MG 23 because of relatively low abundance (Supplementary Table 6). Most (50%) of the sequences identified in 24 the phylum proteobacteria were closely related to uncultured bacterial sequences. To determine the relative 25 transcriptional activity of individual taxa and individual genes, we compared the relative abundance in the 26 combined MT data (HS100-MS151-MS301) to our previously reported MG data for the same sample 27 (Ranjan et al 2016). In an analysis of the MT at the phyla level, we observed that the abundance of 28 Bacteroidetes transcripts was high, whereas the abundance of transcripts representing Firmicutes, 29 Actinobacteria, Fusobacteria, and Verrucomicrobia was low. This was observed across all the sequencing 30 platforms and read lengths (Fig. 2C). The abundance of the Fusobacteria and Verrucomicrobia was 31 approximately 100-fold lower than the other Phyla (note Y-axis scale).  Supplementary Fig. 10). This implies that the identified functions are similar in either the read or contig 5 based analysis of the MT data with slight variations. 6 We investigated the MG and MT data at the species level. Interestingly, we observed that few of 7 the species (for example, Faecalibacterium prausnitzii, Bacteroides spp., B. thetaiotaomicron, B. vulgatus,   8 B. ovatus among others) had a higher relative representation in MT than MG, indicating that these species 9 are highly transcriptionally active ( Fig. 3 and Supplementary Fig. 11). However, the species B. fragilis did 10 not have increased transcriptional activity as compared to other Bacteroides spp. As shown in a scatter 11 plot, F. prausnitzii, Bacteroides spp., and Alistipes putredinis were highly transcriptionally active at a 12 significant level (log fold difference ≥3, p adj. <0.05) whereas Clostridium saccharolyticum, Eubacterium 13 rectale and Ruminococcus obeum (log fold difference ≥-1, p adj. <0.05) were low in transcriptional activity 14 (Fig. 4A). 15 We compared the abundance of KEGG functions detected in the MT data to the predicted functions 16 in the MG data. The analysis revealed that genes involved in translation, carbohydrate metabolism, and 17 transcription were highly abundant in MT (log2 fold change >3, p< 0.05), compared to low abundance of 18 glycan biosynthesis and metabolism, metabolism of cofactors and vitamins, replication and repair, 19 membrane transport and amino acid metabolism (log2 fold change >-2, p adj. < 0.05) (Fig. 4B). Translation 20 and amino acid metabolism showed the largest differential expression with a fold change of >±5 (p adj. 21 <0.05), respectively. We observed similar patterns at the more specific levels 2, 3 and 4 (Supplementary 22 Fig. 12-15). In this fecal sample, in total we detected 1916 functions at KEGG level 4 assignments in MG, 23 compared to 1067 in MT. The MG and MT data shared 52% (1014) of the total functions, revealing the 24 shared functional genes involved in active physiological functions of the gut microbiota which can be 25 detected in MG and MT in a given time point (Fig. 5). Our analysis indicated that MG and MT overlapping 26 genes are metabolically active genes. Genes which are only detected in the MT are even more 27 metabolically active. On the other hand, if genes were detected only in MG and not in the MT, this may also 28 suggest that genes may be present but not active in a given time.  Fig. 16). We further focused our analysis on Fusobacteria 2 and Verrucomicrobia, as these phyla are present in low abundance (<1% and <0.1% abundance, 3 respectively) and not well characterized in the gut microbiota (Fig. 2C). 4 In phyla -Firmicutes, Bacteroidetes, Actinobacteria, and Proteobacteria, the genes involved in 5 carbohydrate metabolism were abundant, followed by amino acid metabolism and translation. There were 6 no translation and/or transcription functions detected in Fusobacteria and Verrucomicrobia (Supplementary 7 Fig. 17). However, Fusobacteria and Verrucomicrobia contributed towards the expression of specific genes 8 involved in carbohydrate and amino acid metabolism pathways compared to other phyla (Figs. 8 and 9, 9 Supplementary Fig. 18). For example, the genes glgB (1,4-alpha-glucan branching enzyme), pgi (glucose-10 6-phosphate isomerase) involved in starch, and sucrose metabolism and glycolysis/gluconeogenesis were 11 highly expressed by Fusobacteria (Supplementary Fig. 18). Also, the genes involved in oxidative 12 phosphorylation such as atpD (F-type H+-transporting ATPase subunit beta), ppa (inorganic 13 pyrophosphatase) and nuoE (NADH-quinone oxidoreductase subunit E) were also enriched in Fusobacteria 14 (Figs. 6, and Supplementary Fig. 18). On the other hand, the phylum Verrucomicrobia was enriched for 15 genes invloved in alanine, aspartate and glutamate metabolism [gdhA: glutamate dehydrogenase (NADP+), 16 purB: adenylosuccinate lyase], ABC transporters [msmX: maltose/maltodextrin transport system ATP- 17 binding protein] and amino sugar and nucleotide sugar metabolism [npdA: NAD-dependent deacetylase] 18 ( Fig. 7 and Supplementary Fig. 18). These results show the high abundance of transcripts contributed by 19 the rare abundant bacterial species in the community may contribute unique biological functions to the 20 microbiome that have the potential to affect the host physiology. 21 23 The Shannon diversity index for estimating the bacterial diversity in MG (5.4 ± 0.1) and MT (4.9 ± 24 0.1) was significantly different (p<0.05), however no significant difference was observed in species 25 evenness (0.7 ± 0.0). Similarly, the index for diversity of functional genes in MG (6.7 ± 0.0) and MT (6.0 ± 26 0.3) was significantly different (p<0.05), also a significant difference was observed in functional evenness 27 in MG (0.89 ± 0.01) and MT (0.93 ± 0.01). The Shannon diversity index analysis at both taxonomic and 28 functional level indicated that the MG was more diverse than the MT, most likely due to unexpressed genes 29 or dormant bacteria ( Supplementary Fig. 19).

31
Mapping the genomic and transcriptomic KEGG pathways 32 We mapped the predicted (MG) and expressed (MT) functions onto pathways using KEGG Mapper suite. 33 Almost all (more than 99%) of the functions identified by MT were also identified in MG (Fig. 8 and   34 Supplementary Fig. 20). However, some functions were identified only in the MG dataset suggesting that 35 not all of the predicted functions in the metagenome are expressed, which supports the notion that the 36 metagenome may not be an accurate proxy of microbiota function. The genes are in the (meta)genomes; 37 they could be expressed under different conditions; therefore, they define the functional potential of the 1 organisms. Linear regression analysis was applied to the MT and MG data examined from the perspective 2 of species and function. The linear regression analysis at the species level was correlated among the MG 3 and MT and 58% of the variation in the MT can be explained by the species composition of the MG 4 (Spearman's r = 0.83; r 2 =0.58=58%) (Fig. 9A). A similar correlation was observed at functional level 4 in 5 MG and MT (Spearman's r = 0.76; r 2 =0.53=53%) (Fig. 9B). In other words, more than 50% of the variation 6 in the microbial community MT can be explained by MG composition at species level, or conversely, 7 approximately 50% of transcriptional activity is regulated and presumably dependent on host or 8 environmental factors.  First, our study shows that the different Illumina platforms do not contribute detectable bias in our 31 analyses (Fig. 2). To validate the technical reproducibility of the sequencing and data analysis methods, 32 we generated 12 replicates of a single sample that generated a similar number of reads, total bases and 33 assembled contigs (Table 1). In addition, our analysis identified a reproducible number of both phyla and 34 species ( Supplementary Figs. 2 and 4, respectively). Furthermore, the functional analysis identified similar 35 abundance of KEGG annotations at all functional levels from 1-4 (see Supplementary Figs. 6-9). Our 36 investigation of the effect of contig assembly showed that assembly only modestly increased length, presumably due to the short length of the mRNA transcripts. This similar observation has also been reported 1 in a forest floor community metatranscriptomics (Hesse et al 2015). This suggests that emerging 2 technologies that produce longer read lengths, particularly in view of their increased error rates, although 3 useful for metagenomics studies, may not be preferable for metatranscriptomic studies. 4 Our investigation of the effects of contig assembly showed that the relative abundance of some 5 taxa was modified by assembly. For example, analysis of assembled reads resulted in greater abundance 6 of Bacteroidetes and lesser abundance of Firmicutes, Actinobacteria and Proteobacteria (Supplementary 7 Fig. 5). Similar differences were also observed at the level of genus and species. Interestingly, we also 8 observed similar changes in relative abundance of Bacteroidetes and Firmicutes in our previous analysis 9 of taxa assignment in our metagenomics data (Ranjan et al 2016). Our results also show that the assembly 10 of reads into contigs can decrease the detection of taxa. Overall, the results suggest that reads are the 11 most comprehensive, and contigs are more specific, method to annotate taxa. 12 Most previous microbiota studies have not been performed with matched metagenome and 13 metatranscriptome datasets of the same sample, thus there is huge knowledge gap in understanding the 14 role of gene expression of the microbiota in human health and diseases. Our comparison of the predicted 15 functions in the metagenome in this sample, with the expressed functions in metatranscriptome, identified 16 more than 1000 functions, which included carbohydrate metabolism, nucleotide metabolism, amino acid 17 metabolism, translation etc., (Fig. 5). The diversity analysis also suggest that the actual metabolically active 18 bacterial species and functions are in fact less diverse compared to predicted metagenome diversity (both 19 taxonomic and functional) (Supplementary Fig. 19). 20 It is well established that the diverse community of bacteria in a microbiome is composed of a small

SUPPLEMENTARY FIGURE AND TABLE LEGENDS
Supplementary Figure 1. Fecal metatranscriptome library preparation. High quality total RNA from a fecal sample was isolated and analyzed by agarose gel electrophoresis and Bioanalyzer (A); Total RNA was enriched for mRNA by depleting the rRNA by subtractive hybridization method (B), the enriched mRNA was fragmented by Covaris (C); A library was prepared using Illumina compatible adaptor (D); In addition, 12 libraries from the same mRNA were prepared for multiplexing (E). The quality of RNA, mRNA and the libraries was analyzed on 2100 Bioanalyzer Instrument.