Metatranscriptomics as a tool to identify fungal species and subspecies in mixed communities - a proof of concept under laboratory conditions.

High-throughput sequencing (HTS) enables the generation of large amounts of genome sequence data at a reasonable cost. Organisms in mixed microbial communities can now be sequenced and identified in a culture-independent way, usually using amplicon sequencing of a DNA barcode. Bulk RNA-seq (metatranscriptomics) has several advantages over DNA-based amplicon sequencing: it is less susceptible to amplification biases, it captures only living organisms, and it enables a larger set of genes to be used for taxonomic identification. Using a model mock community comprising 17 fungal isolates, we evaluated whether metatranscriptomics can accurately identify fungal species and subspecies in mixed communities. Overall, 72.9% of the RNA transcripts were classified, from which the vast majority (99.5%) were correctly identified at the species level. Of the 15 species sequenced, 13 were retrieved and identified correctly. We also detected strain-level variation within the Cryptococcus species complexes: 99.3% of transcripts assigned to Cryptococcus were classified as one of the four strains used in the mock community. Laboratory contaminants and/or misclassifications were diverse, but represented only 0.44% of the transcripts. Hence, these results show that it is possible to obtain accurate species- and strain-level fungal identification from metatranscriptome data as long as taxa identified at low abundance are discarded to avoid false-positives derived from contamination or misclassifications. This study highlights both the advantages and current challenges in the application of metatranscriptomics in clinical mycology and ecological studies.


INTRODUCTION
Microscopic fungal species, such as yeasts and some filamentous fungi, do not live in isolation, and are most commonly found within mixed microbial communities containing multiple species and strains. Assessing the diversity of fungi in mixed communities is important because different fungal taxa may exhibit distinctive phenotypes, and consequently may have different pathogenicity or functional roles. For example, in the rhizosphere, changes in fungal community composition have been associated with shifts in nutrient cycling (Hannula et al. 2017). Humans also harbor, or are exposed to, a diverse fungal community that provides a source of opportunistic pathogens (Bandara et al. 2019;Huffnagle and Noverr 2013;Seed 2014). Strainlevel fungal diversity may influence therapeutic responsiveness and needs further investigation. Although it is typically assumed that invasive fungal infections are caused by a single strain, multiple Candida strains have been observed during the course of a single episode of infection (Soll et al. 1988). Furthermore, nearly 20% of patients with cryptococcosis are infected by multiple strains with different phenotypes and virulence traits (Desnos-Ollivier et al. 2015;Desnos-Ollivier et al. 2010).
Despite its importance, fungal taxonomic diversity is poorly characterized. From over two million fungal species estimated to exist, less than 8% have been described (Hawksworth and Lucking 2017). Even wellknown fungal species are often overlooked during routine diagnostic procedures, surveillance and biodiversity surveys (Brown et al. 2012;Enaud et al. 2018;Yahr et al. 2016). This is in part due to challenges in the detection and classification of these organisms, especially microscopic and cryptic species, such as the etiologic agents of cryptococcosis. Currently, two species complexes are recognized: Cryptococcus neoformans and Cryptococcus gattii (Kwon-Chung et al. 2002). Seven major haploid lineages are found within these two species complexes (C. neoformans species complex: VNI, VNII, VNIV, and C. gattii species complex: VGI, VGII, VGIII and VGIV) and their recognition as distinct biological species has been debated (Hagen et al. 2015;Kwon-Chung et al. 2017;Ngamskulrungroj et al. 2009). Being able to distinguish closely-related lineages is important because their phenotype, virulence and ecophysiology can vary substantially. For example, the closely related laboratory strains JEC21 and B-3501 of C. neoformans var. neoformans (VNIV) are 99.5% identical at the genomic sequence level but differ substantially in thermotolerance and virulence (Loftus et al. 2005). Likewise, different virulence and antifungal tolerance traits were observed within lineages of C. gattii VGIII (Firacative et al. 2016).
The introduction of high-throughput sequencing (HTS) marked a new era in mycological research, where the vast diversity of fungi can be studied without the need for culturing (Nilsson et al. 2019). To date, amplicon sequencing of genetic markers (metabarcoding) has been the most popular HTS approach to identify fungal species in mixed communities. Despite its indisputable utility, metabarcoding surveys are affected by PCR amplification biases, and even abundant species can go undetected due to primer mismatch (Marcelino and Verbruggen 2016;Nilsson et al. 2019;Tedersoo et al. 2015). In addition, DNA fragments from dead organisms inflate biodiversity estimates in metabarcoding surveys (Carini et al. 2016). Stool samples, for instance, naturally contain food-derived DNA, which cannot be distinguished from the genetic material of the resident gut microbiota when using DNA-based methods. These challenges can be circumvented by directly sequencing actively transcribed genes via RNA-Seq (Wang et al. 2009), hence avoiding the amplification step, and obtaining an unbiased characterization of the living microbial community. Metatranscriptomics has been used to identify RNA viruses in a range of animal samples (Shi et al. 2016;Shi et al. 2017;Wille et al. 2018;Zhang et al. 2018) and to characterize the functional profile of microbial communities (Bashiardes et al. 2016;Kuske et al. 2015). Studies applying metatranscriptomics to mycorrhizal communities have provided valuable insights into the functional roles of fungi in these symbiotic systems (Gonzalez et al. 2018;Liao et al. 2014). However, links between functional and species-level taxonomy have been infrequently sought, likely because fungal identification from metatranscriptome data is considered unreliable below the phylum level (Nilsson et al. 2019). It is therefore currently unknown whether it is feasible to use metatranscriptomics to identify fungi at the species and subspecies level within a mixed community. Challenges can be expected at both the molecular (e.g. RNA isolation and sequencing) and computational levels (e.g. wrong taxonomic assignments and lack of reference data for species identification). This information is fundamental to the investigation of the potential of metatranscriptomics in diagnostics and ecological studies.
Herein, we evaluated the utility of metatranscriptomics as a tool for the simultaneous identification of fungal species using a defined mock community created under laboratory conditions. We focused on the molecular aspects of metatranscriptome sequencing, and therefore created a proof-of-concept data set containing 15 species of Ascomycetes and Basidiomycetes for which draft or complete genome sequences were available. In addition, we investigated whether strains belonging to sister species, such as the C. neoformans and C. gattii species complexes, could be identified correctly using metatranscriptomics. Rather than focusing on genetic markers, we sought to classify fungal species using the information from all expressed genes, using the totality of NCBI's nucleotide collection as a reference database. This study paves the way to apply state-of-the art techniques in fungal biodiversity surveys and clinical diagnostics.

METHODS
A defined fungal community was constructed from 17 isolates, including 15 fungal species and three strains of the C. neoformans species complex in addition to one strain of C. gattii (Table 1). Fungal strains were obtained from the Westmead Mycology Culture Collection, and were originally derived from clinical isolates, environmental strains or laboratory lineages (Additional file 1: Table S1). As our goal was to obtain a proof-of-concept of the molecular aspects of metatranscriptome sequencing, we only used fungal species containing complete or draft genome sequences in the NCBI RefSeq database (Pruitt et al. 2007). This avoids analytic complexities due to the lack of reference sequence data (see Discussion). Strains were cultured on Sabouraud agar at 27°C for 72 h. A sweep of colonies was made with a disposable inoculating loop and dispersed in Phosphate-Buffered Saline buffer (PBS). Fungal cells were quantified in a Neubauer chamber and their concentration adjusted such that the fungal mixture contained equal concentrations of each species (10 8 cells/mL). RNA was isolated with the RNeasy Plus kit (Qiagen), following the manufacture's protocol, with an initial freeze-thaw step in liquid nitrogen to disrupt fungal cells. The quantity and quality of the RNA extract was determined with the Nanodrop Spectrophotometer (Thermo Scientific) and the Agilent 2200 TapeStation. As some residual DNA was detected, the RNA extract was further treated with DNase I (Qiagen). Ribosomal depletion (Ribo-Zero Gold technology), library preparation and sequencing (Illumina HiSeq HT, 125 bp Paired End) were performed by the Australian Genomics Research Facility. The raw sequence data were deposited in the NCBI Short Read Archive (accession PRJNA521097).
Sequence reads containing more than five ambiguous positions or with average quality scores ≤25 were filtered from the data set using prinseq-lite v.0.20.4 (Schmieder and Edwards 2011) with the options -ns_max_n 5 -min_ qual_mean 25 -out_format 3. Assembly of sequence reads into contigs was performed with Trinity v.2.5.1 (Grabherr et al. 2011). Contigs were mapped to the NCBI nucleotide collection using KMA v1.1.7 (Clausen et al. 2018), a novel approach that has proven to be more accurate than other mapping software. Prior to mapping, NCBI's taxonomic identifier codes (taxids) were appended to each sequence record in the nucleotide collection, and the reference database was indexed using KMA's options -NI -Sparse TG. Contigs were then mapped to the indexed database with the options -mem_mode -and -apm f. Matches to the reference database with low support (i.e. coverage < 20 and depth < 0.05) were excluded from the analyses. The species-level taxonomic classifications were based on NCBI's taxonomy identifiers (taxids) to minimize the issue of changing species nomenclature (Federhen 2012). Species names were manually checked in MycoBank (Robert et al. 2013), and the only two discordances observed between the NCBI taxonomy database and MycoBank were Candida pseudohaemulonis (which is referred to as C. pseudohaemulonii in MycoBank), and Candida glycerinogenes, which was not found in MycoBank. The glycerolproducing C. glycerinogenes has been described elsewhere using a different spelling (C. glycerolgenesisalso not in MycoBank) (Wang et al. 1999), but the C. glycerinogenes spelling used here has been used in subsequent publications concerning this species (e.g. Chen et al. 2008;Ji et al. 2016) and is a recognized species name in the NCBI taxonomy database. Subspecies-level classification within the Cryptococcus neoformans and C. gattii species complexes was made following the ISHAM consensus MLST scheme for the C. neoformans and C. gattii species complexes ) and manually examined.
Abundance was estimated at the level of sequence reads and transcripts. For read-level abundances, sequence reads were mapped to transcripts using Bowtie2 v.2.3.3.1 (Langmead and Salzberg 2012) and quantified in Transcripts Per Million (TPM) with RSEM v.1.2.31 (Li and Dewey 2011), using the Trinity pipeline. For transcript-level abundances, the depth values estimated within KMA were used, which is the total number of nucleotides (in each contig) covering the reference sequence divided by the length of the reference sequence. The number and length of assembled contigs for each taxon is likely a better proxy for species abundance than read-level abundances (which are subject to gene expression), and therefore were used for graphic representation and analyses. For simplicity, we refer as 'abundance' the transcript-level abundance, unless otherwise stated.
It is possible that species with larger and gene-rich genomes express a greater number of transcripts. To test for this potential correlation, genome sizes and the estimated number of proteins were obtained from the Fungal Genome Size Database (Kullman et al. 2005), Loftus et al. (2005), Munoz et al. (2018) and NCBI's Genome database (Additional file 1: Table S1). The correlation coefficients between genome size, number of proteins and abundance of transcripts were estimated using Person's correlation and visualized using the R package ggpubr v.0.2 (Kassambara 2017).

RESULTS
RNA sequencing yielded a total of 26,558,491 paired end reads, of which 98.3% passed quality control. Overall, 277,404 contigs (transcripts) were obtained, from which 202,219 (72.9%) were classified. The majority of the sequence reads (80.2%) mapped to a classified contig. Of the 15 fungal species sequenced, 13 were retrieved and correctly classified at the species level ( Fig. 1, Table 2, Additional file 1: Table S2). The two false-negatives were Debaryomyces hansenii and Schizosaccharomyces pombe; these may have been misclassified as another fungus or were lost due to cell pooling inaccuracy and/or RNA extraction biases. A small proportion of bacterial transcripts (0.03%) and other eukaryotic microbes (0.4%, including 31 fungi that were not present in the mock community) was also observed ( Table 2, Additional file 1: Table S2), which likely represent laboratory contaminants and misclassifications (see Discussion). However, these were present at a consistently lower frequency than true members of the mock community, with the most common -Candida glycerinogenesonly present in 0.08% of the transcripts. Some of the transcripts were assigned to entries in GenBank that do not have a species-level classification (e.g. Candida sp. and Pichia sp.). These assignments were considered misclassifications here, although it is possible that the species in our mock community are the correct species-level identity of these GenBank sequences.
Overall, the commonest species detected was C. neoformans, which was expected as it comprised three strains in the mock community and therefore was three times more abundant than other fungal species. Transcripts belonging to Candida tropicalis and Pichia kudriavzevii (former Candida krusei)were also common (19.2 and 18.8%, respectively), while C. albicans, C. orthopsilosis and C. glabrata (other causes of candidaemia in humans) were detected at lower abundance (2.0-2.9%). There was no relationship between abundance of transcripts and phylogenetic relatedness. Genomes with low GC content can be overrepresented in metagenomic sequencing (Shakya et al. 2013). Conversely, some of the species detected here in high abundance (Cryptococcus neoformans and Clavispora lusitaniae) have a higher GC content than most other fungal species (Dujon 2010), suggesting that GC bias is unlikely to affect our results. No correlation between abundance of transcripts and genome size or estimated number of proteins was observed (p > 0.05, Additional file 2). Molecular type and strain-level variation within the Cryptococcus neoformans and C. gattii species complexes was also detected, with contigs matching to C. gattii VGI WM 276, C. neoformans var. grubii VNI H99 and C. neoformans var. neoformans VNIV strains B-3501A and JEC21 (Fig. 2, Additional file 1: Table S3). A proportion of the transcripts (1.6%) matched with equal probability scores to both strains of C. neoformans var. neoformans (B-3501A and JEC21, Additional file 1: Table S2 and Table  S3). From the transcripts classified as Cryptococcus spp., 99.3% were classified as one of the four Cryptococcus strains (or both B-3501A and JEC21) used in the mock community. It is possible that misclassifications occurred within the strains analyzed. For example, transcripts originally from JEC21 might have been classified as B-3501A and vice versa. As it is not possible to know from which strain the transcripts originated, these possible misclassifications would be undetected.
The total costs for RNA extraction, library preparation and sequencing were AUD $806 (~USD $558). The library preparation was the most expensive process, at AUD $400 per sample.

DISCUSSION
Our metatranscriptomics approach yielded taxonomic identification of fungi from a defined mock community with high success, while false-positives were detected at far lower abundance. This proof-of-concept study therefore indicates that it is possible to obtain accurate species-and strain-level identifications for fungi from metatranscriptome data, as long as taxa identified at low abundance are removed from the analyses to avoid false-positives derived from contamination or misclassifications.
Taxonomic classification at species and strain levels using metabarcoding and metagenomic data has been considered inaccurate (Nilsson et al. 2019;Sczyrba et al. 2017;Yamamoto et al. 2014), raising the question of how our metatranscriptomics approach succeeded in identifying closely-related fungal strains. Metabarcoding relies on a single or multiple genetic marker(s) (Banchi et al. 2018;e.g. McGuire et al. 2013;Schmidt et al. 2013), which does not contain sufficient phylogenetic information to differentiate some closely related fungal lineages (Balasundaram et al. 2015;Nilsson et al. 2008). Metatranscriptomics, on the other hand, potentially yields data on all expressed coding sequences. Classifications derived from metagenomes are likely to be equally accurate as the ones obtained from metatranscriptomes, except that dead organisms might also be sequenced. Additionally, we used a new alignment method (KMA) that is both highly accurate and fast (Clausen et al. 2018), allowing us to map sequences against the complete NCBI nucleotide collection. Metatranscriptomics is also less susceptible to amplification bias, no information about the community members is required a priori, and it only detects functionally active members of a microbial community. These advantages make metatranscriptomics a promising tool in biodiversity surveys, functional assessments of microbial communities, pathogen detection and biosecurity surveillance (e.g. Kuske et al. 2015;Shi et al. 2016;Wille et al. 2018).
We constructed a model community to establish a proof-of-concept for fungal metatranscriptomics, using only organisms for which draft or complete genome sequences were available and with fresh cultures processed under ideal laboratory conditions. Clearly, routine applications of metatranscriptomics have additional hurdles that should be taken into consideration. First, RNA is more unstable than DNA and quickly degrades unless the sample is frozen or embedded in a stabilizing reagent, which can introduce biases in the study (Reck et al. 2015). Second, the community complexity may affect taxonomic identification, as the number of sequence reads per species is inversely proportional to the number of species in the community. This is especially Fig. 1 Relative abundance of transcripts assigned to microbial species in the metatranscriptome of the mock community used in this study. The size of the rectangles represents the relative abundance of different species. See Table 2   problematic in communities containing a diverse and abundant bacterial community, which can dominate the metatranscriptome. Finally, the lack of reference sequences is a major issue, especially when working with poorly characterized systems. Considering that complete genome sequences are available for only a fraction of the over 2 million species estimated to exist (Hawksworth and Lucking 2017), it is likely that the entire NCBI nucleotide collection is the most comprehensive database available to identify fungi in metagenomes and metatranscriptomes. Importantly, it is possible to extract informative genes from metatranscriptome data and subsequently perform phylogenomic analyses to identify rare and novel taxa with more certainty (e.g. Shi et al. 2017;Wille et al. 2018;Zhang et al. 2018). Even though false-positives were present at low abundance, they do pose a challenge in the interpretation of metatranscriptomic and metagenomic data. Falsepositives generally result from spurious classifications and laboratory contaminants, which may be common in laboratory reagents (Salter et al. 2014). Metatranscriptomics is less sensitive to laboratory contamination than DNA-based metagenomics or metabarcoding, as only living microorganisms are sequenced. Nevertheless, contamination can occur at various stages of the library preparation and is routinely observed in RNA-Seq studies Strong et al. 2014). Misclassifications occur because some genome regions are very similar (or identical) across closely-related species, which cannot be differentiated. This might be the case for some false-positives that are closely-related to some of the species used in the mock community, like Candida parapsilosis (a false-positive) which is closely related to Candida orthopsilosis (true positive). Errors in reference databases can also result in misclassifications. Sequences attributed to incorrectly-classified species are not uncommon in GenBank and result in downstream classification errors (Li et al. 2018). It is also not unusual to find bacterial regions misassembled into eukaryotic genomes (e.g. Koutsovoulos et al. 2016), which can result in sequences from common laboratory contaminants being classified as a eukaryote. Filtering out organisms found in low abundance is an option to reduce the incidence of false-positives in downstream analyses. In this study, filtering organisms for which the abundance of transcripts is lower than 0.1% would eliminate false-positives, at the cost of excluding one true-positive from the analyses ( Table 2). The threshold for this abundance-filtering depends on the desired balance between precision and recall. The application of an abundance-filtering step might not be feasible when sequencing depth (per microbial species) is limited. Species present at low abundance will be represented by a small number of transcripts and so are more likely to be misclassified or undetected.
The abundance disparity across species and the fact that we did not retrieve two species from the mock community (D. hansenii and S. pombe) shows that the method and our experimental design have limitations, even when performed under ideal laboratory conditions. The abundance of transcripts could vary according to genome size, number of coding sequences, and gene expression. However, we found no correlation between the  abundance of transcripts and genome size or number of genes (Additional file 2), suggesting that gene expression prevails over gene numbers in defining transcript abundance. Imprecise estimates of cell abundance and RNA extraction biases could also have influenced abundance estimates. It is widely recognized that biases in extracting genetic material from mixed communities affect abundance and diversity estimates (Martin-Laurent et al. 2001), although we are unaware of features of D. hansenii and S. pombe that could have influenced the extraction of their RNA. Metabarcoding studies have suggested that performing DNA extraction in triplicate minimizes biases for bacteria, but it had no effect in fungal communities (Feinstein et al. 2009). To our knowledge, the effect of RNA extraction bias in metatranscriptomics has yet to be studied. It is possible that competition between fungal species influences gene expression, although we believe that this would have a negligible effect given that the samples were processed immediately after pooling the species together. As a single metatranscriptome library was sequenced, it is difficult to estimate the impact of these potential biases, and more insights may be obtained by analyzing communities with a different composition and different levels of complexity. RNA extraction bias depends on the species composition of the sample (for example, whether some species have harder cell walls that may hinder RNA extraction), potential inhibitors found in the study system and sample processing (including human error). Alternatively, as metagenomics surveys are not affected by gene expression, they might be more appropriate for studies where it is important to quantify species abundance. Although fungal species and their genes can be confidently identified, it remains challenging to link some genes with particular species using metatranscriptomics. A large portion of fungal genomes are highly similar among species, making it difficult, if not impossible, to infer which species in the community are expressing which genes. Recently, a method was developed to perform species-level functional profiling of metagenome data (Franzosa et al. 2018). This method, however, relies on a reference database of complete genomes that currently contains few fungal representatives, limiting its application in fungal metagenomics. Contrary to metatranscriptomics, metagenomics yields coding and noncoding sequences, which can facilitate linking genes to species if sequencing depth is large enough to assemble large parts of fungal genomes (e.g. Olm et al. 2019).

CONCLUSIONS
In sum, we show that metatranscriptomics is a viable alternative approach to identify fungal species and subspecies in mixed samples from a model fungal community. The major advantages of metatranscriptomics over other HTS technologies include the selective sequencing of living organisms and the ability to detect a wide range of microorganisms in one step, which has multiple applications in biological research, surveillance and diagnosis. There is an increasing literature reporting that virulence and antimicrobial tolerance traits vary within species (e.g. Firacative et al. 2016;Rizzetto et al. 2013) and that multiple strains or species can co-infect a host (e.g. Desnos-Ollivier et al. 2010;Soll et al. 1988). The high discriminatory power obtained for closely-related lineages of Cryptococcus provides a good example of where metatranscriptomics would be valuable in precision medicine, where therapy practices are defined according to strain-specific pathogenicity and drug susceptibility traits. However, it must also be acknowledged that metatranscriptomics has limitations that are common to high-throughput sequencing methods, as it is susceptible to DNA/RNA extraction biases, contamination and misclassifications. These limitations can be minimized if appropriate controls are in place (e.g. abundance filtering before statistical analyses). Besides its application to identify well-known fungal species, metatranscriptomics can help to identify novel functional roles of fungi (e.g. Gonzalez et al. 2018;Liao et al. 2014) and novel species when used within a phylogenomic framework.

Additional files
Additional file 1: Table S1. Metadata of the strains used to construct the mock fungal community, including genome size and protein count. Table S2. Mapping results. Table S3