Comparative genomics of Coniophora olivacea reveals different patterns of genome expansion in Boletales

Coniophora olivacea is a basidiomycete fungus belonging to the order Boletales that produces brown-rot decay on dead wood of conifers. The Boletales order comprises a diverse group of species including saprotrophs and ectomycorrhizal fungi that show important differences in genome size. In this study we report the 39.07-megabase (Mb) draft genome assembly and annotation of C. olivacea. A total of 14,928 genes were annotated, including 470 putatively secreted proteins enriched in functions involved in lignocellulose degradation. Using similarity clustering and protein structure prediction we identified a new family of 10 putative lytic polysaccharide monooxygenase genes. This family is conserved in basidiomycota and lacks of previous functional annotation. Further analyses showed that C. olivacea has a low repetitive genome, with 2.91% of repeats and a restrained content of transposable elements (TEs). The annotation of TEs in four related Boletales yielded important differences in repeat content, ranging from 3.94 to 41.17% of the genome size. The distribution of insertion ages of LTR-retrotransposons showed that differential expansions of these repetitive elements have shaped the genome architecture of Boletales over the last 60 million years. Coniophora olivacea has a small, compact genome that shows macrosynteny with Coniophora puteana. The functional annotation revealed the enzymatic signature of a canonical brown-rot. The annotation and comparative genomics of transposable elements uncovered their particular contraction in the Coniophora genera, highlighting their role in the differential genome expansions found in Boletales species.


Background
Coniophora olivacea is a basidiomycete fungus belonging to the order Boletales. C. olivacea produces brown-rot decay on dead wood of conifers (softwood) and, less frequently, on hardwood species. In addition, C. olivacea also damages wood buildings or construction materials. The genome sequence of its sister species C. puteana was made public in 2012 [1] and contributed to the understanding of genomic differences between brown and white-rot fungi. White-rot fungi are efficient lignin degraders, whereas brown-rot fungi attack cell wall carbohydrates leaving lignin undigested. The main responsible of this behavior are lignin-degrader peroxidases, which are abundant in white-rot species and particularly contracted in brown-rot and mycorrhizal fungi [2]. The Boletales order comprises a diverse group of species including saprotrophs and ectomycorrhizal species such as Suillus sp. or Pisolithus sp. During the last 6 years, up to 12 Boletales genomes have been sequenced and annotated [1,3,4]. Information that emerged from these studies showed important differences in genomic characteristics between the species belonging to this group, whose predicted common ancestor was dated 84 million years ago. Evolution from this boletales ancestor (supposed to be a brown-rot saprotroph) lead to the diversification and the appearance of ectomycorrhizae, which shows a particular contraction of the number of plant cell wall-degrading enzymes coding genes (PCWDE) [4,5]. In addition, Boletales show important differences in their genome size and gene content. For example, the smallest assembled Boletales genome spans 38.2 Mb and has 13,270 annotated genes (Hydnomerulius pinastri), but the largest (Pisolithus tinctorius) spans 71.0 Mb and has 22,701 genes [4]. Previous studies in saprophytic basidiomycetes have shown that species with higher genome sizes tend to have more transposable elements [6]. Also, it has been described that species associated with plants (pathogenic and symbiotic) have genomes with expanded TE families [1,7], although this trend varies between the three basidiomycete phyla [8]. In this paper, we describe the draft genome sequence and annotation of the brown-rot C. olivacea, and we compare it with the genomes of C. puteana as well as with that of three other Boletales showing important differences in genome sizes (Serpula lacrymans, Pisolithus tinctorius and Hydnomerulius pinastri). The results show that C. olivacea displays enzymatic machinery characteristic of brown-rot fungi encoded in a compact genome, carrying a small number of repetitive sequences. The comparative analysis with other Boletales shows that both ancient and modern LTR-retrotransposon amplification events have greatly contributed to the genome expansion along the evolution of Boletales.

Fungal strains and culture conditions
Coniophora olivacea MUCL 20566 was obtained from the Spanish Type Culture Collection and was cultured in SMY submerged fermentation (10 g of sucrose, 10 g of malt extract and 4 g of yeast extract per litre).

Nucleic acid extraction
Mycelia were harvested, frozen, and ground in a sterile mortar in the presence of liquid nitrogen. High molecular weight DNA was extracted using the phenol-chloroform protocol described previously [9]. DNA sample concentrations were measured using a Qubit® 2.0 Fluorometer (Life Technologies, Madrid, Spain), and DNA purity was measured using a NanoDrop™ 2000 (Thermo-Scientific, Wilmington, DE, USA). DNA quality was verified by electrophoresis in 0.7% agarose gels. Total RNA was extracted from 200 mg of deep-frozen tissue using Fungal RNA E.Z.N.A Kit (Omega Bio-Tek, Norcross, GA, USA), and its integrity was verified using the Agilent 2100 Bioanalyzer system (Agilent Technologies, Santa Clara, CA, USA).

Genome and transcriptome sequencing and assembly
A detailed description is provided in Additional file 1: Text S1. Briefly, the C. olivacea MUCL 20566 genome was sequenced using Illumina HiSeq-1 TB Regular 2 × 151 bp 0.309 kb. Sequenced reads were QC filtered for artifact contamination using BBDuk from the BBMap package (https://sourceforge.net/projects/bbmap/) and subsequently assembled with Velvet 1.2.07 [10]. The result -pair library with an insert size of 3000 +/− 300 bp in silico that was then assembled together with the original Illumina library with AllPathsLG [11]. Raw sequences were deposited in SRA (Sequence Read Archive) NCBI database under accession number SRP086489. Strand-specific RNASeq libraries were created and quantified by qPCR. Sequencing was performed using an Illumina HiSeq-2500 instrument. Reads were filtered and trimmed to remove artifacts and low quality regions using BBDuk. Transcriptome was de novo assembled using Trinity [12] and used to assist annotation and assess the completeness of the corresponding genome assembly using alignments of at least 90% identity and 85% coverage.

Whole-genome alignment
The genome assemblies of C. olivacea MUCL 20566 and C. puteana (http://genome.jgi.doe.gov/Conpu1/Conpu1.home.html) were aligned using the Promer tool from the MUMmer 3.0 package [13]. Genome rearrangements were identified in the alignment with dnadiff tool from the same package.

Genome annotation
The annotation of the C. olivacea MUCL 20566 assembly was performed using the Joint Genome Institute pipeline [14] to predict and functionally annotate protein-coding genes and other features such as tRNAs or putative microRNA precursors. The SECRETOOL pipeline [15] was used to identify putatively secreted proteins, considering the presence of signal peptides, cleavage sites, transmembrane domains and the GPI (glycosylphosphatidylinositol) membrane anchor. Carbohydrateactive enzymes (CAZys) were annotated based on BLAST [16] and HMMER [17] searches against sequence libraries and HMM (Hidden Markov Models) profiles of the CAZy database [18] functional modules. Protein structure predictions were carried out with Phyre2 [19]. Raw sequencing reads, genome assembly, transcriptome assembly, gene predictions and functional annotations are publicly available in the C. olivacea genome portal of Mycocosm database (http://genome.jgi.doe.gov/Conol1/Conol1.home.html).

Insertion age of LTR-retrotransposons
Full-length LTR-retrotransposons were identified using LTRharvest [23] followed by BLASTX against Repbase [24]. Long Terminal Repeats were extracted and aligned with MUSCLE [25]. Alignments were trimmed using tri-mAl [26] and used to calculate Kimura's 2P distances. The insertion age was calculated following the approach described in [27] using the fungal substitution rate of 1.05 × 10 −9 nucleotides per site per year [6,28].

Identification of gene families
All-by-all BLASTP followed by MCL (Markov Cluster Algorithm) clustering [29] was carried out with C. olivacea protein models using a threshold value of e −5 and an inflation value of 2. We considered gene families carrying four or more genes for further analyses.

Phylogenetic analyses
The predicted proteomes of the following species were all-by-all BLASTP followed by MCL clustering was carried out with a dataset containing the proteomes of all the species. The clusters carrying only one protein per species were identified, and the proteins were aligned using MAFFT [30]. The alignments were concatenated after discarding poorly aligned positions with Gblocks [31]. The phylogeny was constructed using RaxML [32] with 100 rapid bootstraps under PROTGAMMAWAGF substitution model. Phylogenetic reconstruction of Gypsy reverse-transcriptases was carried out as follows: Reverse transcriptase RV1 domains were extracted from LTRretrotransposons of the TE consensus library using Exonerate [33] and aligned with MUSCLE. The alignments were trimmed using trimAl with the default parameters, and an approximate maximum likelihood tree was constructed using FastTree [34].

C. olivacea assembly and annotation
The nuclear genome of C. olivacea was sequenced with 137 X coverage and assembled into 863 scaffolds accounting for 39.07 Mb, 90.3% of the genome size estimation based on k-mer spectrum (43.28 Mb). The mitochondrial genome was assembled into two contigs accounting for 78.54 kb. The assembly completeness was 99.78% according to the Core Eukaryotic Genes Mapping Approach (CEGMA [35]), with only one missing accession (KOG1322, GDP-mannose pyrophosphorylase). We assembled 66,567 transcripts (mean lenght = 2,744 nt, median = 2,154 nt) of which 97.8% could be mapped to the genome. The C. olivacea assembled genome was more fragmented than its close relative C. puteana ( Table 1). The total repeat content was 2.91% of which 2.15% corresponded to transposable elements, 0.64% to simple repeats, and 0.12% to low complexity regions. The estimation of repeat content from low-coverage Illumina data (3.8X) yielded 6% of the genome size covered by transposable elements (Additional file 2: Table S1). We used transcriptomic information, ab initio predictions and similarity searches to predict a total of 14,928 genes-84.5% of them having a strong transcriptome support (spanning more than 75% of the gene length). In addition, 88.3% of the annotated genes had significant similarity to proteins from the NCBI nr database and 46.6% to the manually curated proteins from the Swiss-Prot database (cutoff e −05 ) [36]. A total of 7,841 predicted proteins (52.3%) carried Pfam domains and 1,471 (9.8%) carried signal peptide, of which 470 were predicted to be secreted using the more stringent SECRETOOL pipeline.
The multigene phylogeny based on 1,677 conserved single copy genes displayed different classes, orders and families in branches congruent with previous phylogenetic data [37] and with very high support. C. olivacea was placed in a branch next to its sequenced closer species C. puteana representing the Coniophoraceae family in the order Boletales (Fig. 1).
The whole-genome protein-based alignment between the two Coniophoraceae species spanned 52.7% of the C. olivacea and 48.0% of C. puteana assemblies. It shows evidence of macrosynteny between the two species (Fig. 2a, Additional file 3: Fig. S1), with an average similarity of 78.4% in the aligned regions ( Fig. 2b) and numerous inversions (1,027 regions). The good conservation between both genomes in protein coding regions was evidenced by the amount of orthologous genes obtained using the reciprocal best hit approach (7,468 genes with more than 70% identity over 50% of protein sequences) and by the number of C. olivacea proteins yielding significant tBLASTN hits against the C. puteana genome (13,572 genes, cutoff e-5, Fig. 2c). For the remaining 1,352 C. olivacea-specific (orphan) genes, only 48 could be functionally annotated based on KOG (Eukaryotic Orthologous Groups), KEGG (Kyoto Encyclopedia of Genes and Genomes), GO (Gene Ontology) or InterPro databases.

Carbohydrate-active enzymes of C. olivacea
The annotated proteome was screened for the presence of carbohydrate-active enzymes (CAZy). A total of 397 proteins were annotated and classified into different CAZy classes and associated modules. The CAZyme profile of C. olivacea was very similar to that of C. puteana although small differences were found in the glycoside hydrolases (GH, Additional file 4: Table S2). Some families such as GH5, GH18 or GH31 were smaller than in C. puteana. Similar to other brown-rot basidiomycetes, C. olivacea lacked Class II peroxidases (Auxiliar Activities AA2) and displayed a reduced set of other cellulolytic enzymes such GH6 (1), GH7 (1) and CBM1 (2) and AA9 (6).
Functional characteristics of C. olivacea predicted secretome Using SECRETOOL pipeline we predicted 470 putatively secreted proteins in C. olivacea and 504 in C.
puteana. An enrichment analysis of gene ontology (GO) terms was performed to determine what gene functions were over-represented in the secreted proteins. Thirty GO terms were significantly enriched including 24 corresponding to molecular functions, four to biological processes and two to cellular components ( Table 2). The most enriched molecular function was "feruloyl esterase activity," which is responsible for plant cell-wall degradation. "Polysaccharide catabolic process" was the most enriched GO term within the biological processes, and "extracellular region" within the cellular components (Table 2).

Analysis of putatively secreted multigene families
Using all-by-all BLASTP followed by MCL we clustered by similarity the 1,471 proteins carrying signal peptides in C. olivacea. We used all proteins carrying signal peptides rather than only SECRETOOL predictions in order to obtain larger protein clusters. Up to 60% of the 1,471 proteins grouped in clusters were formed by 2 to 59 genes (Additional file 5: Table S3), showing the same distribution as the whole proteome (p = 0.6032, Wilcoxon test, 61% of the 14,928 predicted genes were found in clusters containing 2 to 157 members). For further analysis of the secreted genes found in clusters, we focused on the 70 clusters (families) formed by four or more gene members. Using the KOG, KEGG, InterPro and GO databases, we could assign functions to 45 out of the 70 gene families (Table 3). Cytochrome P450, hydrophobins and aspartic-peptidases were the largest gene families. In addition, 17 CAZys clusters were found including glycoside hydrolases (GH), carbohydrate esterases (CE), carbohydrate-binding modules (CBMs) and redox enzymes classified as auxiliary activities (AA). 25 clusters lacked functional annotation, and some of them had a high number of genes (clusters 2, 6 and 7 in Table 3). All of these genes belonging to families with unknown function were further analyzed with Phyre2 to predict their protein structure and used for PSI-BLAST (Position-Specific Iterated BLAST) analysis. Using this approach, two gene families were functionally annotated with high confidence (96.3-97.4% confidence for individual protein predictions): one as a copper-dependent lytic polysaccharide monooxygenase (LPMO, also known as AA9; cluster 16), and the other as thaumatin-lyke xylanase inhibitor (tlxi, cluster 48). The Cluster16 containing putative LPMOs was particularly interesting. This was formed by 10 genes coding for small proteins ranging from 130 to 162 amino acids with three exons (with the exception of protein ID839457 that shows only two). All these genes coded for proteins that have a signal peptide but lack of known conserved functional domains. Six were confidently annotated as LPMOs by Phyre2, and four of them were predicted to be secreted by SECRETOOL. In addition, this family of unknown proteins is conserved in all the agaricomycetes shown in Fig. 1. Interestingly, four members of this family appear as a tandem located in C. olivacea scaffold_124 (scaf-fold_426:4800-12,000).

Impact of repeat content on C. olivacea genome size and other Boletales
To study the role that TEs have played in the evolution of the Boletales genomes, we annotated and quantified the TE content in five species showing important differences in genome size: C. olivacea (39.  by TEs, respectively (Fig. 3, Table 4). In addition to higher TE content, species with larger genome assembly size showed higher TE diversity as reflected by the higher number of TE families, which ranged between 43 in C. olivacea to 432 in P. tinctorius. The TEs found belong to seven out of the nine TE orders described by Wicker et al [38]: LTR, DIRS (Dictyostelium Intermediate Repeat Sequences), PLE (Penelopelike Elements), LINE (Long Interspersed Nuclear Elements), SINE (Small Interspersed Nuclear Elements), TIR (Terminal Inverted Repeats) and Helitrons. Two of the orders (LTR and TIRS, which contain long terminal repeats or terminal inverted repeats, respectively) were present in the five species. Class I TEs were primarily responsible for the observed genome size differences-especially the elements belonging to LTR in the Gypsy superfamily, which accounted for more than 15% of the assembly in S. lacrymans and P. tinctorius, but less than 3% in H. pinastri, C. olivacea and C. puteana. Of all the LTR/Gypsy families detected by TEdenovo, we observed that those elements belonging to the Chromoviridae group (carrying a Chromatin organization domain, PF00385, in the N-terminal region after the integrase, Fig. 4) were the most abundant LTR-retrotransposons in these five species, ranging from 44 to 83% of the total Gypsy coverage. LTR-retrotransposons in the Copia  superfamily were also particularly abundant in S. lacrymans and P. tinctorius (accounting for 2.4-6% of the total assembly size). Remarkably, non-coding LTRretrotransposons such as TRIM (Terminal-repeat Retrotransposons In Miniature) and LARD (Large Retrotransposon Derivatives) were also found in three out of the five genomes, but in lower amounts (<1% of the genome, Table 4). LINE, SINE, DIRS and PLE elements were also found in low copy numbers, but none of these were present in the five species. Regarding Class II transposons, TIR order was the most important in terms of abundance    and copy number with elements encoding DDE transposases present in the five species. The second most important were MITEs (Miniature Inverted-repeat Transposable Elements) and other non-coding elements carrying structural features (classified as TIR/unknown in Table 1). Rolling-circle helitrons were found in H. pinastri, S. lacrymans and P. tinctorius, while putative Mavericks were present only in this latter one.

Phylogenetic reconstruction of the LTR reversetranscriptases
To understand the phylogenetic relationship between the LTR-retrotransposon familes in the five analyzed genomes, we inferred a maximum likelihood phylogeny of the LTR reverse-transcriptases of the Gypsy consensus sequences (Fig. 5). Three main clades were obtained (A, B and C). Clades A and B were formed, almost exclusively, by families found in the P. tinctorius genome. Moreover, while clade B is formed mostly by distantly related families, the profile of clade A suggests that an important fraction of the families underwent recent diversification. All LTR families found in the other four species grouped in clade C along with the remaining families of P. tinctorius. This clade contained several retrotransposon sub-clades sharing closely related families from three to five species.

Age of the LTR-retrotransposon amplification bursts in the Boletales
LTR-retrotransposons carrying conserved domains as well as intact Long Terminal Repeats (putative autonomous elements) were subjected to further study to investigate their amplification dynamics over the course of evolution. Based on the nucleotide divergence between the two LTRs, we estimated the time of insertion of each element using a substitution rate of 1.05 × 10 −9 nucleotide substitutions per site per year. The number of intact, putative autonomous LTR-retrotransposons varied greatly in the five species ranging from 26 elements in C. olivacea to 944 in P. tinctorius. The LTR profiles of C. olivacea, C. puteana and S. lacrymans showed recent peaks of amplification with insertion dates at 0-5 million years (MY). LTR amplification in H. pinastri showed a peak at 10-15 MY ago, whereas the profile of P. tinctorium pointed to a much older amplification burst showing a maximum peak at 25-30 MY ago and few recent retrotransposition events (Fig. 6).

Discussion
Genomic and proteomic characteristics of C. olivacea We report the 39.07 Mb draft genome assembly and annotation of brown-rot basidiomycete C. olivacea. In terms of genome size, this species is slightly smaller than C. puteana, but it falls in the range of other brown-rot basidiomycetes such as Hydnomerulius pinastri (38.3 Mb) [4] or Serpuyla lacrymans (47.0 Mb). As expected for closely related species, C. olivacea and C. puteana show macrosynteny, although due to the short scaffold lengths it is impossible to establish comparisons at a chromosome scale. We found very good conservation of protein-coding genes, although C. olivacea has up to 1,352 orphan genes-most of these are supported by structure and RNA evidence (i.e., no homology to any other known gene). In this sense, the higher number of annotated genes in C. olivacea relative to C. puteana is probably related to the higher amount of assembled RNA contigs used to assist the annotation of the former (resulting from the higher RNAseq depth). The presence of about 10% of orphan genes is common in fungal genomes, and these genes often lack an in silico functional annotation as we found for C. olivacea [39,40]. Wood-decaying species require a complex enzymatic machinery to degrade lignin and obtain nutrients. According to the CAZy enzymes identified in the genome, the C. olivacea proteome carries the main signatures of canonical brown-rot: (i) it completely lacks Class II peroxidases-enzymes primarily involved in lignin degradation [41], and (ii) it carries a reduced set of enzymes involved in degradation of crystalline cellulose. In fact, its profile is very similar to that of C. puteana, displaying only minor differences in several enzyme groups. As previously seen in other wood-degrading fungi, the in silico secretome of C. olivacea is enriched in functions related to lignocellulose degradation [42]. Our analysis showed that most intracellular and secreted proteins are members of multi-gene families of diverse size originating from gene duplications. The number of gene families that could not be functionally annotated by standard similarity-based methods was high, a phenomenon that is frequently observed in fungi.
To overcome this drawback, we used an alternative approach that combines similarity with structural information (Phyre-2). We then assigned a putative function to two multi-gene families conserved across the basidiomycete phylogeny but for which a putative function had not been previously proposed. Of special interest is the newly identified family of putative copper-dependent lytic polysaccharide monooxygenases (AA9, LPMO). The LPMOs are recently discovered enzymes used by microbes to digest crystalline polysaccharides [43]. They increase the saccharification yield of commercial enzyme cocktails [44]. Nevertheless, despite the promising results obtained in silico, experimental assays will be necessary to confirm the function of the members of this newly described gene family.

Impact of TEs in the evolution of Boletales genomes
The results of TE annotation in the five Boletales showed how different patterns of LTR-retrotransposon amplifications have shaped the architecture of their genomes. The expansion of LTR/Gypsy retrotransposons belonging to Chromoviridae occurred mainly in the species with large genomes, whereas the smaller genomes have a small amount of these families (ie, three families in C. olivacea and C. puteana). Chromoviruses are the most common LTR-retrotransposons in fungi [45], and the key to their success might be the presence of a chromo-integrase, which is thought to guide the integration of these elements into heterochromatic regions [46]. Heterochromatin is gene-poor, and it is silenced by epigenetic mechanisms such as DNA methylation and RNAi [47]. Thus, integration of these elements in such regions would allow them to skip purifying selection and increase their probability to persist in the genome. In fact, this could be the reason for the longer prevalence of Gypsy over Copia LTR-retrotransposons in most fungal species-the latter tend to integrate at random locations including euchromatic regions where transposon fixation is more difficult [48]. The LTR-retrotransposon amplification bursts of the Boletales indicate that elements from both Coniophora species are young and thus putatively active, and the profile of S. lacrymans also indicates a very strong activity of young copies with a progressive decrease in the amplification signals of older elements. Our findings suggest that the latter three species are currently in a period of genome expansion. Despite the different profile of H. pinastri and P. tinctorius we cannot rule out the same hypothesis, as both assemblies contain high gap content (7.7% and 13.3%, respectively). This fact usually leads to an underestimation in the amount of young retrotransposons [6], as they are difficult to assemble due to their repetitive nature and high sequence identity. In fact, we show that due to this reason the assembly-based TE quantification underestimated LTR content in C. olivacea in comparison to non-assembly based quantification (Additional file 2: Table S1). The profile of P. tinctorius is intriguing. This ectomycorrhizal (ECM) species undergoes a massive expansion of LTR-retrotransposons in the Gypsy superfamily (similar to that found for other symbiotic species in Agaricomycotina [7,49]; however, the majority of elements are very old  and still carry structural and coding domains necessary for transposition. The phylogeny of Gypsy reverse-transcriptases suggests that many P. tinctorius-specific families are distantly related to the other four species. In fact, its impressive retrotransposon content might be partially explained by the amplification and diversification of ancestral families (giving rise to clades A and B in Fig. 5). Our phylogenetic reconstruction suggests that such ancestral families were also present in other boletales but didn't proliferate in the genome (ie, H. pinastri or C. puteana). Whether genome defense mechanisms or lifestyle constraints are responsible of this phenomenon is still to be demonstrated. In this regards, it is interesting to note that the LTR-mediated genome amplification of P. tinctorius roughly coincides with the estimated origins of ECM symbiosis in Boletales [4]. Of the four Class I TE orders found, only the LTR elements were present in the five species. The most plausible scenario is that the elements from the other three orders (DIRS, LINE, and PLE) were lost by random drift in some of the species. Alternatively, they might be present in some genomes but in the form of very ancient and degenerated copies that are not detectable. Similarly, this patchy distribution was also found in class II elements (ie, helitrons were absent in the Coniophora genus and present in the remaining three species). Previous studies have shown that besides the conserved presence of LTR and TIR orders, the remaining TE groups tend to be present in variable amounts in basidiomycetes [6].