Genome expansion by allopolyploidization in the fungal strain Coniochaeta 2T2.1 and its exceptional lignocellulolytic machinery

Background Particular species of the genus Coniochaeta (Sordariomycetes) exhibit great potential for bioabatement of furanic compounds and have been identified as an underexplored source of novel lignocellulolytic enzymes, especially Coniochaeta ligniaria. However, there is a lack of information about their genomic features and metabolic capabilities. Here, we report the first in-depth genome/transcriptome survey of a Coniochaeta species (strain 2T2.1). Results The genome of Coniochaeta sp. strain 2T2.1 has a size of 74.53 Mbp and contains 24,735 protein-encoding genes. Interestingly, we detected a genome expansion event, resulting ~ 98% of the assembly being duplicated with 91.9% average nucleotide identity between the duplicated regions. The lack of gene loss, as well as the high divergence and strong genome-wide signatures of purifying selection between copies indicates that this is likely a recent duplication, which arose through hybridization between two related Coniochaeta-like species (allopolyploidization). Phylogenomic analysis revealed that 2T2.1 is related Coniochaeta sp. PMI546 and Lecythophora sp. AK0013, which both occur endophytically. Based on carbohydrate-active enzyme (CAZy) annotation, we observed that even after in silico removal of its duplicated content, the 2T2.1 genome contains exceptional lignocellulolytic machinery. Moreover, transcriptomic data reveal the overexpression of proteins affiliated to CAZy families GH11, GH10 (endoxylanases), CE5, CE1 (xylan esterases), GH62, GH51 (α-l-arabinofuranosidases), GH12, GH7 (cellulases), and AA9 (lytic polysaccharide monoxygenases) when the fungus was grown on wheat straw compared with glucose as the sole carbon source. Conclusions We provide data that suggest that a recent hybridization between the genomes of related species may have given rise to Coniochaeta sp. 2T2.1. Moreover, our results reveal that the degradation of arabinoxylan, xyloglucan and cellulose are key metabolic processes in strain 2T2.1 growing on wheat straw. Different genes for key lignocellulolytic enzymes were identified, which can be starting points for production, characterization and/or supplementation of enzyme cocktails used in saccharification of agricultural residues. Our findings represent first steps that enable a better understanding of the reticulate evolution and “eco-enzymology” of lignocellulolytic Coniochaeta species.

Plant biomass is a carrier of energy with high relevance both ecologically and for biotechnology. Several studies have attempted production of commodity chemicals from agricultural residues [12,13]. However, one bottleneck in this process is low saccharification efficiency, due largely to the recalcitrant nature of plant polymers [14]. Recently, mining of fungal genomes, transcriptomes, and proteomes has unveiled new enzymes and/or mechanisms that enhance the saccharification of plant polysaccharides [15,16]. For example, Hüttner et al. [17] and Qin et al. [18] merged genomics and transcriptomics to elucidate the lignocellulolytic machinery in Malbranchea cinnamomea (thermophilic ascomycete) and Irpex lacteus (white-rot basidiomycete), respectively. Currently, the saccharification process is carried out using commercial enzyme cocktails obtained from Trichoderma reesei strains [19]. It has been reported that the supplementation of exogenous enzymes (or secretomes) to T. reesei-derived cocktails can improve the saccharification efficiency [20,21]. Moreover, Harris et al. [22] showed that co-expression of a lytic polysaccharide monoxygenase (LPMO) in a commercial T. reesei strain resulted in enhancing conversion of plant biomass. LPMOs (e.g., CAZy families AA9, AA11, AA13, and AA16) are metalloenzymes that catalyze the oxidative cleavage of (1,4)-linked glycosidic bonds of plant polysaccharide surfaces [23]. These proteins have been identified and characterized in several fungal species (e.g., Neurospora crassa, Podospora anserina, Thielavia terrestris, and Myceliophthora thermophila) [24,25]. However, their presence and function in Coniochaeta species have yet to be explored.
In this study, we analyzed the genome and transcriptome of Coniochaeta sp. strain 2T2.1 to identify its lignocellulolytic machinery. This fungus was isolated from a heat pretreated wheat straw-degrading microbial consortium, where it plays a key role in the degradation of plant polysaccharides, along with bacteria belonging to the genera Sphingobacterium and Klebsiella [26,27]. Through genome sequencing, we discovered that 2T2.1 experienced a massive genome duplication event. Changes in genome size have been observed occasionally across members of the Ascomycota and can be caused by several processes including: transposable element expansion spontaneous changes in ploidy, allopolyploidization and autopolyploidization. These last events can hypothetically result in whole-genome duplication (WGD) [28][29][30][31]. WGD has the potential to increase fitness for specific functions through diversification of gene function and evolution by selection. Typically, WGD causes genome instability, leading to massive gene loss, genome rearrangements and sequence divergence [32][33][34]. Consequently, our study sought to answer three main questions: (i) what is the origin of the genome duplication event in 2T2.1? (ii) What lignocellulolytic machinery is present in its genome and how does it differ from other fungal species? (iii) What type of lignocellulolytic enzymes (especially LPMOs) are significantly upregulated during growth on wheat straw compared with glucose? The results of our study expand our "eco-enzymology" (defined here as the study of enzymes and their role in microbial interactions and the modification of surrounding environments) understanding of this fungus and enable the discovery of novel enzymes useful in saccharification of agricultural residues.

Morphological and genomic features of Coniochaeta sp. 2T2.1
On potato dextrose agar (PDA), Coniochaeta sp. strain 2T2.1 formed unique black mycelial colonies without evidence of two colony types. In liquid mineral medium supplemented with wheat straw, it grew in a yeast-like form (Fig. 1). The genome of Coniochaeta sp. 2T2.1 was sequenced using PacBio technology at the Joint Genome Institute (JGI) and assembled using Falcon, a diploidaware PacBio assembler [35]. This generated a contiguous, but highly duplicated final assembly with a size of 74.53 Mbp, at read coverage depth of 122.9× with 95 scaffolds larger than 2 Kbp (N50 of 2.67 Mbp and L50 of 11 scaffolds). The three largest scaffolds are all around 4.4 Mb. The proportion of reads with circular intermediates (see methods) that could potentially cause artificial contigs/duplicated content was extremely low (~ 0.3%), indicating that duplicated regions were unlikely to arise due to mis-assembly. Furthermore, junctions between the duplications on the same scaffolds were well supported by PacBio read mapping, indicating a high-quality assembly. The 2T2.1 genome contains 24,735 gene models with an average of 390 amino acids per protein. Around 28% of the total gene models had assigned KEGG functions. From these, some proteins were predicted to be enzymes involved in carbohydrates (1098), amino acids (909), lipids (859), and xenobiotics (806) metabolism. In addition, Pfam domains were located on ~ 67% of genes (16,503 out of 24,735) and ~ 86% (21,299) were supported by transcriptomic data (Additional file 1: Table S1). Other main features of the 2T2.1 genome can be found at JGI-MycoCosm genome portal (https ://genom e.jgi.doe.gov/ Conio c1/).

Evidence for a genome expansion in Coniochaeta sp. 2T2.1
Unlike other members of the Coniochaetaceae family, strain 2T2.1 displayed a massive genome expansion, resulting in 97.91% of the assembly being duplicated. Duplicated content was identified as regions with at least three genes in each fragment, and at least 50% of genes between fragments were homologous to each other (blastp e value ≤ 1e−20 and alignment coverage for both query and target > 80%). This approach revealed that 24,198 (97.83%) of gene models were contained in duplicated regions and 537 genes were found in regions present only once in the assembly. Around 1.55 Mb of the genome is unpaired. For a list of all proteins and their duplication status, see Additional file 2: Table S2. Consistent with genome duplication, much of the assembly is syntenic with other regions in the 2T2.1 genome, although synteny breaks and inversions can be observed (Fig. 2a). To identify the source of this duplication event, we compared genome assembly and gene features to what is typically observed in assemblies of varying ploidy (i.e., haploid, diploid, and dikaryotic lineages). We found that in representative diploid and dikaryotic lineages, over 85% of the total duplicated content was > 95% identical (Rhizoclosmatium globosum; diploid: 88.47%, Puccinia striiformis f. sp. tritici; dikaryon: 88.66%) (Fig. 2b). However, 2T2.1 showed a different pattern from these fungi, as only 2.45% of total duplicated content was > 95% identical. Instead, in 2T2.1, we observed 91.9% nucleotide identity on average (92.33% of duplicated content was between 88.5 and 92.5% identity).
Comparing duplicated protein content also shows a dissimilarity of 2T2.1 to patterns observed in other lineages of varying ploidy ( Fig. 3; Additional file 3: Fig S1). While allelic proteins from diploid/dikaryotic fungi  [26] and growth on Potato Dextrose Agar (PDA) (left) and in liquid medium using wheat straw as the sole carbon source (micrograph on the right) (labeled in blue in Fig. 3) were frequently > 98% identical to one another, Coniochaeta sp. 2T2.1 showed both a higher diversity amongst copies and a depletion of nearly identical copies. For example, in P. striiformis (dikaryon), nearly half (44.75%) of all bidirectional best blast hits (BBHs) were 99.75-100% identical in amino acid sequence to each other, while in 2T2.1, this was only 2.46%. Altogether, the features that we observed in 2T2.1 were largely inconsistent with what is typically observed in diploid/dikaryotic assemblies. Since the material for the genome and transcriptome sequencing arose from an isolated colony and only a single mitochondrial sequence was detected, the duplicated content that we observed is unlikely to be due to contamination with a closely related strain. Therefore, we hypothesized that a whole-genome duplication (WGD) event may have occurred either through (i) a within-species WGD (autopolyploidization) or (ii) recent hybridization of two closely related species (allopolyploidization). However, nucleotide conservation (calculated using nucmer [38]) between 2T2.1 and its closest relatives, genome-sequenced, was substantially lower (Coniochaeta sp. PMI546: 85.97% and Lecythophora sp. AK0013: 86.73%). Due to the absence of available genomes closely related to 2T2.1, methods such as phylogeny reconstruction [33] are currently unable to resolve whether this duplication occurred through autopolyploidization or allopolyploidization. Furthermore, duplicated genes appear similarly diverged from close relatives, as calculation of synonymous divergence Length (x-axis) and percent identity at the nucleic acid level (y-axis) between duplicated regions in Coniochaeta sp. 2T2.1 (red) and representative haploid (C. lignaria, grey), dikaryotic (P. striiformis f. sp. tritici, blue) [36] and diploid (R. globosum, purple) fungi [37]. Each dot represents a single duplicated region [29,39] between 2T2.1 duplicates and their orthologues in Lecythophora sp. AK0013 did not yield any separation of potential parents (Additional file 3: Fig S2).
Consequently, we developed a different method for separating recent allopolyploidization events from autopolyploidization in 2T2.1. In cases of autopolyploidization, since duplicates are originally at (or near) 100% identity to each other, we expect little or no fitness cost of losing duplicated content (or perhaps even a fitness gain) across most genes in the genome. Therefore, one should observe a rapid accumulation of deleterious mutations and pseudogenization following autopolyploidization, a signature that can be captured by exploring the patterns of nonsynonymous (d N ) and synonymous (d S ) substitutions across duplicated content. For instance, if copies demonstrate high rates of pseudogenization (d N /d S ~ 1.0) genome wide, this would suggest autopolyploidization. In contrast, if we observe high rates of purifying selection, this would suggest a recent allopolyploidization, as copies have not coexisted for long enough to accumulate deleterious mutations and become pseudogenes. In the case of Coniochaeta sp. 2T2.1, in addition to absence of gene loss despite copies having diverged on average by 8.1% (or 91.9% identity), we observed a strong signature of genome-wide purifying selection. This profile was highly correlated with that observed when comparing single-copy orthologues across different Coniochaeta/Lecythophora species (R 2 ≥ 0.945; Fig. 4). In other words, the d N /d S distribution across duplicated genes in 2T2.1 looks the same as between orthologues across species, indicating that the source of the duplication was likely a hybridization event (allopolyploidization) instead of autopolyploidization.

Clusters of orthologous genes and phylogeny reconstruction
Clusters of orthologous genes were analyzed across the genome of 2T2. 1 Fig. 3 Unique pattern of sequence divergence between duplicates is observed in Coniochaeta sp. 2T2.1 (red) compared to haploid (black) and diploid/dikaryotic (blue) fungi. For each genome, a self-BLASTp was conducted to identify duplicates by reciprocal best blast hits (BBHs; min e value 1e−5). The fraction of bidirectional best blast hits (BBHs) at varying identity levels (steps = 0.25%) are then plotted (y-axis, grey = 0) for each lineage (x-axis). Only published PacBio genomes and close relatives of 2T2.1 were included. Despite being dispersed across most of the fungal kingdom, a consistent pattern is observed based on ploidy regardless of phylogenetic neighborhood A total of 215 and 141 clusters of orthologous genes were shared between 2T2.1 with PMI546 and AK0013, respectively. Moreover, 994 clusters of genes (containing 2199 proteins) were unique in 2T2.1 (Fig. 5b). From these, 87 proteins were affiliated to carbohydrate-active enzymes (CAZymes) and 27 of these were related specifically to lignocellulases (families AA11, AA4, GH43, GH16, GH5, CE1, GH141, GH3, GH31, and CBM16) (Additional file 4: Table S3). For phylogeny reconstruction, we used 2552 single-copy orthologous genes identified using mcl [40] which produced a robust and highly supported tree (RAxML and FastTree) and reveal Lecythophora sp. AK0013 as the earliest diverging Coniochaeta species that has so far been identified. In addition, Lecythophora/Coniochaeta species were found to be evolutionarily closer to N. crassa, P. anserina, and M. thermophila than Fusarium oxysporum, T. reesei, and Aspergillus chrysogenum ( Fig. 5a; Additional file 3: Fig.  S3).  Table S1). To make 2T2.1 comparable to other fungi that did not experience a WGD, only one copy was kept for each duplicated gene. Here, we found that the AA8, CBM24, and GH127 families were significantly enriched in the Lecythophora/Coniochaeta linage. Next, we determined which gene families from strain 2T2.1 were enriched or depleted (two standard deviations above or below the mean) in abundance in 2T2.1 compared to other fungal genomes. The results showed that genes for lignocellulases from families GH43 (α-arabinosidases/β-xylosidases), GH16 (xyloglucanases/ endoglucanases), CE1, CE3 (acetyl xylan esterases), GH11 (endoxylanases), AA4 (vanillyl-alcohol oxidases), and AA1_2 (ferroxidases) were highly abundant in 2T2.1 (more than five genes) compared with the other Lecythophora/Coniochaeta genomes (Table 1). Moreover, genes for CAZy families CBM24, GH76, CE1, GH47, GH31, GH71, AA8, GH55, AA3, GH11, AA4, AA1_2, AA12, AA3_3, GH13_40, GH45, and GH5_5 were highly abundant in 2T2.1 (more than five genes) compared with the other fungi outside of the Coniochaetaceae. Including all the duplicated content of 2T2.1, the results showed that 122 CAZy families were differentially abundant (two standard deviations above or below) compared to the whole dataset (Coniochaetaceae-derived plus other fungal genomes). Complete counts of all genes belonged to each CAZy family across genomes used in this study     WS and PTWS compared with Glu (Table 3). Regarding AA11 and AA13 genes, we observed that four and two transcripts, respectively, were significantly upregulated (padj-value ≤ 0.05 and Log2 FC ≥ 2) in WS compared with Glu (Additional file 6: Table S5).

Discussion
Despite their diverse lifestyles, widespread distribution in different environments [1,[43][44][45], and lignocellulolytic microbial consortia [46,47]  , and dilute-acid-pretreated wheat straw solids (PTWS). Asterisks represent putative secreted enzymes that were significantly upregulated (padj-value ≤ 0.05 and Log2 FC ≥ 8) in WS and PTWS compared with glucose (Glu) cultures; s, d and t letters represent single, duplicate and triplicate genes within the 2T2.1 genome. b Structural 3D modeling of five selected AA9 proteins that were significantly and highly upregulated (padj-value ≤ 0.05 and Log2 FC ≥ 8) on wheat straw (WS) compared with glucose (Glu) cultures. Phyre2 [41] and EZmol [42] web portals were used to predict the putative 3D structural conformation. The molecular size of these proteins (JGI-IDs 1170506, 980755, 1220247, 1175568, and 1230134) ranged between 22 and 29 kDa with different isoelectric points (from 4.56 to 7.51). We identified predicted metal-binding and histidine brace sites based on the structural position and comparison with the best protein for modeling (Additional file 7: Table S6). In the five AA9 proteins, these sites were identified and contain generally two to three histidines (green), one to two tyrosines (red) and one residue of glutamine (blue) (ii) an unusual genome duplication event. With respect to lignocellulolytic machinery, genes encoding proteins from CAZy families GH43, GH16, CE1, GH11, AA1_2, and AA4 were highly enriched in the genome of 2T2.1 compared with other fungal genomes, even after removing the duplicated gene content. With nearly double the number of genes in 2T2.1 compared to related fungi, the enrichment of CAZymes in 2T2.1 is even more substantial (Additional file 5: Table S4). Glycosyl hydrolases (GHs) are key in the breakdown of internal and external linkages of arabinoxylan and xyloglucan [49], while AA1_2 and AA4 proteins could be involved in conversion of lignin. Moreover, 2T2.1 contains 13 CE1-encoding genes, whereas in the genome of M. thermophila, we found only four of these [50]. Fungal acetyl xylan esterases (EC 3.1.1.72) from CAZy family CE1 hydrolyze ester bonds to liberate acetic acid from acetylated arabinoxylan and xylooligosaccharides. It has been reported that these enzymes enhance the hydrolysis of pretreated wheat straw and giant reed (Arundo donax) [51]. Moreover, using Fisher's exact test, we found that genes encoding CAZy family GH127 enzymes were significantly enriched in Lecythophora/Coniochaeta genomes. These types of enzymes are mostly found in bacteria (e.g., Bifidobacterium longum), and many have β-l-arabinofuranosidase activity and can act on pectin, arabinoxyloglucan, and glycoproteins that are widely distributed in plant cell walls [52,53]. Thus, proteins of the GH127 family could play an important role in plant-fungal interactions within Lecythophora/Coniochaeta species. In addition, we found that one transcript associated with this family was significantly and highly upregulated on wheat straw compared with glucose cultures.
Regarding the genome duplication, we provide arguments, suggesting that 2T2.1 arose due to a hybridization of two related Coniochaeta-like species. Considering (i) the substantial diversity between the duplicated regions (91.9% identity on average; Fig. 2b), (ii) the inability of diploid-aware assemblers to phase haplotypes, and (iii) the higher diversity amongst copies and a depletion of nearly identical ones (Fig. 3), it is unlikely that these patterns emerged due to diploidization/dikaryosis. Regarding dikaryosis, this is even less likely as vegetative dikaryons have not been observed in Ascomycota. Alternatively, if the duplication had been caused by autopolyploidization, over the time, it would take the resulting copies to diverge to the extent we observe we would have expected to see the canonical gene loss and genome rearrangement patterns observed in other fungi (e.g., Rhizopus delamar 99-880) [30]. Even in the unlikely event that insufficient time has elapsed for rampant gene loss and rearrangements to occur, we should see elevated rates of pseudogenization given the 8% average divergence between copies, which is also not observed. In contrast, gene content was found to be highly conserved in 2T2.1 and a strong genome-wide consensus of purifying selection across copies was detected, similar to what was seen when comparing single-copy orthologues across different species (Fig. 4). As we would not expect nearly all genes in the genome to persist after autopolyploidization and simultaneously be experiencing purifying selection, these features indicate that the most likely source of this duplication event is a hybridization of two different Coniochaeta species (allopolyploidization). In addition, this likely occurred in the very recent past, as minimal gene loss has occurred. Previous studies revealed that highly selective environments could force hyphal fusion between unrelated fungi [54,55]. Since our strain was isolated from the highly selective wheat straw environment, [26,27], it is possible that to effectively break down plant biomass, two Coniochaeta/Lecythophora species were forced to fuse together. Alternatively, it is possible that the hybrid can more aggressively break down lignocellulose and is, therefore, more fit in this environment than either parent alone. Moreover, although we have not explicitly explored sexual reproduction here, we have not observed reproductive structures in 2T2.1 and it contains two copies of the same mating type (MAT 1-2-1) (JGI protein IDs 71119 and 1224076). Based on this evidence, we expect that 2T2.1 is heterothallic (i.e., not selffertile). However, given the limited sampling of this clade, identifying an opposite-mate closely related enough to 2T2.1 to explore fertility of this hybrid is challenging and remains to be addressed. Through comparing expression profiles of lignocellulolytic enzymes from 2T2.1 grown on wheat straw (raw and/or pretreated) and glucose, we were able to identify several upregulated enzymes which have potential for plant biomass saccharification processes. Remarkably, some of these were associated with endoxylanases (GH10 and GH11), feruloyl (CE1), and acetyl xylan esterases (CE5), which is consistent with what has been reported in M. cinnamomea grown on wheat bran and xylan [17]. Feruloyl esterases (EC 3.1.1.73) are responsible for the disruption of the ester bond in the lignin-ferulate-arabinoxylan complex. They act as auxiliary enzymes that assist other enzymes in gaining access to their site of action and, therefore, are likely key to lignocellulolytic activity [56]. Interestingly, α-l-arabinofuranosidases (GH51 and GH62) were also upregulated on 2T2.1 in wheat straw cultures. These enzymes are predicted to cleave the arabinose side chain into arabinoxylan. Qin et al. [18] reported upregulation of family GH61 enzymes in I. lacteus during growth on corn stover, whereas de Gouvêa et al. [16] showed that family GH51 enzymes are upregulated in Aspergillus fumigatus when the fungus was grown on steam-exploded bagasse compared with fructose. Moreover, Kolbusz et al. [15] studied the CAZy expression profile of M. thermophila during cultivation on different types of complex biomass in comparison with glucose. They reported the overexpression of nine enzymes involved in xylan deconstruction (five GH11, one GH62, one CE1, and two CE5) and seven cellulolytic enzymes (three AA9, two GH7, one GH6, and one GH12). In our study, we observed that five significantly and highly upregulated transcripts were associated with endoglucanases (GH12), cellobiohydrolases (GH7), and LPMOs (AA9). These enzymes may comprise the core of the cellulolytic machinery in Coniochaeta sp. 2T2.1. Based on this evidence, we suggest that 2T2.1 contains a complete set of enzymes required for exceptionally powerful lignocellulolytic activity. Based on the TPM data, we suggested that the high expression values in raw (WS) over pretreated wheat straw (PTWS) and glucose could be correlated with the highly complex interactions/bonds of the polysaccharides and lignin found in WS. Therefore, the fungal strategy to breakdown this challenging material might be largely based on increased expression and secretion of specific CAZymes.
Fungal LPMOs were first identified in saccharification experiments using pretreated corn stover [22]. Since their discovery, LPMOs have been included in all modern commercial enzyme cocktails (e.g., Cellic CTec3 ™ ) [19,57]. These copper-dependent enzymes boost the activity of classical GHs and cleave glycosidic bonds in cellulose, xylan, xyloglucan, glucomannan, and starch. In our study, after removing duplicate gene content in the 2T2.1 genome, we identified genes for 26 LPMOs (20 AA9-encoding genes). In the genomes of C. ligniaria NRRL30616 and C. pulveracea CAB683, 23 and 24 LPMOs were identified [7,9], respectively, whereas in I. lacteus, 17 LPMOs were detected that are potentially involved in stimulating (hemi)cellulose degradation [18]. An average plant biomass-degrading fungus has 10 AA9-encoding genes in its genome. Nevertheless, some fungi possess more than 30 different AA9-encoding genes (e.g., Chaetomium globosum), indicating a potentially important role of the LPMOs in their lifestyle [58]. For instance, some species of Coniochaeta are plant pathogens that could potentially use LPMOs as pathogenicity factors, similar to what was been reported in the maize pathogen Colletotrichum graminicola [59]. LPMOs in Coniochaeta species could additionally play a role in the decomposition of organic matter in soils. Several factors may be involved in the amplification and diversification of genes encoding LPMOs in 2T2.1. For instance, preference with respect to electron donor, adaptation to minimize undesirable oxidation events and physiochemical preferences [60].
Based on our transcriptomic analysis, we observed that some AA9-encoding genes were highly and significantly upregulated on WS versus Glu. To start characterization of these key LPMOs, we modeled their 3D structure using fungal-derived reported proteins. It is important to mention that LPMOs have low sequence identity, but share the same fold (immunoglobulin-like β-sandwich structure) [24,60,61]. To break (1,4)-linked glycosidic bonds of plant polysaccharide surfaces, LPMOs activate oxygen in a reducing agent-dependent manner, at a copper-containing active site known as the "histidine brace". Unlike GHs, which have substratebinding grooves or tunnels, LPMOs position their active site at the center of a flat surface. Based on 3D modeling, we identified these sites within five upregulated LPMOs, suggesting a similar structure and/or function with other fungal LPMOs. Notably, protein 1230134 showed a high percentage of identity (80%) with an AA9 family protein from M. thermophila [62]. In addition, the 3D model of protein 1175568 was reconstructed based on an AA9 protein from T. terrestris (Additional file 7: Table S6). Finally, it is important to note that our research team has recently developed a method for the genetic transformation of strain 2T2.1 using hygromycin as the selectable marker [63]. This method will be very useful for overexpressing lignocellulolytic enzymes that were detected in this study.

Conclusions
This study reports genomic and transcriptomic features of Coniochaeta sp. strain 2T2.1 isolated from a wheat straw-degrading microbial consortium. Interestingly, this fungus experienced an unusual genome duplication resulting from a recent hybridization event between two closely related species. This phenomenon is hypothesized to increase fitness in plant biomass deconstruction. Based on our results, we confirm that strain 2T2.1 has a very complete potential to degrade plant biomass and we highlight the relevance of some CAZy families in these processes (e.g., GH11, GH10, GH62, GH51, AA9, CE1, and CE5). The data presented in this study enable a better understanding of genomic features and metabolic potential of lignocellulolytic Coniochaeta species and identify novel proteins useful in saccharification of agricultural residues.

Isolation of Coniochaeta sp. 2T2.1 and DNA/RNA extraction
The Coniochaeta sp. strain 2T2.1 was originally isolated on PDA from a lignocellulolytic microbial consortium [26,27]. After 3-4 days of cultivation (30 °C at 250 rpm) in defined mineral medium (MM) [25 mM KH 2 PO 4 , 25 mM Na 2 HPO 4 , 0.1% (NH 4 ) 2 SO 4, and 0.1% Hutner mineral base] containing 1% (w/w) ground, autoclaved wheat straw (final pH 6.8), the growth of strain 2T2.1 on the substrate was identified using a BX60 microscope (Olympus Life Science, Waltham, MA, USA) with Nomarski interference contrast (Fig. 1). Coniochaetalike fungi form masses of conidia on hyphae, resulting in a yeast-like appearance in liquid culture. The liquid culture was transferred to a yeast extract-peptone-dextrose (YPD) agar and a single colony was isolated and used for reinoculation. To extract fungal genomic DNA, strain 2T2.1 was cultivated at 30 °C under shaking conditions in 50 ml of YPD broth containing 50 μg/ml kanamycin. Total DNA extraction was performed using the OmniPrep kit for fungi (G-Biosciences, St. Louis, MO). Total RNA was then extracted after growth (OD 600 nm of 1.0) on nine different cultures media and conditions: YPD (aerobic and microaerophilic conditions); YPD containing 1.5% (w/v) agar, yeast-peptone (YP); YP plus 1 M NaCl; MM containing 5 mM furfural, 4 mM HMF, and 3 mM benzaldehyde; MM containing glucose and NH 4 as a nitrogen source; and MM with NO 3 as nitrogen source and corn stover dilute-acid hydrolysate. Cell pellets were collected by centrifugation. In cases where 2T2.1 was grown on solid medium, cells were scraped off the plate. Subsequently, cells were suspended in 1.0 ml RNALater solution (Qiagen, Venlo, Netherlands) and stored at − 80 °C. Total RNA was isolated using the Qiagen RNAEasy plant mini kit (Qiagen) followed by DNase digestion, and quantified using the Qubit RNA HS assay (ThermoFisher Scientific, Waltham, MA, USA). RNA quality was also assessed visually using RNA bleach gels. The RNA isolated from the above nine cultures was pooled in equal quantities for use in genome annotation.

Genome and transcriptome sequencing, assembly, and annotation
For genome sequencing, 5 µg of genomic DNA was used to generate unamplified > 10 Kbp libraries. The sheared DNA fragments were then prepared using Pacific Biosciences SMRTbell template preparation kit. Pacific Biosciences hairpin adapters were ligated to the fragments to create the SMRTbell template for sequencing. The SMRTbell templates were then purified using exonuclease treatments and size-selected using AMPure PB beads. PacBio sequencing primer was then annealed to the SMRTbell template library and sequencing polymerase was bound to them using Sequel Binding kit v2.0. The prepared SMRTbell template libraries were then sequenced on a Pacific Biosystem's Sequel sequencer using v3 sequencing primer, 1 M v2 SMRT cells, and version 2.1 sequencing chemistry with 1 × 360 and 1 × 600 sequencing movie run times. Filtered sub-read data were then assembled together with Falcon version 1.8.8 [35].
Plate-based RNA sample preparation was performed using TruSeq Stranded mRNA HT Sample Prep Kit. Total RNA starting material was 1 µg per sample and 8 cycles of PCR was used for library amplification. The prepared library was then quantified using KAPA Biosystem's next-generation sequencing library qPCR kit and run on a Roche LightCycler 480 real-time PCR instrument. The quantified library was then multiplexed with other libraries, and the pool of libraries was then prepared for sequencing on the Illumina HiSeq sequencing platform utilizing a TruSeq paired-end cluster kit, v4, and Illumina's cBot instrument to generate a clustered flow cell for sequencing. Sequencing of the flow cell was performed on the Illumina HiSeq 2500 sequencer using HiSeq TruSeq SBS sequencing kits, v4, following a 2 × 150 indexed run recipe. The raw fastq file reads were filtered and trimmed using the JGI pipeline and assembled into consensus sequences using Trinity version 2.3.2 [64]. Fungal genome annotation was performed using the JGI pipeline and is available via the JGI-MycoCosm genome portal (http://genom e.jgi.doe.gov/Conio c1) [65].

Analysis of Coniochaeta sp. 2T2.1 genome with respect to duplication
To explore the duplication event in Coniochaeta sp. 2T2.1, we first identified segmentally duplicated regions. These were selected as duplicated genome fragments with a minimum of three genes in each fragment and at least 50% of genes between fragments being homologs to each other (blastp e value ≤ 1e−20 and alignment coverage for both query and target > 80%). As we are unable to assign parents to scaffolds due to potential genome rearrangements and similar divergence of duplicates to close relatives (see below), genes in duplicated regions were assigned "copy 1" and "copy 2" designations based on their alphanumeric position in the assembly (Additional file 2: Table S2). The percent assembly in duplication was then calculated as the total sum length of segmentally duplicated regions divided by the total assembly length. To calculate average similarity of 2T2.1 to close phylogenetic relatives (Lecythophora sp. AK0013 and Coniochaeta sp. PMI546) and representative lineages of varying ploidy, we used nucmer with default parameters from the mummer version 4.4.0 software package [38] and coordinates for all syntenic regions > 2000 bp were extracted using show-coords parameters -l -o -d -c -r -L 2000 -T. For comparison to assemblies of varying ploidy, potentially repetitive sequences (same position mapping to multiple locations) were removed. Since synteny is sometimes interrupted by unique sequence in one of the two copies, neighboring syntenic regions were extended if interrupted by less than 5 kb of non-syntenic sequence. If extended, % identity was averaged across duplicated regions. % of all duplicated content above 95% identity, or between 88.5 and 92.5% was calculated by dividing the sum length of duplicated content in regions at the specified identity levels by the total length of all duplicated content. Whole-genome DNA synteny for the visualization of duplicated content within in 2T2.1 was calculated using VISTA [66] and is available interactively at https :// mycoc osm.jgi.doe.gov/vista _embed /?viewM ode=dotPl ot&organ ism= Conio c1&?&run= 47620 -mbZ aH OBh&xdset =6678&ydset =6730&cutoff =50. As selfalignment will always generate a diagonal line of synteny across the plot, this is uninformative and is automatically removed by VISTA.
To explore patterns of sequence divergence between duplicates in haploid, diploid/dikaryotic and 2T2.1, we included other published fungal genomes deposited on JGI-MycoCosm genome portal that were sequenced using PacBio [36,37,[67][68][69][70][71][72][73], as well as close relatives of 2T2.1. For each genome, a self-BLASTp was conducted using all predicted proteins prior to removal of duplicates to identify orthologues by reciprocal best blast hits (minimum e value 1e−5). While the previous publications already identified P. coronata f. sp. avenae and P. striiformis f. sp. tritici assemblies to be dikaryotic [36,69], diploid PacBio assemblies were identified by: (1) analyzing the fraction of associate bases determined by Falcon [35], where any assembly with > 2% associate bases was considered a potential diploid and (2) calculating the fraction of 'alleles' present in each genome, where models were determined to be allelic if a secondary models were detected in regions on smaller scaffolds that were > 95% identical at the nucleic acid level and > 50% of the smaller scaffold was covered by these regions. In all cases included here (Linderina pennispora ATCC12442, Catenaria anguillulae PL171, and Rhizoclosmatium globosum JEL800), the percent of associate bases was > 20%, and correspondingly, > 20% of models were determined to be allelic (L. pennispora: 24.72%, R. globosum: 30.99%, and C. anguillulae: 37.09%), indicating that these assemblies are likely diploid. In contrast, in 2T2.1, the percent of associated bases determined by Falcon was 0.53% and only 18 of the 24,735 models (0.073%) fit our criteria to be considered potentially allelic.
Using mcl-identified orthologous gene clusters (see clustering of orthologous genes and phylogenomic comparisons, below), we further conducted an analysis of d N /d S across duplicated single-copy genes in 2T2.1. Following a similar approach to Mondo et al. [74], we aligned protein sequences using MUSCLE [75], converted to codon alignments using PAL2NAL [76] and then calculated pairwise d N /d S using the YN00 model [77] implemented in PAML v4.8 [78]. d N /d S distributions were similarly calculated between single-copy genes in related pairs of species (Lecythophora sp. AK0013 and Coniochaeta sp. PMI546, Coniochaeta sp. PMI546 and C. lignaria CBS111746, Coniochaeta sp. PMI546 and C. lignaria NRRL30616). To quantify similarities between genome-wide d N /d S distribution patterns in homeologs of 2T2.1 and orthologues across different species, QQ plot analysis was conducted using the EnvStats v2.3.1 package implemented in R version 3.5.1. The same approach was used when attempting to separate parents through comparing d S [29,39] between 2T2.1 duplicates and Lecythophora sp. AK0031, where any mcl cluster containing a single member from AK0031 and two copies in 2T2.1 were used. AK0031 was chosen for this analysis as it had the highest nucleotide conservation to 2T2.1 based on nucmer results.

Clustering of orthologous genes and phylogenomic comparisons
To perform phylogenomic comparisons, we selected 14 fungal genomes (including four from the Lecythophora/Coniochaeta lineage; and eight other Ascomycota, and two Basidiomycota species) that have been deposited on JGI-MycoCosm genome portal (Additional file 1: Table S1). The filtered protein models of each taxon were downloaded, and clusters of orthologous genes among the five Lecythophora/Coniochaeta genomes were detected using the software OrthoVenn [79]. Unique clusters of proteins found in the genome of Coniochaeta sp. 2T2.1 were then annotated using the dbCAN web server [80]. A species tree of Coniochaeta was generated using 2522 orthologous genes identified using mcl [40] that were aligned with MAFFT [81]. mcl clusters can be viewed interactively here: https ://mycoc osm.jgi. doe.gov/clm/run/Conio c1-Study .2509;zFSsa D?organ ism=Conio c1. Informative sites for phylogenetic purposes were extracted (1,096,767) from the alignment of each orthologous set using GBLOCKs [82], and then, maximum-likelihood phylogeny was re-constructed using both FastTree [83] and RAxML with (100 bootstrap replicates) [84]. Both phylogeny-reconstruction methods used the gamma rate distribution, WAGF substitution model and resulted in nearly fully supported phylogenies that showed the same topology.

CAZyme genome profile
Annotation of CAZymes in all the genomes evaluated in this study was performed using a combination of BLAST and HMMER searches conducted against the CAZy database [85]. To avoid an overestimation on the number of CAZymes detected in enriched/ depleted in the Coniochaetaceae, we removed secondary duplicated gene copies (see methods section: analysis of Coniochaeta sp. 2T2.1 genome with respect to duplication) for each CAZy family. For list of secondary duplicates, see Additional file 5: Table S4. Following family assignment, we identified CAZyme families that differed significantly (FDR corrected p ≤ 0.05) in abundance in Lecythophora/Coniochaeta genomes (Coniochaeta sp. 2T2.1, C. ligniaria CBS111746, C. ligniaria NRRL30616, Coniochaeta sp. PMI546 and Lecythophora sp. AK0013) compared with other fungal genomes using Fisher's exact test (two-tailed). To explore additional expansions/contractions in 2T2.1, we also determined which CAZy families from 2T2.1 were two standard deviations above or below of the mean counts compared to other Lecythophora/Coniochaeta genomes (CBS111746, NRRL30616, PMI546, and AK0013) and the other fungal genomes. The same analysis was also conducted including duplicated content (Additional file 5: Table S4). Moreover, LPMOs from family AA9 were extracted from 2T2.1, C. ligniaria NRRL30616 (Conlig1), T. reesei (Trire2), P. anserina (Podans1), and Phanerochaete chrysosporium (Phchr2) genomes and used for phylogeny reconstruction using the protocol listed above (see methods section: clustering of orthologous genes and phylogenomic comparisons). In addition, SignalP v.4.1 [86] was used to detect signal peptide cleavage sites in the AA9 proteins.

Transcriptomic analysis of Coniochaeta sp. 2T2.1 growing on different carbon sources
Strain 2T2.1 was cultivated in triplicate in 50 ml of MM containing either: 1% w/v raw wheat straw (autoclaved and cooled before inoculation) (WS), 1% w/v dilute-acidpretreated wheat straw solids (PTWS), or 1% w/v glucose (Glu). For cultures containing WS or PTWS, flasks were gently shaken and solids were allowed to settle, and then, the liquid fraction was removed by pipetting. The total RNA was extracted as described above when the cultures reached an optical density of 1.0 (OD 600 nm). Stranded RNAseq libraries were created and quantified by qPCR. RNA sequencing was performed using an Illumina HiSeq HiSeq-2500 1TB 1 × 101 instrument. Using BBDuk (https ://sourc eforg e.net/proje cts/bbmap /), raw reads were evaluated for artifact sequence by kmer matching (kmer = 25), allowing one mismatch and detected artifact were trimmed from the 3′ end of the reads. RNA spikein reads, PhiX reads, and reads containing any Ns were removed. Quality trimming was performed using the Phred trimming method set at Q6. Finally, reads under the length threshold were removed (minimum length 25 bases or 1/3 of the original read length-whichever is longer). Filtered reads from each library were aligned to the 2T2.1 reference genome (Conioc1) using HISAT2 version 2.1.0 [87]. HISAT2 searches for up to N distinct, primary alignments for each read, where N equals the integer specified with the − k parameter. Primary alignments mean alignments, whose alignment score is equal or higher than any other alignments. It is possible that multiple distinct alignments have the same score. However, for Coniochaeta sp. 2T2.1, we set k = 1, meaning that only unique primary alignments were included in downstream analysis. Across all libraries, 97.62% to 99.27% of reads mapped uniquely to the 2T2.1 genome, indicating that duplicated regions were sufficiently diverged to allow accurate read mapping. FeatureCounts [88] was then used to generate the raw gene counts file using gff3 gene models. Only primary hits assigned to the reverse strand were included in the gene counts (Additional file 8: Table S7 contains libraries and raw counts). Raw gene counts were used to evaluate the level of similarity between biological replicates using Pearson's correlation. DESeq 2 (version 1.18.1) [89] was subsequently used to determine which genes were differentially expressed between pairs of conditions. A table with the Log2 FC (fold change), adjusted pval (padj-value) and whether the gene is significantly and differentially expressed (TRUE/ FALSE/NA) for each pair of conditions was then generated. In addition, FPKM (fragments per kilobase million) and TPM (transcripts per kilobase million) normalized gene counts were obtained using the RNAseq gene expression analysis pipeline at the JGI.