Complete Genome Sequences and Genome-Wide Characterization of Trichoderma Biocontrol Agents Provide New Insights into their Evolution and Variation in Genome Organization, Sexual Development, and Fungal-Plant Interactions

ABSTRACT Trichoderma spp. represent one of the most important fungal genera to mankind and in natural environments. The genus harbors prolific producers of wood-decaying enzymes, biocontrol agents against plant pathogens, plant-growth-promoting biofertilizers, as well as model organisms for studying fungal-plant-plant pathogen interactions. Pursuing highly accurate, contiguous, and chromosome-level reference genomes has become a primary goal of fungal research communities. Here, we report the chromosome-level genomic sequences and whole-genome annotation data sets of four strains used as biocontrol agents or biofertilizers (Trichoderma virens Gv29-8, Trichoderma virens FT-333, Trichoderma asperellum FT-101, and Trichoderma atroviride P1). Our results provide comprehensive categorization, correct positioning, and evolutionary detail of both nuclear and mitochondrial genomes, including telomeres, AT-rich blocks, centromeres, transposons, mating-type loci, nuclear-encoded mitochondrial sequences, as well as many new secondary metabolic and carbohydrate-active enzyme gene clusters. We have also identified evolutionarily conserved core genes contributing to plant-fungal interactions, as well as variations potentially linked to key behavioral traits such as sex, genome defense, secondary metabolism, and mycoparasitism. The genomic resources we provide herein significantly extend our knowledge not only of this economically important fungal genus, but also fungal evolution and basic biology in general. IMPORTANCE Telomere-to-telomere and gapless reference genome assemblies are necessary to ensure that all genomic variants are studied and discovered, including centromeres, telomeres, AT-rich blocks, mating type loci, biosynthetic, and metabolic gene clusters. Here, we applied long-range sequencing technologies to determine the near-completed genome sequences of four widely used biocontrol agents or biofertilizers: Trichoderma virens Gv29-8 and FT-333, Trichoderma asperellum FT-101, and Trichoderma atroviride P1. Like those of three Trichoderma reesei wild isolates [QM6a, CBS999.97(MAT1-1) and CBS999.97(MAT1-2)] we reported previously, these four biocontrol agent genomes each contain seven nuclear chromosomes and a circular mitochondrial genome. Substantial intraspecies and intragenus diversities are also discovered, including single nucleotide polymorphisms, chromosome shuffling, as well as genomic relics derived from historical transposition events and repeat-induced point (RIP) mutations.

About 3 decades ago, the Trichoderma genus was divided into five taxonomic sections: Hypocreanum, Longibrachiatum, Pachybasium, Saturnisporum, and Trichoderma (31). Thanks to the rapid development of next-generation sequencing (NGS) and genome-wide annotation technology, researchers have gained more insights into the genomes of several Trichoderma species, including plant-cell-wall degradation enzymes, effector-like proteins and genes putatively involved in the biosynthesis of SMs (2,3,10,(32)(33)(34)(35)(36)(37). Because the lengths of NGS reads are too short to exclude unidentified nucleotides and assembly errors, further validation is required before these NGS-based genome sequences and resulting annotation data sets can be used systematically. In a worst-case scenario, false-positive assemblies can result in incorrect assignments of gene gain or loss. NGS-based genome sequences also cannot reveal accurate information about genome synteny, diversity, or evolution (21,38).
As in the filamentous fungal model organism Neurospora crassa (Sordariales, Ascomycota) (44)(45)(46) and QM6a (21), the putative centromeric loci in all seven Trichoderma genomes are the longest AT-rich blocks and the longest regions of each chromosome lacking an openreading frame (ORF) or putative protein-encoding genes (Fig. S4-S8 and Table S3). Using BLASTN with an E value of 1e-8 (identity .80%) (47), all putative centromeric loci contain an array of repeats that are either short repetitive sequences or the relics derived from historical transposition and RIP events. Notably, there are no or very few authentic transposons in all putative centromeres (Fig. S4-S8 and Table S2). This scenario is consistent with our recent finding that all seven putative centromeres of QM6a and CBS1-1 generated no or only a few RIP mutations upon sexual crossing (26).
Extensive gross chromosome rearrangements between the genomes of different Trichoderma species. Conserved synteny between the nearly complete Trichoderma genomes was revealed by ideogrammatic representations ( Fig. 1 and Fig. S1) or by CIRCOS plots (48) (Fig. S2). Genomes of the same species (i.e., QM6a, CBS1-2, and CBS1-2 or Gv29-8 and FT-333) or of the same section/clade (i.e., P1 and FT-101) display a higher degree of chromosome synteny than when different Trichoderma species or sections are compared, respectively (Fig. S2). Disruption of synteny often (but not always) occurs at long AT-rich blocks or even within centromeres ( Fig. 1) (see Discussion).
Transcription factors. In terms of overall numbers of transcription factors (TF) genes, the ranking is Table 1). The main differences are due to specific gain or loss of two transcription factor subfamilies, i.e., the fungal Zn(2)-Cys(6) binuclear cluster domain subfamily (InterPro identity: 111138) and the fungal-specific TF domain family (InterPro identity: 007219). Compared with the genomes of Gv29-8, FT-333, FT-101, and P1, the three T. reesei genomes possess a speciesspecific PAS-fold protein, but lack a helix-turn-helix TF (Table S4). The physiological functions of these unique TFs need to be further explored. It is important to note that gene numbers of other TF families are nearly identical among the seven genomes. Thus, these TFs constitute the "core" transcriptional regulatory circuitry of the genus Trichoderma.
Predicted signal peptide proteins. Signal peptides (SPs) are short amino acid sequences in the amino termini of many newly synthesized proteins that target to membranes or membrane-embedded export machines. The SignalP server has been widely used to predict the presence and location of SP cleavage sites in amino acid sequences of SP proteins (55,56). The three T. reesei genomes each possesses 840 to 870 SP proteins. In contrast, those of Gv29-8, FT-333, FT-101, and P1 encode 1,072, 995, 1,050, and 1,041 SP proteins, respectively. GO indicates that all seven Trichoderma genomes are highly enriched with SP proteins having catalytic, oxidoreductive, or FIG 3 Comparison of the nucleotide sequences within and around the mating-type loci in P1, FT-101, CBS1-1, CBS1-2, Gv29-8 and FT-333. The tracks between two strains are color-coded to indicate nucleotide sequence identity. The mating-type genes (mat1-1-1, mat1-1-2, mat1-1-3 and mat1-2-1) are dissimilar in sequence, but they are found at the same loci on the third chromosomes and are all flanked by two evolutionarily conserved genes, the DNA lyase apn2 and the complex I intermediate-associated protein 30 gene cia30.
hydrolytic activities (Table 1). SP proteins revealed by the SignalP server are not necessarily cleaved or cleavable in vivo. Further experimental investigations are needed to validate if they are indeed secreted proteins.
To address the intriguing question of how SM-BGCs form and are regulated, we compared the biosynthetic genes of the five best-characterized SM-BGCs (see reviews of [2,10]) to explore their commonalities and differences (Table S6). First, BGCs for siderophore (SID), ferrichrome (FRC), and conidial green pigment (CGP) are found in all four Trichoderma species we examined here. SID-BGCs and FRC-BGCs exhibit gene order conservation and contain none or only one AT-rich block (.500 bp) (Fig. S11 and S12). There are only three evolutionarily conserved genes in CGP-BGCs. Their neighboring sequences are quite variable and contain more AT-rich blocks. The neighboring genes of CGP-BGC in P1 are more conserved with those in FT-101, but they are completely different from those in Gv29-8 and CBS1-2 (Fig. S13). Second, the sorbicillinoid (SOR)-BGC is T. reesei-specific, as only T. reesei can secrete yellow sorbicillinoid compounds (59). The 59 and 39 termini of SOR-BGC are physically linked to usk1 and a well-characterized CAZ-GC harboring three CAZyme genes: axe1 (acetyl xylan esterase), cip1 (a CBM-containing auxiliary factor), and cel61a (previously named endoglucanase IV) (32) (Fig. S14). The entire chromosomal region (i.e., usk1-SOR-BGC-axe1-cip1-cel61a) contains 14 protein-encoding genes, but no AT-rich blocks. Although Gv29-8, P1, and FT-101 each possesses eight to 10 orthologs of those 14 protein-encoding genes, their orthologs are scattered across at least five different chromosomes (Fig. S14). Third, FT-101 possesses a terpene cyclase (TrA_010949) that is highly similar in amino acid sequence to the protein product of the T. brevicompactum tri5 trichothecene synthase gene (2). This putative tri5 gene and seven novel genes together constitute a potential trichothecene (TRI) or TRI-like BGC in FT-101 (i.e., SM-BGC-7.3) (Fig. S15). We were unable to identify or annotate this putative tri5 gene in P1, CBS1-2, or Gv29-8. It is also noteworthy that the protein products of these seven novel genes in FT-101 are different from the biosynthetic proteins encoded by the canonical TRI-BGCs in T. arundinaceum and T. brevicompactum (60,61). Interestingly, six orthologs of these seven novel genes also cluster together in P1, whereas CBS1-2 and Gv29-8 each possesses three and five orthologs of these seven novel genes, respectively (Table S6). Fourth, T. virens has two species-specific BGCs: viridin (VIR)-BGC and gliotoxin (GTX)-BGC. These two SM-BGCs produce a group of furano-steroidal antibiotics and GTX or GTX-like compounds to compete with and restrict the growth of plant pathogenic fungi, respectively (36,62). CBS1-2, P1, and FT-101 possess none or only two orthologs of the VIR biosynthetic genes, respectively (Table S6). Both GTX-BGCs in Gv29-8 and FT-333 contain 13 gli genes and are located at the far-right subtelomeres of their first chromosomes. Ten long AT-rich blocks separate these 13 gli genes. The order of these long AT-rich blocks is conserved in FT-333 and Gv29-8, but they exhibit highly variable lengths and sequences (Fig. S16A). Although T. reesei does not produce GTX during vegetative growth, QM6a and both CBS999.97 strains each possesses a gene cluster with only six and seven gli orthologs, respectively, i.e., SM-BGC-6.4 in CBS1-2 (Table S6 and Fig. S16B). gliH, an essential GTX biosynthetic gene, does not exist in any of the three T. reesei genomes. QM6a also lacks gliC, which encodes a CPY450 oxidoreductase that catalyzes the formation of C-S bonds. These C-S bonds constitute the internal disulfide bridge of GTX and other exopolysaccharide-type fungal toxins (63). The order of the gli genes in T. reesei is different from that in Gv29-8. There are no AT-rich blocks between the gli orthologs in T. reesei (Fig. S16B). These results suggest that the ancestral genomes of T. virens and T. reesei might have undergone multistep processes to acquire and/or reorganize their GTX or GTX-like BGCs. Further investigations will be needed to address the physiological roles of the incomplete GTX-BGCs in T. reesei.
T. reesei Sor7 and T. virens GliT are secreted proteins. Most of our knowledge about the molecular mechanisms of GTX biosynthesis has come from studies of the GTX-BGC in Aspergillus fumigatus, an opportunistic fungal pathogen. A. fumigatus GliT is an intracellular FAD-dependent dithiol oxidase that converts the reduced GTX (i.e., dithio-GTX) to GTX, thus conferring host self-resistance toward highly reactive dithio-GTX and tolerance to exogeneous GTX (64,65). We found that the GliT proteins in FT-333 hosts an SP, whereas that of A. fumigatus does not (Fig. S17). The GliC proteins in FT-333 and Gv29-8, like that of A. fumigatus, are predicted to possess signal peptides. In contrast, that of CBS1-2 does not (Fig. S18). Accordingly, the biosynthetic and regulatory mechanisms of GTX-or GTX-like BGCs in T. virens and T. reesei likely differ from those of A. fumigatus. Our annotation results also reveal that the Sor7 FAD-linked oxireductase of T. reesei hosts an SP (Fig. S19). Because the SP-hosting proteins predicted by the SignalP server are not necessarily cleaved or cleavable in vivo, we applied a proteomics approach to analyze them in the culture filtrate (CFs) of CBS1-2, FT-333, Gv29-8, FT-101, and P1, respectively, including by means of ammonium sulfate precipitation, in-solution trypsin digestion and LC-MS-MS analysis (SI, Materials and Methods). Although most of the secreted proteins we identified in CFs are functionally unannotated, our results confirm that T. virens GliT and T. reesei Sor7 are indeed secreted proteins (SD, DS11 and Table S7). Further investigations will be needed to reveal the functional roles of these two FAD-linked oxireductases in regulating biosynthesis and the functions of GTX and sorbicillinoids, respectively.
Our proteomics approach also identified two fungal secreted proteins, i.e., T. virens small protein 1 (Sml) and T. asperellum KatG2 catalase-peroxidase 2. T. virens Sm1 is a secreted elicitor-like protein that induces plant defense responses and systemic resistance by triggering production of reactive oxygen species. Sm1 lacks toxic activity against plants and microbes (66). The functional roles of KatG2 in T. asperellum remain unclear. KatG2 was originally identified as an extracellular protein in several phytopathogenic fungi (67). In Fusarium graminearum (Hypocreales, Ascomycota), KatG2 is exclusively located on the cell wall of invading hyphal cells and contributes to its pathogenicity by alleviating oxidative stress in the vicinity of invasion hyphae (68).
Genome-wide transcriptomic analyses reveal evolutionarily conserved and diverse genes contributing to fungal-plant interactions. Trichoderma spp. can colonize plant roots, both externally and internally. Induction of plant defense via fungal-plant interactions is considered one of the most important mechanisms of Trichoderma-mediated biological control. Michael Kolomiets and colleagues performed RNA sequencing (RNA-seq) analysis on the roots of maize seedlings grown in hydroponic conditions and treated with T. virens Gv29-8 at 6 h and 30 h (69,70). Notably, the time points represented fungal-plant recognition at 6 h and advanced fungal colonization at 30 h. Gv29-8 undergoes global repression of transcription upon recognition of maize roots and then induces expression of a broad spectrum of genes during root colonization (69). Using the complete genome sequence of Gv29-8 as a reference, we reanalyzed the previously acquired transcriptomic data sets (69,70) to reveal 365 and 2,082 T. virens Gv29-8 genes that are transcriptionally upregulated (fold change [FC] $ 3 and P , 0.05) at the 6-h and 30-h time points, respectively (Table S8 and SD, DS12 and DS13). The protein products of 365 putative fungal-plant recognition genes (FPRGs) include 42 predicted SP proteins, 13 proteases, 14 CAZymes, and 75 membrane proteins. Three SM-BGCs and one CAZ-GC (1.4) were specifically upregulated at 6 h. The protein products of 2,082 putative advanced fungal colonization genes (AFCGs) include 349 predicted SP proteins, 100 proteases, 158 CAZymes, 433 membrane proteins, and 433 TFs. Almost 80% of SM-BGCs and CAZ-GCs in T. virens have at least one gene that was transcriptionally upregulated at 30 h. Notably, most genes in GTX-BGC-1.1 and CPG-BGC-10.1 are AFCGs, whereas all those in VIR-BGC-5.5 and SID-BGC-6.4 are not (Table S8). Further investigations will be needed to reveal whether and why CPG, GTX, or related SMs have nonantibiotic functions during advanced fungal colonization. Moreover, we found by genome-wide BLASTp searches that 190 (52%) FPRGs and 1,037 (50%) AFCGs are evolutionarily conserved in the genomes of CBS1-2, FT-101, P1, and FT-333, respectively (Fig. 2D, Table S9 and SD, DS12-DS14), prompting us to infer that different Trichoderma species might utilize both conserved and diverse SMs, proteins or signaling pathways to mediate fungal-plant interactions.
Mitogenomes and nuclear-encoded mitochondrial sequences. Our results also reveal the circular mitogenomes of all seven Trichoderma strains (SD, DS9). We reported recently that T. reesei mitochondria are inherited maternally (27), so the mitogenomes of CBS1-1 and CBS1-2 should be identical. Indeed, we found that their sequence differences were attributable to sequencing read errors in 18 polyhomonucleotide runs. The lengths of the other six mitogenomes vary significantly, although both number and order of proteinencoding genes, tRNAs and rRNA are all well-conserved. We found that mobile genetic elements play key roles in shaping the mitogenomes of Trichoderma (Fig. S20-S22 and SD, DS15).
Although our results of Trichoderma mitogenomes are consistent with those of previous studies (2,21,71,72), we have discovered that the mitogenomes of Gv29-8 and FT-333 each harbor three nuclear-encoded mitochondrial sequences (NUMTs). NUMTs, first detected in the mouse genome (73), have now been found in several other eukaryotes (74)(75)(76). NUMT integration has been implicated in increasing genetic diversity and facilitating genome evolution (77,78). NUMTs have yet to be explicitly reported among filamentous fungi.  (Fig. S23). The corresponding sequences of these three NUMTs localize within two mitochondrial NADH dehydrogenase subunit genes (nad5 and nad6) and a mitochondrial noncoding sequence, respectively (Fig. S21). Second, these three NUMTs are unlikely to represent mitochondrial DNA contamination arising from sequence assembly errors because the corresponding nucleotide sequences were observed in several long raw reads generated by the nanopore sequencer. Third, all three NUMTs in FT-333 and Gv29-8 are located within an AT-rich block (;1,500 bp in length) that lacks protein-coding sequences (Fig. S24). It was reported previously that mammalian NUMTs tend to be inserted near retrotransposons and that the insertion sites often represent DNA sequences with high DNA "bendability" and lie immediately adjacent to AT-rich sequences (79). Therefore, filamentous fungi and mammalian cells may display an evolutionarily conserved origin for NUMTs (79).

DISCUSSION
Although .50 different Trichoderma genomes have been determined by NGS technology, almost all of the NGS genomic drafts are far from nearly complete or chromosome-level assemblies due to the shortcomings of short NGS reads and the exceedingly low-complexity nature of Trichoderma genomes. In contrast, the seven nearly complete Trichoderma genomes we have determined by TGS technology represent the highest quality yet achieved. The greatest benefits of precise genome assemblies are that they enable accurate determination of structural components in each genome (e.g., telomeres, centromeres, interspersed AT-rich blocks and authentic transposable elements), as well as the chromosome synteny between different Trichoderma species. First, the results of our comparative genomic analysis reveal that all seven genomes likely underwent extensive transposon invasions followed by RIP mutations (Fig. S3), explaining why the overall numbers of authentic transposons (i.e., which have not been extensively mutated or degenerated by RIP) in these seven genomes are relatively low (Table S2). Second, we reported previously that the longest AT-rich blocks in all seven QM6a chromosomes are likely their centromeres, as they collectively harbor 24 centromere-specific and conserved satellite repeats (21). Here, we further demonstrate that the putative centromeric loci in all seven Trichoderma genomes are not only the longest AT-rich blocks but also the longest regions of each chromosome lacking an open-reading frame (ORF) or putative protein-encoding genes (Fig. S4-S8 and Table S3). Third, the genomes of different Trichoderma species exhibit extensive chromosome shuffling. Disruption of chromosome synteny often but not always occurs near or within AT-rich blocks or even the predicted centromeres (Fig. 1). One possibility is that mobilization of ancient retrotransposons might induce chromosome rearrangements. Alternatively, chromosome translocations via recombination of AT-rich blocks or centromere scission (80) may drive chromosome shuffling to reshape the genomes of different species. In this regard, retrotransposition and RIP might act as a critical driver of genome reorganization, eventually leading to genome diversification within a species (microevolution) and between species (macroevolution). Consequently, the newly generated strains harboring chromosome translocations may fail to undergo successful sexual reproduction with the parental wild-type strain because extensive chromosomal heterozygosity results in recombination suppression during meiosis, i.e., via nonhomologous pairing, synaptic adjustments or shifts in recombination (or chiasma) positions (81,82).
Locally, the order of biosynthetic genes, AT-rich blocks, and retrotransposons in SM-BGCs provides detailed insights into the mechanisms of genome evolution. Our data indicate that evolutionarily conserved SM-BGCs often lack or only contain a few longer AT-rich blocks ($500 bp). In contrast, the GTX-BGCs in FT-333 and Gv29-8 have 10 such blocks. This unique property of GTX-BGCs had not been reported before because NGS sequence reads are too short to assemble chromosomal regions of low sequence complexity (36). We suggest that GTX-BGCs represent versatile systems for future experimental evolution studies to reveal the functional impacts of AT-rich blocks for the expression and regulation of GTX biosynthesis. Also noteworthy is that orthologs of GTX biosynthetic genes are randomly allocated across the genomes of T. atroviride and T. asperellum. The gene order of the putative GTX-BGC in T. reesei also differs from that of the authentic GTX-BGCs in T. virens (Fig. S16). We conclude that diverse and/or multistep mechanisms were involved in the formation of different SM-BGCs during evolution (83), such as lateral gene transfer, unequal crossing over, or transposon-mediated gene transfer. Lateral gene transfer (also called horizontal gene transfer) represents the acquisition of genetic material from another organism (e.g., bacteria), whereas unequal crossing over involves the exchange of sequences between chromosomes. Because the GTX-BGCs of both FT-333 and Gv29-8 are located in the subtelomeres of their first chromosomes, it would be interesting to determine if BIR (42,43) or other molecular mechanisms might play critical roles in generating or maintaining this T. virens-specific gene cluster.
High-quality and near complete genome sequences also ensure accurate and comprehensive genome annotation. Based on NGS-based genome sequences, it was reported previously that T. reesei and its ancestor might have undergone rapid gene loss relative to several other of the most common Trichoderma species (37). Notably, a smaller gene inventory might result from less intensive expansion of certain gene families rather than profound loss of distinct protein-encoding genes. In contrast, our genome-wide annotation data indicate that QM6a, CBS1-1, CBS1-2, and FT-101 encode more near-universal single-copy orthologs (NUSCOs) than the genomes of P1, Gv29-8, and FT-333, with the BUSCO protein metrics of QM6a, CBS1-1, CBS1-2, and FT-101 being higher respectively than those of P1, Gv29-8, and FT-333 (Table 1). BUSCOs are ideal for quantifying genome and proteome completeness, as the assumptions for these NUSCOs are evolutionarily sound (84). The genome sizes of QM6a, CBS1-1, and CBS1-2 are much smaller than those of FT-101, P1, Gv29-8, and FT-333. Thus, the genomes of T. reesei and its ancestor might not have experienced rapid gene loss. Instead, compared with the genomes of QM6a and the two CBS999.97 strains, the genomes of P1, Gv29-8, and FT-333 appear to have experienced both gene loss and extensive gene family expansions during evolution. In contrast, the genome of FT-101 only underwent extensive gene family expansions and not gene loss (Table 1). These findings represent a striking example of why TGS-based genomes are superior to NGSbased genomes for comparative and functional studies. Because the overall numbers of interspersed AT-rich blocks ($500 bp) in QM6a, CBS1-1, and CBS1-2 are much lower (50% to 70%) than those in FT-101, P1, Gv29-8, and FT-333, we further infer that the larger genome sizes of these latter and their more extensive gene family expansions might have arisen from more profound mobilization and RIP of ancient retrotransposons in their ancestral genomes.
In this study, we have shown by comparative genomic and transcriptomic analyses that different Trichoderma species might utilize both conserved and diverse proteins or signaling pathways to mediate fungal-plant interactions (Fig. 2D). Our genome-wide annotation results also have revealed a comprehensive array of SM-BGCs and CAZ-BGCs in each of the four Trichoderma species we considered herein (Fig. 4). The majority of SM-BGCs, CAZ-BGCs, and effector-like SP proteins are poorly understood and, consequently, their corresponding products and physiological functions are largely unknown. Because a variety of genome mining methods and tools have been developed to guide characterization and activation of gene clusters (85)(86)(87), the nearly complete genome resources provided in this study will help discover and/or functionally characterize new SMs, CAZymes, and effector-like SP proteins, as well as their roles in and underlying evolutionary mechanisms of mycoparasitism, fungal-plant interactions and biocontrol activities.
Conclusion. Our study highlights that high-quality and nearly complete genome sequences are necessary tools for accurate comparative and functional genomic analyses. The data we have presented herein provide numerous insights into fundamental biology and industrial biotechnology.
Whole-genome DNA sequencing, RNA sequencing, and whole-genome gene prediction. Isolation of genomic DNA and RNA, as well as PacBio single-molecular real-time (SMRT) genome sequencing and assembly, were carried out as described previously (21,38,88). Oxford Nanopore direct DNA library preparation and sequencing were performed at the Genomic Core Lab of the Institute of Molecular Biology, Academia Sinica. In brief, small fragments of purified genomic DNA were removed by using the Ampure cleanup kit (Beckman Coulter). High-quality DNA quantification was conducted using a Qubit fluorometer (Thermo Fisher Scientific). The average length of genomic DNA was scaled using Femto Pulse (Agilent). We used 1 mg of genomic DNA for library construction. The library contained three or four DNA samples barcoded using the EXP-NBD104 Rapid barcoding kit (Oxford Nanopore Technologies), and adaptors were added with the SQK-LSK109 ligation sequencing kit (Oxford Nanopore Technologies). The library was loaded onto a R9.4.1 Flow Cell system (FLO-MIN106) and processed for 48 h on the MinION platform. The sequencing process was controlled using the MinKNOW software (version 3.6.5; Oxford Nanopore Technologies). All experimental procedures were carried out according to the manufacturer's instructions. FAST5 data files were generated upon completion of sequencing. These files were converted into FASTQ files in Guppy (v3.6.0). All reads were split into separate FASTQ files based on their barcode by means of the qcat tool (v1.1.0), and then Canu (v2.1.1) (89) was used to perform whole-genome assembly on the corresponding FASTQ files for each sample. The parameter of corOutCoverage was set to 60 based on sequencing read depth. Finally, we used medaka (v1.2.0) to polish the assemblies with the raw reads. Because Canu generated expected numbers of contigs for each sample, and a comparison of the genome to wild-type QM6a indicated no broken contigs, we did not perform any manual finishing or validation. The completeness of genome assemblies was evaluated in BUSCO (v4.0.6) (90) against the database of fungi_odb10 (OrthoDB; https://www.orthodb.org). All other experimental procedures have been described previously in detail (38), including PacBio SMRT genome sequencing, NGS-based RNA sequencing, as well as genome-wide gene prediction in Funannotate (91). BUSCO was also used to quantitatively measure the completeness of genome assemblies and gene predictions. We selected evolutionarily informed expectations of gene content based on the Ascomycota odb9 database from OrthoDB (https://www.orthodb.org). Genome or protein matrix scores .95% for model organisms are generally deemed complete reference genomes (92).
Analysis of genome-wide synteny. For comparative genome analyses and to identify duplicated regions, we identified orthologous gene pairs using the annotation results generated in Funannotate v1.8 (91). Alignments were performed using BLASTP with an expect value (E) # 1e-20. CIRCOS (48) was used to display sequence similarity and conservation. The inner ribbon track of CIRCOS outputs is used to show synteny, whereas the exterior tracks quantify the degree of sequence conservation between the T. reesei CBS1-2 genome and those of the three other Trichoderma species.
Comparative genomic analysis. Funannotate (91) not only parses the protein-coding models from the annotation, but also identifies numbers and classes of carbohydrate-active enzymes (CAZymes), proteases, transcriptional factors, secreted proteins, polyketide synthases (PKSs), and nonribosomal peptide synthetases (NRPSs). To predict secreted proteins, we downloaded the SignalP 4.1 (93), TMHMM 2.0 (94), and big-PI Fungal Predictor (95) programs into "Funannotate" and then applied them with default settings. In this study, effector candidate proteins were defined as predicted secreted proteins (with a signal peptide present, but no transmembrane domains or glycosylphosphatidylinositol anchors) having lengths of ,300 amino acids. To assess if effector candidates presented similarity to known proteins, we performed BLASTP analysis (cutoff E-value 10 25 ) using the Swiss-Prot database (downloaded October 22, 2016).
CAZymes were reannotated using the dbCAN2 meta server (http://bcb.unl.edu/dbCAN2) (96). We have developed a software tool "IBM-CAZ" to predict potential CAZyme gene clusters (CAZ-GCs) with the following requirements: a potential CAZ-GC must contain $3 CAZyme genes or $2 CAZyme genes with $1 other specific signature genes (namely, transporters or transcription factors), and be present within #2 intergenic distances. Although the prediction requirements in IMB-CAZ are more stringent than those of dbCAN2 (96) and dbCAN-PUL (97), IBM-CAZ was able to identify all four previously identified CAZ-GCs in Trichoderma reesei QM6a, the ancestor of all currently used cellulase-producing mutants (32).
Proteomics analysis of secreted proteins. To identify proteins secreted by different Trichoderma strains, we germinated 1 mL of conidia (OD 600nm = 0.3) and cultured it at 25°C in 250 mL potato dextrose broth (PDB) medium in shake flasks at 120 rpm for 24 h. The vegetative mycelia and conidia were removed from culture medium by filtering through a 0.22-mm filter (Merck Millipore, Darmstadt, Germany). The proteins in the culture filtrate (CF) were precipitated by adding ammonium sulfate (AS) to 80% saturation at 4°C. After centrifugation at 10,000 Â g for 30 min at 4°C, the AS precipitates were recovered and then resuspended in 500 mL of 50 mM Tris-HCl (pH 7.5). Protein concentration was determined using Bradford reagent (Sigma-Aldrich).
For in-solution trypsin digestion, the target proteins (20 mg) were reconstituted in 100 mM Tris-HCl (pH 8.0) and 5% acetonitrile, mixed with 1/10 volume of 100 mM 1,4-dithiothreitol in 100 mM Tris-HCl (pH 8.0), and incubated at 37°C for 1 h. Then, 1/10 volume of 550 mM iodoacetamide in 100 mM Tris-HCl (pH 8.0) was added to the mixture, gently vortexed, and incubated at room temperature for 1 h. Promega Sequencing Grade Modified Trypsin was added to give a final substrate:trypsin ratio of 50:1. The digestion reaction was carried out overnight at 37°C and then the reaction was stopped by adding 5% formic acid to adjust the pH of the solution to below pH 6.0. We determined pH by placing 1-mL aliquots onto pH paper. The peptide digestion products were desalted using C 18 Zip-Tips (Millipore, ZTC 18M 096). The peptide solutions were dried down in a SpeedVac vacuum concentrator (Thermo Fisher Scientific) and stored at 220°C before undergoing mass spectrometric analysis.
Liquid chromatography-mass spectrometry (LC-MS-MS) was applied to analyze trypsin-digested peptides. LC-MS analysis was performed by Shu-Yu Lin (Institute of Biological Chemistry, Academia Sinica) using an EASY-nLC 1200 system linked to a Thermo Orbitrap Fusion Lumos mass spectrometer equipped with a Nanospray Flex ion source (Thermo Fisher Scientific) located at the Academia Sinica Common Mass Spectrometry Facilities (https://www.ibc.sinica.edu.tw/facilities/mass-spectrometry-facilities/). All related experimental procedures were described previously in detail (98). Proteomic data were searched against our whole-genome annotation data sets (see below) using the Mascot search engine (v.2.6.2; Matrix Science, Boston, MA, USA) in Proteome Discoverer (v 2.2.0.388; Thermo Fisher Scientific, Waltham, MA, USA). We used the search criterion trypsin digestion. The fixed modification was set as carbamidomethyl (C), and variable modifications were set as oxidation (M), acetylation (protein N-terminal) allowing up to two missed cleavages, and mass accuracy of 10 ppm for the parent ion and 0.6 Da for the fragment ions. The false discovery rate (FDR) was calculated with the Proteome Discoverer Percolator function, and identifications with an FDR . 1% were rejected.
Data availability. The complete genome sequences, annotation, and raw data sets have been deposited in the National Center for Biotechnology Information (https://www.ncbi.nlm.nih.gov/bioproject/) under accession number PRJNA700774 (Table S1). All study data are included in the article as well as in SI and SD appendixes. All genomic databases data and 15 supplemental data sets are also available at the following websites: https://github.com/tfwangasimb/Trichoderma-biocontrol/releases.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. SUPPLEMENTAL FILE 1, PDF file, 3.5 MB.