Genome-scale phylogeny and comparative genomics of the fungal order Sordariales

The order Sordariales is taxonomically diverse, and harbours many species with different lifestyles and large economic importance. Despite its importance, a robust genome-scale phylogeny, and associated comparative genomic analysis of the order is lacking. In this study, we examined whole-genome data from 99 Sordariales, including 52 newly sequenced genomes, and seven outgroup taxa. We inferred a comprehensive phylogeny that resolved several contentious relationships amongst families in the order, and cleared-up intrafamily relationships within the Podosporaceae . Extensive comparative genomics showed that genomes from the three largest families in the dataset ( Chae-tomiaceae , Podosporaceae and Sordariaceae ) differ greatly in GC content, genome size, gene number, repeat percentage, evolutionary rate, and genome content affected by repeat-induced point mutations (RIP). All genomic traits showed phylogenetic signal, and ancestral state reconstruction revealed that the variation of the properties stems primarily from within-family evolution. Together, the results provide a thorough framework for understanding genome evolution in this important group of fungi.


Introduction
The order Sordariales (Ascomycota) is one of the most taxonomically diverse groups within the Sordariomycete fungi (Huhndorf et al., 2004).The order is of economic and ecological importance and contains species inhabiting a wide variety of natural habitats (Huang et al., 2021;Huhndorf et al., 2004;Hyde, 2020).The order also includes well-known model-organisms such as Neurospora crassa and Podospora anserina, both of which have been key-players in important scientific discoveries (Davis & Perkins, 2002;Gladieux et al., 2020;Roche et al., 2014;Silar, 2020).It furthermore contains species producing a diversity of biologically active secondary metabolites with interesting drug-like properties (Charria-Girón et al., 2022;Noumeur et al., 2020), and the highest known number of thermophilic species, which have large industrial relevance (Hutchinson et al., 2019;Patel & Rawat, 2021;van den Brink et al., 2015).
The order Sordariales was first described in 1960 by Chadefaud and validated by Hawksworth and Eriksson based on morphological data (Chadefaud & Emberger, 1963;Hawksworth & Eriksson, 1986), after which Huhndorf et al (2004) made an initial attempt to resolve the phylogenetic relationships within the order based on LSU sequence analysis.Since then, several studies have been performed on the Sordariales in order to delimitate its families and their largest genera (e.g.Cai et al., 2006;Kruys et al., 2015;Miller & Huhndorf, 2005;Wang et al., 2019).Past phylogenetic studies of the Sordariales have utilized manytaxa/few-genes approaches that have substantially advanced our understanding of phylogenetic relationships inside the order.Over time, however, lack of resolution has remained an issue and many parts of the tree have remained poorly resolved (see e.g.Ament-Velásquez et al., 2020;Huang et al., 2021;Marin-Felix & Miller, 2022;Wang et al., 2019).
In 2004, Huhndorf et al. (2004) divided Sordariales into three families: Chaetomiaceae, Lasiosphaeriaceae and Sordariaceae.Since then, several lasiosphaeriaceous taxa were reassigned to establish the additional families Diplogelasinosporaceae, Naviculisporaceae, and Schizotheciaceae, with the remaining species placed in Lasiosphaeriaceae sensu lato (Marin-Felix et al., 2020).Around the same time, the family Podosporaceae was introduced to accommodate the Podospora type species and was further divided into three main clades (Wang et al., 2019).However, the branching order of these three clades and the taxonomy of the Podospora type species have to date remained unresolved (e.g.Ament-Velásquez et al., 2020;Silar, 2020;Vogan et al., 2021).Recently, Huang et al. (2021) introduced an additional five new families to the order (Huang et al., 2021), although ongoing debates about this taxonomic classification remain (Charria-Girón et al., 2022;Marin-Felix and Miller, 2022).Thus, obtaining and investigating additional sequence data has been a high priority to further determine the phylogenetic and taxonomic affinities in the Sordariales (Huang et al., 2021;Hyde, 2020;Marin-Felix and Miller, 2022).
In addition to resolving taxonomic ambiguities, a robust phylogenetic framework of Sordariales facilitates comparative genomic analysis, which is required to identify the key factors underlying genome evolution in Sordariales.Properties such as genome size, gene number and GC content can vary widely amongst different organisms (Li and Du, 2014).Such genomic properties are often found to correlate with lifestyles and co-vary with each other.For example, recent comparisons of trait values between subdivisions Saccharomycotina and Pezizomycotina showed that evolutionary rate, GC content, genome size, and number of proteincoding genes were highly variable (Shen et al., 2020).Furthermore, covariation is seen between numerous genomic traits and repeat-induced point mutation (RIP), a process unique to fungi that gives rise to multiple G:C -> A:T mutations in repeated sequences (see e.g.Borkovich et al., 2004;Galagan and Selker, 2004;Gladyshev, 2017;Clutterbuck, 2011).Another example is the strong correlation between genome size and transposable-element content.At the dawn of large-scale sequencing, this was one of the first traits to show covariation in eukaryotes (Gregory, 2005), and today the correlation is well established in fungi (Fouché et al., 2022;Mat Razali et al., 2019;Oggenfuss et al., 2021).
An overview of genomic features across the Sordariales will furthermore increase knowledge on the impact of fungal lifestyle on genome evolution and covariation of genomic properties.Podospora anserina, for example, has a coprophilic nature, believed to select for rapid sexual reproduction (Geydan et al., 2012;Griffiths, 1992;Scheckhuber and Osiewacz, 2008), leading to short generation times in the family.Many studies have found negative correlations between evolutionary rate and generation time in eukaryotes (e.g.Thomas et al., 2010;Welch et al., 2008), but no research has been done comparing the evolutionary rates of Podosporaceae to sister groups with known longer generation times.Similarly, the Chaetomiaceae is well known for having the highest described number of thermophilic species (Geydan et al., 2012).Genomes of three thermophilic fungal species appear to be characterized by a relatively small size, high G:C content, and low levels of repetitive DNA (Patel and Rawat, 2021;van Noort et al., 2013), but how general these characteristics are for thermophilic versus nonthermophilic species is not yet known.
In this study, we have used whole-genome data from 99 Sordariales strains that span the diversity of nine families (Huang et al., 2021;Marin-Felix and Miller, 2022).Together with seven outgroup genomes, this creates a more comprehensive phylogenomic dataset than previously available.We used the resulting phylogeny as a framework for analyses of relatedness and evolution of genomic properties in this group of important fungi.

Genome sequencing and assessment of genome assemblies
A variety of different methods were used to cultivate the fungi, extract DNA, sequence, assemble and annotate the 52 genomes (Supplementary methods, Methods datafile; see Table S1 for the respective methods for each genome).After assembly and annotation of all genomes, we assessed the quality of the genome assemblies using BUSCO V5.2.2 with the Augustus gene predictor for eukaryote runs (Manni et al., 2021a(Manni et al., , 2021b;;Stanke et al., 2006).Using BUSCO, each assembly's completeness was assessed on the basis of the presence or absence of a set of 3817 predefined orthologs from the sordariomycetes_odb10 database (Kriventseva et al., 2019).
Potential contaminations of the assemblies were determined using the Blobtools2 workflow (Challis et al., 2020) as follows: first, the FASTA assembly files were used to create basic, per-sequence statistics (e.g.length, GC proportion).Next, contigs of each assembly were queried using BLAST v2.11.0+ (Altschul, 1997) with the blastn algorithm and the settings specified in the BlobToolKitPipeline (E = 1x10 -25 , max target sequences = 10, max hsps = 1, 16 threads).The BlobTools approach provides taxonomic annotation for each sequence in an assembly.To avoid taxonomic inference for longer contigs being dependent on a single region, the script divides contigs longer than 100 kb into chunks before running blastn.By using the setting "-taxrule bestsumorder", the taxonomic assignment is then based on the total bitscore obtained from a single database search with the NCBI taxonomy new_taxdump database (date: 20220129; Challis et al., 2020;Schoch et al., 2020).Based on information given by blastn searches and BUSCO analysis, the BlobToolKit viewer was used to interactively explore each assembly for contamination.

Phylogenetic datamatrix
To construct the phylogenetic datamatrix, we used the set of 3817 single-copy, full-length BUSCO genes from the 99 representatives of Sordariales and the seven outgroups.We removed BUSCO genes whose taxon occupancy was ≤50 % (i.e., when the gene was present in <45 Sordariales genomes).This removal resulted in a dataset with 3800 (i.e., 99.55 % of the original 3817) BUSCO genes.
To account for the underlying codon structure of our protein-coding nucleotide sequences, the combined sequences were aligned using an amino acid guided nucleotide alignment with MACSE V2.05 (Ranwez et al., 2018).The standard translation code was used (Elzanowski and Ostell, 2019).The nucleotide sequences were individually trimmed using TrimAl V1.4 with the default settings of the "gappy-out" parameter (Capella-Gutierrez et al., 2009).To validate our ortholog selection pipeline, a random subset of 40 gene alignments (selected using the bash command shuf -n 40) was visually examined, which revealed no ambiguously aligned sequences.
IQ-TREE (v2.1.4_beta)was run on a single node with three logical cores for each of the 3800 BUSCO gene alignments.The options "-m TEST -runs 10" were used to find the best-fitting substitution model, using the best-scoring ML gene tree under 10 independent tree searches (Minh et al., 2020b).A concatenation-based tree was created with IQ-TREE on a single node with three logical cores under the GTR + F + I + G4 model, as 1582 of the 3800 genes favoured this as best fitting model (Suppl.Table S2).The support for each internal branch was evaluated using 1000 ultrafast bootstrap (UFBS) replicates, the seed was manually set to 33822.Next, we inferred a coalescence-based phylogeny using the set of 3800 individual ML gene trees and ASTRAL-III V5.7.8 (Zhang et al., 2018).The reliability of each internal branch was evaluated with local posterior probability (PP; default settings).
Subsequently, gene concordance factors (gCF; setting -gcf) and site concordance factors (sCF; setting -scf 100 -T 10) were estimated for each branch using IQ-TREE (Minh et al., 2020a).Here, gCF is defined as the percentage of gene trees containing that branch, while the sCF is defined as the percentage of decisive alignment sites supporting a branch in the reference tree (Lanfear, 2018;Minh et al., 2020a).The gCF and sCF concordance factors complement ultrafast bootstrap and posterior probability estimates by offering additional information about the underlying variability in the data (Minh et al., 2020a).All measures of support were evaluated and compared between the coalescence-and concatenation-based tree.Phylogenetic trees were visualized using the online iTOL tool (Letunic and Bork, 2021; V6.5.8).

Genomic properties
For a given genome, the length was estimated as the total number of base pairs in the genome assembly (total scaffold length).The number of genes was calculated by using the assemblies in nucleotide format and the gff annotation files, removing genes that contained introns that overlap with CDS (Suppl.Table S6) and subsequently using the script gag (Hall et al., 2014) to create fasta files of the coding sequences.Quality of gene number estimations was analysed with BUSCO completeness of the coding regions, using the sordariomycetes_odb10 database and default settings.The evolutionary rate is given as a number of nucleotide substitutions per site, taken as a sum of path distances from the base of the Sordariales to the tip on the concatenation-based tree.The repeat content of each genome was assessed with RepeatModeler (V1.0.8) and RepeatMasker (V4.1.0);by first building a database with the BuildDatabase function, followed by running RepeatModeler with default settings and subsequently running RepeatMasker with the settings -pa 10 -gccalc -gff (Smit et al., 2013;Smit and Hubley, 2008).Lastly, the extent of RIP in the genomes was assessed with the web-based tool RIPper with the option "RIP profile".In this program, a sliding window approach is used with a 1000 bp window size and 500 bp slide size in order to assess the percentage of RIP affected regions in the genomes (van Wyk et al., 2019).
Taxon sampling was most dense across the Chaetomiaceae, Podosporaceae and Sordariaceae.Therefore, we performed comparative genomic analyses amongst these three groups.Tests of normality were performed for each genome property using SPSS V28, and the results showed that these were non-normally distributed within each group, and over the entire dataset.As such, statistical comparisons of trait values amongst these three families were done using Kruskal Wallis Tests for independent samples with SPSS V28 (IBM Corp, 2021).Genome properties were plotted using R version 4.2.1 in Rstudio, with package ggplot2 V3.3.6 (R Core Team, 2022;Rstudio Team, 2022;Wickham, 2016).

Assessment of phylogenetic signal and ancestral state reconstruction
Pearson's correlation coefficient was used to test for correlations among the trait variables on the entire phylogeny without outgroups.To correct for phylogenetic dependence of species traits, the R package ape V5.6-2 (Paradis and Schliep, 2019) was used to compute phylogenetically independent contrasts (Felsenstein, 1985).We analysed phylogenetic signal of trait variables with four indices: Abouheif's Cmean, Moran's I, Blomberg's K and Pagel's λ, using R-package phylosignal V1.3 with the function phylosignal (Keck et al., 2016;Münkemüller et al., 2012).Both Abouheif's Cmean and Moran's I are developed in the context of spatial correlation and are not useable for effect size measure.However, stronger deviations from zero indicate stronger relationships between trait values and the phylogeny.The other two measures of phylogenetic signal, Blomberg's K and Pagel's λ, specifically relate to a Brownian motion model of trait evolution.For these indices, a value close to zero indicates phylogenetic independence and values closer to one indicate strong phylogenetic signal.These four indices give information about the general presence or absence of phylogenetic signal within the phylogeny.However, traits rarely evolve similarly across the phylogeny and phylogenetic signal can be scale dependent and vary among clades.Therefore, we computed local Moran's I values with phylosignal V1.3 to detect hotspots of phylogenetic correlation (Anselin, 1996;Keck et al., 2016).
Lastly, we computed ancestral character states for the traits across internal nodes using the R package Phytools V0.6.44 with function ContMap (Revell, 2012).The mapping was done with default settings, which estimates states at internal nodes using ML with FastAnc, and then interpolates the states along each edge using equation [2] of Felsenstein (1985).The input tree was derived from the concatenationbased tree with branch lengths, using P. atrogrisum as outgroup.

A genome-wide dataset of the order Sordariales
With this study we present a rich genomic resource for the order Sordariales.This dataset will prove valuable for further taxonomic investigations within the order, and provide a starting platform for comparative genomic analysis in this important group of filamentous Ascomycetes.Assembly quality statistics were obtained from the genomes and are summarized in Suppl.Table S1.Of the 106 whole genome assemblies, 100 genomes contained over 90 % of the BUSCO genes from N. Hensen et al. the sordariomycete_odb10 dataset.In total, 17 out of the 3817 Sordariomycete BUSCO genes were found in less than 50 % of the Sordariales taxa and were deleted from the phylogenomic dataset.
BlobToolKit analysis was used to detect potential contamination in the genome assemblies.The vast majority of sequences showed no sign of contamination, but 17 genomes contained sequences of potential nonfungal origin.Manual exploration of the potentially contaminated sequences showed that, in 16 of the 17 genomes, regions without taxonomic assignment, or with best hits outside of the kingdom fungi, often showed high similarity to a broad range of taxa, including fungi.Here, we expect that the best mapping to other phyla was most likely due to the sequences being highly conserved rather than of non-fungal origin.
The total sequence length of these hits was < 0.5 % of the total genome length in all 16 genomes (Suppl.Table S1).In the seventeenth potentially contaminated genome, however, that of Arnium arizonense, 12.6 % of the genome assembly was classified as non-fungal.The majority of the blastn hits were to Firmicutes bacteria, without additional hits to the kingdom fungi.BUSCO completeness for the A. arizonense genome was >90 % for both genome assembly and coding region sequences, indicating good quality and estimates of gene numbers.Furthermore, all potential contamination in this genome was located outside of identified BUSCO genes.Therefore, A. arizonense was retained in the dataset, though we note that e.g.gene numbers may be influenced by contamination in this genome.

A highly supported phylogeny of the order Sordariales
Phylogenetic inference using both concatenation-and coalescentbased approaches of the 3800 genes yielded a robust, well-resolved and comprehensive phylogeny of the Sordariales order (Fig. 1).The vast majority of internodes in both the concatenation-based and the coalescent-based phylogenies received strong (≥95 % UFBS and PP) support (98 % of the nodes; three nodes without strong support in both trees).The three cases of low UFSB and PP support were all found on intrafamily branches.The majority of the nodes were congruent between the phylogenies inferred using both approaches, with 92 congruent internodes (92.4 %; Fig. S1).There were no incongruences between the two inference methods amongst the branches leading to the nine families.The majority of nodes (70 nodes, 67 %) received strong support from the majority of single gene trees (high gCF values).Only six out of 105 branches showed low sCF values, all low sCF values were found within families.(<33 %; Figs.S2-S6).The high interfamily gCF and sCF values indicate that the family phylogeny is well supported, and that only low levels of conflicting signals are present.The higher levels of gene tree discordance and site discordance were found within the genus Neurospora (Sordariaceae), within the Chaetomiaceae and amongst deep interfamily nodes in the phylogeny.Such discordance can have multiple biological sources, including incomplete lineage sorting and introgression.In places where branch lengths are extremely short, such as along the short branches within the genus Neurospora, either technical or biological processes can increase the discordance (Mallet et al., 2016;Minh et al., 2020aMinh et al., , 2020b)).
Due to the high levels of support, our phylogeny has robustly resolved relationships that were only weakly supported in previous analyses based on smaller datasets.Our phylogeny strongly supported the division of Sordariales into at least nine clades corresponding to the families accepted by Huang et al. (2021).Four interfamily relationships were strongly supported with all measures of branch support (100 % UFSB and PP, with the majority of single gene trees supporting the branching pattern).Congruent with Huang et al (2021) our phylogeny consistently favoured placement of Chaetomiaceae and Podosporaceae as sister groups.In congruence with Marin-Felix and Miller (2022), our analysis confirms the grouping of Lasiosphaeridaceae and Schizotheciaceae as sister clades.The interfamily relationships among Bombardiaceae, Naviculisporiaceae and Lasiosphaeriaceae and the relationship between Sordariaceae and the rest of the tree were supported by high levels of both UFBS and PP (100 % UFBS and PP).
Our phylogeny has furthermore cleared up intrafamily relationships within the Podosporaceae.Previous studies of this family have utilized four marker genes (rpb2, tub2, ITS and LSU) and divided the family into three main clades.These three main clades, characterized in our tree by Podospora fimiseda, P. australis, and P. anserina (Suppl.Fig. S2) are strongly supported (Ament-Velásquez et al., 2020;Wang et al., 2019; this study).However, in the four-marker phylogenies, the relationship of the clades was dependent on the variation in only one marker with strong phylogenetic signal (rpb2; Ament-Velásquez et al., 2020).With our genomic approach used herein, we were able to clarify the relationships amongst the clades.In short, our analysis confirmed the sister relationship of the clades containing P. australis and P. anserina (Suppl.Fig. S2) as initially described by Wang et al. (2019).The degree of conflict between markers for this internode does not seem to be dependent on a single gene (rpb2) with strong phylogenetic signal.Instead, we find an overall strong support for the branching relationships of all three clades, with 100 % UFSB and PP.In total 84.84 % of gene trees support the branching pattern between the two clades, indicating almost no conflicting signal in the underlying gene trees.This robust resolution of the relationships amongst the genera provides a better basis for discussions to minimize taxonomic conflict within Podosporaceae.
We note that the genome sequences of the taxa sampled are unevenly distributed across the families of Sordariales (specifically, e.g. Neurospora spp.are over-represented compared to other taxa).Next to increasing sequence length, extensive taxon sampling is one of the most important determinants of accurate phylogenetic estimation (Heath et al., 2008).Previous studies of the order have focused on many taxa, but few loci, and produced highly conflicting results of the Sordariales phylogeny.While our dataset contained 99 Sordariales species and 3800 genes, future studies may add data from additional taxa that are underrepresented in our study, and it is possible that the phylogeny may change as a result.This caveat notwithstanding, utilizing a genome-wide dataset has allowed us to establish a well-resolved phylogeny of the Sordariales.Between the concatenation-based and coalescence-based approaches only a few incongruences were found.All of the incongruences were found within families, mainly within species (complexes) and were accompanied by higher gene tree discordance.The vast majority of branches were highly supported with all phylogenetic reconstructions.Very few relationships were not fully resolved, with conflicting branches mainly present within one species or species complex.

Analysis of genomic properties
We performed comparative genomics for six genomic traits: genome size, gene number, GC content, evolutionary rate, repeat content and RIP-affected genome content.To this end, we first controlled for the possible effect of differences in genome assembly quality by looking at the correlation between N50 and each of the genomic properties.The only correlation we found was between N50 and gene number (Suppl.Table S4).Accordingly, to further assess the quality of the gene number analysis, we assessed BUSCO completeness of the coding regions for Sordariales genomes (Suppl.Table S1).From the coding sequences, 97 of the 101 examined genomes contained more than 90 % of the complete single-copy BUSCO genes, indicating reliable gene number estimations.Only a few duplicated or fragmented BUSCO genes were found, indicating high quality of the genome assemblies in general.
In our comparative genomic analysis, we found a wide distribution of trait values among the investigated taxa (Fig. 2, specific data of each species is given in Suppl.Table S3).Genome size ranged from 28 to 58 Mb, gene numbers from 7066 to 14970, GC content from 44 % to 60 %, evolutionary rate from 0.27 to 0.63 substitutions per site.Repeat content ranged from 1.58 % for Canariomyces arenarius to 41.64 % in T. antarcticum (both members of Chaetomiaceae), while the RIP-affected genome content was estimated to range from below 1 % in a number of taxa, to 40.96 % in Trichocladium antarcticum.

Comparative genomics in Chaetomiaceae, Podosporaceae and Sordariaceae
Despite its large taxonomical diversity, genome sequencing in the order Sordariales has thus far focussed mainly on families containing well-known model-organisms (i.e.Sordariaceae, Podosporaceae), and Chaetomiaceae, whose members produce a diversity of secondary metabolites.As a result, a large amount of whole genome sequencing data is present for these three families, while only a few genomes have been fully sequenced in others.We examined the evolution of different genomic properties amongst the three most densely sampled groups: the Chaetomiaceae, Podosporaceae, and Sordariaceae.Evolution of genomic properties was analysed using non-parametric tests for independent samples (Fig. 2) and ancestral state reconstruction (Fig. 3).Comparisons of the trait values for the genome properties amongst the three families showed that all properties were variable.
All three families were found to be significantly different in GC content (p < 0.05; Fig. 2C).For the other traits there was no significant difference amongst trait values between Podosporaceae and Chaetomiaceae.The Chaetomiaceae displayed the highest overall GC content and intermediate numbers of genes.They contained the smallest average genome sizes, with intermediate levels of repeat content, RIP affected genome content and evolutionary rates.The Podosporaceae had the The Sordariaceae differed significantly from the other two families in having the largest genome sizes, highest percentage of repeats and highest levels of RIP in their genomes.This family displayed intermediate gene numbers, while the GC content and evolutionary rates were significantly lower than those of Podosporaceae and Chaetomiaceae.Overall, the Sordariaceae are overrepresented in the dataset, with many closely related Neurospora strains having been sequenced.These closely related strains, often belonging to one single species, all display very similar genomic characteristics.As a result, the trends described above are valid for the current dataset, but it is possible that the inferences of trends of genomic properties change as species outside Neurospora are added to the Sordariaceae, or as taxon sampling becomes more even over all families.
Uneven taxon sampling can influence trends of genomic properties over multiple genomes.Additionally, on the level of single genomes, estimation of genomic properties can be influenced by assembly quality.
Repeat content and RIP content in particular are highly dependent on genome assembly quality (Testa et al., 2016;van Wyk et al., 2019van Wyk et al., , 2021)).As an example in our dataset, in Neurospora crassa, previous reports indicate RIP affected genome content to be around 15 % (van Wyk et al., 2019).Two of our three N. crassa genomes display similar levels of RIP and repeat levels (14.24 % repeat content in neucra-01 and 11.30 % repeat content for neucra-03), but the third (neucra-02) displays only 4.81 % repeat content.While BUSCO analysis shows high completeness of all three assemblies (Suppl.Table S1), neucra-02 is missing about 3-4 Mb in assembly size compared to neucra-01 and neucra-03.One hypothesis is that the genome size of neucra-02 has been streamlined to contain lower levels of repetitive elements and RIP.However, since repetitive elements represent a prevalent part of gaps in genome assemblies (Peona et al., 2018(Peona et al., , 2021;;Sotero-Caio et al., 2017;Tørresen et al., 2019;Weissensteiner and Suh, 2019), we are currently unable to determine whether this is an artefact of genome assembly or a genomic trait of neucra-02.Thus, future research should focus on sampling more taxa, as well as ensuring high quality data.This will reduce the chance of false trends and genomic assembly artefacts in genome analysis.In particular, analysis of a wider variety of N. crassa genomes is needed to determine whether the lower levels of repetitive elements are a genomic trait of the neucra-02 strain or an assembly artefact.

Variation of genomic properties stems primarily from within-family evolution
Analysis of standard Pearson's correlations amongst genomic properties revealed that half of the correlations were different before (i.e., standard Pearson's correlations) and after (i.e., phylogenetically independent contrasts) accounting for phylogeny (Fig. 4).After accounting for phylogeny, negative correlations remained between GC content and genome size (p < 0.01), GC-and repeat-content (p < 0.05), and GC-and RIP affected genome content (p < 0.05).RIP affected genome content was furthermore positively correlated with repeat content (p < 0.01) and genome size (p < 0.01).Lastly, bimodality of GC content was positively correlated with genome size (p < 0.05), and negatively correlated with evolutionary rate (p < 0.05) and RIP affected genome content (p < 0.01).All correlations had r values below 0.6, indicating relatively weak correlation between the traits (Fig. 4).The other correlations found before accounting for phylogeny became insignificant at the p < 0.05 level, indicating that phylogenetic signal, rather than trait correlation, could be the cause of the significant correlation values.We quantified phylogenetic signal to determine how trait variation was correlated with the phylogenetic relatedness of species (Suppl.Table S5).All four indices (Blomberg's K, Pagel's λ, Abouheif's Cmean, and Moran's I) indicated significant phylogenetic signal in all traits.Pagel's λ ranged from 0.299 for RIP affected genome content to 1.01 for evolutionary rate.Here, lower values indicate less phylogenetic signal in the trait and a value of λ = 1 indicates that there is strong phylogenetic signal and the variation observed in the trait does not deviate from expectations under a Brownian motion model of evolution.The results indicate that genomic trait values are phylogenetically linked in the Sordariales, but the fraction of trait variation explained by the phylogeny is variable.Local Moran's I tests further showed that phylogenetic signal in traits is concentrated within various families, indicating that variation in genomic properties stems from within-family evolution of properties.The local autocorrelation tests showed that the signal is strongest in the Chaetomiaceae family, followed by the Sordariaceae.The Podosporaceae showed less signals of local autocorrelation for most of the traits, but significant positive autocorrelation was found for evolutionary rate for most species.
The genome-scale phylogeny was then used to infer the ancestral character states and reconstruct the evolution for each property.Values for genomic traits at the last common ancestors to Sordariaceae, Podosporaceae, and Chaetomiaceae were inferred to be quite similar (Fig. 3).For example, the inferred state values of GC content for the Sordariaceae last common ancestor and the last common ancestor of both Podosporaceae and Chaetomiaceae are 52.17% and 53.18 % respectively, while there is a significant difference in average GC content amongst the three groups (Fig. 2c).The same trend is also observed for the other traits.This pattern further supports the hypothesis -similar to the phylogenetic signal-that the differences in genomic properties amongst Chaetomiaceae, Podosporaceae, and Sordariaceae stem from evolution after divergence of the three families.

Potential causes of divergence of genomic properties
In Chaetomiaceae, Podosporaceae and Sordariaceae, several traits covary.Phylogenetic signal and ancestral state reconstruction analysis indicated that variation in genomic properties has stemmed from within-family evolution of properties.For each of the three families, we examined potential causes of divergence of genomic properties.The majority of Chaetomiaceae showed a trend of high GC content, small genome size and predominantly low levels of repeats.One possible explanation for these trends in Chaetomiaceae could be the thermophilic lifestyle of at least part of the species in the dataset.Our current dataset contains ten species with known optimal growth temperatures, of which seven have been classified as thermophilic (van den Brink et al., 2015).Earlier research into prokaryotic thermophiles indicated that their genomes are GC rich and streamlined, most likely caused by GC-rich codons encoding for more thermostable amino acids (Wu et al., 2012;Hu et al., 2022) and optimization of energy utilization in stressful environments by reduction of genome size (Giovannoni et al., 2014) known thermophilic species in our dataset showed the general pattern of high GC content and small genome sizes.This result further supports earlier findings that high temperature environments may lead to similar changes in fungi and prokaryotes (van Noort et al., 2013).Deviations from the overall genomic pattern were seen in some species, which could be related to lower optimal growth temperatures.Indeed, the nonthermophilic species T. antarcticum contained the highest level of repeats in the dataset, relatively low GC content, and large genome size.Since the optimal growth temperature is unknown for the majority of Chaetomiaceae species in the dataset, future research in the field of thermophilia and how it affects genome evolution in this group of fungi will be an interesting endeavour to pursue.On average, the Podosporaceae clade showed the highest evolutionary rate.In other lineages, such as both animals and plants, evolutionary rate is negatively associated with generation time (Thomas et al., 2010;Welch et al., 2008;Yue et al., 2010).Assuming equal mutation rates per generation, species with rapid sexual reproduction replicate their genomes more frequently, leading to a higher number of mutations per time unit.Although generation time is unknown for many fungi, the coprophilic lifestyle of the Podosporaceae is believed to select for rapid sexual reproduction rather than prolonged mycelial maintenance (see e. g.Geydan et al., 2012).Thus, the high evolutionary rates in Podosporaceae may stem at least partially from the rapid generation time observed in this group of fungi.
The Sordariaceae contained the largest number of genomes in our dataset.They had the largest average genome size of the three main families, with low GC content and gene numbers, and a wide variety of repeat contents indicating widespread presence of RIP.RIP is known to have profound impact on genome evolution (Borkovich et al., 2004;Clutterbuck, 2011;Galagan and Selker, 2004;Gladyshev, 2017) and could thus explain the trends seen in genomic traits.For example, research across the Ascomycota showed that RIP affected genomes have low GC levels, and genome size and RIP mutation content are moderately correlated (van Wyk et al., 2021).Relatively low gene numbers can furthermore be caused by the apparent lack of functional gene duplication, with RIP slowing the creation of new genes (Galagan et al., 2003).High levels of RIP could thus explain the overall trend of low GC content and gene number as well as the large genome size in the Sordariaceae.

Concluding remarks
Even though recent phylogenetic studies have substantially advanced our understanding of Sordariales evolution, the absence of whole genome data caused uncertainties to remain in the phylogeny (Ament-Velásquez et al., 2020;Marin-Felix and Miller, 2022).In this study, 99 Sordariales genomes were used to create a whole-genome phylogeny and research genome evolution in the order, of which 52 newly sequenced as part of JGI community science programs.High UFSB and PP support values, together with high interfamily gCF and sCF values indicate that the family relationships are well supported, despite uneven taxon sampling.The few low intrafamily gCF and sCF values indicate that intrafamily conflict remains and it is possible that both the inference of the species phylogeny and the trends of genomic properties within families will change as more taxa are added.Addition of new taxa is especially important for underrepresented groups in the dataset, as taxonomic conflicts often remain in underrepresented families (Marin-Felix and Miller, 2022).Despite these caveats, using genome-scale information to infer the phylogeny of the order enabled us to resolve several controversies surrounding the phylogenetic relationships of the families within the order and to test the robustness of the inference of several contentious branches.For example, our study robustly supported Chaetomiaceae as a sister group to Podosporaceae.Within the Podosporaceae our study resolved the ambiguities surrounding the sister relationship of the clades containing P. australis and P. anserina.
The well-supported phylogeny was next used as a framework for comparative genomic analysis in the Sordariales.The three largest families in our dataset (Chaetomiaceae, Podosporaceae, and Sordariaceae) showed clear trends in most of the investigated genomic properties.
Local autocorrelation and ancestral state reconstruction of the traits further revealed that the variation of the properties stems primarily from within-family evolution.We hypothesize that some of the trends in genomic properties seen in the families Chaetomiaceae, Podosporaceae, and Sordariaceae may be caused by ecological, life-history, and molecular traits.Unfortunately, many ecological traits are analysed only amongst a subset of species.It therefore remains difficult to determine whether the described trends are caused by ecological and/or life history traits or if they are a signal of phylogeny alone.Future addition of genomes of underrepresented clades, together with research into the relation amongst genomic properties and ecological traits, will provide more knowledge about the key-drivers of genome evolution in this group of fungi.

Fig. 2 .
Fig. 2. Overview of genomic properties.a.Next to the maximum likelihood phylogeny of Sordariales, we plotted (from left to right) genome size, gene number, evolutionary rate (substitutions per site), repeat content (%), RIP affected area (%), and GC content (%).Continuous trait values were plotted using a heatmap from white (lowest value), to black (highest value).The detailed values of all six properties for each taxon are given in TableS3.Families from top to bottom: Sordariaceae.Lasiosphaeriaceae.Naviculisporiaceae.Bombardiaceae.Lasiosphaeridaceae.Schizotheciaceae.Podosporaceae.Chaetomiaceae.The tree was rooted with Leotiomycetes.B Individual graphs showing the distribution of genomic properties in the tree largest families of the concatenation-based tree.Families from left to right in the individual graphs: C: Chaetomiaceae (n = 22).P: Podosporaceae (n = 17).S: Sordariaceae (n = 41).*difference is significant at p < 0.05; ** difference is significant at p < 0.01.

Fig. 3 .
Fig. 3. Evolution of genomic properties across the phylogeny.The continuous properties were reconstructed on the species phylogeny and their ancestral states were visualized on the concatenation-based tree rooted with Phialemonium atrogrisum.Heatmap bars denote ancestral state values from small (light yellow) to large (red).Five ancestral state values are shown.Two for the most recent common ancestor of the families Sordariaceae, Podosporaceae and Chaetomiaceae, and three for the last inferred ancestor of each of the families.(For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)