Comparative genomics and divergence time estimation of the anaerobic fungi in herbivorous mammals

The anaerobic gut fungi (AGF) or Neocallimastigomycota inhabit the rumen of herbivorous mammals where they support plant fiber degradation. These obligate anaerobes have large and AT-biased genomes, which have limited genome-wide studies and the phylogenomic analyses of these fungi. Using newly generated genomes and transcriptomes of 27 Neocallimastigomycota taxa, we have explored their evolutionary relationships, divergence time, and gene content. The most recent common ancestor of the AGF is estimated to have diverged 73.5±5 million years ago (Mya), which coincides with the estimated ages of grasses (Poaceae) and evolution of mammalian herbivory. Comparative genomics identified lineage-specific genes and protein family domains including three that may have been acquired through horizontal gene transfer from animal hosts, plants, and rumen gut bacteria. The concordance of independently estimated divergence times of Neocallimastigomycota fungi, grasses, and ruminate lineages suggest AGF were important in shaping the success of modern ruminants and enabling their efficient acquisition of energy from recalcitrant plant material.


Introduction
Diverse microbes inhabit the digestive tract of ruminant mammals and contribute to degradation of ingested plant fiber, which liberates nutrients for their hosts. Large scale rumen metagenomic sequencing and analyses have produced hundreds of novel bacterial genomes enabling discovery of plant biomass degrading enzymes and patterns of genomic evolution 1,2 . However, eukaryotic members of the rumen microbial community have been less intensely studied 3,4

. The phylum
Neocallimastigomycota is composed of anaerobic gut fungi (AGF) associated with herbivorous mammals and reptiles and are important contributing members to the rumen environment 5 .
Adaptation to living in the rumen requires ability to grow in the anoxic and prokaryotesdominated environment. The AGF have acquired changes as part of the adaptation to ruminate hosts including loss of the mitochondria and gain of a hydrogenosome, the loss of respiratory capacities, and the substitution of ergosterol with tetrahymanol in the cell membrane 6 .
Importantly, the AGF have a remarkably efficient plant biomass degradation machinery, which may be critical for competing with other microbes for resources and establishing growth in the herbivorous gut. Such capacity is reflected in the possession of an impressive arsenal of plant biomass degradation enzymes and the production of the cellulosomes-extracellular structures that harbor multiple extracellular enzymes bound to scaffoldins 4 . These metabolic and structural adaptations improve survivability, fitness, and competitiveness of the AGF in the herbivorous gut, but the genetic and evolutionary origins of these changes remain undescribed 3,7 . Genomic investigations of the AGF have identified a massive number of carbohydrate active enzymes coded that were likely acquired through Horizontal Gene Transfer (HGT) events from multiple bacterial lineages 3,4,7 . Multiple examples of HGT from bacteria to fungi have been documented [8][9][10][11] . However, HGT events in fungi that have an eukaryote origin are still rare with only a few described cases from animals 12 , oomycetes 13 , or plants 14 . The rumen is an intriguing context to explore patterns of HGT where degradative enzymes break down sorts of cells liberating DNA and RNA. Competing organisms can find advantage by evolving or acquiring enzymes that operate efficiently in an anaerobic environment to obtain nutrients from recalcitrant plant fibers.
Multilocus phylogeny or phylogenomics has not yet been applied to evaluate evolutionary relationships and divergence time of the AGF. Using a collection of genomes and transcriptomes from 27 different AGF taxa representing seven out of the ten recognized genera, we reconstruct a robust phylogenomic tree of the AGF and estimated their divergence time. We compared the genomes or transcriptomes of AGF and their non-rumen Chytridiomycota relatives to identify unique and shared gene content. This study examines the relatively recent divergence of the Neocallimastigomycota, their evolutionary relationships to other fungal relatives in the tree of life, and the concordance of the timing of the AGF evolution with the emergence of herbivory in mammals and the grasses they consume. As the AGF are also highly effective at plant biomass degradation, we also explored the potential contribution of HGT to the genomic evolution of these fungi and identified their possible donor lineages in animals, bacteria, and plants.

Divergence time estimation and phylogenomic relationship of Neocallimastigomycota
Phylogenomic analysis placed the 27 AGF taxa into a single monophyletic clade with strong support of Bayesian posterior probability (1.0/1.0) and maximum likelihood bootstrap value (100%) ( Fig. 1 and Supp. Fig. 1). All AGF genera (Anaeromyces, Caecomyces, Feramyces, Neocallimastix, Orpinomyces, Pecoramyces, and Piromyces) included in this study formed individual monophyletic clades that were also supported by both Bayesian (Fig. 1) and maximum likelihood analyses (Supp. Fig. 1). A conflict in the tree topology between the two phylogenetic reconstructions is the placement of the Caecomyces clade. This lineage is sister to the rest of the Neocallimastigomycota in the maximum likelihood tree (Supp. Fig. 1), while the Caecomyces position is swapped with Piromyces in the Bayesian phylogeny (Fig. 1). This is likely due to short internode distances, which suggest a rapid radiation of the ancestors of these two genera.
The relative short bar of the highest-probability density (HPD) on the node of the AGF clade ( Fig. 1) suggests the integrative natural history of this group of fungi and the outperforming resolving power of the genome-wide data in the molecular dating analysis.
The divergence time of the Neocallimastigomycota clade is estimated at the end of Cretaceous 73.5 (±5) Mya (Fig. 1). This estimate corresponds to the age of the grasses  Mya), previously estimated using molecular (nuclear and chloroplast) markers with calibrations of fossil pollen, dinosaur coprolite, and the breakup time of the Gondwana 21-25 . The inferred divergence time of AGF is also consistent with the timing of a major diet change of placental mammals. Loss of the chitinase gene diversity in mammalian lineages was likely necessary as part of the diet transition from primarily insectivore to herbivorous and omnivorous mammalian 5 Fig. 1. Bayesian phylogenomic Maximum Clade Credibility tree of Neocallimastigomycota with divergence time estimation. All clades are fully supported by Bayesian posterior probabilities (BPP). Mean ages and 95% highestprobability density ranges (blue bars) are denoted on each node. The emergence time of the grasses is shaded in green box (within the Cretaceous). The red arrow indicates the dietary radiation time of placental mammals from insectivores to herbivores or omnivores (at the boundary of Cretaceous and Paleogene).  lineages that occurred around the boundary of Cretaceous-Paleogene, which also overlaps with the estimated divergence time of the AGF clade ( Fig. 1) 26 . The chronogram displays a characteristic long branch leading to the AGF which bridges from boundary of Cryogenian-Ediacaran (~647 Mya) to the end Cretaceous (~73.5 Mya). This suggests the distinction of the AGF and that the modern lineages of AGF did not diverge from the most common ancestor until Cretaceous and further diversified in the Paleogene period (Fig. 1).
On the other hand, our analysis identified that 106 Pfam domains have been lost in AGF genomes and transcriptomes, most of which are related to oxidation reactions on cytochromes and mitochondria (Supp. Table 1). In addition, domains involved in the biosynthesis of nicotinic acid, uric acid, purine catabolism, photolyase, pathways of ureidoglycolate and kynurenine are also found to be missing in AGF species ( Fig. 2 and Supp. Table 1). Similar patterns were revealed in the homologous gene comparisons (Supp. Fig. 2). A permissive criterion, allowing some missing copies, found a total of 2,728 gene families shared between AGF and chytrids. We discovered that 1,709 additional gene families are shared among AGF genomes (each gene present in at least 22 out of the total 27 taxa) but missing from other chytrids, while another 367 families are missing in AGF lineages and present in the other chytrid lineages.

Genomic interactions within the gut system of mammalian herbivores
We focused on three unique Neocallimastigomycota Pfam domains ("Rhamnogal_lyase", "Gal_lectin", and "Cthe_2159") that have never been encountered in the fungal kingdom.
Phylogenetic analyses strongly suggest that these three domains were horizontally acquired from plants, animals, and bacteria respectively.

a) A plant-like rhamnogalacturonate lyase ("Rhamnogal_lyase")
In plants, the rhamnogalacturonate lyases are involved in the fruit ripening-related process, cellwall modification, and lateral-root and root-hair formation 27,28 . The Pfam database classifies two types of domains for rhamnogalactoside degrading activity: "Rhamnogal_lyase" and "RhgB_N".
They are both N-terminal catalytic domains on the rhamnogalacturonan lyase protein (polysaccharide lyase family 4, PL4) and flanked persistently by the group of "fn3_3" and   Table 1).

Heatmap of the genome−wide enriched Pfam domains (natural logarithm)
"CBM-like" domains, with a particular function to degrade the rhamnogalacturonan I (RG-I) backbone of pectin. The "Rhamnogal_lyase" domain exists in members of plants and plantpathogenic bacteria (e.g. Erwinia chrysanthemi), whereas the "RhgB_N" domain has a wider distribution and can be found in bacteria, fungi, and oomycetes 29 . Searches against current databases (e.g., EnsEMBL, MycoSosm, Pfam) suggest that the "Rhamnogal_lyase" domain found in AGF represents the first documented observation of this domain in Fungi. The phylogenetic tree including available PL4 proteins shows that AGF "Rhamnogal_lyase" domains are more similar to the plant homologs and distantly related to the clades of fungi and oomycetes, which implies the novelty and potential foreign origin of these AGF domains (Fig.   3). The presence of "Rhamnogal_lyase" domain suggests that the AGF may possess a plant-like ability to soften, modify, and degrade the plant pectin within the rumen. In addition, the phylogenetic tree groups the "Rhamnogalacturonate lyase A" with the "RhgB_N"-containing proteins, which indicates that they are probably synonymous names.

b) An animal-like galactose binding lectin domain ("Gal_Lectin")
Among the Fungi, the "Gal_Lectin" domain is uniquely found in AGF but is absent in chytrids and all other fungal genomes. Phylogenetic analysis recovered a monophyletic clade of AGF "Gal_Lectin" domains embedded with animal homologs (Fig. 4a). The three distinct animal subclades represent "Gal_Lectin"-containing proteins with different annotated functions (Fig.   4b)

c) A bacteria-like biomass-binding and putatively polysaccharide lyase domain ("Cthe_2159")
Multiple biomass-binding (and degrading) domains were identified uniquely in AGF but missing in the other chytrid lineages (Fig. 2). One domain,"Cthe_2159", was described as a novel polysaccharide lyase-like protein and originally identified from the thermophilic and biomassdegrading bacterium Clostridium thermocellum 30

Discussion
Our results provide new insights to ruminant gut fungi evolution. The analyses find that AGF have diverged relatively recently (73.5±5 Mya, Fig. 1), an age that is in remarkable concordance with salient events in plant (evolution of the Poaceae) and mammalian (transition from insectivore to herbivore and omnivore) evolution. Grass evolution enabled the herbivory transition, and the adaptations drove an increase in developmental and morphological complexity of the digestive tract, compartmentalization, and the development of dedicated anaerobic fermentation chambers (e.g., rumen and caecum) in the herbivorous alimentary tract to improve biomass degradation efficiency 31 . This transition to plant-based (or plant-exclusive) diets required additional partnership with microbes since mammals lack cellulolytic and hemicellulolytic enzymes necessary to liberate sugars for absorption 5 Table 1). Interestingly, three of these incidents that resulted in the acquisition of Pfam domains are missing from the rest of the examined genomes of the fungal kingdom. These Pfam domains were horizontally acquired from animals, bacteria, and plants separately  and highlighted how HGT has contributed to broaden the lignocellulolytic capacities and potentially increase the fungal abilities in cell recognition with immune advantages in the rumen.
The "Rhamnogal_lyase" (PL4 family) and "Cthe_2159" function in pectin binding or degradation activities, and the possession of both suggests that AGF may have specialized ability to deconstruct pectin distinctly from other fungi. Pectin is abundant in primary cell walls and the middle lamella in both dicotyledonous plants (making up 20-35% dry weight) and grasses (2-10%) serving as a protection of plant cells from degrading enzymes produced by animals [33][34][35][36] .
Efforts have shown that removal of pectin can effectively increase the exposed cell wall surface, and improve the accessibility of other polysaccharides (cellulose and hemicellulose) masked by pectin 37 . "Rhamnogal_lyase" has been suggested to play important roles for cell wall modification in plants 27 , while in the view of pathogenic bacteria, it can be used to disorganize the plant tissue for invasion purposes 38 . The gain of "Rhamnogal_lyase" indicates a key synapomorphy of the modern AGF taxa, which could be an important contributor to the AGF ability to utilize the polysaccharide in plant cell walls. The "Cthe_2159" was recently identified with the resolved crystal structure showing the ability to bind cellulosic and pectic substrates 30 .
Although the right-handed parallel -helix structure is conserved across bacteria and some archaea (a common fold to CE, GH, and PL enzyme families), the "Cthe_2159" does not align with any characterized enzyme. Three Calcium ions binding sites of the "Cthe_2159" exhibit similarity with pectate lyases in the PL9 family. Utilization of "Cthe_2159" and "Rhamnogal_lyase" may have contributed to the AGF efficient machinery by degrading the pectin that bind plant cells together so that more surface of the ingested plant materials can be exposed to diverse polysaccharide enzymes in the rumen. These Pfam domains could account for the superior performance that AGF have to weaken forage fiber and degrade polysaccharides 39,40 . The AGF may benefit or depend on these acquired domains in its capacity in the primary degradation of ingested forage, which leaves other microbes to process partially digest remains 41 .
The third AGF-unique Pfam domain, "Gal_Lectin", bears the phylogenetic hallmark of being acquired from an animal donor. Animals use galactose-binding lectins to recognize foreign entities 42 and participate in anti-microbial defenses 43,44 . Interestingly, the animal origin of the AGF "Gal_Lectin" domain is similar to the extracellular motif of the PC-1 (Fig. 4), an integral membrane protein functioning in cell recognition 45,46 . In vitro, PC-1 shows binding ability to carbohydrate matrices and collagens type I, II, and IV 47 . As such, we postulate that the acquisition of the animal-like "Gal_Lectin" domain contributes to the AGF abilities of cell-cell recognition and interaction with other microbes in the rumen. Syntenic relationship of the gene copies shows that the AGF copies are flanked by the Glyco_transf_34 domain, which is not found in any other animal homologs (Fig. 4b). The AGF-equipped Glyco_transf_34 belongs to the galactosyl transferase GMA12/MNN10 family and may help catalyze the transfer of the sugar moieties in cooperating with the adjacent "Gal_Lectin" domain. Our investigation found that HGT has contributed to the genome evolution of AGF with donors from both bacteria and eukaryotic lineages of animals and plants. HGT may have played important roles in the ability of these fungi to acquire new functions and thrive in the anaerobic gut as a key member of the microbial community degrading plant material in animal hosts.

Transcriptome and genome datasets
We generated the transcriptomes of 22 strains of Neocallimastigomycota fungi from cow, sheep, horse, and goat feces, and rumen fluid of fistulated cows in the Stillwater, OK area (Murphy and Youssef et al. in review; Table 1). These strains were maintained under anaerobic conditions using the modified Hungate method as described previously [48][49][50][51] . Total volume of RNA was harvested from the growing fungal strains and processed for transcriptomics sequencing, which was performed using an external commercial service provided by Novogene (Beijing, China).
The RNAseq data were assembled into de novo transcript assemblies using Trinity (v2.6.   (Table S2), and the 1,165 available fungal genomes from the ongoing 1KFG project 17,18,53,54 . To prioritize AGF genes that may have been laterally acquired from these hosts, a Python script 12 and similarity search tool BLAT 81 was applied to filter out genetic elements in AGF with higher similarity to animal or plant homologs than any fungal ones, excluding the AGF themselves. Candidate genes for lateral transfer were ranked by the combination of the two strategies. The candidate genes with an assigned functional or biological process annotation were analyzed with priority using phylogenetic reconstruction to infer their potential origin.

Identification of homologous sequences and potential origin of HGT candidate loci
Three Pfam domains "Rhamnogal_lyase", "Gal_Lectin", and "Cthe_2159" were identified to be unique to the AGF genomes as compared to the Chytridiomycota fungi or all other fungal members (identified using following approaches). To predict the donor lineages for these putative HGT events, we searched more broadly for homologues in genome databases of Plant, The "RhgB_N" domain is distantly related to "Rhamnogal_lyase" (24% similarity between copies from the Aspergillus nidulans and An. robustus), and both are involved in the degradation of the pectin rhamnogalacturonan I region. Members of the "RhgB_N" sequences were obtained from the Pfam database classified in the "RhgB_N" (PF09284) family 29 along with previously annotated members of the rhamnogalacturonate lyase families A, B, and C from GenBank [84][85][86] . A single dataset of "RhgB_N" and "Rhamnogal_lyase" family members from animals, fungi, plants, and bacteria was constructed from these searches. Domain names were confirmed using NCBI's conserved domain search tool (cutoff=1E -5 ) with unaligned FASTA sequences 87 . Similarly, homologs of the "Gal_Lectin" and "Cthe_2159" were obtained by searching for similar sequences in the previously described genome databases and the categorized Pfam database (families of "Gal_Lectin (PF02140)" and "Cthe_2159 (PF14262)").
Homologous sequences containing the "Cthe_2159" domain were only identified in Archaea and Bacteria, while the AGF copies are the first eukaryotic representatives identified with this domain. Highly similar sequences (>90%) were filtered using CD-HIT v4.6.4 followed by multiple sequence alignment with MUSCLE v3.8.31 88,89 . Sequence and phylogenetic analyses were performed on the High-Performance Computing Center (HPCC) at the University of California, Riverside.

Phylogenetic trees of candidate HGT and homologs
In total, 747 sequences of the rhamnogalacturanate degradation proteins (including both "Rhamnogal_lyase" and "RhgB_N") were included in the alignment. For the other two domains, "Gal_Lectin" and "Cthe_2159", the alignments include 297 and 234 unique variants respectively. Both the upstream and downstream flanking regions of the studied Pfam domain sequences were trimmed using the Mesquite software 90 . Selection of appropriate substitutional model, maximum-likelihood phylogenetic tree reconstruction, and ultrafast bootstrapping (1000 replicates) were conducted using the IQ-TREE v1.5.5 package 59,91,92 . Sequence logos of the "Gal_Lectin" domains were generated using the WebLogo tool for each clade 93 . The protein homology of the animal subclades of the "Gal_Lectin" phylogenetic tree was suggested by similarity searches using BLASTp against the available non-redundant (nr) database 64 .