Dynamics of genomic innovation in the unicellular ancestry of animals

Which genomic innovations underpinned the origin of multicellular animals is still an open debate. Here, we investigate this question by reconstructing the genome architecture and gene family diversity of ancestral premetazoans, aiming to date the emergence of animal-like traits. Our comparative analysis involves genomes from animals and their closest unicellular relatives (the Holozoa), including four new genomes: three Ichthyosporea and Corallochytrium limacisporum. Here, we show that the earliest animals were shaped by dynamic changes in genome architecture before the emergence of multicellularity: an early burst of gene diversity in the ancestor of Holozoa, enriched in transcription factors and cell adhesion machinery, was followed by multiple and differently-timed episodes of synteny disruption, intron gain and genome expansions. Thus, the foundations of animal genome architecture were laid before the origin of complex multicellularity – highlighting the necessity of a unicellular perspective to understand early animal evolution. DOI: http://dx.doi.org/10.7554/eLife.26036.001


Introduction
The transition from a unicellular organism to the first multicellular animal, more than 600 million years ago (Budd and Jensen, 2017;dos Reis et al., 2015), marks one of the most radical evolutionary innovations within the eukaryotes. Although multicellularity has independently evolved multiple times in the eukaryotic lineage, the highest levels of organismal complexity, body plan diversity and developmental regulation are found in the Metazoa (Grosberg and Strathmann, 2007). Key advances in the study of animal origins have been made by comparing the genomes of early branching metazoa, such as cnidarians, ctenophores or sponges (Putnam et al., 2007;Srivastava et al., 2010a;Moroz et al., 2014;Srivastava et al., 2008;Fortunato et al., 2014), with their closest unicellular relatives in the Holozoa clade, such as the choanoflagellates Monosiga brevicollis and Salpingoeca rosetta Fairclough et al., 2013), and the filasterean Capsaspora owczarzaki   (Figure 1). By focusing on the transition, it is possible to determine which genomic innovations occurred at the origin of metazoa, and whether it required the invention of novel genes or structural features.
We now know that the animal ancestor was already a genomically complex organism, with a rich complement of genes encoding proteins related to a multicellular lifestyle. These include transcription factors, extracellular matrix components and intricate signaling pathways that were previously considered animal-specific, but were already poised to be co-opted for multicellularity when animals emerged (Fairclough et al., 2013;Richter and King, 2013;Manning et al., 2008;Suga et al., 2012;de Mendoza et al., 2013;Sebé-Pedró s et al., 2017). Suggestively, detailed analyses of the transcriptomic and proteomic regulatory dynamics of Capsaspora and Salpingoeca showed that these genes are frequently implicated in the transition to life stages reminiscent of multicellularity -aggregative in Capsaspora , Sebé-Pedró s et al., 2016a, and clonal colonies in Salpingoeca (Fairclough et al., 2013). Furthermore, the genome architectures of extant Metazoa are, in many aspects, markedly different from most other eukaryotes: they have larger genomes (Elliott and Gregory, 2015a), containing more (Csuros et al., 2011) and longer introns (Elliott and Gregory, 2015a) that can sustain alternative splicing-rich transcriptomes (McGuire et al., 2008;Irimia and Roy, 2014), have richer complements of repetitive sequences such as transposable elements (Elliott and Gregory, 2015b) and are structured in ancient patterns of gene linkage associated with transcriptional co-regulation Simakov et al., 2013) -e.g., the Homeobox clusters (Ferrier, 2016). The relationship between these patterns of genome evolution and multicellularity is, however, unclear: these traits are not exclusive of animals (cf. (Curtis et al., 2012;Shoguchi et al., 2013;Michael, 2014;de Mendoza et al., 2015;French-Italian Public Consortium for Grapevine Genome Characterization et al., 2007)); the existence of secondarily reduced genomes in animals (smaller, gene-compact, less repetitive) in animals blurs their link with organismal complexity (Simakov and Kawashima, 2017;Seo et al., 2001;Petrov et al., 1996); and non-adaptive scenarios can explain the emergence of genomic complexities as a consequence drift-enhancing population-genetic environments (Lynch and Conery, 2003;Lynch, 2002, Lynch, 2007. Establishing the timeline of genome architecture evolution in the eLife digest Hundreds of millions of years ago, some single-celled organisms gained the ability to work together and form multicellular organisms. This transition was a major step in evolution and took place at separate times in several parts of the tree of life, including in animals, plants, fungi and algae. Animals are some of the most complex organisms on Earth. Their single-celled ancestors were also quite genetically complex themselves and their genomes (the complete set of the organism's DNA) already contained many genes that now coordinate the activity of the cells in a multicellular organism.
The genome of an animal typically has certain features: it is large, diverse and contains many segments (called introns) that are not genes. By seeing if the single-celled relatives of animals share these traits, it is possible to learn more about when specific genetic features first evolved, and whether they are linked to the origin of animals. Now, Grau-Bové et al. have studied the genomes of several of the animal kingdom's closest single-celled relatives using a technique called whole genome sequencing. This revealed that there was a period of rapid genetic change in the single-celled ancestors of animals during which their genes became much more diverse. Another 'explosion' of diversity happened after animals had evolved. Furthermore, the overall amount of the genomic content inside cells and the number of introns found in the genome rapidly increased in separate, independent events in both animals and their single-celled ancestors.
Future research is needed to investigate whether other multicellular life forms -such as plants, fungi and algae -originated in the same way as animal life. Understanding how the genetic material of animals evolved also helps us to understand the genetic structures that affect our health. For example, genes that coordinate the behavior of cells (and so are important for multicellular organisms) also play a role in cancer, where cells break free of this regulation to divide uncontrollably.
ancestry of Metazoa is thus essential to understand to which extent genomic complexity is linked to multicellularity.
Overall, gene content has been extensively studied in the unicellular ancestry of animals, but less attention has been devoted to the evolution of genome architecture in this period -covering features such as the repetitive content, intron creation and synteny conservation (although cf. Irimia et al., 2012)). This bias is partly due to the multi-million year gap separating animals from their unicellular relatives and the limited genome sampling of unicellular holozoans. We now know several examples of the effects of such limitations. For instance, our view of the transcription factor repertoire of the animal ancestor was confounded by the gene losses of Monosiga, which only became evident when Capsaspora genome was analyzed (Sebé-Pedró s et al., 2011); and the same happened with the ancestral animal diversity of cadherin and integrin adhesion systems before genomes from choanoflagellates and Capsaspora were analyzed (Nichols et al., 2012;Sebé-Pedró s et al., 2010). Therefore, comparative genomics studies are highly sensitive to taxonomic biases, meaning that rare genomic changes can remain elusive, and more frequent events can manifest saturated evolutionary signals. To overcome these limitations, we analyze the genomes of the third lineage of close unicellular relatives of animals, the Teretosporea, composed of Ichthyosporea and Corallochytrium limacisporum .
As the earliest-branching holozoan clade, Teretosporea are in a key phylogenetic position to complement our current view of premetazoan evolution. Interestingly, they display a developmental mode that radically differs from choanoflagellates and filastereans: many ichthyosporeans have a

Filasterea
Single-celled filopodiated amoebas, aggregative stage (Capsaspora) Figure 1. Evolutionary framework and genome statistics of the study. (A) Schematic phylogenetic tree of eukaryotes, with a focus on the Holozoa. The adjacent table summarizes genome assembly/annotation statistics. Data sources: red asterisks denote Teretosporea genomes reported here; double asterisks denote organisms sequenced for this study; † previously sequenced genomes Fairclough et al., 2013;; ‡ organisms for which transcriptomic data exists but no genome is available . (B) Overview of the phenotypic traits of each group of unicellular Holozoa, focusing on their multicellular-like characteristics. For further details, see Mendoza et al., 2002;Marshall et al., 2008;Glockling et al., 2013). multinucleate coenocytic stage (Mendoza et al., 2002;Marshall et al., 2008), and Corallochytrium develops colonies by binary, palintomic, cell division (Raghukumar, 1987). In both cases, completion of the life cycle frequently involves release of propagules that restart the clonal proliferation (Mendoza et al., 2002;Marshall et al., 2008). In addition, the ichthyosporean Creolimax fragrantissima exhibits many features reminiscent of animals, such as transcriptional regulation of cell type differentiation or synchronized nuclei division during its development .
Our aim is to provide new insights into the evolutionary dynamics of the genome in the ancestral unicellular lineage leading to animals, at two broad levels: gene family origin and diversification, and conservation of genome architectural features. We address the origin of the large and intron-rich animal genomes, changes in gene linkage (microsynteny), and ancient patterns of gene family diversification. The leitmotiv of these analyses is to identify and date genomic novelties along the ancestry of Metazoa, aiming to understand the foundations of the transition to multicellularity. The emerging picture from this comparative study is one of punctuated, differently-timed bursts of innovation in genome content and structure, occurring in the unicellular ancestry of animals.

Four new genomes of unicellular relatives of animals
We obtained the complete nuclear genome sequences of Corallochytrium limacisporum and the ichthyosporeans Chromosphaera perkinsii, Pirum gemmata and Abeoforma whisleri. For all these taxa, we sequenced genomic DNA from axenic cultures using Illumina paired-end and mate-pair reads, which were assembled using Spades (Nurk et al., 2013). Gene annotation was performed using a combination of de novo gene predictions and transcriptomic evidence derived from RNA sequencing experiments (see Methods). Of the four genomes presented here, Corallochytrium (24.1 Mb) and Chromosphaera (34.6 Mb) have the highest completeness and contiguity ( Figure 1). Specifically, Corallochytrium has 7535 genes and 83.4% of the BUSCO paneukaryotic gene set (a proxy to genome completeness (Simão et al., 2015)), and 75% of the assembly length is covered by 86 scaffolds (L75 statistic). Chromosphaera has 12,463 annotated genes comprising 85.5% of the BUSCO set, and its L75 statistic is 187 scaffolds. In contrast, Abeoforma and Pirum have larger genome assemblies (101.9 and 84.4 Mb), but these are fragmented (L75 = 25,133 and 25,440 scaffolds) and incomplete (11.9% and 17.0% of BUSCO). These lower contiguities are reflected in their partial gene predictions ( Figure 1A, Figure 1-figure supplement 1), which consequently hindered the detection of BUSCO orthologs.
Overall, together with Capsaspora, the two choanoflagellates and three already available ichthyosporeans, our expanded dataset now comprises 10 genomes from all unicellular Holozoa lineageseight more than in previous genome analyses (Fairclough et al., 2013;. The new Chromosphaera (gen. nov.) helps resolve the phylogeny of Holozoa To have a robust phylogenetic framework for our comparative analyses, we investigated the phylogenetic relationships between holozoans with a phylogenomic analysis based on the dataset developed in Torruella et al. (2015). We classified the newly identified Chromosphaera perkinsii (gen. nov., sp. nov.) as a member of Ichthyosporea, in the order Dermocystida, as it clusters with Sphaerothecum destruens in our phylogenomic analysis (Figure 2; BS = 100%, BPP = 1). Therefore, Chromosphaera, isolated from shallow marine sediments in Hawai'i, is the first described putatively freeliving dermocystid Ichthyosporea. Indeed, all described dermocystids are strict vertebrate parasites, whereas ichthyophonids are typical animal commensals or parasites (although free-living species have been described and some have been identified in environmental surveys of marine microbial eukaryotic diversity) Glockling et al., 2013).
A parsimonious scenario for genome size evolution would imply an holozoan ancestor with a fairly compact genome, in line with the values of Corallochytrium, Capsaspora and Chromosphaera (24.1-34.6 Mb), followed by secondary genome expansions in ichthyosporeans (the stem lineage of ichthyophonids, and then again in individual species) and possibly Salpingoeca (55.4 Mb). The largest unicellular holozoan assembled genomes fall short of the inferred C-values of ancestral Metazoa (~300 Mb) (Simakov and Kawashima, 2017), thus indicating another genome expansion at the origin of multicellularity.
Transposable element (TE) invasions partially explain the inflations in genome size and can carry the signal of the independent expansions (Elliott and Gregory, 2015b). Indeed, 5-9% of the genome of Salpingoeca, Sphaeroforma, Abeoforma and Pirum are covered by TEs, whereas other holozoans are below 2.5% ( Figure 3A). Unicellular holozoan have diverse TE complements, ranging between 42 families in Corallochytrium to >400 in Pirum or Abeoforma (Figure 3-figure supplement 1; [Carr and Suga, 2014]); and~31% of these families are shared with metazoan genomes (Figure 3-figure supplement 2). In Salpingoeca, Pirum and Abeoforma, we found species-specific small sets of TE families, sharing high sequence identity, that accounted for the vast majority of copies ( Figure 3B). This signaled recent TE invasions, and, therefore, independent contributions to genome expansion. There were hints of older TE propagation events in Sphaeroforma and Pirum, with a long tail of low-similarity TE copies ( Figure 3B). In Abeoforma and Pirum, TEs and other simple repeats comprised up to 17% and 34% of the genome, accompanied by unusually AT-biased nucleotide compositions ( Figure 1A). However, the exact repetitive fraction of Abeoforma and Pirum genomes cannot be exactly quantified: their highly repetitive nature has contributed to their fragmented and incomplete assemblies ( Figure 1A, Figure 1-figure supplement 1) (Treangen and Salzberg, 2011), which hinders the annotation of TEs and simple repeats. Finally, the smaller genomes of Corallochytrium and Chromosphaera were largely depleted of repetitive/satellite regions and TEs (1.8% and 3.8% of their genomes). This finding, together with their reduced intron content (see below, Figure 4) suggests a secondary streamlining process.

Synteny conservation across holozoan lineages is rare, except in Capsaspora
Ancestral conservation of gene linkage at the local level (microsynteny) is common in Metazoa, frequently due to coordinated cis-regulation Simakov et al., 2013). Following this reasoning, we analyzed the microsyntenic gene pairs of unicellular holozoan genomes ( Figure 3C), expecting higher degrees of conservation within lineages than across them. This hypothesis held true for the Salpingoeca-Monosiga genome pair, but we found little or no conservation in almost all inter-specific comparisons of ichthyosporeans and Corallochytrium. There were, however, two exceptions: Creolimax-Sphaeroforma (sibling species; 907 syntenic orthologous genes) and, to a lesser extent, Chromosphaera-Corallochytrium (72 genes). In the case of the closely-related Pirum and Abeoforma, their fragmented genomes hindered the gene order analyses and yielded low synteny conservation values.
In contrast, the analysis of microsynteny in Capsaspora revealed remarkable across-lineage conservation with the distant teretosporeans Chromosphaera and Corallochytrium (142 and 129 genes, respectively). Moreover, and to a lesser degree, Capsaspora also retains a few shared linked gene pairs with Trichoplax, the cnidarians Aiptasia sp., Nematostella vectensis, and the sponges Amphimedon queenslandica and Oscarella carmela ( Figure 3C   Density plots indicate the sequence similarity profile of the TE complement in each organism. Embedded pie-charts denote the relative abundance, in nucleotides, of the main TE superclasses in each genome: retrotransposons (SINE, LINE and LTR), DNA transposons (DNA) and unknown. N c : total number TE copies in the genome; N f : number of families to which these belong; P 25f and P 75f : percentage of most-frequent TE families that account for 25% and 75% of the total number of TE copies, respectively. (C) Heatmap of pairwise microsynteny conservation between 10 unicellular holozoan genomes. Species ordered according the number of shared syntenic genes (Euclidean distances, Ward clustering). At the right: selected pairwise comparisons of syntenic single-copy orthologs between unicellular holozoan genomes. Numbers denote number of syntenic genes, total number of single-copy orthologs, and proportions (%) of syntenic genes per the compared orthologs. Circle segments are scaffolds sharing ortholog pairs, connected by gray lines. (D) Phylogenetic distances between unicellular holozoans and four selected animals: Homo sapiens, Nematostella vectensis, Trichoplax adhaerens and Amphimedon queenslandica. Red asterisks denote organisms that have lower phylogenetic distances to metazoans than one (single asterisk) or both choanoflagellates (double asterisks) (p value < 0.05 in Wilcoxon rank sum test). † indicates significantly higher distances between Corallochytrium and metazoans.  notable example of ancestral microsynteny is that of integrins: heterodimeric transmembrane proteins involved in cell-to-matrix adhesion and signaling in animals that are also present in unicellular Holozoa Sebé-Pedró s et al., 2010). Indeed, integrin-a and integrin-b genes from Corallochytrium (one pair) and Capsaspora (four pairs) are in a conserved head-to-head arrangement of likely holozoan origin. Incidentally, Capsaspora's pairs of collinear a/b integrins coexpress during its life cycle (Sebé-Pedró s et al., 2013), a typical cause of microsynteny conservation in animals . Overall, gene linkage of most extant holozoans appears to be markedly different from their common ancestor, with specific gene pairings arising in Metazoa Simakov et al., 2013), choanoflagellates and some ichthyophonids. In contrast, Capsaspora harbors a relatively slow-evolving genome in terms of synteny conservation.
Coding sequence conservation patterns vary across holozoan lineages Finally, we examined the level of coding sequence conservation between unicellular holozoans and animals. We aimed to contrast the patterns of conservation at the structural level (outlined above) with those of the genic regions. Using 143 phylogenies of paneukaryotic orthologous genes, we examined the pairwise distances between unicellular holozoans and Homo sapiens (bilaterian), Amphimedon (sponge), Nematostella (sea anemone) and Trichoplax (placozoan) ( Figure 3D). In all comparisons, Capsaspora, Chromosphaera and Ichthyophonus accumulated fewer amino-acidic substitutions per alignment position than choanoflagellates since their divergence from animals (p<0.05 in Wilcoxon rank sum test). Conversely, Corallochytrium was singled out as the taxon with more cumulative amino acid differences with animals. Thus, the analysis of coding sequence conservation across holozoans-a genomic trait fundamentally unrelated to synteny-also attests to Capsaspora's slower pace of genome change.  Intron evolution in Holozoa: two independent 'great intronization events' Evolution of intron structure Intron-rich genomes are a hallmark of Metazoa. Indeed, the last common ancestor (LCA) of Metazoa is inferred to have had the highest intron density among eukaryotes, due to a process of continuous intron gain starting in the last eukaryotic common ancestor (LECA) (Csuros et al., 2011;Carmel et al., 2007). The high intron density of multicellular animals has been linked to their higher organismal complexity, as it enables frequent alternative splicing (AS) and richer transcriptomes (Rogozin et al., 2012;Barbosa-Morais et al., 2012;Irimia et al., 2009;Nilsen and Graveley, 2010), provides physical space for transcription regulatory sites (Le Hir et al., 2003;Sebé-Pedró s et al., 2016b), and facilitates the diversification of gene families by exon shuffling (Liu et al., 2005). The dominance of weak splice sites inferred at the intron-rich ancestral Metazoa reinforces the proposed role of alternative splicing as an important source of transcriptomic innovation at the dawn of animal multicellularity (Csuros et al., 2011;Irimia et al., 2007).
Our expanded set of unicellular holozoan genomes provides an ideal framework to investigate the emergence of the high intron densities found in animal genomes. Our survey of intron richness across eukaryotes identifies a high number of introns per gene in many ichthyosporeans, choanoflagellates and animals ( Figure 4A). Moreover, Creolimax and Ichthyophonus harbor longer introns than most protistan eukaryotes, similar in length to those of some animals ( Figure 4B). These similarities between ichthyosporeans and animals suggest two possible scenarios: (1) an early intronization event at the origin of Holozoa followed by reduction in some unicellular lineages (e.g., Capsaspora or Corallochytrium); or (2) independent episodes of intron proliferation in Metazoa, Choanoflagellata and Ichthyosporea. To test these hypotheses, we assembled a set of 342 paneukaryotic orthologs from 40 complete genomes and analyzed the conservation of their intron sites according to the maximum likelihood method developed by Csűrö s and Mikló s (2006) (Figure 5-figure supplement 1). This analysis supports the second hypothesis and reveals two independent periods of intense intron gain in unicellular holozoans: at LCA of Metazoa and Choanoflagellata, and in the branch leading to ichthyophonid Ichthyosporea ( Figure 5A-B). After animals and choanoflagellates diverged, intron gains independently persisted in both lineages.
Our reconstruction shows that, since the origin of introns in the LECA, most ancestors were dominated by intron loss while a few remain in an equilibrium, static or dynamic (consistent with previous studies [Csuros et al., 2011;Rogozin et al., 2012]) ( Figure 5B). A prolonged process of intron gain can be observed, however, in the lines of descent from the LECA (4.9-5.5 introns per kbp of coding sequence) to Ichthyophonida (6.9 introns/CDS kbp) and Metazoa LCAs (8.7 introns/CDS kbp), interrupted by phases of stasis with slight intron loss, such as in the Filozoa or Holozoa LCAs ( Figure 5A-B).
The existence of independent intronization events in ancestral holozoans is supported by a hierarchical clustering analysis of the intron presence/absence profile across extant and ancestral genomes ( Figure 6A; Ward clustering from Spearman correlation-based distances). First, most intron-rich animals form a cluster with Salpingoeca and Monosiga that also includes the LCAs of Metazoa and Metazoa + Choanoflagellata. Second, ichthyosporeans and Corallochytrium, although phylogenetically closely-related to each other, are highly divergent in their pattern of shared introns: the introndense Creolimax and Sphaeroforma form an independent cluster that differs from the Holozoa LCA; whereas Corallochytrium and Chromosphaera undergo independent secondary simplifications (from 5.5 introns/CDS kbp in the Teretosporea LCA, to 0.0 and 0.7, respectively). In contrast, Ichthyophonus (intron-rich) and Capsaspora have lower intron loss rates and are more similar to older eukaryotic ancestors, from Holozoa to the LECA ( Figure 6A). In Ichthyophonus, retention is accompanied by a high gain rate, giving intron densities similar to some modern animals (7.1 intron/CDS kbp). In contrast, Capsaspora (3.5 intron/CDS kbp) appears to have undergone little ancestral reconfiguration of its gene architecture: there is an equilibrium between few losses and gains at the root of Filozoa ( Figure 5A), and 85.5% of its introns are of holozoan or earlier origin ( Figure 6B). Interestingly, introns with regulatory sites from Capsaspora (identified in [Sebé-Pedró s et al., 2016b]) have a similar, ancestral-biased, age distribution (Fisher's exact test, p-value=1; Figure 6B). This hints at a decoupling between the evolutionary dynamics of introns and regulatory sites, despite sharing physical space in the genome.

C. Conservation of NMD and SR machinery
NMD machinery* SR splicing factors † Figure 5. Intron evolution. (A) Rates of intron gain and loss per lineage, including extant genomes and ancestral reconstructed nodes. Diameter and color of circles denote the number of introns per kbp of coding sequence at each ancestral node. Bolder edges mark the lines of descent between the LECA and Metazoa/Ichthyophonida, which were characterized by continued high intron densities (see text). Red and green bars represent the inferred number of intron gains (green) and losses (red) in ancestral nodes. (B) Difference between intron site gains and losses in selected ancestors, including animals (left; from Metazoa to Unikonta/Amorphea) and unicellular holozoans (right). For each ancestor, we specify the variance-to-mean ratio of the inferred number of introns from 100 bootstrap replicates (higher values, denoted by lighter purple, indicate less reliable inferences; see Methods. The color code denotes modes of intron evolution: dominance of gains (green), losses (pink) and stasis (light gray). (C) Conservation of the NMD machinery and SR splicing factors in unicellular holozoans (up) and selected ancestors (down). Black dots indicate the presence of an ortholog, and empty dots partial conservation. For the NMD machinery, each column summarizes the presence of multiple gene families (number between brackets). † denotes the ancestral eukaryotic origin of TRA2 according to (Plass et al., 2008). Complete survey at the species and gene levels available as Figure 4-figure supplements 2 and 3. Figure 5-source data 1, 2 and 3. DOI: 10.7554/eLife.26036.018 The following source data and figure supplements are available for figure 5: Source data 1. Rates of gain and loss of intron sites for extant and ancestral eukaryotes, calculated for a rates-across-sites Markov model for intron evolution with branch-specific gain and loss rates (Csurö s, 2008). DOI: 10.7554/eLife.26036.019 Source data 2. Reconstruction of intron site evolutionary histories, using a rates-across-sites Markov model for intron evolution, with branch-specific gain and loss rates (Csurö s, 2008). DOI: 10.7554/eLife.26036.020 Source data 3. Reconstruction of the evolution of the NMD machinery (He and Jacobson, 2015) and key SR splicing factors (Plass et al., 2008).

Consequences of intron gains in early holozoan evolution
The evolutionary implications of intron gain episodes in Holozoa remain an open question. High intron densities have been linked to inefficient purifying selection: according to the mutational-hazard hypothesis, the lower effective population sizes of animals preclude the loss of slightly deleterious intronic sequence -which can constitute an impediment to genome replication or precise transcription (Csuros et al., 2011;Lynch and Conery, 2003;Lynch, 2002, Lynch, 2006. Whether this population-genetic effect is also connected with the intron gains in Creolimax, Sphaeroforma and Ichthyophonus, however, is unclear: their specific effective population sizes are not known, but estimates from their close relative Sphaeroforma tapetis are in line with typical unicellular eukaryotes (in the 10 6 to 10 7 range [Marshall and Berbee, 2010]) and thus higher than most animals (Lynch, 2006). Alternatively, holozoans' intron gains could be linked to adaptive roles related to alternative splicing (AS): intron-dense genomes exhibit AS-rich transcriptomes (Irimia and Roy, 2014), which can    increase proteomic diversity (Barbosa-Morais et al., 2012;Nilsen and Graveley, 2010;Bush et al., 2017) or fine-tune gene expression regulation (Lareau et al., 2007;He and Jacobson, 2015). Transcriptomes of complex animals frequently feature exon skipping events that conduce to multiple protein isoforms per gene (McGuire et al., 2008;Irimia and Roy, 2014;Bush et al., 2017). In contrast, the AS profiles of Creolimax and Capsaspora are dominated by intron retention (affecting 24.9% and~33% of their genes, respectively), which can disrupt the transcripts' open reading frames de Mendoza et al., 2015). Intron retention is present in virtually all intron-bearing eukaryotes, pointing at an early origin in evolution (Irimia and Roy, 2014). Consequently, AS events in the intron-rich Creolimax were proposed to be involved in down-regulation of gene expression (de  by a mechanism akin to the nonsense-mediated decay (NMD) pathway that operates in other eukaryotes (Lareau et al., 2007;He and Jacobson, 2015;Braunschweig et al., 2014;Kerényi et al., 2008).
In order to explore the relationship between intron evolution and AS-based transcriptome regulation, we surveyed the conservation in unicellular holozoans of the NMD protein complex and key splicing factors involved in AS ( Figure 5C, Figure 5-figure supplement 2 and 3). The core NMD toolkit (consisting of the Upf1-3, Smg1, Smg5/6/7 and Smg8/9 genes; the release factors 1 and 3; and the exon-junction complex [EJC] [He and Jacobson, 2015]) has a pan-eukaryotic distribution ( Figure 5C- Figure 5-figure supplement 2A), as previously reported for the wider spliceosomal molecular machinery (Collins and Penny, 2005). The NMD toolkit was also fully conserved in the LCAs of Ichthyophonida, Metazoa and Metazoa + Choanoflagellata -which underwent the abovereported intron gain episodes ( Figure 5A). Similarly, the SR splicing factors (serine/arginine-rich proteins, termed SRSF1-9 and TRA2A/B in humans), which are involved in splice site recognition in metazoan AS (Plass et al., 2008;Sanford et al., 2005), also appeared early in eukaryotic evolution and were conserved in LCAs ranging from Opisthokonta to Metazoa ( Figure 5C, Figure 5-figure supplement 2B). Interestingly, Corallochytrium secondarily lost part of its NMD machinery and SR splicing factors (e.g., it lacks three out of four EJC components, and only possesses one canonical SR gene) concomitantly with its acute intron losses -a process that mirrors the depletion of splicing factors in the intron-depleted ascomycete Saccharomyces cerevisiae (Plass et al., 2008). Thus, we found that the intron gain episodes of the LCAs of ichthyophonids and animals occurred in ancestral holozoans that were potentially able to perform NMD of aberrant transcripts.

Timing of gene family diversification in holozoa
The Monosiga genome paper by King et al. (2008) revealed that much of the innovation in gene content seen in the transition to multicellularity is rooted in pervasive 'tinkering' with preexisting gene families, notably by rearrangements of protein domains. This mechanism, combined with gene duplication, allows for a functional diversification of gene families by tuning the interactions with other components of the cell-its substrate specificities, sub-cellular localization or partnerships with other proteins within larger complexes. Albeit protein domain rearrangements are not uncommon in eukaryotes (Basu et al., 2008, Basu et al., 2009Leonard and Richards, 2012), this process is specifically credited with the diversification of many gene families involved in complex signaling and/or multicellular integrated lifestyle in Metazoa (Suga et al., 2012;Simakov and Kawashima, 2017;Sebé-Pedró s et al., 2010;Tordai et al., 2005;Ekman et al., 2007;Hynes, 2012;Deshmukh et al., 2010;Grau-Bové et al., 2015).
Here, we present a comprehensive study of gene diversification in Holozoa, using our taxon-rich genomic dataset to reconstruct its effect in the animal ancestry. We thus performed a comparative analysis of protein domain architectures across eukaryotes, using the rates of domain rearrangement (or shuffling) as a proxy for gene family diversification. We compared the phylogenetic distribution of protein domain co-occurrences across species and gene families (using a dataset comprising 26,377 gene families or clusters of orthologs derived from 40 eukaryotic species (see Methods). We inferred rates of domain rearrangement at ancestral nodes of the eukaryotic tree using a probabilistic birth-and-death model (Csűrö s and Mikló s, 2006) to reconstruct the content of specific protein domain architectures in ancestral genomes (available as Figure 7). In our approach, pairs of domains can create novel combinations ('gain') that diversify existing gene families, or dissociate domains ('loss'), which results in decreased diversity of multi-domain proteins. Heatmap to the right represents the log-ratio value of the diversification rate for selected sub-sets of functionally-related protein domains relevant to multicellularity: green indicates higher-than-average diversification; pink less; white asterisks indicate two-fold or more increases or decreases (all comparisons relative to the whole set of protein domains). Source Data Figure 7-source data 1, 2, 3 and 4. DOI: 10.7554/eLife.26036.026 The following source data and figure supplement are available for figure 7: Source data 1. Rates of gain and loss of protein domain pairs within a given orthogroup for extant and ancestral eukaryotes, calculated for a phylogenetic birth-and-death probabilistic model that accounts for gains, losses and duplications (Csurö s, 2010 Shuffling of protein domain architectures is common in the holozoan ancestors We assessed the frequency of protein domain rearrangements by quantifying the rates of domain pair gain and loss at each node of the eukaryotic tree (number of gained or lost domain pairs relative to the total number of pairs in that node) ( Figure 7A-B). Gains and losses are frequent but unequally distributed across organisms and over time, with a majority of nodes showing a tendency towards destruction or creation of domain combinations. Out of 73 analyzed organisms, 20 show a strong bias towards gains, 32 a bias towards losses (>5% difference in either sense), and 64 show combined rates of gain and loss of >10% ( Figure 7A). In contrast, the ancestral reconstruction of individual protein domain evolution (based on Dollo parsimony) showed that losses dominate in most nodes, both extant and ancestral -with the exception of animals and their ancestors (Figure 7-figure supplement 1) (Zmasek and Godzik, 2011).
In this scenario of pervasive domain rearrangements, we identified a consistent pattern of creation of protein domain architectures in the lineage leading to Metazoa -specifically, the line of descent from the opisthokont to the bilaterian LCA ( Figure 7A-B). This tendency was most acute at three points in animal prehistory: the Holozoa LCA, the Filozoa LCA (Capsaspora, animals and choanoflagellates) and the Metazoa LCA. Conversely, unicellular holozoans outside the animal lineage were dominated by secondary simplification (e.g., the LCAs of choanoflagellates or ichthyosporeans, as well as some individual species such as Sphaeroforma, Ichthyophonus or Corallochytrium) or by dynamic stasis (e.g., Capsaspora, Creolimax or Chromosphaera). Our analysis thus shows that the increased diversity of protein organizations in animals has its roots in successive events of domain shuffling during their unicellular holozoan prehistory, even if this period was dominated by a relative stasis in terms of the emergence of new protein domain families ( Figure 7A and Then, we questioned whether these expansions were more frequent in protein domains related to typical multicellular functions, such as the extracellular matrix (ECM), transcription factors (TF) or signaling pathways Richter and King, 2013;de Mendoza et al., 2013;Hynes, 2012;de Mendoza et al., 2014). We found that gene families carrying TF-and ECM-related domains had consistently higher diversification rates not only in Metazoa but also in their unicellular ancestors ( Figure 7B, right panel; asterisks indicate two-fold differences). We thus identify a continuous process of protein diversity gain involving multicellularity-related genes in animal ancestors ranging from the LCA of Obazoa (Opisthokonta + Apusomonadida) to the LCA of Metazoa.

A unique mode of transcription factor diversification in premetazoan ancestors
Next, we analyzed the dynamics of the bursts of innovation in protein domain architectures in the unicellular ancestry of Metazoa, particularly regarding TFs and ECM-related genes. Specifically, we examined the degree of protein domain promiscuity across gene families (i.e., whether a specific domain combination is re-used in multiple gene families) in different ancestors, to measure changes in the specificity of protein domain architecture diversity.
We measured domain promiscuity by modeling each proteome as a network graph, where vertices represented protein domains that were linked by edges if they co-occurred in a given gene family (with !90% probability for the ancestral reconstructions; Methods and Figure 8). In this context, highly promiscuous domains would join multiple gene families within the network, whereas gene family-specific domains would form independent clusters. This effect can be investigated by computing the network modularity: a parameter describing the degree of isolation of 'modules' (here, groups of co-occurring domains) within a network given their connections to other 'modules' ( Figure 8C).
We identified a general tendency for multi-domain protein families to diversify by acquisition of highly promiscuous domains also present in other families. This result was based on two observations. First, network modularities were high in most analyzed genomes (within the 0.7-1 range; consistent with previous observations (Itoh et al., 2007;Xie et al., 2011)) but they were generally lower in animals than in their unicellular relatives and ancestors ( Figure 8A). Second, there was a strong negative relationship between modularity and the number of protein domains per gene family (Spearman's rank correlation coefficient, s =À0.96, p<0.001, Figure 8B). Therefore, at the genome level, gene family diversification tends to reduce modularity due to the use of highly promiscuous protein domains, as it has been frequently reported in animals (Simakov and Kawashima, 2017;Basu et al., 2008). This same effect was observed when we analyzed subsets of the proteome networks sharing a common function: the diversification of gene families with domains related to the ECM, signaling, ubiquitination or protein-protein interactions occurs by acquisition of promiscuous domains that reduce their modularity (with s in the range À0.32 to À0.84 and p<0.001; Figure 8figure supplement 1A-D), and this reduction is frequently stronger in animals than in their unicellular relatives and ancestors (Figure 8-figure supplement 1E-H). The high promiscuity of domains mediating protein-protein interactions has already been reported in previous analyses (Basu et al., 2008;Zmasek and Godzik, 2012), thus confirming the validity of our approach.
However, the analysis of the transcription factor domain sub-networks exhibited an opposite signal: animal TF genes have more exclusive domains than their unicellular ancestors or relatives (reflected by higher modularities; Figure 8A, lower panel). Also, there was no negative relationship between the number of domains per community and the network modularity ( s =0.12,    Figure 8. Protein domain architecture networks. (A and B) Modularity and community size of the global network of domain pairs (upper panels) and the TF subnetwork (lower panels), with !90% probability. The modularity parameter measures the fraction of the intra-community edges in the network, minus the expected value in a random network (takes values from 0 to 1; see Materials and methods and [Newman and Girvan, 2004]). Panels at the left show the observed modularity of the protein domain (sub)networks of various genomes (Holozoa and selected ancestors; dots are taxa-colored). Purple box plots represent the distribution of simulated modularities from 100 rewirings of the original organism-specific networks, while keeping a constant vertex degree distribution. Panels to the right show the relationship between modularities and the number of domains/community, both for actual genomes (orange) and simulated rewired networks (purple density plot, see Methods). Monotonic dependence between modularity and domains/community was tested for each set of data (global, TF and their respective simulations) using Spearman's rank correlation coefficient ( s ), and linear regression fits are included for clarity. Note that simulated TF subnetworks are less modular and have more domains/community than the original ones, signaling their higher-than-expected modularities. Note that the scales of the vertical axes change between upper and lower panels. (C) Example of protein domain co-occurrence network. Vertices represent domains, linked by edges if they co-occur within the same gene family. Two subnetworks are highlighted in yellow (domain pairs occurring in TF genes) or green (same for signaling genes).    p-value=0.32), meaning that the addition of new domains to TF genes occurred in a gene family-specific manner ( Figure 8B). This implies that the expanded TF repertoires of animal genomes (de  preferentially diversify their protein domain architectures by acquiring new, not promiscuous, domains. In summary, we identify a distinct dynamics of protein domain rearrangements for TF families in the LCA of Metazoa: new domains tend to be acquired in a family-specific manner (as opposed to reuse of promiscuous domains), contributing to the functional specialization of the animal TF repertoire.

Gene family-specific protein domain diversification: TFs and collagen IV
Our ancestral reconstruction of protein domain architectures (Figure 7) allowed us to investigate the evolutionary origin of specific domain organizations within gene families and examine their diversification pattern in the ancestry of animals (Table 1). For example, we recovered many examples of gene family-specific domain diversification in novel animal TFs ( Figure 10): Homeobox families (OAR, PBC/X, SIX, CUT, Pou, HNF or PAX families), TALE Homeobox (Homeobox_KN domain; Meis/Knox families), MH (MH1 and MH2 domains), bZIPs (Jun), C4 zinc finger (nuclear hormone receptors), Ets (Ets with modified SAM motifs) and HMG-box (SOX). Interestingly, the functions of accessory domains were often related to regulation of TF multimerisation or the DNA-binding affinities of the protein (de Sebé-Pedró s et al., 2011;Holland et al., 2007;Holland, 2013). These TF families appeared as isolated clusters when we sorted protein domains by their pattern of co-occurrence in the reconstructed Metazoa LCA ( Figure 9A). Furthermore, we detected an unexpected premetazoan origin for some TF classes as per their domain combinations (Figure 10) Figure 10. Domain combinations that appear in transcription factor (TF) families in unicellular premetazoans, from the LCA of Unikonta/Amorphea to the LCA of Metazoa. First and second columns indicate the TF family and its inferred evolutionary origin, respectively (from [de ). Subsequent columns list (i) the p-value of a Fisher's exact test for the relative enrichment of that TF family in that node of the tree (compared to other domains that rearrange there; p-values<0.05 in green); and (ii) the accessory domains that appear within each TF family. Figure 7-source data 2, validated two case-in-point examples by phylogenetic analysis, in order to illustrate the distinct pattern of TF domain diversification: the LIM Homeobox (LIM-HD) and p300/CBP transcriptional coactivators. LIM homeobox genes have been classified as an animal-specific non-TALE family (Srivastava et al., 2010b). However, we identified LIM-associated homeobox genes in multiple ichthyosporeans, Corallochytrium and Capsaspora. We classified these candidate genes according to HomeoDB (Zhong andHolland, 2011) using (Holland et al., 2007) as a phylogenetic reference. Our analysis identified bona fide LIM-HD homologs with 1-2 LIM domains in Corallochytrium, Chromosphaera, Ichthyophonus, Amoebidium and Capsaspora (which had 1-2 LIM domains and a homeodomain); together with many LIM-devoid homologs in Creolimax, Sphaeroforma, Pirum and Abeoforma ( Figure 9C). None of the unicellular holozoan LIM-HD genes could be confidently assigned to animal LIM homeodomain subfamilies (Lhx1/5,Lhx3/4,Lmx,Islet,Lhx2/9,Lhx6/8), probably because they emerged before LIM-HD radiation in animals. As such, they also predate the establishment of the LIM code of cell type specification, which has been shown to control neuronal differentiation via combinatorial expression of LIM-HD subfamilies, in animals from Caenorhabditis elegans to mammals or the sea walnut Mnemiopsis (Simmons et al., 2012;Thor et al., 1999;Gadd et al., 2011). Given that transcriptionally regulated cell type specification has already been demonstrated in Creolimax (de , the presence of LIM-HD paralogs in ichthyosporeans will require further examination, as it raises the possibility of a conserved or convergent regulatory role in cell differentiation. The p300/CBP TF is a transcriptional activator that contributes to distal enhancer demarcation by histone acetylation in bilaterian animals and Nematostella (Gaiti et al., 2017a). Most eukaryotes have a consensus architecture composed of a central HAT/KAT11 domain (acetylase) flanked by three zinc fingers of TAZ (2) and ZZ (1) types (DNA-binding motifs) ( Figure 9D). Animal p300/CBP homologs typically include an additional 3-domain structure, N-terminal to the acetylase domain, composed of KIX-Bromodomain-DUF902. KIX recognizes and binds to CREB in animals (a cAMPresponsibe bZIP TF), and the Bromodomain is responsible for interaction with acetylated histones. We identified this protein domain architecture in both Capsaspora and ichthyosporeans, which also have the CREB gene (Sebé-Pedró s et al., 2011). Intriguingly, Capsaspora's epigenome contains p300/CBP-specific histone acetylation marks, but its relatively compact genome lacks distal enhancers (Sebé-Pedró s et al., 2016b).
Finally, in stark contrast to TF domain-specific diversifications, clusters of co-occurring protein domains in ECM-related genes were dominated by highly promiscuous domains shared between different gene families ( Figure 9B). This pattern explains the lower network modularity of animal ECM genes ( Figure 8-figure supplement 1). Among the most promiscuous domains, we found epidermal growth factor-related domains (EGF-CA, EGF), type III fibronectin or protein tyrosine kinase motifs, consistent with previous observations (Cromar et al., 2014). These domains are part of multiple, functionally different gene families: structural laminins, immunoglobulins, the Notch/Delta signaling system, LDL receptors or GPCR signaling genes (pink highlight, Figure 9B).
The diversification of collagen genes, however, is a counterexample to the promiscuous domain shuffling at the ECM: like many TFs, collagens typically contain repetitive motifs with unique domains conferring functional specificity (Hynes, 2012). This includes, for example, structural fibrillar collagens (COLFI domains and further specialization within metazoans), type XV/XVIII (endostatin/NC10 domains), type IV collagen or type IV-like spongins (specific of invertebrate metazoans); there are also non-structural genes like collectin receptors (Lectin-C) or the C1q complement subcomponent (C1q) (Hynes, 2012;Aouacheria et al., 2006;Heino, 2007;Fahey and Degnan, 2012;Exposito et al., 2008). Most collagen genes appeared and expanded in Metazoa, concomitantly with the ECM structures they associate with (Hynes, 2012;Fidler et al., 2017). We found, however, a remarkable exception: a canonical type IV collagen gene in the filasterean Ministeria vibrans, a naked filose amoeba devoid of basement membrane or ECM (Patterson et al., 1993;Cavalier-Smith and Chao, 2003). Cross-linked type IV collagens are part of the structural core of animal basement membranes (to date, all of its components had been described as exclusive to animals) (Hynes, 2012;Fidler et al., 2017). This Ministeria ortholog is composed of a pair of C4 domains at the C-terminus and multiple collagen Gly-X-Y repeats. Phylogenetic analysis of C4 showed that this domain arrangement appeared from two duplicated motifs within the same protein, and its order is thoroughly conserved in animals and Ministeria ( Figure 9E). Thus, a canonical type IV collagen was already present in the common ancestor of filastereans, choanoflagellates and animals -which was unicellular and most likely lacked ECM or basement membrane-like structures. The essential role of collagen IV in the organization of extant metazoans' tissues (Fidler et al., 2017) would therefore require a co-option from an earlier function in a unicellular context, as it has been previously proposed for other ECM components such as the integrin adhesome (Sebé-Pedró s et al., 2010) or cadherins ).

Discussion
We have investigated the evolutionary dynamics of key genomic traits in the unicellular ancestry of Metazoa, in the first comparative genomic study that simultaneously includes all unicellular holozoan lineages, and more than one species per lineage: animals, seven Teretosporea genomes (six ichthyosporeans and Corallochytrium), Capsaspora, and two choanoflagellates (Salpingeoca and Monosiga). Our enhanced taxon sampling, including four newly sequenced genomes, allows us to perform both within-and across-lineage comparisons, thus covering the different time scales at which the evolution of coding and non-coding genome features occurred.
Dating the origin of animal-like protein domain architectures, intron architecture and genome size We have identified continued process of gene innovation in terms of protein domain architectures in the animal ancestry, peaking at the LCA of Holozoa. This burst of diversification, enriched in TFs and ECM-related domains ( Figure 7B), set the foundations of the animal-like gene tool-kits of unicellular holozoans that have been reported in previous studies of gene family evolution regarding signaling pathways (Suga et al., 2012;Grau-Bové et al., 2015, 2013, cell adhesion systems Nichols et al., 2012;Sebé-Pedró s et al., 2010) and transcription factors, often involved in developmental processes (de Sebé-Pedró s et al., 2011). The expansion of protein diversity in early holozoans provided fertile ground for the frequent co-option of ancestral genes for multicellular functions in Metazoa . Overall, our probabilistic reconstruction of the genome content of unicellular animal ancestors (available as Figure 7source data 7) provides a useful framework for targeted analysis of gene evolution and protein domain architecture evolution. As case-in-point examples of our approach, we have established the premetazoan origin of the transcription factors LIM Homeobox (present in Ichthyopsorea and Capsapsora) and p300/CBP-like (all unicellular Holozoa) ( Figure 9C-E), and canonical Type IV collagens, a key element of the animal ECM (Hynes, 2012) (present in the filasterean amoeba Ministeria vibrans).
We have also investigated the time of origin of intron-rich genomes in Holozoa. We detect three independent episodes of massive intron gain: (1) at the root of Metazoa, (2) the shared LCA between Metazoa and Choanoflagellata, and (3) the root of ichthyophonid Ichthyosporea (Creolimax, Sphaeroforma and Ichthyophonus). Furthermore, since the early origin of introns in the earliest eukaryotes (Irimia and Roy, 2014), the ancestry of both animals and ichthyophonids maintained a state of high intron density. The evolutionary implications of this circumstance, however, remain an open question. First, the independent intron gain episodes of animals and unicellular holozoans are mirrored by two different modes of alternative splicing dominating in each clade: animal transcriptomes are rich in isoform-producing exon skipping (McGuire et al., 2008;Irimia and Roy, 2014), whereas most of the alternatively spliced transcripts of Capsaspora (Sebé-Pedró s et al., 2013) and Creolimax (de  originate by intron retention and are thus more similar to the putative ancestral eukaryote than to Metazoa (Irimia and Roy, 2014). Second, we here show that the holozoan LCAs that underwent intron invasions (Ichthyophonida, Metazoa and Metazoa + Choanoflagellata) all possessed the essential NMD machinery and a rich complement of assisting splicing factors ( Figure 5C). Thus, they were in principle able to reduce the costs imposed by slightly deleterious intron invasions, as predicted by the mutational-hazard hypothesis (Lynch and Conery, 2003;Lynch, 2002, Lynch, 2006. And third, the protracted state of high intron density in the ancestry of Metazoa and Ichthyophonida could have contributed to maintaining high levels of transcriptome variability that could in turn be co-opted for potentially adaptive, regulated AS events (Irimia and Roy, 2014;Koonin et al., 2013). However, we cannot at present elucidate the relative importance of adaptation and population-genetic effects in the holozoans' intron gain episodes: further transcriptomic analyses of unicellular holozoans are required to confirm that intron retention is their ancestrally prevalent AS mode Irimia and Roy, 2014;de Mendoza et al., 2015); and the scant data on unicellular holozoans' population genetics hampers the interpretation of genome architecture evolution under the light of the mutational-hazard hypothesis (Lynch and Conery, 2003;Lynch, 2002).
We also addressed the evolution of genome size across holozoans. The emergence of larger genomes in Metazoa cannot be explained solely by intron gain and gene family expansion (Elliott and Gregory, 2015a). Unfortunately, other factors such as the contribution of TE invasions ( Figure 3B) or the extension of intron sites are not possible to date at the holozoan-wide evolutionary scale due to the lack of conserved signals. A possible way out of the conundrum is to study the conserved functions in the non-coding parts of the genome. For example, the compact genome of Capsaspora (median intergenic regions: 373 bp) has intragenic cis-regulatory elements key to its temporal regulation of cell differentiation (Sebé-Pedró s et al., 2016b), but the putative regulatory functions in the larger intergenic regions of Creolimax, Sphaeroforma and Salpingoeca (median intergenic 900-1200 bp) remain uncharacterized. It is tantalizing to note that (1) Creolimax and Salpingoeca exhibit temporal differentiation of cell types (Fairclough et al., 2013;de Mendoza et al., 2015), and (2) their intergenic median sizes are in line with those of Amphimedon (885 bp) (Figure 1-source data 1), a demosponge with bilaterian-like promoters and enhancers, including distal regulation (Gaiti et al., 2017a, Gaiti et al., 2017b. However, the ancestral gene linkages conserved across Metazoa, frequently due to common cis-regulation , appear to be animal innovations absent in unicellular holozoans (Figure 3-figure supplement 1). We thus propose that homologous regulatory regions would be rarely conserved between animals and unicellular holozoans; and only common types of regulatory elements could be expected, e.g. distal enhancers or developmental promoters.

Independence of genome features in premetazoan evolution
Overall, our results show that extant holozoan genomes have been shaped by both differential retention of ancestral states and secondary innovations, for the multiple genomic traits analyzed here, namely genome size, intron density, synteny conservation, protein domain diversity and gene content (reviewed in ). We can thus conclude that the genomes of unicellular premetazoans were shaped by independent evolutionary pressures on different traits, as has been seen in Metazoa (Simakov and Kawashima, 2017).
Our findings can help to delimit the implicit trade-offs of choosing a unicellular model organism for functional and comparative studies with Metazoa, taking into account the loss of animal-like genomic traits relevant to different analyses. For example, phylogenetic distances between orthologous genes are shorter between some ichthyosporeans and animals than between choanoflagellates and animals ( Figure 3D), yet choanoflagellates are more similar to the animal ancestor in terms of intron structure ( Figure 6A) and have lower rates of protein domain diversity loss ( Figure 7B). Interestingly, Capsaspora emerges as a well-suited model with a slow pace of genomic change attested for multiple traits: intron evolution, coding sequence conservation, gene order and (possibly) genome size. Its remarkable microsynteny conservation with Corallochytrium and Chromosphaera indicates the existence of ancestral holozoan gene linkages that have been disrupted, and rewired, in extant choanoflagellates, ichthyosporeans and animals ( Figure 3C). However, Capsaspora's lack of close sister groups hampers comparative studies of faster-evolving genomic features, be it the regulatory circuitry (Sebé-Pedró s et al., 2016b), or co-option of genetic tool-kits for its unique aggregative development .
The new genomes from Ichthyosporea and Corallochytrium analyzed here provide novel insights into the reconstruction of premetazoan genomes. The Teretosporea clade has a deeper sampling than other unicellular holozoans and exhibit a mixture of slow-and fast-evolving genomic traits, which provides novel insights into the independence of genomic characters during premetazoan evolution. For example, Ichthyophonus tends to retain the ancestral intron/exon structure ( Figure 6A) and is relatively similar to animals in terms of coding sequence conservation ( Figure 3D), but it harbors a secondarily expanded genome with disrupted gene linkage ( Figure 3A, C). Another example is Corallochytrium and Chromosphaera, both with massive simplifications of intron content ( Figure 5A), but higher synteny conservation with the inferred ancestral Holozoa ( Figure 3C). Also, the diversity of protein domain combinations of Chromosphaera is the highest among ichthyosporeans (in line with values of animals and holozoan ancestors; Figure 7A) and phylogenetic distances to animal orthologs are comparatively low ( Figure 3D). These studies of genome history in holozoans are key to our interpretation of functional genomics analyses. For example, Creolimax and Sphaeroforma are close species with a broadly conserved life cycle (Glockling et al., 2013), and they could therefore be an apt model to test hypotheses of cell type evolution in Holozoa -for example, whether new cell types emerge as lineage-specific transcriptomic specializations, as proposed by . This investigation would benefit from taking into account their high microsynteny when analyzing co-regulated gene modules, while considering that Sphaeroforma's multiple TE invasions could blur the conservation of non-coding regulatory elements in the intergenic regions ( Figure 3A-C).

Genomic innovation in the animal ancestry
Our analysis of ten unicellular holozoans has uncovered the timing of genome evolution in the ancestry of Metazoa, at both the architectural and gene content levels. In particular, we have observed that holozoan genomes evolved under temporally uncoupled dynamics for synteny reorganization, intron gains, TE propagation, coding sequence conservation and gene family diversification. Some of these traits have independent effects in extant holozoans, e.g., different episodes of intron gain or genome expansion in ichthyosporeans and animals. Yet, other traits exhibit conserved dynamics across the unicellularity/multicellularity divide: the diversification of ECM and TF gene familiesincluding molecular tool-kits essential for multicellularity-extends back to the LCA of Holozoa; and the high intron densities in premetazoans suggest a continued state of transcriptome variability, cooptable for regulation or protein innovation, in the unicellular prehistory of Metazoa. Overall, our timeline of holozoan genome evolution offers a framework to investigate when and how premetazoan genomic elements-gene tool-kits, linkages and structure, and the non-coding sequences that harbor epigenomic regulatory elements-were functionally co-opted in multicellular animals.
RNA-seq data was produced for Chromosphaera and Abeoforma. RNA extractions were performed from confluent axenic cultures grown in three 25 ml flasks for 5 days. RNA was extracted using Trizol reagent (Life Technologies, Carlsbad, CA, USA) with a further step of Dnase I (Roche) to avoid contamination by genomic DNA, then purified using RNeasy columns (Qiagen). We sequenced PE libraries of 125 bp with an insert size of 250 bp, yielding 168Á10 6 reads for Chromosphaera and 178Á10 6 for Abeooforma; which were constructed using the Trueseq Sequencing Kit v4 (Illumina, San Diego, CA). The libraries were sequenced in one lane of an Illumina HiSeq 2000 at the CRG genomics unit (Barcelona). All transcriptome sequencing data has been deposited in NCBI SRA using the BioProject accession PRJNA360056.

Genome assembly
Genomic PE and MP libraries were quality-checked using FastQC v0.11.2 (Andrews, 2014) and trimmed accordingly with Trimmomatic v0.33 (Bolger et al., 2014) to remove remnant adapter sequences (ad hoc) and the low-quality 5' read ends (sliding window = 4 and requiring a minimum Phred quality = 30). A minimum length equal to the original read length was required. During the quality-trimming process, libraries of unpaired forward reads were kept as single-end reads (SE Genome assemblies were performed using Spades v3.6.2 (Nurk et al., 2013) with the BayesHammer error correction algorithm (Nikolenko et al., 2013). For each organism, PE data were analyzed using Kmergenie (Chikhi and Medvedev, 2014) to determine the optimal k-mer length for the assembly process, which was used in the Spades assembly in combination with smaller and larger values, including the maximum possible odd length below the maximum read length after trimming. The optimized assemble parameters for each genome were as follows: Pirum,max. read length = 125,k = 55,123;Abeoforma,max. read length = 100,k = 47,91;Chromosphaera,max. read length = 125,k = 91,121;Corallochytrium,max. read length = 100,k = 41,63,91. In the cases of Corallochytrium and Chromosphaera genomes, Spades was run in careful mode, taking into account PE, SE and MP data in the same run. In the cases of the highly repetitive Abeoforma and Pirum genomes, an initial Spades assembly of PE and SE libraries was combined with MP libraries using the Platanus v1.2.1 scaffolding module (Kajitani et al., 2014). Each assembly was later processed using the GapCloser module from SOAPdenovo assembler with PE data, in order to extend the scaffolded contigs by shortening N stretches (Luo et al., 2012). Genome assembly statistics (genome size, N50, L75) were calculated using Quast v2.3 , and completeness was assessed using the BUSCO v1.1 (Simão et al., 2015) database of universal eukaryotic genes, based on the predicted transcripts.
SNAP and GeneMark-ES annotations were iterated for three times on the final genome assemblies, using the output of each step as a training set for the next one (the first SNAP prediction was done using the standard minimal HMM; GeneMark-ES was omitted for Abeoforma and Pirum due to its highly fragmented gene bodies, which impaired intron delimitation).
Transcriptome assemblies were produced to support PASA and Augustus gene predictions. RNAseq PE libraries were assembled using genome-guided Trinity v2.0.6 and STAR v2.5 (for Corallochytrium, Chromosphaera and Ichthyophonus) or de novo Trinity (Pirum and Abeoforma, assemblies from Grabherr et al., 2011;Dobin and Gingeras, 2015). In the case of the Corallochytrium, Chromosphaera and Ichthyophonus genome-guided assemblies, quality control was performed as indicated above for the genomic libraries, using the RNA-seq data generated for this study (Chromosphaera) or in  (Ichthyophonus accession: PRJNA264423; Corallochytrium accession: PRJNA262632). A minimum k-mer coverage = 2 was used in all Trinity assemblies. Transcriptome assemblies were annotated with Transdecoder using Pfam release 29 protein domain database, in order to obtain mRNA and translated peptides. Next, PASA annotations were obtained from assembled transcripts, mapped to the genome using GMAP and BLAT v35 (Kent, 2002;Wu et al., 2016). Only high quality mapping was accepted, with a minimum of 95% identity and 75% transcript coverage. We then trained Augustus independently, using protein and mRNA predictions (mapped to the genome with Scipio 1.4 (Keller et al., 2008), BLAT and GMAP), followed by an optimization round of the species-specific parameters. After the training, an Augustus prediction was performed using the optimized parameters.
Finally, all annotations were consolidated using Evidence Modeler. In this step, gene models from PASA and Augustus were given higher relative weights than ab initio-predicted models (10 and 5 times more reliability, respectively).

Phylogenomic analysis
We used an improved version of the dataset published by Torruella et al. (Torruella et al., 2015), adding nine single-copy protein domains to the previous version (which included 78 alignments) according to the methodology developed in (Torruella et al., 2012). Since Abeoforma and Pirum genome assemblies were fragmented and contained partial gene models, we used transcriptome assemblies from  instead. We compiled a 57-taxa dataset of Unikonta/Amorphea species (hereby termed BVD57 taxa matrix; including Holozoa, Holomycota, Breviatea, Apusomonadida and Amoebozoa; 24,021 amino acid positions). This dataset represents a~10% increase in the number of aligned positions, compared to the original S70 dataset from .
We used the BVD57 dataset to build ML phylogenetic trees using IQ-TREE v1.5.1 (Nguyen et al., 2015), under the LG model with a 7-categories free-rate distribution, and a frequency mixture model with 60 frequency component profiles based on CAT (LG + R7+C60) (Quang et al., 2008).
LG + R7 was selected as the best-fitting model according to the IQ-TREE TESTNEW algorithm as per the Bayesian information criterion (BIC), and the C60 CAT approximation was added because of its higher rate of true topology inference (Quang et al., 2008). Statistical supports were drawn from 1000 ultrafast bootstrap values with a 0.99 minimum correlation as convergence criterion (Minh et al., 2013) and 1000 replicates of the SH-like approximate likelihood ratio test (Guindon et al., 2010), for all models stated above. Furthermore, 500 non-parametric bootstrap replicates were computed for the LG + R7+PMSF CAT approximation (as this was the only CAT approximation for which non-parametric bootstraps could be calculated in a feasible computation time).
We then used the same alignment to build a Bayesian inference tree with Phylobayes MPI v1.5 (Lartillot et al., 2013), using the LG exchange rate matrix with a 7-categories gamma distribution and the non-parametric CAT model (Lartillot and Philippe, 2004) (LG+G7 + CAT). A G7 distribution was considered to be the closest approximation to the free-rates R7 distribution of the IQ-TREE ML analysis (as free-rates distributions are not implemented in Phylobayes). We removed constant sites to reduce computation time. We ran two independent chains for 1231 generations until convergence was achieved (maximum discrepancy <0.1) with a burn-in value of 32% (381 trees). The adequate burn-in value was selected by sequentially increasing the number of burn-in trees, until we achieved (1) a minimum value of the maximum discrepancy statistic, and (2) the highest possible effective size for the log-likelihood parameter. The bpcomp analysis of the sampled trees yielded a maximum discrepancy = 0.095 and a mean discrepancy = 0.001. The tracecomp parameter analysis gave an effective size for the log-likelihood parameter = 37; and the minimum effective size = 11 (for the alpha statistic).

Generation of a species tree and ortholog datasets for comparative analyses
Our comparative genomics analyses are based on a dataset of 42 complete eukaryotic genomes, with a focus on unicellular and multicellular Holozoa, and using relevant outgroups from the Holomycota, Apusomonadida, Amoebozoa, Viridiplantae, Stramenopila, Alveolata, Rhizaria and Excavata groups. The complete list of species, abbreviations and data sources is available as Figure 2source data 1.
Since ancestral state reconstruction requires the assumption of an explicit species tree, we classified the 42 genomes in our dataset according to a consensus of phylogenomic studies Derelle et al., 2015;He et al., 2014) and our own results. We remained agnostic about the internal topology of SAR (Burki et al., 2016), Fungi , the contentious hypotheses for the root of eukaryotes (namely, 'Opimoda-Diphoda' or 'Excavata-first') (Derelle et al., 2015;He et al., 2014) and the earliest-branching animal group (Porifera or Ctenophora) (Whelan et al., 2015;Simion et al., 2017). All these cases were recorded in our species tree as polytomic branchings.
We inferred two different ortholog datasets using the predicted proteins from the afore-mentioned genomes, using Orthofinder v0.4.0 with a MCL inflation = 2.1 (Emms and Kelly, 2015). The first database included 40 eukaryotic species (excluding the low-quality gene models of Pirum and Abeoforma), whose genes were classified in 162,559 clusters of orthologs, 26,377 of which contained >1 gene (henceforth, 'orthocluster'). The second database included all available unicellular holozoan genomes (i.e., six ichthyosporeans, two choanoflagellates, Corallochytrium and Capsaspora) and yielded 58,516 orthoclusters, 11,925 of which contained >1 gene.

Retrieval of homologous sequences
Retrieval of homologous protein sequences was performed by querying orthologs or protein domain HMM profiles (depending on the gene family; see below) against a database of protein sequences from 69 selected eukaryotic genomes and transcriptomes (Figure 2-source data 1). Since Abeoforma and Pirum genome assemblies were fragmented and contained broken gene models, we used transcriptome assemblies from  instead.
In the case of LIM homeodomain genes, we queried the genomes/transcriptomes of all available unicellular holozoans (see taxon sampling above) using the homeobox HMM (PF00046), and restricted the subsequent phylogenetic analysis (see below) to sequences that clustered with known LIM-HD genes from the HomeoDB database in blastp searches (Zhong and Holland, 2011).
Protein alignments and phylogenetic analyses (LIM homeobox, type IV collagen and CBP/p300, Smg5/6/7, eIF4A3, eRF3, SRSF1-9/TRA2 splicing factors) Protein alignments were built with MAFFT v7.245 (Katoh and Standley, 2013), using the G-INS-i algorithm optimized for global homology for single-domain alignments (LIM homeobox, type IV collagen, CBP/p300 and SRSF1-9/TRA2) or the E-INS-i for multiple local homology for whole-protein alignments (Smg5/6/7, eIF4A3 and eRF3). All alignments were run for up to 10 6 cycles of iterative refinement. Then, the resulting alignments were manually examined, curated and trimmed (a process that included the removal of non-homologous amino acid positions and, eventually, non-essential sequences containing too few aligned positions that could disrupt the subsequent phylogenetic analysis). If necessary, the alignment and trimming process was repeated to incorporate the changes from manual curation.
For IQ-TREE (Nguyen et al., 2015) analyses, the best-scoring ML tree was searched for up to 100 iterations, starting from 100 initial parsimonious trees; statistical supports for the bipartitions were drawn from 1000 ultra-fast bootstrap (Minh et al., 2013) replicates with a 0.99 minimum correlation as convergence criterion, and 1000 replicates of the SH-like approximate likelihood ratio test. For MrBayes analyses, we ran two independent runs of four chains each (three cold, one heated) for a variable number of generations until run convergence was achieved (at different values depending on the gene family), sampling every 100 steps and running a diagnostic convergence analysis every 1000 steps. Convergence was deemed to occur when, using a 25% relative burn-in value, the average standard deviation of split frequencies was <0.01. Final number of generations for each gene family: 7.2Á10 7 generations for Collagen IV; 1.2Á10 7 for LIM Homeobox; and 9.9Á10 6 for HAT/KAT11; 6.4Á10 7 for Smg5/6/7; 1.6Á10 7 for eRF3; 7Á10 6 for eIF4A3.

Analysis of repetitive elements
Repetitive regions were annotated in Holozoa genomes using RepeatMasker open-4.0.5 (Smit et al., 2015) and annotations from the 20150807 release of GIRI RepBase database (Bao et al., 2015). We used the Eukaryota-specific database, with either the slow high-sensitivity search mode (unicellular holozoans) or the default search mode (metazoans); and stored the genome coordinates of TEs, low complexity repeats, tRNA genes, simple repeats and satellite regions. Internal similarity of each genome's TE complements was analyzed with blastn self-alignments of all TEs (considering a minimum 70% identity and 80 bp alignment length), and the distribution of percentage identity values was plotted using R.

Analysis of gene microsynteny by ortholog pair collinearity
We used the frequency of collinear ortholog pairs as a proxy to estimate microsynteny across holozoans. Specifically, we retrieved all sets of single-copy orthologs for each pairwise species comparison within our set of 10 unicellular holozoan genomes. We then defined collinear gene pairs for each species pairs if the same two orthologs were adjacent in both genomes (irrespective of individual gene orientation to account for possible local inversions, as in (Putnam et al., 2007)). To account for spurious conservation of gene order, we assigned random positions to each gene using the bedtools v2.24.0 shuffle utility (Quinlan and Hall, 2010) in 100 independent rounds, for which the number of spurious conserved syntenic pairs was recorded. Then, we calculated the gene synteny ratio r of each species pair i-j as follows: where c denotes the number of syntenic orthologs between i and j; s is the number of spurious syntenic orthologs averaged over 100 random replicates; and N is the number of comparable ortholog pairs between i and j. Values are normalized to the 0-1 interval using the maximum values of the dataset as a reference, i.e. Sphaeroforma and Creolimax. A heatmap representing the degree of similarity in pairwise species comparisons was produced using the synteny ratio (R gplots library (Warnes et al., 2016)). Species were clustered according to their mean synteny. The same analysis was performed using the database of 40 eukaryotic genomes, which excluded Abeoforma and Pirum. In this case, the maximum values used as a reference were the Nematostella-Aiptasia pair. For specific selected species comparisons, syntenic pairs were plotted onto the genome scaffolds using Circos v0.67 (Krzywinski et al., 2009).

Analysis of coding sequence conservation
From our ortholog database using 40 eukaryotic genomes (excluding Pirum and Abeoforma, which had lower-quality gene annotations due to their fragmented assemblies), we selected 143 orthoclusters present in all unicellular holozoans, plus Amphimedon queenslandica, Trichoplax adhaerens, Homo sapiens and Nematostella vectensis (as representative animal genomes). We aligned each group of orthologs using MAFFT G-INS-i (Katoh and Standley, 2013), trimmed the alignments using trimAL automated algorithm (Capella-Gutiérrez et al., 2009), and inferred maximum likelihood trees for each ortholog group using RAxML v8.2.0 (Stamatakis, 2014) and the LG amino acid substitution model. Then, for each tree, we recorded all pairwise phylogenetic distances between species as measured by substitutions per alignment position using the cophenetic module of the ape v3.5 R library (Paradis et al., 2004;Core Team, 2015). We retrieved distances between each unicellular holozoan ortholog and, separately, Amphimedon, Trichoplax, Homo and Nematostella orthologs. For each inter-species comparison, we tested the significance of differences in phylogenetic distances between unicellular holozoans, using the non-parametric Wilcoxon rank sum test from the R stats library (Core Team, 2015).

Comparative analysis of intron content
Intron content of a subset of 40 eukaryotic genomes (excluding Abeoforma and Pirum, which had lower-quality gene annotations due to their fragmented assemblies) was analyzed using a set of single-copy orthologous genes, and used to reconstruct ancestral states as described by Csű rö s et al. (Csuros et al., 2011;Csurö s et al., 2007. We then selected orthocluster present as single-copies in 80% of our species dataset, allowing for paralog genes to occur in just one species per group (if that was the case, the best-scoring copy in BLAST alignments was kept). This yielded a group of 342 nearly paneukaryotic genes, whose protein translations were then aligned using MAFFT v7.245 G-INS-i algorithm (Katoh and Standley, 2013) and annotated with their intron coordinates (retrieved from their respective genome annotations). With this information, we reconstructed the ancestral states of each intron using the Malin implementation of the probabilistic model of intron evolution developed by Csű rö s et al. (Csűrö s and Mikló s, 2006;, starting from the standard null model, running 1000 optimization rounds (likelihood convergence threshold = 0.001) and assuming a consensus eukaryotic phylogeny (see Generation of a species tree for comparative analyses).
Conserved intron sites (defined as unambiguously aligned in 80% of the orthologs, maximum of 10% of gap positions) were used to calculate the rates of intron loss and gain for each node of the tree. These rates were used to calculate a table of intron sites with a certain probability of presence, gain or loss at every node of the tree (which, when summed, give the number of introns that are present, gained or lost at that node (Csűrö s and Mikló s, 2006)). We computed 100 bootstrap replicates in Malin to assess uncertainty about inferred rate parameters and evolutionary history. In particular, we calculated the variance-to-mean ratio of the inferred number of introns in each ancestor with 100 bootstrap replicates (with values higher than one indicating more dispersed results and less reliable inferences).
For each node i, we calculate the percentage of introns gained (p G,i )or lost (p L,i ) as a percentage of the total number of introns at that node. Then, the gain/loss ratio of a node, r i , was calculated as follows: p G;i > p L;i ! r i ¼ log 10 p G;i p L;i p L;i < p L:i ! r i ¼ log 10 p G;i p L;i À1 ! Â À1 We represented the presence and absence of intron sites at each lineage (extant and ancestral), and the number of introns shared between species (only extant), using heatmaps (R gplots library (Warnes et al., 2016)). Inter-species distances were calculated using the pairwise counts of shared introns and the Spearman correlation algorithm, which was used to perform Ward hierarchical clustering as implemented in R stats library (Core Team, 2015). We used the same algorithms to calculate distances of intron presence probability profiles, and subsequent clustering. For Capsaspora, the phylostratigraphy of intron sites was combined with the nucleosome-free sites identified by ATAC-seq analysis in (Sebé-Pedró s et al., 2016b), which were assumed to be putative regulatory sites. Then, we compared phylostrata distribution ('ancestral' versus 'recent' Capsaspora-specific sites) for introns with and without regulatory sites, using a Fisher's exact test: 74 recent introns and 465 ancestral introns lacked putative regulatory sites (!50% ATAC site overlap with the intron sequence, calculated using bedtools v2.24.0 intersect utility (Quinlan and Hall, 2010)), while 3 and 22 recent and ancestral introns had regulatory sites.

Comparative analysis of protein domain architecture evolution
Protein domain architectures of the 40 eukaryotic species subset (excluding Abeoforma and Pirum, which had lower-quality gene annotations due to their fragmented assemblies) were computed using Pfamscan and the 29th release of the Pfam database (Punta et al., 2012). For each protein, the domain architecture was decomposed into all possible directed binary domain pairs (ignoring repeated consecutive domains; i.e. from protein A-B-B-C, the pairs A-B, A-C and B-C were built), and linked to its presence in its corresponding orthocluster (see Generation of a species tree and ortholog datasets for comparative analyses section). The final output was a numerical profile of species distribution for each combination of domain pairs in orthoclusters (considering that a cluster can contain more than one pair, and a pair can be present in more than one cluster, and thence the number of occurrences is recorded).
The numerical profile was analyzed using the general phylogenetic birth-and-death model developed by Csű rö s and Mikló s (Csűrö s and Mikló s, 2006) as implemented in Count (Csurö s, 2010). This allows the comparative analysis and ancestral reconstruction of discretized quantitative properties of genomes, assuming a specific species tree (see Comparative analysis of intron content). We used a gain-loss-duplication model with unconstrained gain/loss and duplication/loss ratios in all lineages, assuming a Poisson distribution of orthocluster size at the LECA (root) and no rate variation categories. In this context, 'gain' was defined as the acquisition of a new pairwise domain combination in an orthocluster; a 'duplication' as the propagation of the combination (by gene duplication or convergent domain rearrangements); and 'loss' as pair dissociation. Starting from the standard null model, we ran 100 optimization rounds (convergence threshold = 0.1).
To analyze the modularity of the protein domain networks (and subnetworks) for each genome, we 1) calculated the community structure of each network using Louvain iterative clustering to obtain communities of domain pairs (undirected graphs), and 2) calculated the global network modularity according to these communities. The modularity parameter measures the fraction within-community edges minus the expected value obtained from a network with the same communities but random vertex connections (Newman and Girvan, 2004). A maximum value of 1 indicates a strong community structure, while a minimum value of 0 indicates that within-community edges are as frequent as expected in a random network. For these analyses we used the relevant algorithms from the igraph R library v1.0.1 (Core Team, 2015;Csárdi and Nepusz, 2006). Function-oriented domain subnetworks were obtained by retrieving orthologous groups that contained relevant domains, which were obtained from previous studies (transcription factors from (de Weirauch and Hughes, 2011), signaling domains from , ECM-related domains from Sebé-Pedró s et al., 2010;Hynes, 2012), ubiquitination from ) and pfam2go annotations (for the subsets mentioned above, and also for protein-binding domains) (Mitchell et al., 2015). Monotonic statistical dependence between modularity and the number of domains per community was tested using Spearman's rank correlation coefficient ( s ) for all network or subnetwork (for original and simulated data).

Comparative analysis of individual protein domain evolution
We mapped the presence of individual protein domains across our dataset of 40 eukaryotic species (excluding Abeoforma and Pirum), as predicted by Pfamscan and the 29th release of the Pfam database (Punta et al., 2012). Using this numerical profile of domain presence in extant genomes, we computed the gains and losses at ancestral nodes using the Dollo parsimony algorithm as implemented in Count (Csurö s, 2010).

Accession numbers
Genome sequencing and assembly data from Corallochytrium, Abeoforma, Pirum and Chromosphaera has been deposited in NCBI using the BioProject accession PRJNA360047. Transcriptome sequencing data from Abeoforma and Chromosphaera has been deposited in NCBI using the Bio-Project accession PRJNA360056.