The ancestral animal genetic toolkit revealed by diverse choanoflagellate transcriptomes

The changes in gene content that preceded the origin of animals can be reconstructed by comparison with their sister group, the choanoflagellates. However, only two choanoflagellate genomes are currently available, providing poor coverage of their diversity. We sequenced transcriptomes of 19 additional choanoflagellate species to produce a comprehensive reconstruction of the gains and losses that shaped the ancestral animal gene repertoire. We find roughly 1,700 gene families with origins on the animal stem lineage, of which only a core set of 36 are conserved across animals. We find more than 350 gene families that were previously thought to be animal-specific actually evolved before the animal-choanoflagellate divergence, including Notch and Delta, Toll-like receptors, and glycosaminoglycan hydrolases that regulate animal extracellular matrix (ECM). In the choanoflagellate Salpingoeca helianthica, we show that a glycosaminoglycan hydrolase modulates rosette colony size, suggesting a link between ECM regulation and morphogenesis in choanoflagellates and animals. Data Availability Raw sequencing reads: NCBI BioProject PRJNA419411 (19 choanoflagellate transcriptomes), PRJNA420352 (S. rosetta polyA selection test) Transcriptome assemblies, annotations, and gene families: https://dx.doi.org/10.6084/m9.figshare.5686984 Protocols: https://dx.doi.org/10.17504/protocols.io.kwscxee


15
The biology of the first animal, the "Urmetazoan," has fascinated and confounded biologists for more 16 than a century (Dujardin, Usui et al., 1999;Frank and Kemler, 2002) that were previously thought to be animal-specific, but are found in a subset of choanoflagellates within 219 our data set (Supp. Figure 10, Methods). 220 We also found evidence that numerous gene families and pathways that originated in animals arose 221 through shuffling of more ancient protein domains and genes that were already present in the   Figure 16). Therefore, GAG hydrolases are not animal-specific, 292 but instead evolved earlier, in a similar time frame to the chondroitin sulfate biosynthetic pathway.

293
Interestingly, at least 18 of the 21 choanoflagellates in this study also encode a GAG lyase, which in 294 animals are only found in sponges and cnidarians, suggesting that a GAG lyase was present early in      (Simakov and Kawashima, 2017)], about 80% of which were also present in the Urchoanozoan and the 320 remainder of which evolved on the animal stem lineage (Supp. Table 5). The patchwork ancestry of the 321 Urmetazoan genome is illustrated by the fact that many gene families responsible for animal 322 development, immunity and multicellular organization evolved through shuffling of protein domains that 323 first originated in the choanozoan stem lineage together with ancient or animal-specific domains (e.g. the 324 TGF-b ligand and receptor; Supp. Figure 10, Supp. Figure 14). In addition, other gene families found in 325 the Urchoanozoan were subsequently combined into new pathways in the animal stem lineage along with 326 newly evolved genes (e.g., the TLR and Hedgehog pathways; Figure 4, Supp. Table 9). On the other 327 hand, the history of the Urmetazoan genome is not simply one of innovation and co-option, as roughly 328 1,650 Urchoanozoan genes were lost, including genes for the synthesis of nine essential amino acids 329 (Supp. Figure 12, Supp.

354
The origin of multicellularity in animals provided novel niches for bacteria to exploit, requiring the 355 evolution of new mechanisms for mediating interactions with pathogenic and commensal bacteria.

356
Nonetheless, the roots of animal innate immunity clearly predate animal origins. We have found that hypothesize that the core TLR/NF-κB pathway functioned in prey sensing, immunity, or more complex 364 processes in the Urchoanozoan that subsequently formed the basis of a self-defense system in animals.
Because critical pathway members linking TLR and NF-κB appear to be animal innovations (e.g.,

366
MyD88), the animal signaling pathway may have evolved to diversify downstream signaling processes to 367 tailor responses in a multicellular context. This pathway diversification may have included the evolution 368 of roles in development, as TLRs have been implicated in both NF-κB-dependent and NF-κB-369 independent developmental signaling (in addition to their functions in immunity) in bilaterians and in the 370 cnidarian N. vectensis (Leulier and

374
Our study provides an exhaustive and detailed view of the changes in gene content that laid the 375 foundation for animal origins. Changes in how Urmetazoan gene families were regulated also likely 376 played a role in animal evolution, as aspects of animal phosphoproteome remodeling are shared with C.  Table 1). In order to reduce the possibility of contamination during the isolation 419 procedure, we generated sterile air flow across the microscope and micromanipulator apparatus using a 420 HEPA-type air purifier (HAP412BN, Holmes, Boca Raton, Florida, United States). Choanoflagellates are co-isolated with diverse bacterial prey, which serve as a food source. In order to 431 limit bacterial diversity to the species that led to optimal choanoflagellate growth, we treated each culture 432 with a panel of ten different antibiotics (Supp. solutions before use by filtration through a 0.22 µm syringe filter (Thermo Fisher Scientific) in a laminar flow hood. We initially treated each culture with all ten antibiotics. We selected initial treatments that decreased bacterial density and diversity, and then repeatedly diluted the cultures into fresh medium with 438 the same antibiotic until no further change in bacterial density or diversity was observed. We then re-439 treated each of these cultures with an additional test of all ten antibiotics, as their modified bacterial 440 communities often responded differently from their initial communities. We repeated successive rounds 441 of treatment until no further reduction in bacterial density or diversity was observed (Supp. Table 1).

442
We tested a range of concentrations of different temperatures and growth media (Supp.   Table 3). We tested buffered medium for two freshwater species 457 that experienced lowered pH during growth, C. hollandica (pH 5.5) and Salpingoeca punica (pH 5), using

491
We pelleted cells in 50 ml conical tubes at 3220 x g in a refrigerated centrifuge at 4º C, removed the 492 first 47.5 ml of supernatant by pipetting, and the last 2.5 ml by aspiration. When we harvested more than 493 50 ml for a culture, we spun in separate tubes, removed all but 2.5 ml of supernatant, resuspended, 494 combined into a single 50 ml conical tube, and repeated the centrifugation as above. We flash froze 495 pellets in liquid nitrogen and stored them at -80º C. To reduce the possibility of cross-contamination, we 496 harvested and centrifuged each culture separately, we used disposable plastic pipette tubes, conical tubes, We isolated total RNA from cell pellets with the RNAqueous kit (Ambion, Thermo Fisher Scientific).

503
We modified the manufacturer's protocol to double the amount of lysis buffer, in order to increase RNA 504 yield and decrease degradation. We performed both optional steps after adding lysis buffer: we spun for 3 505 minutes at top speed at 1º C and passed the supernatant through a 25 gauge syringe needle several times.

506
We used the minimum suggested volumes in the two elution steps (40 µl and 10 µl). We measured RNA 507 concentration using a NanoDrop ND-1000 spectrophotometer (Thermo Fisher Scientific). 508 For all species except C. hollandica, we immediately proceeded to digest genomic DNA using the 509 TURBO DNA-free kit (Ambion, Thermo Fisher Scientific), following the manufacturer's protocol with a 510 30 minute incubation. After digestion, we removed DNase with DNase Inactivation Reagent for all 511 species except S. punica, whose RNA extract was incompatible with the reagent. We instead removed 512 DNase from S. punica by extracting with two volumes of pH 8 phenol:chloroform:isoamyl alcohol, 513 removing residual phenol with two volumes of chloroform:isoamyl alcohol, and precipitating with 0.3 M 514 sodium acetate pH 5.2 (all three from Sigma-Aldrich), 25 µg of GlycoBlue (Thermo Fisher Scientific) as 515 a carrier and two volumes of 100% ethanol. We washed the pellet in 70% ethanol and resuspended in 50 516 µl of nuclease-free water. For Didymoeca costata, after DNase removal with the Inactivation Reagent, the 517 RNA still appeared to be slightly contaminated with protein, so we performed a pH 8 phenol:chloroform 518 extraction to remove it. For C. hollandica, we observed significant total RNA degradation in the presence 519 of DNase buffer. Instead, to remove genomic DNA we performed three successive rounds of extraction 520 with pH 4.5 phenol:chloroform:isoamyl alcohol (Sigma-Aldrich), followed by the chloroform:isoamyl 521 and precipitation steps described above. To reduce the possibility of cross-contamination among samples, 522 we performed RNA isolation and DNase digestion for a single culture at a time, we used disposable materials when possible, and we cleaned all other materials (bench tops, centrifuges, dry baths, pipettes, 524 etc.) thoroughly with ELIMINase before use.

525
We evaluated total RNA on Bioanalyzer 2100 RNA Pico chips (Agilent Technologies, Santa Clara, 526 California, United States) with four criteria to be considered high-quality: (1) four distinct ribosomal 527 RNA peaks (16S and 23S for bacteria, 18S and 28S for choanoflagellates), (2) low signal in all other 528 regions, as a non-ribosomal signal is evidence of degradation, (3) at least a 1:1 ratio of 28S ribosomal area 529 to 18S ribosomal area, since 28S ribosomal RNA is likely to degrade more easily than is 18S ribosomal   differ substantially between the two data sets: 12,737,031 reads mapped for the two-round data, and 22 12,585,647 for the four-round data. Similarly, 10,509,262 reads from the two-round data mapped to S. 568 rosetta transcripts and 10,181,522 for the four-round data. We also asked whether additional rounds of 569 polyA selection would cause increased RNA breakage due to pipetting or heating during the selection 570 process [e.g., (Kingston, 2001)], leading to lower coverage of the 5' ends of transcripts (because the poly-571 dT sequence binds to the 3' end of RNA molecules). We estimated the loss of 5' transcript ends due to 572 shear to affect less than roughly 5% of transcripts (Supp. Figure 2a).

573
The four-round method removed roughly an order of magnitude more non-polyadenylated RNA than 574 the two-round method (Supp. Figure 2b). We observed that the four-round data set had a slightly lower 575 overall read quality. To address this, we tested whether a difference in read quality between the two data 576 sets could account for the difference in read mapping by randomly resampling the two-round data set to 577 contain the same number of either Phred-like quality 20 or Phred-like quality 30 bases as the four-round 578 data set, but neither resampling affected the results. We also tested whether transcript assembly quality 579 would suffer in the four-round data set by assembling both data sets de novo with Trinity release 2012-03-   . Table 1). We performed four rounds of polyA selection (instead of 592 the standard two) and introduced two further modifications to the standard protocol: first, we repeated the 593 Agencourt AMPure XP (Beckman Coulter, Indianapolis, Indiana, United States) bead clean-up step to 594 enhance adapter removal, and second, we used 1.5 µl less volume in all bead elution steps, in order to 595 reduce loss during the protocol. We prepared libraries from 5 RNA samples at a time, and the libraries 596 were later multiplexed into groups of 6 or 7 per sequencing lane. To allow us to detect evidence of 597 potential cross-contamination during either process, we ensured that the groupings for sample preparation 598 and sequencing were distinct (Supp. Table 1).

599
We estimated library concentration using the Qubit dsDNA HS Assay (Thermo Fisher Scientific) and  We trimmed sequence reads using Trimmomatic version 0.30 (Lohse et al., 2012) with two separate 616 filters: (1) removal of TruSeq adapter sequence and (2) trimming very low quality bases from the ends of 617 each read. To implement these filters, we ran Trimmomatic in three phases. In the first phase, we clipped 618 palindromic adapters using the directive ILLUMINACLIP:2:40:15 and discarded resulting reads shorter 619 than 25 bases with MINLEN:25. This resulted in two data sets: one containing reads whose mate pair 620 remained in the set, and the other composed of reads whose pair was removed due to adapter 621 contamination. In the second phrase, we clipped simple adapters from the remaining paired data set using 622 the directive ILLUMINACLIP:2:40:15, imposed a minimum Phred-like quality cutoff of 5 on the first ten 623 and last ten bases using LEADING:5 and TRAILING:5, subjected the read to a minimum sliding window 624 quality using SLIDINGWINDOW:8:5 and discarded resulting reads shorter than 25 bases with 625 MINLEN:25. The third phase operated on the remaining unpaired reads from the first phase, and 626 implemented the same directives as the second phase. We used a permissive minimum quality of 5 in 627 order to remove very low quality bases, as these might interfere with read error correction in the 628 subsequent processing step. We discarded reads less than 25 in length because they were shorter than the 629 k-mer size of the Trinity assembler (see De novo transcriptome assembly below). In all adapter clipping 630 operations, we used sequences appropriate to the index used for multiplexed sequencing. The number of 631 sequence reads and total bases remaining after trimming for each library are given in Supp. We performed error correction on trimmed reads using Reptile v1.1 (Yang et al., 2010) following the 636 authors' instructions, with the modifications described below. We began by using the 'fastq-converter.pl' 637 script to convert from FASTQ and to discard reads with more than 1 ambiguous character "N" in any 638 window of 13 bases. For reads with one "N", we chose the character "a" as the substitute for "N", as all of 639 the characters in our input reads were in upper case (A, C, G, or T); thus, we could later recognize "N" 640 bases converted in this step. Next, we tuned parameters using the 'seq-analy' utility following the  Table 1). All other parameters were left at their defaults to run Reptile.

647
We noticed that the locations within reads of errors identified by Reptile fell into two general classes: 648 sporadic errors not located adjacent to any other error, and clustered errors, in which several adjacent 649 bases within the same k-mer window were identified as errors. In some extreme cases, every single base 650 within a sequence read was identified as a target for error correction; we observed this phenomenon in the 651 set of read corrections for every species. We reasoned that this was an unintended consequence of the 652 iteration-to-exhaustion approach (step 2d) of the Reptile algorithm. Therefore, we designed a method to 653 correct sporadic errors, but not clustered errors. For each species, we began by grouping each read according to the total number of errors identified. Within each group, we built a distribution of the 655 number of adjacent errors within the same k-mer window. For sporadic errors, this number should be 656 close to 0, but for clustered errors, the number could be up to the k-mer size minus one. There was a clear 657 pattern within each of these distributions, with some errors identified with no neighbors (sporadic errors), 658 a smaller number identified with 1 neighbor, and an increasing number beginning at 2 or more neighbors 659 (clustered errors). We used these empirical distributions to set the maximum allowable amount of  Figure   692 3), most contigs from one species had no matches in any other species. Within the contigs that did have 693 cross-species matches (Supp. Figure 18a), we observed a large number that were identical or nearly-694 identical, which were likely cross-contaminants, and another set of matches distributed around roughly 695 80% identity, likely representing highly conserved genes. The two cases were separated at roughly 96% 696 identity. After exploring the distribution of match lengths in a similar manner (Supp. Figure 18b), we 697 considered matches at 96% or greater identity of at least 90 bases in length to be cross-contaminated.
Next, we identified the sources of cross-contaminated contigs by comparing the number of reads 699 mapping from both species for each match. We first masked contigs with Tandem Repeats Finder version  Figure 18c), we retained only contigs for the species in a pair with 10 times or more reads 707 mapping, and discarded all other contigs, with one exception: if a contig had at least 10,000 reads 708 mapping, we did not discard it, regardless of read mapping ratio. We observed that many contigs 709 encoding conserved genes (for example, α-tubulin and elongation factor 1α) also tended to be the most 710 highly expressed, and thus the read mapping ratio was often close to 1 for these contigs. We identified as 711 cross-contaminated and removed between 1.7% and 8.8% of contigs for each species (Supp. Table 1). We 712 note that our procedure would also be expected to discard sequences from any bacterial species that were 713 present in two different choanoflagellate cultures. For a more detailed examination of the cross-714 contamination removal process, see (Richter, 2013).

722
Furthermore, we also observed many contigs whose predicted proteins were a strict subset of the 723 predicted proteins from another contig. For example, contig 1 might encode predicted proteins A and B, 724 and contig 2 might encode two predicted proteins exactly matching A and B, and a third predicted protein 725 C. We removed both types of redundancy (exact matches and subsets) from the set of predicted proteins, 726 and we also removed the contigs from which they were predicted (Supp.

745
To estimate the completeness of the predicted protein set for each species, we searched our data for 758 759

761
To identify gene families, we chose a set of representative animals and outgroup species (Supp . Table   762 3). We used the 19 choanoflagellates we sequenced and the complete predicted protein sets from the M.    Figure 4a). If these gene families represented true orthologous groups and were therefore present 788 in the ancestral Choanozoan, they would require several independent loss events within one of the two 789 groups; we reasoned that it was more parsimonious that some proportion of these genes evolved in only 790 one of the two groups and that the isolated BLAST hits from the other group represented false positives.

791
To address this problem, we produced membership probabilities for each protein within each gene family,

828
In the text, we precede estimated gene family counts with a tilde. We selected a conservative 829 probability threshold of 10% in order to minimize the number of gene families that might erroneously be assigned as specific to a given group (e.g., animals or choanozoans). This choice may have resulted in 831 some gene families that are truly specific to a group instead being incorrectly assigned as shared with

891
Since the focus of this study was not phylogenetic tree construction, and because the topic has been 892 thoroughly addressed elsewhere, we selected species trees from prior publications rather than building 893 one based on our data. Furthermore, because our major findings were based on comparisons among

Group-specific core gene families found in all extant members
To identify animal-specific or choanoflagellate-specific gene families that were also present in all 918 species within either group, we required every species in the group to pass the 10% minimum probability 919 criterion. These core gene families are subject to several potential technical artifacts. First, an incomplete 920 genome or transcriptome assembly could result in a species appearing to lack a gene family. Second, gene 921 families containing repeated or repetitive protein domains (e.g. EGF or Ankyrin) might cause 922 inappropriate inferences of sequence homology in our BLAST-based approach. Third, a gene family 923 which duplicated on the lineage leading to the last common ancestor of a group could produce two gene 924 families, among which paralogs are incorrectly partitioned, resulting in one or more species appearing to 925 lack one of the two families. Thus, the lists of core gene families should not necessarily be considered as 926 exhaustive, especially for serially duplicated or repeat-containing gene families.

927
We annotated animal-specific pan-animal gene families (Table 1) Figure 19). We accepted Pfam domains present in at least 10% of proteins in a gene 997 family. Domains of unknown function in the Pfam database had names beginning with "DUF". To ensure 998 that these domain names were not assigned a function in a more recent version of the Pfam database 999 published after our initial annotation, we checked against Pfam version 31.0 and considered all domains 1000 whose names no longer began with "DUF" as having an assigned function.
We   (Heldin et al., 1997). Although none of these combinations was present in encoded in separate proteins (Supp. Figure 14). None of these four domains is found in eukaryotes outside Choanozoa (kinases are ancient domains, and the TGF_beta domain is found only in animals).

1119
We identified GAG synthesis genes in S. helianthica using the list of human genes from Data S3 in 1120 (Woznica et al., 2017). We first standardized the identifiers of the human sequences to those in NCBI 1121 release GRCh38 of the human genome. Next, we ran blastp version 2.2.26 (Altschul, 1997) against S.

1122
helianthica with a maximum E value of 10 -5 and selected the matching S. helianthica protein with the 1123 lowest E value. If a single S. helianthica protein hit two human proteins, we selected the lowest E value 1124 hit (and used the next best hit, if any, to the other human protein). In three cases, there was a S.

1185
We performed all counts blind to the identity of each sample. We counted 30 colonies per treatment.

1186
We also tested two additional commercially available GAG cleavage enzymes (all from Sigma-1187 Aldrich). We found that bovine hyaluronidase type VIII, a GAG hydrolase, had no effect on S.

1188
helianthica, but that it also formed visible precipitates at the concentrations we used for type IV-S (≥ 300 1189 µg/ml), which likely prevented its enzymatic activity. We also found no effects in S. helianthica for 1190 Streptomyces hyalurolyticus hyaluronidase, a GAG lyase, which is thought to have no enzymatic activity 1191 on chondroitin sulfate (Ohya and Kaneko, 1970;Park et al., 1997  Top, the phylogenetic relationships among the species whose gene contents were compared. Each colored node represents the last common ancestor of a group of species. Bottom, a heat map of the 13,358 orthologous gene families inferred to have been present in at least one of six nodes representing common ancestors of interest: Ureukaryote, Uropisthokont, Urholozoan, Urchoanozoan, Urchoanoflagellate, and Urmetazoan (the full heat map for all gene families is shown in Supp. Figure 6). Each row represents a gene family. Gene families are sorted by their presence in each group of species, indicated by colored bars and boxes (eukaryotes, opisthokonts, holozoans, choanozoans, choanoflagellates and animals) and subsequently clustered within groups by uncentered Pearson correlation.     The representation is the same as in Figure 2, with one major difference: gene family origin, indicated on the left, indicates that the average gene family probability within the group is at least 0.1, but does not imply that it was present in the last common ancestor of the group (as is the case in Figure 2).  Figure 9. RNAi components in are present in choanoflagellates, but have been lost multiple times in different lineages. A phylogenetic tree of choanoflagellates with inferred losses of Argonaute (blue) and Dicer (red). For each species with moderate evidence for either Argonaute or Dicer, the expression level of that gene is shown as a percentile FPKM rank within each species. Species lacking a gene have no expression value. Overall, there have been multiple losses of both Argonaute and Dicer, and RNAi machinery appears to be entirely absent in the clade containing the two previously sequenced species, Monosiga brevicollis and Salpingoeca rosetta. In the species with Argonaute, it is highly expressed, and Dicer always shows a lower relative expression level.  These include the pathways for leucine, valine, isoleucine, histidine, lysine, threonine, tryptophan, phenylalanine and methionine, which were all more complete in the Urchoanozoan than in the Urmetazoan. Purple: pathway components present in the Urchoanozoan and the Urmetazoan. Blue: pathway components present in the Urchoanozoan but absent from the Urmetazoan.   Figure 15. Alignment of gene family 6840, which contains animal SARM1 proteins. Only a portion of the alignment, surrounding the glutamic acid residue that is necessary but not sufficient for SARM1 function (yellow box), is depicted. The Homo sapiens SARM1 protein, which was not part of our data set of gene families, but which we included in the alignment, is shown at the top. Choanoflagellate sequences are surrounded by a green box. Positions at the top are given with respect to the human SARM1 protein. Each species is listed with its identifier followed by the length of the protein that was used to build the full alignment.